根据qq_慕神6513837我修改了以下3个地方:
1.修改setting.py
将ROBOTSTXT_OBEY = True 修改为 ROBOTSTXT_OBEY = False
2.修改pycharm的默认浏览器为chrome,file->setting->Tools->Web Browsers and Preview, 只勾选了Chrome,然后 default browser 选择:first listed 。确认。
3.根据user-agent 修改uc.Chrome() 的版本号。
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}
因为我的header里面这个user-agent的chrome版本是97,所以我指定了uc.chrome的版本为97:driver = uc.Chrome(version_main=97)
4.修改main.py,修改后如下:
from multiprocessing import freeze_support
if __name__ == '__main__': #添加这段语句
from scrapy.cmdline import execute
import sys
import os
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
execute(["scrapy","crawl","cnblogs"])
修改完以后,我就成功的打开浏览器了。希望对你有帮助!
cnblogs修改完后的代码如下:
import scrapy
class CnblogsSpider(scrapy.Spider):
name = 'cnblogs'
allowed_domains = ['news.cnblogs.com']
start_urls = ['http://news.cnblogs.com/']
custom_settings = {
"COOKIES_ENABLED": True
}
def start_requests(self):
#入口可以模拟登录拿到cookie
import undetected_chromedriver.v2 as uc
driver = uc.Chrome(version_main=97)
driver.get("https://account.cnblogs.com/signin")
input("回车继续:")
cookies = driver.get_cookies()
cookie_dict = {}
for cookie in cookies :
cookie_dict[cookie['name']] = cookie['value']
for url in self.start_urls:
#将cookie交给scrapy
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}
yield scrapy.Request(url,cookies=cookie_dict, headers=headers, dont_filter=True)
def parse(self, response):
url = response.xpath('//*[@id="entry_712436"]/div[2]/h2/a/@href').extract_first("未取到")
url2 = response.xpath('//div[@id="news_list"]//h2[@class="news_entry"]/a/@href').extract()
url3 = response.css('#news_list .news_entry a::attr(href)').extract()
print(url)
print(url2)
print(url3)
pass