我已经获取到cookie,控制台上面不停的爬取数据,程序不去运行parse_job函数,请问是什么原因?
2019-01-04 15:59:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.lagou.com/utrack/trackMid.html?f=https%3A%2F%2Fwww.lagou.com%2Fzhaopin%2Flvshi%2F3%2F&t=1546588794&_ti=1> from <GET https://www.lagou.com/zhaopin/lvshi/3/>
2019-01-04 15:59:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.lagou.com/utrack/trackMid.html?f=https%3A%2F%2Fwww.lagou.com%2Fzhaopin%2Fshuiwu%2F3%2F&t=1546588794&_ti=1> (referer: https://www.lagou.com/zhaopin/shuiwu/)
2019-01-04 15:59:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.lagou.com/utrack/trackMid.html?f=https%3A%2F%2Fwww.lagou.com%2Fzhaopin%2Ffengkong%2F2%2F&t=1546588794&_ti=1> (referer: https://www.lagou.com/zhaopin/fengkong/)
2019-01-04 15:59:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.lagou.com/utrack/trackMid.html?f=https%3A%2F%2Fwww.lagou.com%2Fzhaopin%2Fshuiwu%2F2%2F&t=1546588794&_ti=1> (referer: https://www.lagou.com/zhaopin/shuiwu/)
2019-01-04 15:59:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.lagou.com/utrack/trackMid.html?f=https%3A%2F%2Fwww.lagou.com%2Fzhaopin%2Flvshi%2F2%2F&t=1546588794&_ti=1> from <GET https://www.lagou.com/zhaopin/lvshi/2/>
2019-01-04 15:59:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.lagou.com/utrack/trackMid.html?f=https%3A%2F%2Fwww.lagou.com%2Fzhaopin%2Fzhuanli%2F10%2F&t=1546588794&_ti=1> from <GET https://www.lagou.com/zhaopin/zhuanli/10/>
这个是部分输出。
# -*- coding: utf-8 -*-
import time
import os
import pickle
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from selenium import webdriver
from article_spider.settings import BASE_DIR
class LagouSpider(CrawlSpider):
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com/']
rules = (
Rule(LinkExtractor(allow=r'gongsi/j\d+.html'), follow=True),
Rule(LinkExtractor(allow=r'zhaopin/.*'), follow=True),
Rule(LinkExtractor(allow=r'jobs/\+d.html'), callback='parse_job', follow=True),
)
def start_requests(self):
cookies = {}
if os.path.exists(BASE_DIR+'/cookies/lagou.cookie'):
cookies = pickle.load(open(BASE_DIR+'/cookies/lagou.cookie', 'rb'))
if not cookies:
browser = webdriver.Chrome(executable_path=BASE_DIR + '/chromedriver')
browser.get('https://passport.lagou.com/login/login.html')
browser.find_element_by_xpath("//div[@data-view='passwordLogin']//div[@data-propertyname='username']/input"). \
send_keys('15091891365')
browser.find_element_by_xpath("//div[@data-view='passwordLogin']//div[@data-propertyname='password']/input"). \
send_keys('4419565')
browser.find_element_by_xpath("//div[@data-view='passwordLogin']//div[@data-propertyname='submit']/input"). \
click()
time.sleep(10)
cookies = browser.get_cookies()
pickle.dump(cookies, open(BASE_DIR+'/cookies/lagou.cookie', 'wb'))
cookies_dict = {} # 将获取到的cookies转换成dict{name:value}类型,才能够继续传到Request中去
for cookie in cookies:
cookies_dict[cookie['name']] = cookie['value']
for url in self.start_urls:
yield Request(url, dont_filter=True, cookies=cookies_dict)
def parse_job(self, response):
i = {}
#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = response.xpath('//div[@id="name"]').extract()
#i['description'] = response.xpath('//div[@id="description"]').extract()
return i
这是我的代码,请老师指点下,谢谢!
带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎
了解课程