请稍等 ...
×

采纳答案成功!

向帮助你的同学说点啥吧!感谢那些助人为乐的人

只爬取,不处理

我已经获取到cookie,控制台上面不停的爬取数据,程序不去运行parse_job函数,请问是什么原因?

2019-01-04 15:59:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.lagou.com/utrack/trackMid.html?f=https%3A%2F%2Fwww.lagou.com%2Fzhaopin%2Flvshi%2F3%2F&t=1546588794&_ti=1> from <GET https://www.lagou.com/zhaopin/lvshi/3/>
2019-01-04 15:59:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.lagou.com/utrack/trackMid.html?f=https%3A%2F%2Fwww.lagou.com%2Fzhaopin%2Fshuiwu%2F3%2F&t=1546588794&_ti=1> (referer: https://www.lagou.com/zhaopin/shuiwu/)
2019-01-04 15:59:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.lagou.com/utrack/trackMid.html?f=https%3A%2F%2Fwww.lagou.com%2Fzhaopin%2Ffengkong%2F2%2F&t=1546588794&_ti=1> (referer: https://www.lagou.com/zhaopin/fengkong/)
2019-01-04 15:59:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.lagou.com/utrack/trackMid.html?f=https%3A%2F%2Fwww.lagou.com%2Fzhaopin%2Fshuiwu%2F2%2F&t=1546588794&_ti=1> (referer: https://www.lagou.com/zhaopin/shuiwu/)
2019-01-04 15:59:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.lagou.com/utrack/trackMid.html?f=https%3A%2F%2Fwww.lagou.com%2Fzhaopin%2Flvshi%2F2%2F&t=1546588794&_ti=1> from <GET https://www.lagou.com/zhaopin/lvshi/2/>
2019-01-04 15:59:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.lagou.com/utrack/trackMid.html?f=https%3A%2F%2Fwww.lagou.com%2Fzhaopin%2Fzhuanli%2F10%2F&t=1546588794&_ti=1> from <GET https://www.lagou.com/zhaopin/zhuanli/10/>

这个是部分输出。

# -*- coding: utf-8 -*-
import time
import os

import pickle
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from selenium import webdriver

from article_spider.settings import BASE_DIR


class LagouSpider(CrawlSpider):
    name = 'lagou'
    allowed_domains = ['www.lagou.com']
    start_urls = ['https://www.lagou.com/']

    rules = (
        Rule(LinkExtractor(allow=r'gongsi/j\d+.html'), follow=True),
        Rule(LinkExtractor(allow=r'zhaopin/.*'), follow=True),
        Rule(LinkExtractor(allow=r'jobs/\+d.html'), callback='parse_job', follow=True),
    )

    def start_requests(self):
        cookies = {}
        if os.path.exists(BASE_DIR+'/cookies/lagou.cookie'):
            cookies = pickle.load(open(BASE_DIR+'/cookies/lagou.cookie', 'rb'))
        if not cookies:
            browser = webdriver.Chrome(executable_path=BASE_DIR + '/chromedriver')
            browser.get('https://passport.lagou.com/login/login.html')
            browser.find_element_by_xpath("//div[@data-view='passwordLogin']//div[@data-propertyname='username']/input"). \
                send_keys('15091891365')
            browser.find_element_by_xpath("//div[@data-view='passwordLogin']//div[@data-propertyname='password']/input"). \
                send_keys('4419565')
            browser.find_element_by_xpath("//div[@data-view='passwordLogin']//div[@data-propertyname='submit']/input"). \
                click()
            time.sleep(10)
            cookies = browser.get_cookies()
            pickle.dump(cookies, open(BASE_DIR+'/cookies/lagou.cookie', 'wb'))
        cookies_dict = {}  # 将获取到的cookies转换成dict{name:value}类型,才能够继续传到Request中去
        for cookie in cookies:
            cookies_dict[cookie['name']] = cookie['value']
        for url in self.start_urls:
            yield Request(url, dont_filter=True, cookies=cookies_dict)


    def parse_job(self, response):
        i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

这是我的代码,请老师指点下,谢谢!

正在回答

3回答

你加我qq 442421039

0 回复 有任何疑惑可以回复我~
  • 提问者 blowwind #1
    老师请您通过下,谢谢!
    回复 有任何疑惑可以回复我~ 2019-01-07 17:29:47
  • 提问者 blowwind #2
    非常感谢!
    回复 有任何疑惑可以回复我~ 2019-01-09 20:29:53
何杨233 2019-07-05 18:58:25

我也用cookies登录的 延时是 10 ,还加了随机延时,随机UA  但是爬一会儿还是会302重定向,现在就很懵拉钩到底是怎么判断爬虫的

0 回复 有任何疑惑可以回复我~
  • bobby #1
    可以再配合上ip代理试试
    回复 有任何疑惑可以回复我~ 2019-07-06 14:07:55
慕码人5330596 2019-01-06 21:57:31

我也遇到这个问题了,爬取的网址都302重定向到类似这样的网址了  https://www.lagou.com/utrack/trackMid.html?f=https%3A%2F%2Fwww.lagou.com%2Fjobs%2F4776950.html&t=1546782084&_ti=1

0 回复 有任何疑惑可以回复我~
  • 提问者 blowwind #1
    拉钩现在的策略又变了,现在针对登录用户,将职位页都重定向到了一个地方,用于统一收集和处理,要登录爬取就得重新设置RULE规则,不过现在又可以不用登陆情况下爬取了。
    回复 有任何疑惑可以回复我~ 2019-01-09 20:33:42
  • 提问者 blowwind #2
    找到原因了,setting里面把cookies_enable=True就可以了
    回复 有任何疑惑可以回复我~ 2019-02-15 17:28:15
  • slairmy 回复 提问者 blowwind #3
    请问下,现在抓取也有这样的重定向,登录之后获取了cookie存入文件再抓取,但是抓取了几条之后又会出现同样的重定向。如果设置了ip代理,一开始抓取就直接重定向了,是什么问题呢?
    回复 有任何疑惑可以回复我~ 2019-02-23 17:38:36
问题已解决,确定采纳
还有疑问,暂不采纳
意见反馈 帮助中心 APP下载
官方微信