请稍等 ...
×

采纳答案成功!

向帮助你的同学说点啥吧!感谢那些助人为乐的人

关于 dont_filter 参数在scrapy.http.Request里的使用

我在爬知乎用户信息时,写了以下爬虫代码(最下方):
思路是:
登陆知乎拿到cookie。
从李开复的主页入手,提取当前用户信息,然后提取其关注者。
之后根据之前提取到的关注者,提取当前用户信息,然后提取其关注者……依此循环

但在调试时,发现用于提取用户数据的函数,只在开始第一个用户(李开复)的时候进入了一次,之后的其他用户一直没有进入过。我觉得有可能是自己对 callback 的使用存在很大的误解。
以下断点处,是代码中 yield 的逻辑 (parse_detail 没有回调):
图片描述
附:李开复关注者的json结构:
https://www.zhihu.com/api/v4/members/kaifulee/followers?include=data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics&offset=0&limit=20

import time
import os
import pickle
from urllib import parse
import re
import json


import scrapy
from scrapy.http import Request
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

from jobbole_zhihu_lagou.settings import project_dir


class ZhihuSpider(scrapy.Spider):
    name = 'zhihu'
    allowed_domains = ['zhihu.com']
    start_urls = ['https://www.zhihu.com/people/kaifulee/followers']

    load_follower_url_format = "https://www.zhihu.com/api/v4/members/{user_id}/followers?include=data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics&offset={num}&limit=20"

    custom_settings = {
        "COOKIES_ENABLED": True,
        "COOKIES_DUBUG": True,
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
        "DOWNLOADER_MIDDLEWARES": {
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 2,
        },
        "DOWNLOAD_DELAY ": 3,
    }

    headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
    }

    def start_requests(self):
        option = webdriver.ChromeOptions()
        option.add_experimental_option('excludeSwitches', ['enable-automation'])
        browser = webdriver.Chrome(executable_path='I:\\谷歌下载\\chromedriver_win32\\chromedriver.exe', options=option)

        browser.get("https://www.zhihu.com/signin")
        time.sleep(5)

        browser.find_element_by_xpath('//input[@name="username"]').send_keys(Keys.CONTROL, "a")
        time.sleep(1)
        browser.find_element_by_xpath('//input[@name="username"]').send_keys("15986744115")
        browser.find_element_by_xpath('//input[@name="password"]').send_keys(Keys.CONTROL, "a")
        time.sleep(1)
        browser.find_element_by_xpath('//input[@name="password"]').send_keys("29q82q8t7q")
        time.sleep(1)
        browser.find_element_by_xpath('//button[@type="submit"]').click()
        time.sleep(5)

        # 获取cookie 并保存
        cookies = browser.get_cookies()

        pickle.dump(cookies, open(os.path.join(os.path.dirname(project_dir), 'cookies\\zhihu.cookie'), "wb"))
  
        cookie_dict = {}
        for cookie in cookies:
            cookie_dict[cookie["name"]] = cookie["value"]
        return [scrapy.Request(url=self.start_urls[0], dont_filter=True, cookies=cookie_dict)]

    def parse(self, response):
        # 对当前用户提取资料、提取关注者
        m = re.match("https://www.zhihu.com/(.*?)/(.*?)/followers.*", response.url)
        if m:
            # 提取资料
            user_id = m.group(2)
            yield Request(url=response.url, meta={"user_id": user_id}, headers=self.headers, callback=self.parse_detail)

            # 提取关注者link
            follow = response.xpath('//strong[@class="NumberBoard-itemValue"]/text()').extract()
            followers_count = follow[1]
            if followers_count != '0':
                followers_json_url = self.load_follower_url_format.format(user_id=user_id, num=0)
                yield Request(url=followers_json_url, headers=self.headers, callback=self.parse_follower_link)
        pass

    def parse_follower_link(self, response):
        m = re.match(".*members/(.*?)/followers?.*offset=(\d+)&limit=20", response.url)
        user_id = m.group(1)
        current_num = m.group(2)

        followers_json = json.loads(response.text)
        is_end = followers_json["paging"]["is_end"]
        followers_link = set()
        for data in followers_json["data"]:
            user_type = data["type"]
            url_token = data["url_token"]
            url = 'https://www.zhihu.com/' + user_type + '/' + url_token + '/followers'
            followers_link.add(url)
        for link in followers_link:
            yield Request(url=link, headers=self.headers, callback=self.parse)

        if is_end == "false":
            current_num += 20
            yield Request(url=self.load_follower_url_format.format(user_id=user_id, num=current_num),
                          headers=self.headers, callback=self.parse_follower_link)

    def parse_detail(self, response):
        user_id = response.meta.get("user_id")
        # 提取用户信息
        txt = response.xpath('//script[@id="js-initialData"]/text()').extract_first()
        init_json = json.loads(txt)
        entities = init_json["initialState"]["entities"]
        users_data = entities["users"][user_id]
        name = users_data["name"]
        avatarUr = users_data["avatarUrl"]
        url = users_data["url"]
        isActive = users_data["isActive"]
        description = users_data["description"]
        gender = users_data["gender"]
        followerCount = users_data["followerCount"]
        followingCount = users_data["followingCount"]
        answerCount = users_data["answerCount"]
        questionCount = users_data["questionCount"]
        articlesCount = users_data["articlesCount"]
        favoriteCount = users_data["favoriteCount"]
        favoritedCount = users_data["favoritedCount"]
        answerCount = users_data["answerCount"]
        # 做成item yield 给pipeline
        yield
        pass

问题:上述代码为何一直不进入 parse_detail 函数?

ps:精心写了半个小时的问题,到最后一刻自己想通了。
发现原因,是上述代码里,有一种url需要处理两次,而对于需要多次的url,除了最后一次,其他时候Request,都应该加上 dont_filter=True 。所以以上代码修改下面一行就正常了。
图片描述
想着编辑不易,还是发出来给大家看看吧!

正在回答 回答被采纳积分+3

1回答

bobby 2019-06-03 11:32:35

真棒! 分析过程是成长的最佳时刻

0 回复 有任何疑惑可以回复我~
问题已解决,确定采纳
还有疑问,暂不采纳
微信客服

购课补贴
联系客服咨询优惠详情

帮助反馈 APP下载

慕课网APP
您的移动学习伙伴

公众号

扫描二维码
关注慕课网微信公众号