我在爬知乎用户信息时,写了以下爬虫代码(最下方):
思路是:
登陆知乎拿到cookie。
从李开复的主页入手,提取当前用户信息,然后提取其关注者。
之后根据之前提取到的关注者,提取当前用户信息,然后提取其关注者……依此循环
但在调试时,发现用于提取用户数据的函数,只在开始第一个用户(李开复)的时候进入了一次,之后的其他用户一直没有进入过。我觉得有可能是自己对 callback 的使用存在很大的误解。
以下断点处,是代码中 yield 的逻辑 (parse_detail 没有回调):
附:李开复关注者的json结构:
https://www.zhihu.com/api/v4/members/kaifulee/followers?include=data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics&offset=0&limit=20
import time
import os
import pickle
from urllib import parse
import re
import json
import scrapy
from scrapy.http import Request
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from jobbole_zhihu_lagou.settings import project_dir
class ZhihuSpider(scrapy.Spider):
name = 'zhihu'
allowed_domains = ['zhihu.com']
start_urls = ['https://www.zhihu.com/people/kaifulee/followers']
load_follower_url_format = "https://www.zhihu.com/api/v4/members/{user_id}/followers?include=data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics&offset={num}&limit=20"
custom_settings = {
"COOKIES_ENABLED": True,
"COOKIES_DUBUG": True,
"USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
"DOWNLOADER_MIDDLEWARES": {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 2,
},
"DOWNLOAD_DELAY ": 3,
}
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
}
def start_requests(self):
option = webdriver.ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
browser = webdriver.Chrome(executable_path='I:\\谷歌下载\\chromedriver_win32\\chromedriver.exe', options=option)
browser.get("https://www.zhihu.com/signin")
time.sleep(5)
browser.find_element_by_xpath('//input[@name="username"]').send_keys(Keys.CONTROL, "a")
time.sleep(1)
browser.find_element_by_xpath('//input[@name="username"]').send_keys("15986744115")
browser.find_element_by_xpath('//input[@name="password"]').send_keys(Keys.CONTROL, "a")
time.sleep(1)
browser.find_element_by_xpath('//input[@name="password"]').send_keys("29q82q8t7q")
time.sleep(1)
browser.find_element_by_xpath('//button[@type="submit"]').click()
time.sleep(5)
# 获取cookie 并保存
cookies = browser.get_cookies()
pickle.dump(cookies, open(os.path.join(os.path.dirname(project_dir), 'cookies\\zhihu.cookie'), "wb"))
cookie_dict = {}
for cookie in cookies:
cookie_dict[cookie["name"]] = cookie["value"]
return [scrapy.Request(url=self.start_urls[0], dont_filter=True, cookies=cookie_dict)]
def parse(self, response):
# 对当前用户提取资料、提取关注者
m = re.match("https://www.zhihu.com/(.*?)/(.*?)/followers.*", response.url)
if m:
# 提取资料
user_id = m.group(2)
yield Request(url=response.url, meta={"user_id": user_id}, headers=self.headers, callback=self.parse_detail)
# 提取关注者link
follow = response.xpath('//strong[@class="NumberBoard-itemValue"]/text()').extract()
followers_count = follow[1]
if followers_count != '0':
followers_json_url = self.load_follower_url_format.format(user_id=user_id, num=0)
yield Request(url=followers_json_url, headers=self.headers, callback=self.parse_follower_link)
pass
def parse_follower_link(self, response):
m = re.match(".*members/(.*?)/followers?.*offset=(\d+)&limit=20", response.url)
user_id = m.group(1)
current_num = m.group(2)
followers_json = json.loads(response.text)
is_end = followers_json["paging"]["is_end"]
followers_link = set()
for data in followers_json["data"]:
user_type = data["type"]
url_token = data["url_token"]
url = 'https://www.zhihu.com/' + user_type + '/' + url_token + '/followers'
followers_link.add(url)
for link in followers_link:
yield Request(url=link, headers=self.headers, callback=self.parse)
if is_end == "false":
current_num += 20
yield Request(url=self.load_follower_url_format.format(user_id=user_id, num=current_num),
headers=self.headers, callback=self.parse_follower_link)
def parse_detail(self, response):
user_id = response.meta.get("user_id")
# 提取用户信息
txt = response.xpath('//script[@id="js-initialData"]/text()').extract_first()
init_json = json.loads(txt)
entities = init_json["initialState"]["entities"]
users_data = entities["users"][user_id]
name = users_data["name"]
avatarUr = users_data["avatarUrl"]
url = users_data["url"]
isActive = users_data["isActive"]
description = users_data["description"]
gender = users_data["gender"]
followerCount = users_data["followerCount"]
followingCount = users_data["followingCount"]
answerCount = users_data["answerCount"]
questionCount = users_data["questionCount"]
articlesCount = users_data["articlesCount"]
favoriteCount = users_data["favoriteCount"]
favoritedCount = users_data["favoritedCount"]
answerCount = users_data["answerCount"]
# 做成item yield 给pipeline
yield
pass
问题:上述代码为何一直不进入 parse_detail 函数?
ps:精心写了半个小时的问题,到最后一刻自己想通了。
发现原因,是上述代码里,有一种url需要处理两次,而对于需要多次的url,除了最后一次,其他时候Request,都应该加上 dont_filter=True 。所以以上代码修改下面一行就正常了。
想着编辑不易,还是发出来给大家看看吧!
带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎
了解课程