老师好,在爬取cnblogs时,大概爬取100多条数据时(没设置DOWNLOAD_DELAY),遇到
2019-09-06 11:14:26 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://account.cnblogs.com/signin?returnUrl=https%3A%2F%2Fnews.cnblogs.com%2Fn%2F630237%2F> from <GET https://passport.cnblogs.com/user/signin?ReturnUrl=https%3A%2F%2Fnews.cnblogs.com%2Fn%2F630237%2F>
我设置了随机的user-agent(fake-useragent),维护了一个50个ip的ippool(付费的),每次请求都是随机的ip,网站是如何识别出我是爬虫的,要求我进行登入。如果我不想通过登入的方式继续爬虫,应该如何解决?或者我如何去了解这个网站的反爬机制?
带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎
了解课程