有的时候往数据库写入8000条数据,会卡住,有的时候写入1000就卡住。卡住地方的控制台日志信息,此外没有其他报错???
......
......
sales: 677
sales: 671
2018-09-13 18:13:01 [scrapy.core.scraper] DEBUG: Scraped from <200 https://s.taobao.com/search?data-key=s&data-value=0&ajax=true&_ksTS=1532158365171_1326&callback=jsonp1327&q=%E6%B4%97%E9%A2%9C%E4%B8%93%E7%A7%91&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20180721&ie=utf8&sort=sale-desc&bcoffset=0&p4ppushleft=%2C44>
None
sales: 668
2018-09-13 18:13:01 [scrapy.core.scraper] DEBUG: Scraped from <200 https://s.taobao.com/search?data-key=s&data-value=0&ajax=true&_ksTS=1532158365171_1326&callback=jsonp1327&q=%E6%B4%97%E9%A2%9C%E4%B8%93%E7%A7%91&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20180721&ie=utf8&sort=sale-desc&bcoffset=0&p4ppushleft=%2C44>
None
2018-09-13 18:13:08 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://s.taobao.com/search?data-key=s&data-value=220&ajax=true&_ksTS=1532158365171_2206&callback=jsonp2207&q=jmsolution&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_222012208721&ie=utf8&sort=sale-desc&bcoffset=220&p4ppushleft=%2C220> (failed 1 times): User timeout caused connection failure: Getting https://s.taobao.com/search?data-key=s&data-value=220&ajax=true&_ksTS=1532158365171_2206&callback=jsonp2207&q=jmsolution&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_222012208721&ie=utf8&sort=sale-desc&bcoffset=220&p4ppushleft=%2C220 took longer than 10.0 seconds..
get ip from ip api
2018-09-13 18:13:08 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): piping.mogumiao.com:80
2018-09-13 18:13:08 [urllib3.connectionpool] DEBUG: http://piping.mogumiao.com:80 "GET /proxy/api/get_ip_al?appKey=b828f9952ec847fca9c12d48833c93ba&count=1&expiryDate=0&format=1&newLine=2 HTTP/1.1" 200 57
-------r: {"code":"0","msg":[{"port":"38156","ip":"49.87.117.74"}]}
2018-09-13 18:13:09 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 49.87.117.74:38156
start_requests方法如下:for循环中共有184个star_url
def start_requests(self):
# 传入请求头headers、cookies,模拟真人请求
# 不填入headers, Chrome模拟登陆知乎,会报400错误,不进入parse
for word in self.search_key_words:
start_url = self.goods['start_urls']
yield scrapy.Request(
start_url.format(self.start_data, word),
headers=self.headers,
meta={'next_data': 0, 'counts': self.counts, 'word': word}
)
time.sleep(1)
setting.py配置如下:
DOWNLOAD_DELAY = 3
COOKIES_ENABLED = True(但是程序没用到cookie)AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_DEBUG = True
RETRY_ENABLED = True
RETRY_TIMES = 5
DOWNLOAD_TIMEOUT = 10
custom_setting.py配置如下:
'DOWNLOADER_MIDDLEWARES': {
'SpiderProjects.middlewares.RandomUserAgentMiddlware': 490,
'SpiderProjects.middlewares.RandomProxyMiddleware': 400,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 500,
},
'RANDOM_UA_TYPE': 'random',
MIDDLEWARE如下:
class RandomProxyMiddleware(object):
# 动态设置ip代理
def process_request(self, request, spider):
get_ip = GetIP()
print('get ip from ip api')
ip = get_ip.get_random_ip()
request.meta["proxy"] = ip
# request.meta["proxy"] = 'HTTP://114.229.139.176:35573'
def process_exception(self, request, exception, spider):
# 出现异常时(超时)使用代理
print("
出现异常,正在使用代理重试....
")
get_ip = GetIP()
print('get ip from ip api')
ip = get_ip.get_random_ip()
request.meta['proxy'] = ip
return request
带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎
了解课程