老师,我elasticsearch把answer的content放进去的时候报错:
ValueError: too many values to unpack (expected 2)
另外对于question内容为空的,我做了判断,但是还会报错:
Traceback (most recent call last):
File "/Users/yujialian/.virtualenvs/article_spider/lib/python3.6/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/yujialian/Documents/project/crawler/ArticalSpider/ArticalSpider/pipelines.py", line 122, in process_item
item.save_to_es()
File "/Users/yujialian/Documents/project/crawler/ArticalSpider/ArticalSpider/items.py", line 187, in save_to_es
if self["content"]:
File "/Users/yujialian/.virtualenvs/article_spider/lib/python3.6/site-packages/scrapy/item.py", line 59, in __getitem__
return self._values[key]
KeyError: 'content'
here's my code:
def save_to_es(self):
#turn the item in the ES's item
answer = ZhihuAnswerType()
answer.zhihu_id = self["zhihu_id"]
answer.url = self["url"]
answer.question_id = self["question_id"]
answer.author_id = self["author_id"]
answer.answer_excerpt = self["answer_excerpt"]
answer.content = self["content"]
answer.praise_num = self["praise_num"]
answer.comments_num = self["comments_num"]
answer.create_time = datetime.datetime.fromtimestamp(self["create_time"]).strftime(SQL_DATETIME_FORMAT)
answer.update_time = datetime.datetime.fromtimestamp(self["update_time"]).strftime(SQL_DATETIME_FORMAT)
answer.crawl_time = self["crawl_time"]
answer.suggest = get_suggests(ZhihuAnswerType._doc_type.index, ((answer.answer_excerpt, 5)))
answer.save()
return
question段代码:
def save_to_es(self):
#turn the item in the ES's item
question = ZhihuQuestionType()
question.zhihu_id = self["zhihu_id"][0]
question.topics = ",".join(self["topics"])
question.url = self["url"][0]
question.title = "".join(self["title"])
if self["content"]:
question.content = "".join(self["content"])
else:
question.content = "EMPTY"
question.answer_num = extract_num("".join(self["answer_num"]))
question.comments_num = extract_num("".join(self["comments_num"]))
if len(self["watch_user_num"]) == 2:
question.watch_user_num = int(self["watch_user_num"][0])
question.click_num = int(self["watch_user_num"][1])
else:
question.watch_user_num = int(self["watch_user_num"][0])
question.click_num = 0
question.crawl_time = datetime.datetime.now().strftime(SQL_DATETIME_FORMAT)
question.suggest = get_suggests(ZhihuQuestionType._doc_type.index, ((question.title, 10), (question.topics, 7), (question.content, 5)))
question.save()
return
登录后可查看更多问答,登录/注册
带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎
了解课程