打包本地代码到scrapyd上运行 ,这个是个scrapy-redis分布式爬虫,使用scrapy runspider 在该服务器运行没有一点问题(环境应该没有问题),但是使用scrapyd每次运行几秒就报错 ,这个错误百度谷歌都没有解决
查到如下可能有用的资料:
事实上如果在setting里加入了类似与分布式爬虫、大规模爬虫、等设定,但是没有设置完整,缺少后续操作步骤就会报错
爬虫运行日志文件记录如下:
2017-10-21 23:44:33 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: Blog) 2017-10-21 23:44:33 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'Blog', 'DOWNLOAD_DELAY': 0.5, 'DUPEFILTER_CLASS': 'scrapy_redis.dupefilter.RFPDupeFilter', 'LOG_FILE': 'logs/Blog/csdn/b9b36972b67611e79c6100163e001058.log', 'NEWSPIDER_MODULE': 'Blog.spiders', 'SCHEDULER': 'scrapy_redis.scheduler.Scheduler', 'SPIDER_MODULES': ['Blog.spiders']} 2017-10-21 23:44:33 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2017-10-21 23:44:33 [csdn] INFO: Reading start URLs from redis key 'csdn:start_urls' (batch size: 16, encoding: utf-8 2017-10-21 23:44:33 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'Blog.middlewares.RandomUserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-10-21 23:44:33 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-10-21 23:44:33 [twisted] CRITICAL: Unhandled error in Deferred: 2017-10-21 23:44:33 [twisted] CRITICAL: Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/Twisted-17.9.1.dev0-py3.6-linux-x86_64.egg/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/usr/local/lib/python3.6/site-packages/scrapy/crawler.py", line 77, in crawl self.engine = self._create_engine() File "/usr/local/lib/python3.6/site-packages/scrapy/crawler.py", line 102, in _create_engine return ExecutionEngine(self, lambda _: self.stop()) File "/usr/local/lib/python3.6/site-packages/scrapy/core/engine.py", line 70, in __init__ self.scraper = Scraper(crawler) File "/usr/local/lib/python3.6/site-packages/scrapy/core/scraper.py", line 71, in __init__ self.itemproc = itemproc_cls.from_crawler(crawler) File "/usr/local/lib/python3.6/site-packages/scrapy/middleware.py", line 58, in from_crawler return cls.from_settings(crawler.settings, crawler) File "/usr/local/lib/python3.6/site-packages/scrapy/middleware.py", line 36, in from_settings mw = mwcls.from_crawler(crawler) File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/media.py", line 68, in from_crawler pipe = cls.from_settings(crawler.settings) File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/images.py", line 95, in from_settings return cls(store_uri, settings=settings) File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/images.py", line 52, in __init__ download_func=download_func) File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/files.py", line 234, in __init__ self.store = self._get_store(store_uri) File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/files.py", line 270, in _get_store return store_cls(uri) File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/files.py", line 48, in __init__ self._mkdir(self.basedir) File "/usr/local/lib/python3.6/site-packages/scrapy/pipelines/files.py", line 77, in _mkdir os.makedirs(dirname) File "/usr/local/lib/python3.6/os.py", line 210, in makedirs makedirs(head, mode, exist_ok) File "/usr/local/lib/python3.6/os.py", line 220, in makedirs mkdir(name, mode) NotADirectoryError: [Errno 20] Not a directory: '/tmp/Blog-1508600650-duu8m0jk.egg/Blog'
其中最后一行的Blog就是我项目的名字 (我代码里面没有涉及该文件夹的内容)
scrapyd的目录在/package/scrapyd
下面是scrapyd运行日志
0-22T15:30:38+0800 [-] Loading /usr/local/lib/python3.6/site-packages/scrapyd/txapp.py... 2017-10-22T15:30:40+0800 [-] Scrapyd web console available at http://127.0.0.1:6800/ 2017-10-22T15:30:40+0800 [-] Loaded. 2017-10-22T15:30:40+0800 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 17.9.1dev0 (/usr/local/bin/python3.6 3.6.2) starting up. 2017-10-22T15:30:40+0800 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor. 2017-10-22T15:30:40+0800 [-] Site starting on 6800 2017-10-22T15:30:40+0800 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site object at 0x7fbb7cb72898> 2017-10-22T15:30:40+0800 [Launcher] Scrapyd 1.2.0 started: max_proc=4, runner='scrapyd.runner' 2017-10-22T15:37:09+0800 [twisted.python.log#info] "127.0.0.1" - - [22/Oct/2017:07:37:04 +0000] "POST /addversion.json HTTP/1.1" 200 108 "-" "Python-urllib/3.6" 2017-10-22T15:37:25+0800 [twisted.python.log#info] "127.0.0.1" - - [22/Oct/2017:07:37:21 +0000] "POST /schedule.json HTTP/1.1" 200 95 "-" "curl/7.47.0" 2017-10-22T15:37:25+0800 [-] Process started: project='Blog' spider='csdn' job='d8f46e28b6fb11e7ba4d525400db770c' pid=32734 log='logs/Blog/csdn/d8f46e28b6fb11e7ba4d525400db770c.log' items=None 2017-10-22T15:37:29+0800 [Launcher,32734/stderr] Unhandled error in Deferred: 2017-10-22T15:37:29+0800 [-] Process finished: project='Blog' spider='csdn' job='d8f46e28b6fb11e7ba4d525400db770c' pid=32734 log='logs/Blog/csdn/d8f46e28b6fb11e7ba4d525400db770c.log' items=None ^C2017-10-22T16:00:59+0800 [-] Received SIGINT, shutting down. 2017-10-22T16:00:59+0800 [-] (TCP Port 6800 Closed) 2017-10-22T16:00:59+0800 [twisted.web.server.Site#info] Stopping factory <twisted.web.server.Site object at 0x7fbb7cb72898> 2017-10-22T16:00:59+0800 [-] Main loop terminated. 2017-10-22T16:00:59+0800 [twisted.scripts._twistd_unix.UnixAppLogger#info] Server Shut Down.
带你彻底掌握Scrapy,用Django+Elasticsearch搭建搜索引擎
了解课程