我编写了一个程序来抓取单个网站并抓取某些数据。我想通过使用来加快它的执行速度ProcessingPoolExecutor。但是,我无法理解如何从单线程转换为并发。
具体来说,在创建作业时(通过ProcessPoolExecutor.submit()),我可以传递类/对象和参数而不是函数和参数吗?
而且,如果是这样,如何将这些作业的数据返回到队列以跟踪访问过的页面和保存抓取内容的结构?
我一直以此为出发点,并查看了Queue和concurrent.futures文档(坦率地说,后者让我有点不知所措)。我也用谷歌搜索/Youtubed/SO'ed 很多都无济于事。
from queue import Queue, Empty
from concurrent.futures import ProcessPoolExecutor
class Scraper:
"""
Scrapes a single url
"""
def __init__(self, url):
self.url = url # url of page to scrape
self.internal_urls = None
self.content = None
self.scrape()
def scrape(self):
"""
Method(s) to request a page, scrape links from that page
to other pages, and finally scrape actual content from the current page
"""
# assume that code in this method would yield urls linked in current page
self.internal_urls = set(scraped_urls)
# and that code in this method would scrape a bit of actual content
self.content = {'content1': content1, 'content2': content2, 'etc': etc}
class CrawlManager:
"""
Manages a multiprocess crawl and scrape of a single site
"""
def __init__(self, seed_url):
self.seed_url = seed_url
self.pool = ProcessPoolExecutor(max_workers=10)
self.processed_urls = set([])
self.queued_urls = Queue()
self.queued_urls.put(self.seed_url)
self.data = {}
def crawl(self):
while True:
try:
# get a url from the queue
target_url = self.queued_urls.get(timeout=60)
# check that the url hasn't already been processed
if target_url not in self.processed_urls:
# add url to the processed list
self.processed_urls.add(target_url)
print(f'Processing url {target_url}')
# passing an object to the
# ProcessPoolExecutor... can this be done?
job = self.pool.submit(Scraper, target_url)
白衣染霜花
相关分类