将 aiohttp 与多处理相结合

我正在制作一个脚本，它获取近 20 000 个页面的 HTML 并对其进行解析以获取其中的一部分。

我设法使用 asyncio 和 aiohttp 通过异步请求在数据框中获取了 20 000 个页面的内容，但该脚本仍然等待所有页面被提取以解析它们。

async def get_request(session, url, params=None):

async with session.get(url, headers=HEADERS, params=params) as response:

return await response.text()

async def get_html_from_url(urls):

tasks = []

async with aiohttp.ClientSession() as session:

for url in urls:

tasks.append(get_request(session, url))

html_page_response = await asyncio.gather(*tasks)

return html_page_response

html_pages_list = asyncio_loop.run_until_complete(get_html_from_url(urls))

一旦我获得了每个页面的内容，我就设法使用多处理的池来并行化解析。

get_whatiwant_from_html(html_content):

parsed_html = BeautifulSoup(html_content, "html.parser")

clean = parsed_html.find("div", class_="class").get_text()

# Some re.subs

clean = re.sub("", "", clean)

return clean

pool = Pool(4)

what_i_want = pool.map(get_whatiwant_from_html, html_content_list)

这段代码异步混合了获取和解析，但我想将多处理集成到其中：

async def process(url, session):

html = await getRequest(session, url)

return await get_whatiwant_from_html(html)

async def dispatch(urls):

async with aiohttp.ClientSession() as session:

coros = (process(url, session) for url in urls)

return await asyncio.gather(*coros)

result = asyncio.get_event_loop().run_until_complete(dispatch(urls))

有什么明显的方法可以做到这一点吗？我想过创建 4 个进程，每个进程都运行异步调用，但实现看起来有点复杂，我想知道是否有另一种方法。

我对 asyncio 和 aiohttp 很陌生，所以如果你有什么建议我阅读以更好地理解，我会很高兴。

忽然笑

浏览 205回答 2