我编写了一个asyncio与aiohttp库相关的脚本来异步解析网站的内容。我尝试在以下脚本中应用逻辑,就像它通常在scrapy.
但是,当我执行我的脚本时,它的行为就像同步库喜欢requests或urllib.request做的那样。因此,它非常缓慢并且不能达到目的。
我知道我可以通过在link变量中定义所有下一页链接来解决这个问题。但是,我是否已经以正确的方式使用现有脚本完成了任务?
在脚本中,processing_docs()函数的作用是收集不同帖子的所有链接,并将细化的链接传递给fetch_again()函数以从其目标页面获取标题。在processing_docs()函数中应用了一个逻辑,它收集 next_page 链接并将相同的内容提供给fetch()函数以重复相同的内容。This next_page call is making the script slower whereas we usually do the same in刮的and get expected performance.
我的问题是:如何在保持现有逻辑不变的情况下实现相同的目标?
import aiohttp
import asyncio
from lxml.html import fromstring
from urllib.parse import urljoin
link = "https://stackoverflow.com/questions/tagged/web-scraping"
async def fetch(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
result = await processing_docs(session, text)
return result
async def processing_docs(session, html):
tree = fromstring(html)
titles = [urljoin(link,title.attrib['href']) for title in tree.cssselect(".summary .question-hyperlink")]
for title in titles:
await fetch_again(session,title)
next_page = tree.cssselect("div.pager a[rel='next']")
if next_page:
page_link = urljoin(link,next_page[0].attrib['href'])
await fetch(page_link)
async def fetch_again(session,url):
async with session.get(url) as response:
text = await response.text()
tree = fromstring(text)
title = tree.cssselect("h1[itemprop='name'] a")[0].text
print(title)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*(fetch(url) for url in [link])))
loop.close()
元芳怎么了
相关分类