我试图根据标签“foody”从instagram上抓取帖子的网址。使用硒和beautifulsoup,我可以抓取大约2,160个url的帖子。
但是,我无法超越这一点(有超过4,000,000个帖子)。有没有其他办法可以用“食物”标签来抓取整个帖子?或者至少是在2018-2019之间发布的帖子的网址?
以下是我的抓取代码。
谢谢!
instagram_url = "https://www.instagram.com"
tag_url = "https://www.instagram.com/explore/tags"
ads = "foody" # hashtag
#pausetime
pause_time = 2
#driver
driver = webdriver.Chrome("chromedriver.exe")
#go to hashtag page
driver.get(f"{tag_url}/{ads}")
time.sleep(pause_time)
#scroll down
lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
match=False
i = 0
while(match==False):
#urls
html = driver.page_source
bs_html = BeautifulSoup(html, "lxml")
for roots in bs_html.find_all(name="div", attrs={"class":"Nnq7C weEfm"}):
for link in roots.select("a"):
real = link.attrs["href"]
if real not in reallink:
reallink.append(real)
print("appendend data: ", len(reallink))
#Scroll down
lastCount = lenOfPage
print(f"scrolling down {i}")
i += 1
time.sleep(pause_time)
lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
if lastCount==lenOfPage:
match=True
交互式爱情
红颜莎娜
相关分类