我在这个网站上进行了网络抓取:http://www.legorafi.fr/ 它适用于每个类别(政治等),但对于每个类别,我循环浏览相同数量的页面。
我希望能够根据该网站中每个类别的页面数量来抓取所有页面。
我这样做是为了循环浏览页面:
import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import newspaper
import requests
from newspaper.utils import BeautifulSoup
from newspaper import Article
#categories = ['france/politique','france/societe', 'monde-libre', 'france/economie/', 'culture', 'people', 'sports', 'hi-tech', 'sciences']
papers = []
driver = webdriver.Chrome(executable_path="/Users/name/Downloads/chromedriver 4")
#driver.get('http://www.legorafi.fr/')
for category in categories:
url = 'http://www.legorafi.fr/category/' + category
#WebDriverWait(self.driver, 10)
driver.get(url)
Foo()
time.sleep(2)
pagesToGet = 120
pagesToGet = 120
title = []
content = []
for page in range(1, pagesToGet+1):
print('Processing page :', page)
#url = 'http://www.legorafi.fr/category/france/politique/page/'+str(page)
print(driver.current_url)
#print(url)
raw_html = requests.get(url)
soup = BeautifulSoup(raw_html.text, 'html.parser')
for articles_tags in soup.findAll('div', {'class': 'articles'}):
for article_href in articles_tags.find_all('a', href=True):
if not str(article_href['href']).endswith('#commentaires'):
urls_set.add(article_href['href'])
papers.append(article_href['href'])
我想循环浏览所有这些类别,并根据每个类别的页数。
categories = ['france/politique','france/societe', 'monde-libre', 'france/economie/', 'culture', 'people', 'sports', 'hi-tech', 'sciences']
我该怎么做 ?
慕斯709654
相关分类