猿问

网页抓取文本返回一个空集

使用 Beautiful Soup FindAll 时,代码不会抓取文本,因为它返回一个空集。在此之后代码还有其他问题,但在这个阶段我正在尝试解决第一个问题。我对此很陌生,所以我理解代码结构可能不太理想。我来自 VBA 背景。

import requests

from requests import get

from selenium import webdriver

from bs4 import BeautifulSoup

from lxml import html

import pandas as pd

#import chromedriver_binary  # Adds chromedriver binary to path


options = webdriver.ChromeOptions()

options.add_argument('--ignore-certificate-errors')

options.add_argument('--incognito')

options.add_argument('--headless')

driver = webdriver.Chrome(executable_path=r"C:\Users\mmanenica\Documents\chromedriver.exe")


#click the search button on Austenders to return all Awarded Contracts

import time

#define the starting point: Austenders Awarded Contracts search page

driver.get('https://www.tenders.gov.au/cn/search')

#Find the Search Button and return all search results

Search_Results = driver.find_element_by_name("SearchButton")

if 'inactive' in Search_Results.get_attribute('name'):

    print("Search Button not found")

    exit;

print('Search Button found')

Search_Results.click()    


#Pause code to prevent blocking by website

time.sleep(1)

i = 0

Awarded = []


#Move to the next search page by finding the Next button at the bottom of the page

#This code will need to be refined as the last search will be skipped currently.

while True:

    Next_Page = driver.find_element_by_class_name('next')

    if 'inactive' in Next_Page.get_attribute('class'):

        print("End of Search Results")

        exit;  

    i = i + 1

    time.sleep(2)

   

德玛西亚99
浏览 177回答 1
1回答

梦里花落0921

如前所述,您实际上并没有将 html 源代码输入 BeautifulSoup。所以首先改变的是soup = BeautifulSoup(driver.current_url, features='lxml'):soup = BeautifulSoup(driver.page_source, features='lxml')第二个问题:有些元素没有<a>带有class=detail的标签。因此,您将无法从 NoneType 中获取 href。我添加了一个 try/except 以在发生这种情况时跳过(但不确定这是否会产生您想要的结果)。你也可以摆脱那个类,然后说Details_Page = each_Contract.find('a').get('href')接下来,那只是url的扩展名,你需要追加根,所以: driver.get('https://www.tenders.gov.au' + Details_Page)我也看不到您指的是哪里 class=Contact-Heading。您还参考 class='class': 'list-desc-inner' 和一个点,然后 'class': 'list_desc_inner' 在另一个点。同样,我没有看到 class=list_desc_inner下一个。将列表附加到列表中,您想要Awarded.append(Combined),而不是Awarded.append[Combined]我还在.strip()那里添加以清理文本中的一些空白。无论如何,您需要修复和清理很多东西,而且我也不知道您的预期输出应该是什么。但希望这能让你开始。此外,正如评论中所述,您可以单击下载按钮并立即获得结果,但也许您正在努力练习......import requestsfrom requests import getfrom selenium import webdriverfrom bs4 import BeautifulSoupfrom lxml import htmlimport pandas as pd#import chromedriver_binary&nbsp; # Adds chromedriver binary to pathoptions = webdriver.ChromeOptions()options.add_argument('--ignore-certificate-errors')options.add_argument('--incognito')options.add_argument('--headless')driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe")#click the search button on Austenders to return all Awarded Contractsimport time#define the starting point: Austenders Awarded Contracts search pagedriver.get('https://www.tenders.gov.au/cn/search')#Find the Search Button and return all search resultsSearch_Results = driver.find_element_by_name("SearchButton")if 'inactive' in Search_Results.get_attribute('name'):&nbsp; &nbsp; print("Search Button not found")&nbsp; &nbsp; exit;print('Search Button found')Search_Results.click()&nbsp; &nbsp;&nbsp;#Pause code to prevent blocking by websitetime.sleep(1)i = 0Awarded = []#Move to the next search page by finding the Next button at the bottom of the page#This code will need to be refined as the last search will be skipped currently.while True:&nbsp; &nbsp; Next_Page = driver.find_element_by_class_name('next')&nbsp; &nbsp; if 'inactive' in Next_Page.get_attribute('class'):&nbsp; &nbsp; &nbsp; &nbsp; print("End of Search Results")&nbsp; &nbsp; &nbsp; &nbsp; exit;&nbsp;&nbsp;&nbsp; &nbsp; i = i + 1&nbsp; &nbsp; time.sleep(2)&nbsp; &nbsp; #Loop through all the Detail links on the current Search Results Page&nbsp; &nbsp; print("Checking search results page " + str(i))&nbsp; &nbsp; print(driver.current_url)&nbsp; &nbsp; soup = BeautifulSoup(driver.page_source, features='lxml')&nbsp; &nbsp; #Find all Contract detail links in the current search results page&nbsp; &nbsp; Details = soup.findAll('div', {'class': 'list-desc-inner'})&nbsp; &nbsp; for each_Contract in Details:&nbsp; &nbsp; &nbsp; &nbsp; #Loop through each Contract details link and scrape all the detailed&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; #Contract information page&nbsp; &nbsp; &nbsp; &nbsp; try:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Details_Page = each_Contract.find('a', {'class': 'detail'}).get('href')&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; driver.get('https://www.tenders.gov.au' + Details_Page)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #Scrape all the data in the Awarded Contract page&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #r = requests.get(driver.current_url)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; soup = BeautifulSoup(driver.page_source, features='lxml')&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #find a list of all the Contract Info (contained in the the 'Contact Heading'&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #class of the span element)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Contract = soup.find_all('span', {'class': 'Contact-Heading'})&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Contract_Info = [span.text.strip() for span in Contract]&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #find a list of all the Summary Contract info which is in the text of\&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #the 'list_desc_inner' class&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Sub = soup.find_all('div', {'class': 'list-desc-inner'})&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Sub_Info = [div.text.strip() for div in Sub]&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #Combine the lists into a unified list and append to the Awarded table&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Combined = [Contract_Info, Sub_Info]&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Awarded.append(Combined)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #Go back to the Search Results page (from the Detailed Contract page)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; driver.back()&nbsp; &nbsp; &nbsp; &nbsp; except:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; continue&nbsp; &nbsp; #Go to the next Search Page by clicking on the Next button at the bottom of the page&nbsp; &nbsp; Next_Page.click()&nbsp; &nbsp; #&nbsp; &nbsp; time.sleep(3)&nbsp; &nbsp;&nbsp;driver.close()print(Awarded.Shape)
随时随地看视频慕课网APP

相关分类

Python
我要回答