如何抓取 HTML 表格格式的数据?

我正在尝试从https://www.msamb.com/ApmcDetail/ArrivalPriceInfo网站抓取数据。

这是我要抓取的数据。所以,在高亮的下拉选择框中有148个商品。

截至目前,我正在通过选择每个单独的商品来手动复制数据。这需要大量的手动工作来提取数据。

http://img4.mukewang.com/64127080000145de12880508.jpg

所以,为了让它自动化,我开始使用 Python。以下是我在 Python (3.7.8) 代码中使用的库。

  1. 美汤

  2. 熊猫

这是我的 Python 代码。

from selenium import webdriver

from bs4 import BeautifulSoup

import pandas as pd


from selenium.webdriver.support.ui import Select

#from selenium.webdriver.common.by import By


driver = webdriver.Chrome(executable_path='G:/data/depend/chromedriver.exe')

driver.get('https://www.msamb.com/ApmcDetail/ArrivalPriceInfo/')


commodity = Select(driver.find_element_by_id("CommoditiesId"))


#able to select commodities by value

commodity.select_by_value('08005')


# Iterating over the all the commodity an fetching <td> element

for option in commodity.options:

    #print(option.text)

    soup = BeautifulSoup(option.text)

    print(soup)    

    rows = soup.select('tr')

    print(rows)

    for row in rows[1:]:

        td = row.find_all('td')

        print(td)

        APMC = td[0].text.strip()

        print(APMC)

在这里,我可以从下拉选择框中通过等于CommoditiesId的id获取商品。


获取商品列表 (148) 后,我将尝试解析为该特定商品获取的 HTML 表格内容。在这里,我能够为每次迭代打印商品名称,但无法打印APMC 、Variety、Unit、Quantity、Lrate、Hrate、Modal列数据。


如果以上解决了,我想要分隔格式的输出~|~并想要添加两列,即Date, Commodity。因此,示例输出将如下所示(截至目前,手动准备以下数据文件)。


Date~|~Commodity~|~APMC~|~Variety~|~Unit~|~Quantity~|~Lrate~|~Hrate~|~Modal

    2020-07-11~|~APPLE~|~KOLHAPUR~|~QUINTAL~|~17~|~8500~|~14500~|~11500

    2020-07-11~|~APPLE~|~CHANDRAPUR-GANJWAD~|~QUINTAL~|~9~|~15000~|~17000~|~16000

    2020-07-11~|~APPLE~|~NASHIK~|~DILICIOUS- No.1~|~QUINTAL~|~60~|~9500~|~16000~|~13000

    2020-07-11~|~AMBAT CHUKA~|~PANDHARPUR~|~~|~NAG~|~7~|~10~|~10~|~10

    2020-07-10~|~AMBAT CHUKA~|~PUNE-MANJRI~|~~|~NAG~|~400~|~3~|~6~|~4

    2020-07-10~|~AMBAT CHUKA~|~PUNE~|~LOCAL~|~NAG~|~1300~|~4~|~5~|~4


肥皂起泡泡
浏览 128回答 2
2回答

交互式爱情

此脚本将遍历所有页面并将它们保存到标准 csv 和~|~分隔文本文件中:import requestsimport numpy as npimport pandas as pdfrom bs4 import BeautifulSoupurl = 'https://www.msamb.com/ApmcDetail/ArrivalPriceInfo'detail_url = 'https://www.msamb.com/ApmcDetail/DataGridBind?commodityCode={code}&apmcCode=null'headers = {'Referer': 'https://www.msamb.com/ApmcDetail/ArrivalPriceInfo'}soup = BeautifulSoup(requests.get(url).content, 'html.parser')values = [(o['value'], o.text) for o in soup.select('#CommoditiesId option') if o['value']]all_data = []for code, code_name in values:&nbsp; &nbsp; print('Getting info for code {} {}'.format(code, code_name))&nbsp; &nbsp; soup = BeautifulSoup(requests.get(detail_url.format(code=code), headers=headers).content, 'html.parser')&nbsp; &nbsp; current_date = ''&nbsp; &nbsp; for row in soup.select('tr'):&nbsp; &nbsp; &nbsp; &nbsp; if row.select_one('td[colspan]'):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; current_date = row.get_text(strip=True)&nbsp; &nbsp; &nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; row = [td.get_text(strip=True) for td in row.select('td')]&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; all_data.append({&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'Date': current_date,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'Commodity': code_name,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'APMC': row[0],&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'Variety': row[1],&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'Unit': row[2],&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'Quantity': row[3],&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'Lrate': row[4],&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'Hrate': row[5],&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'Modal': row[6],&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; })df = pd.DataFrame(all_data)print(df)df.to_csv('data.csv')&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# <-- saves standard csvnp.savetxt('data.txt', df, delimiter='~|~', fmt='%s')&nbsp; &nbsp; &nbsp; &nbsp;# <-- saves .txt file with '~|~' delimiter印刷:...Getting info for code 08071 TOMATOGetting info for code 10006 TURMERICGetting info for code 08075 WAL BHAJIGetting info for code 08076 WAL PAPDIGetting info for code 08077 WALVADGetting info for code 07011 WATER MELONGetting info for code 02009 WHEAT(HUSKED)Getting info for code 02012 WHEAT(UNHUSKED)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Date&nbsp; &nbsp; &nbsp; &nbsp; Commodity&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; APMC Variety&nbsp; &nbsp; &nbsp;Unit Quantity Lrate Hrate Modal0&nbsp; &nbsp; &nbsp;18/07/2020&nbsp; &nbsp; &nbsp; AMBAT CHUKA&nbsp; &nbsp; PANDHARPUR&nbsp; &nbsp; ----&nbsp; &nbsp; &nbsp; NAG&nbsp; &nbsp; &nbsp; &nbsp;50&nbsp; &nbsp; &nbsp;5&nbsp; &nbsp; &nbsp;5&nbsp; &nbsp; &nbsp;51&nbsp; &nbsp; &nbsp;16/07/2020&nbsp; &nbsp; &nbsp; AMBAT CHUKA&nbsp; &nbsp; PANDHARPUR&nbsp; &nbsp; ----&nbsp; &nbsp; &nbsp; NAG&nbsp; &nbsp; &nbsp; &nbsp;50&nbsp; &nbsp; &nbsp;5&nbsp; &nbsp; &nbsp;5&nbsp; &nbsp; &nbsp;52&nbsp; &nbsp; &nbsp;15/07/2020&nbsp; &nbsp; &nbsp; AMBAT CHUKA&nbsp; &nbsp; PANDHARPUR&nbsp; &nbsp; ----&nbsp; &nbsp; &nbsp; NAG&nbsp; &nbsp; &nbsp; 100&nbsp; &nbsp; &nbsp;9&nbsp; &nbsp; &nbsp;9&nbsp; &nbsp; &nbsp;93&nbsp; &nbsp; &nbsp;13/07/2020&nbsp; &nbsp; &nbsp; AMBAT CHUKA&nbsp; &nbsp; PANDHARPUR&nbsp; &nbsp; ----&nbsp; &nbsp; &nbsp; NAG&nbsp; &nbsp; &nbsp; &nbsp;16&nbsp; &nbsp; &nbsp;7&nbsp; &nbsp; &nbsp;7&nbsp; &nbsp; &nbsp;74&nbsp; &nbsp; &nbsp;13/07/2020&nbsp; &nbsp; &nbsp; AMBAT CHUKA&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; PUNE&nbsp; &nbsp;LOCAL&nbsp; &nbsp; &nbsp; NAG&nbsp; &nbsp; &nbsp;2400&nbsp; &nbsp; &nbsp;4&nbsp; &nbsp; &nbsp;7&nbsp; &nbsp; &nbsp;5...&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ...&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ...&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;...&nbsp; &nbsp; &nbsp;...&nbsp; &nbsp; &nbsp; ...&nbsp; &nbsp; &nbsp; ...&nbsp; &nbsp;...&nbsp; &nbsp;...&nbsp; &nbsp;...4893&nbsp; 12/07/2020&nbsp; &nbsp; WHEAT(HUSKED)&nbsp; &nbsp; &nbsp; &nbsp; SHIRUR&nbsp; &nbsp;No. 2&nbsp; QUINTAL&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; 1400&nbsp; 1400&nbsp; 14004894&nbsp; 17/07/2020&nbsp; WHEAT(UNHUSKED)&nbsp; SANGLI-MIRAJ&nbsp; &nbsp; ----&nbsp; QUINTAL&nbsp; &nbsp; &nbsp; 863&nbsp; 4000&nbsp; 4600&nbsp; 43004895&nbsp; 16/07/2020&nbsp; WHEAT(UNHUSKED)&nbsp; SANGLI-MIRAJ&nbsp; &nbsp; ----&nbsp; QUINTAL&nbsp; &nbsp; &nbsp; 475&nbsp; 4000&nbsp; 4500&nbsp; 42504896&nbsp; 15/07/2020&nbsp; WHEAT(UNHUSKED)&nbsp; SANGLI-MIRAJ&nbsp; &nbsp; ----&nbsp; QUINTAL&nbsp; &nbsp; &nbsp; 680&nbsp; 3900&nbsp; 4400&nbsp; 41504897&nbsp; 13/07/2020&nbsp; WHEAT(UNHUSKED)&nbsp; SANGLI-MIRAJ&nbsp; &nbsp; ----&nbsp; QUINTAL&nbsp; &nbsp; &nbsp;1589&nbsp; 3900&nbsp; 4450&nbsp; 4175[4898 rows x 9 columns]节省data.txt:0~|~18/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~50~|~5~|~5~|~51~|~16/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~50~|~5~|~5~|~52~|~15/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~100~|~9~|~9~|~93~|~13/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~16~|~7~|~7~|~74~|~13/07/2020~|~AMBAT CHUKA~|~PUNE~|~LOCAL~|~NAG~|~2400~|~4~|~7~|~55~|~12/07/2020~|~AMBAT CHUKA~|~PUNE~|~LOCAL~|~NAG~|~1700~|~3~|~8~|~56~|~19/07/2020~|~APPLE~|~KOLHAPUR~|~----~|~QUINTAL~|~3~|~9000~|~14000~|~115007~|~18/07/2020~|~APPLE~|~KOLHAPUR~|~----~|~QUINTAL~|~12~|~8500~|~15000~|~117508~|~18/07/2020~|~APPLE~|~NASHIK~|~DILICIOUS- No.1~|~QUINTAL~|~110~|~9000~|~16000~|~130009~|~18/07/2020~|~APPLE~|~SANGLI-PHALE BHAJIPALAM~|~LOCAL~|~QUINTAL~|~8~|~12000~|~16000~|~1400010~|~17/07/2020~|~APPLE~|~MUMBAI-FRUIT MARKET~|~----~|~QUINTAL~|~264~|~9000~|~12000~|~10500...来自 LibreOffice 的 csv 文件的屏幕截图:

慕后森

您可以将它们保存到 txt 文件中,您可以这样做df = pd.read_csv("out.txt",delimiter='~|~'),或者date = df['Date'] commodity = df['Commodity']您可以将 apmc 附加到列表中,并在最后附加 read_as 数据框。
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python