如何抓取 HTML 表格格式的数据？

此脚本将遍历所有页面并将它们保存到标准 csv 和~|~分隔文本文件中：import requestsimport numpy as npimport pandas as pdfrom bs4 import BeautifulSoupurl = 'https://www.msamb.com/ApmcDetail/ArrivalPriceInfo'detail_url = 'https://www.msamb.com/ApmcDetail/DataGridBind?commodityCode={code}&apmcCode=null'headers = {'Referer': 'https://www.msamb.com/ApmcDetail/ArrivalPriceInfo'}soup = BeautifulSoup(requests.get(url).content, 'html.parser')values = [(o['value'], o.text) for o in soup.select('#CommoditiesId option') if o['value']]all_data = []for code, code_name in values:    print('Getting info for code {} {}'.format(code, code_name))    soup = BeautifulSoup(requests.get(detail_url.format(code=code), headers=headers).content, 'html.parser')    current_date = ''    for row in soup.select('tr'):        if row.select_one('td[colspan]'):            current_date = row.get_text(strip=True)        else:            row = [td.get_text(strip=True) for td in row.select('td')]            all_data.append({                'Date': current_date,                'Commodity': code_name,                'APMC': row[0],                'Variety': row[1],                'Unit': row[2],                'Quantity': row[3],                'Lrate': row[4],                'Hrate': row[5],                'Modal': row[6],            })df = pd.DataFrame(all_data)print(df)df.to_csv('data.csv')                                       # <-- saves standard csvnp.savetxt('data.txt', df, delimiter='~|~', fmt='%s')       # <-- saves .txt file with '~|~' delimiter印刷：...Getting info for code 08071 TOMATOGetting info for code 10006 TURMERICGetting info for code 08075 WAL BHAJIGetting info for code 08076 WAL PAPDIGetting info for code 08077 WALVADGetting info for code 07011 WATER MELONGetting info for code 02009 WHEAT(HUSKED)Getting info for code 02012 WHEAT(UNHUSKED)            Date        Commodity          APMC Variety     Unit Quantity Lrate Hrate Modal0     18/07/2020      AMBAT CHUKA    PANDHARPUR    ----      NAG       50     5     5     51     16/07/2020      AMBAT CHUKA    PANDHARPUR    ----      NAG       50     5     5     52     15/07/2020      AMBAT CHUKA    PANDHARPUR    ----      NAG      100     9     9     93     13/07/2020      AMBAT CHUKA    PANDHARPUR    ----      NAG       16     7     7     74     13/07/2020      AMBAT CHUKA          PUNE   LOCAL      NAG     2400     4     7     5...          ...              ...           ...     ...      ...      ...   ...   ...   ...4893  12/07/2020    WHEAT(HUSKED)        SHIRUR   No. 2  QUINTAL        2  1400  1400  14004894  17/07/2020  WHEAT(UNHUSKED)  SANGLI-MIRAJ    ----  QUINTAL      863  4000  4600  43004895  16/07/2020  WHEAT(UNHUSKED)  SANGLI-MIRAJ    ----  QUINTAL      475  4000  4500  42504896  15/07/2020  WHEAT(UNHUSKED)  SANGLI-MIRAJ    ----  QUINTAL      680  3900  4400  41504897  13/07/2020  WHEAT(UNHUSKED)  SANGLI-MIRAJ    ----  QUINTAL     1589  3900  4450  4175[4898 rows x 9 columns]节省data.txt：0~|~18/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~50~|~5~|~5~|~51~|~16/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~50~|~5~|~5~|~52~|~15/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~100~|~9~|~9~|~93~|~13/07/2020~|~AMBAT CHUKA~|~PANDHARPUR~|~----~|~NAG~|~16~|~7~|~7~|~74~|~13/07/2020~|~AMBAT CHUKA~|~PUNE~|~LOCAL~|~NAG~|~2400~|~4~|~7~|~55~|~12/07/2020~|~AMBAT CHUKA~|~PUNE~|~LOCAL~|~NAG~|~1700~|~3~|~8~|~56~|~19/07/2020~|~APPLE~|~KOLHAPUR~|~----~|~QUINTAL~|~3~|~9000~|~14000~|~115007~|~18/07/2020~|~APPLE~|~KOLHAPUR~|~----~|~QUINTAL~|~12~|~8500~|~15000~|~117508~|~18/07/2020~|~APPLE~|~NASHIK~|~DILICIOUS- No.1~|~QUINTAL~|~110~|~9000~|~16000~|~130009~|~18/07/2020~|~APPLE~|~SANGLI-PHALE BHAJIPALAM~|~LOCAL~|~QUINTAL~|~8~|~12000~|~16000~|~1400010~|~17/07/2020~|~APPLE~|~MUMBAI-FRUIT MARKET~|~----~|~QUINTAL~|~264~|~9000~|~12000~|~10500...来自 LibreOffice 的 csv 文件的屏幕截图：

如何抓取 HTML 表格格式的数据？

2回答