我正在尝试使用 Python 从 30 个相似链接中抓取多个表

我有 10 个公司链接。


https://www.zaubacorp.com/company/ASHRAFI-MEDIA-NETWORK-PRIVATE-LIMITED/U22120GJ2019PTC111757,

https://www.zaubacorp.com/company/METTLE-PUBLICATIONS-PRIVATE-LIMITED/U22120MH2019PTC329729,

https://www.zaubacorp.com/company/PRINTSCAPE-INDIA-PRIVATE-LIMITED/U22120MH2020PTC335354,

https://www.zaubacorp.com/company/CHARVAKA-TELEVISION-NETWORK-PRIVATE-LIMITED/U22121KA2019PTC126665,

https://www.zaubacorp.com/company/BHOOKA-NANGA-FILMS-PRIVATE-LIMITED/U22130DL2019PTC353194,

https://www.zaubacorp.com/company/WHITE-CAMERA-SCHOOL-OF-PHOTOGRAPHY-PRIVATE-LIMITED/U22130JH2019PTC013311,

https://www.zaubacorp.com/company/RLE-PRODUCTIONS-PRIVATE-LIMITED/U22130KL2019PTC059208,

https://www.zaubacorp.com/company/CATALIZADOR-MEDIA-PRIVATE-LIMITED/U22130KL2019PTC059793,

https://www.zaubacorp.com/company/TRIPPLED-MEDIAWORKS-OPC-PRIVATE-LIMITED/U22130MH2019OPC333171,

https://www.zaubacorp.com/company/KRYSTAL-CINEMAZ-PRIVATE-LIMITED/U22130MH2019PTC330391

现在我正在尝试从这些链接中抓取表格并将数据以良好的格式保存在 csv 列中。我想抓取“公司详细信息”、“股本和员工人数”、“上市和年度合规详细信息”、“联系方式”、“董事详细信息”的表格。如果任何表没有数据或缺少任何列,我希望输出 csv 文件中的该列为空白。我写了一段代码,但无法得到输出。我在这里做错了什么。请帮忙


import pandas as pd

from bs4 import BeautifulSoup

from urllib.request import urlopen

import requests

import csv

import lxml


url_file = "Zaubalinks.txt"


with open(url_file, "r") as url:

    url_pages = url.read()

# we need to split each urls into lists to make it iterable

pages = url_pages.split("\n") # Split by lines using \n


# now we run a for loop to visit the urls one by one

data = []

for single_page in pages:

    r = requests.get(single_page)

    soup = BeautifulSoup(r.content, 'html5lib')


    table = soup.find_all('table')  # finds all tables

    table_top = pd.read_html(str(table))[0]  # the top table



DIEA
浏览 140回答 2
2回答

UYOU

import requestsfrom bs4 import BeautifulSoupimport pandas as pdcompanies = {    'ASHRAFI-MEDIA-NETWORK-PRIVATE-LIMITED/U22120GJ2019PTC111757',    'METTLE-PUBLICATIONS-PRIVATE-LIMITED/U22120MH2019PTC329729',    'PRINTSCAPE-INDIA-PRIVATE-LIMITED/U22120MH2020PTC335354',    'CHARVAKA-TELEVISION-NETWORK-PRIVATE-LIMITED/U22121KA2019PTC126665',    'BHOOKA-NANGA-FILMS-PRIVATE-LIMITED/U22130DL2019PTC353194',    'WHITE-CAMERA-SCHOOL-OF-PHOTOGRAPHY-PRIVATE-LIMITED/U22130JH2019PTC013311',    'RLE-PRODUCTIONS-PRIVATE-LIMITED/U22130KL2019PTC059208',    'CATALIZADOR-MEDIA-PRIVATE-LIMITED/U22130KL2019PTC059793',    'TRIPPLED-MEDIAWORKS-OPC-PRIVATE-LIMITED/U22130MH2019OPC333171',    'KRYSTAL-CINEMAZ-PRIVATE-LIMITED/U22130MH2019PTC330391'}def main(url):    with requests.Session() as req:        goal = []        for company in companies:            r = req.get(url.format(company))            df = pd.read_html(r.content)            target = pd.concat([df[x].T for x in [0, 3, 4]], axis=1)            goal.append(target)        new = pd.concat(goal)        new.to_csv("data.csv")main("https://www.zaubacorp.com/company/{}")

largeQ

Fortunatley,看来您可以使用更简单的方法到达那里。以一个随机链接为例,它应该是这样的:url = 'https://www.zaubacorp.com/company/CHARVAKA-TELEVISION-NETWORK-PRIVATE-LIMITED/U22121KA2019PTC126665'import pandas as pdtables = pd.read_html(url)从这里开始,您的表格位于tables[0]、tables[3]、tables[4]、tables[15]等中。只需使用一个for循环来轮换所有 url。
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python