网络抓取清理 CSV 表格时出现问题

我正在尝试从表中抓取一些数据。我得到了我期望的结果,但我找不到将它们保存在干净的 CSV 表中的方法。这是代码,在结果和我想要的下面。有什么建议吗?


from bs4 import BeautifulSoup

import urllib.request # web access

import csv

import re


url = "https://wsc.nmbe.ch/family/87/Senoculidae"

page = urllib.request.urlopen(url) # conntect to website

try:

    page = urllib.request.urlopen(url)

except:

    print("Ups!")

soup = BeautifulSoup(page, 'html.parser')


regex = re.compile('^speciesTitle')

content_lis = soup.find_all('div', attrs={'class': regex})


for li in content_lis:

    con = li.get_text("#",strip=True).split("\n")[0]

    print(con)

我得到了这些不错的输出:


Senoculus albidus#(F. O. Pickard-Cambridge, 1897)#|#| Brazil

Senoculus barroanus#Chickering, 1941#|#| Panama

Senoculus bucolicus#Chickering, 1941#|#| Panama

但我需要这样的东西(CSV 用分号或制表符分隔):


Senoculus albidus;(F. O. Pickard-Cambridge, 1897);Brazil

Senoculus barroanus;Chickering1941;Panama

Senoculus bucolicus;Chickering, 1941;Panama

如何删除字符“|” 和一些空间?有什么建议吗?


肥皂起泡泡
浏览 115回答 2
2回答

幕布斯6054654

尝试这个:from bs4 import BeautifulSoupimport urllib.request # web accessimport reurl = "https://wsc.nmbe.ch/family/87/Senoculidae"page = urllib.request.urlopen(url) # conntect to websitetry:    page = urllib.request.urlopen(url)except:    print("Ups!")soup = BeautifulSoup(page, 'html.parser')#div = soup.find(text=True, recursive=)regex = re.compile('^speciesTitle')content_lis = soup.find_all('div', attrs={'class': regex})file = ''for cl in content_lis:    a = cl.select_one('div a strong i')    b = cl.find(text=True, recursive=False)    c = cl.select_one('span')    cc = re.findall("[\w]+", c.text)[0]    file += f'{a.get_text(strip=True)};{b.strip()};{cc}\n'with open('file.csv', 'w') as f:   f.write(file)保存一个文件:Senoculus albidus;(F. O. Pickard-Cambridge, 1897);BrazilSenoculus barroanus;Chickering, 1941;PanamaSenoculus bucolicus;Chickering, 1941;PanamaSenoculus cambridgei;Mello-Leitão, 1927;BrazilSenoculus canaliculatus;F. O. Pickard-Cambridge, 1902;MexicoSenoculus carminatus;Mello-Leitão, 1927;BrazilSenoculus darwini;(Holmberg, 1883);ArgentinaSenoculus fimbriatus;Mello-Leitão, 1927;BrazilSenoculus gracilis;(Keyserling, 1879);GuyanaSenoculus guianensis;Caporiacco, 1947;jSenoculus iricolor;(Simon, 1880);BrazilSenoculus maronicus;Taczanowski, 1872;French等等...

慕哥6287543

此代码基于您的示例数据集:lst=['Senoculus albidus#(F. O. Pickard-Cambridge, 1897)#|#| Brazil','Senoculus barroanus#Chickering, 1941#|#| Panama','Senoculus bucolicus#Chickering, 1941#|#| Panama']lst2 = [s.replace('|',"").split('#') for s in lst]lst3=[]for s in lst2:   lst3.append(';'.join([sx.strip() for sx in s]).replace(';;',';'))for s in lst3:   print(s)输出Senoculus albidus;(F. O. Pickard-Cambridge, 1897);Brazil Senoculus barroanus;Chickering, 1941;Panama Senoculus bucolicus;Chickering, 1941;Panama--- 根据请求者评论更新 ---在最后一个循环中添加一行:for li in content_lis:    con = li.get_text("#",strip=True).split("\n")[0]    con = ';'.join(sx.strip() for sx in con.replace('|',"").split('#')).replace(';;',';') # add this line    print(con)
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python