使用 BeautifulSoup 从 Tom Holland 的 IMDB 页面中提取角色角色

我从 Tom Holland 的 IMDB 页面中提取了以下数据并将其定义为“movie_contents”:


[<div class="filmo-row odd" id="actor-tt10872600">

 <span class="year_column">

  2021

 </span>

 <b><a href="/title/tt10872600/">Untitled Spider-Man Sequel</a></b>

 (<a class="in_production" href="https://pro.imdb.com/title/tt10872600?rf=cons_nm_filmo">announced</a>)

 <br/>

 Peter Parker / Spider-Man

 </div>, <div class="filmo-row even" id="actor-tt1464335">

 <span class="year_column">

  2021

 </span>

 <b><a href="/title/tt1464335/">Uncharted</a></b>

 (<a class="in_production" href="https://pro.imdb.com/title/tt1464335?rf=cons_nm_filmo">filming</a>)

 <br/>

 Nathan Drake

 </div>, <div class="filmo-row odd" id="actor-tt2076822">

 <span class="year_column">

  2021

 </span>

 <b><a href="/title/tt2076822/">Chaos Walking</a></b>

 (<a class="in_production" href="https://pro.imdb.com/title/tt2076822?rf=cons_nm_filmo">post-production</a>)

 <br/>

 Todd Hewitt

 </div>, <div class="filmo-row even" id="actor-tt9130508">

 <span class="year_column">

  2020/I

 </span>

 <b><a href="/title/tt9130508/">Cherry</a></b>

 (<a class="in_production" href="https://pro.imdb.com/title/tt9130508?rf=cons_nm_filmo">post-production</a>)

 <br/>

 Nico Walker

 </div>, <div class="filmo-row odd" id="actor-tt7395114">

 <span class="year_column">

  2020

 </span>

 <b><a href="/title/tt7395114/">The Devil All the Time</a></b>

 (<a class="in_production" href="https://pro.imdb.com/title/tt7395114?rf=cons_nm_filmo">completed</a>)

 <br/>

 Arvin Russell

 </div>, <div class="filmo-row even" id="actor-tt7146812">

 <span class="year_column">

  2020/I

 </span>

 <b><a href="/title/tt7146812/">Onward</a></b>

 <br/>

 Ian Lightfoot (voice)

 </div>, <div class="filmo-row odd" id="actor-tt6673612">

 <span class="year_column">

  2020

 </span>

 <b><a href="/title/tt6673612/">Dolittle</a></b>

 <br/>

 Jip (voice)

 </div>

我有问题如何提取所有角色名称“Peter Parker / Spider-Man”、“Nathan Drake”、“Todd Hewitt”等?


慕丝7291255
浏览 94回答 2
2回答

白板的微信

该脚本将打印演员的所有角色:import requestsfrom bs4 import BeautifulSoupurl = 'https://www.imdb.com/name/nm4043618/'soup = BeautifulSoup(requests.get(url).content, 'html.parser')seen = set()for row in soup.select('#filmo-head-actor + div .filmo-row > br'):&nbsp; &nbsp; role = row.find_next(text=True).strip()&nbsp; &nbsp; if not role in seen:&nbsp; &nbsp; &nbsp; &nbsp; seen.add(role)&nbsp; &nbsp; &nbsp; &nbsp; print(role)印刷:Peter Parker / Spider-ManNathan DrakeTodd HewittNico WalkerArvin RussellIan Lightfoot (voice)Jip (voice)Walter (voice)Samuel InsullBrother Diarmuid - The NoviceJack FawcettBradley BakerThomas NickersonTomGregory CromwellFormer Billy (Encore) (uncredited)IsaacEddie (voice)BoyLucasShô (UK version, voice)编辑:要获得 DataFrame 的角色,您可以这样做:import requestsimport pandas as pdfrom bs4 import BeautifulSoupurl = "https://www.imdb.com/name/nm4043618/"soup = BeautifulSoup(requests.get(url).content, "html.parser")seen = set()all_data = []for row in soup.select("#filmo-head-actor + div .filmo-row > br"):&nbsp; &nbsp; role = row.find_next(text=True).strip()&nbsp; &nbsp; if not role in seen:&nbsp; &nbsp; &nbsp; &nbsp; seen.add(role)&nbsp; &nbsp; &nbsp; &nbsp; all_data.append(role)df = pd.DataFrame(all_data, columns=["Role"])print(df)印刷:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Role0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Peter Parker / Spider-Man1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Nathan Drake2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Todd Hewitt3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Nico Walker4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Arvin Russell5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Ian Lightfoot (voice)6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Jip (voice)7&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Walter (voice)8&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Samuel Insull9&nbsp; &nbsp; &nbsp; &nbsp; Brother Diarmuid - The Novice10&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Jack Fawcett11&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Bradley Baker12&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Thomas Nickerson13&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Tom14&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Gregory Cromwell15&nbsp; Former Billy (Encore) (uncredited)16&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Isaac17&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Eddie (voice)18&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Boy19&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Lucas20&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Shô (UK version, voice)

HUX布斯

尝试:from bs4 import BeautifulSouphtml = '''<html>&nbsp;<div class="filmo-row odd" id="actor-tt10872600">&nbsp;<span class="year_column">&nbsp; 2021&nbsp;</span>&nbsp;<b><a href="/title/tt10872600/">Untitled Spider-Man Sequel</a></b>&nbsp;(<a class="in_production" href="https://pro.imdb.com/title/tt10872600?rf=cons_nm_filmo">announced</a>)&nbsp;<br/>&nbsp;Peter Parker / Spider-Man&nbsp;</div>, <div class="filmo-row even" id="actor-tt1464335">&nbsp;<span class="year_column">&nbsp; 2021&nbsp;</span>&nbsp;<b><a href="/title/tt1464335/">Uncharted</a></b>&nbsp;(<a class="in_production" href="https://pro.imdb.com/title/tt1464335?rf=cons_nm_filmo">filming</a>)&nbsp;<br/>&nbsp;Nathan Drake&nbsp;</div>, <div class="filmo-row odd" id="actor-tt2076822">&nbsp;<span class="year_column">&nbsp; 2021&nbsp;</span>&nbsp;<b><a href="/title/tt2076822/">Chaos Walking</a></b>&nbsp;(<a class="in_production" href="https://pro.imdb.com/title/tt2076822?rf=cons_nm_filmo">post-production</a>)&nbsp;<br/>&nbsp;Todd Hewitt&nbsp;</div>, <div class="filmo-row even" id="actor-tt9130508">&nbsp;<span class="year_column">&nbsp; 2020/I&nbsp;</span>&nbsp;<b><a href="/title/tt9130508/">Cherry</a></b>&nbsp;(<a class="in_production" href="https://pro.imdb.com/title/tt9130508?rf=cons_nm_filmo">post-production</a>)&nbsp;<br/>&nbsp;Nico Walker&nbsp;</div>, <div class="filmo-row odd" id="actor-tt7395114">&nbsp;<span class="year_column">&nbsp; 2020&nbsp;</span>&nbsp;<b><a href="/title/tt7395114/">The Devil All the Time</a></b>&nbsp;(<a class="in_production" href="https://pro.imdb.com/title/tt7395114?rf=cons_nm_filmo">completed</a>)&nbsp;<br/>&nbsp;Arvin Russell&nbsp;</div>, <div class="filmo-row even" id="actor-tt7146812">&nbsp;<span class="year_column">&nbsp; 2020/I&nbsp;</span>&nbsp;<b><a href="/title/tt7146812/">Onward</a></b>&nbsp;<br/>&nbsp;Ian Lightfoot (voice)&nbsp;</div>, <div class="filmo-row odd" id="actor-tt6673612">&nbsp;<span class="year_column">&nbsp; 2020&nbsp;</span>&nbsp;<b><a href="/title/tt6673612/">Dolittle</a></b>&nbsp;<br/>&nbsp;Jip (voice)&nbsp;</div>&nbsp;'''soup = BeautifulSoup(html, 'html.parser')divs = soup.select('div.filmo-row.odd')for div in divs:&nbsp; &nbsp; text = div.find_all(text=True, recursive=False)&nbsp; &nbsp; print(*[t.strip() for t in text if len(t) > 3])印刷:Peter Parker / Spider-ManTodd HewittArvin RussellJip (voice)
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python