Python requests.get 不返回 html 文档中标签之一中的文本

我正在尝试解析Djinni的个人项目工作描述。我正在使用 Python 3.6、BeautifulSoup4 和 requests 库。当我使用 requests.get 获取职位空缺页面的 html 时,它返回的 html 没有最关键的部分 - 描述文本。例如,采用此页面的 url -示例和我编写的以下代码:

def scrape_job_desc(self, url):

    job_desc_html = self._get_search_page_html(url)

    soup = BeautifulSoup(job_desc_html, features='html.parser')

    try:

        short_desc = str(soup.find('p', {'class': 'job-teaser svelte-a3rpl2'}).getText())

        full_desc = soup.find('div', {'class': 'job-description-wrapper svelte-a3rpl2'}).find('p').getText()

    except AttributeError:

        short_desc = None

        full_desc = None

    return short_desc, full_desc


def _get_search_page_html(self, url):

    html = requests.get(url=url, headers={'User-Agent': 'Mozilla/5.0 CK={} (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko'})

    return html.text

它将返回short_desc,但不返回full_desc。此外,所需的 <p> 标签的文本根本不存在于 html 中。但是当我使用浏览器下载页面时,一切都在那里。是什么原因造成的?


眼眸繁星
浏览 156回答 2
2回答

动漫人物

作业的完整描述以 JavaScript 变量的形式存储在页面内。您可以使用selenium提取它或re模块:import reimport requestsfrom bs4 import BeautifulSoupurl = 'https://djinni.co/jobs2/144172-data-scientist'&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;html_data = requests.get(url).textfull_desc = re.search(r'fullDescription:"(.*?)",', html_data).group(1).replace(r'\r\n', '\n')short_desc =&nbsp; BeautifulSoup(html_data, 'html.parser').select_one('.job-teaser').get_text()print(short_desc)print('-' * 80)print(full_desc)印刷:Together Networks is looking for an experienced Data Scientist to join our Agile team. Together Networks is a worldwide leader in the online dating niche with millions of users across more than 45 countries. Our brands are BeNaughty, CheekyLovers, Flirt, Click&Flirt, Flirt Spielchen.--------------------------------------------------------------------------------What you get to deal with:- Active collaboration with stakeholders throughout the organization;- User experience modelling;- Advanced segmentation;- User behavior analytics;- Anomaly detection, fraud detection;- Looking for bottlenecks;- Churn prediction.&nbsp;You need to have (required):- Masterâs or PHD in Statistics, Mathematics, Computer Science or another quantitative field;- 2+ years of experience manipulating data sets and building statistical models;- Strong knowledge in a wide range of machine learning methods and algorithms for classification, regression, clustering, and others;- Knowledge and experience in statistical and data mining techniques;- Experience using statistical computer languages (Python, SLQ, etc.) to manipulate data and draw insights from large data sets.- Knowledge of a variety of machine learning techniques and their real-world advantages\u002Fdrawbacks;- Experience visualizing\u002Fpresenting insights for stakeholders;- Independent, creative thinking, and ability to learn fast.Would be a great plus:- Experience dealing with end to end machine learning projects: data exploration, feature engineering\u002Fdefinition, model building, production, maintenance;- Experience in data visualization with Tableau;- Experience in dating, game dev, social projects.

富国沪深

这是网页抓取时的一个典型错误。您可能查看了浏览器中呈现的 HTML 的源代码,并尝试p获取job-description-wrapper&nbsp;div.但是,如果您只是加载页面本身(浏览器处理的第一个请求)并查看其内容,您会发现该段落最初并未加载。有些脚本会稍后加载它的内容 - 但这种情况发生得如此之快,您作为用户几乎不会注意到它。检查此输出:print(requests.get(url='https://djinni.co/jobs2/144172-data-scientist').text)这就是造成问题的原因。如何解决又是另外一回事了。一种方法是在 Python 中使用无头浏览器,该浏览器在加载页面后运行脚本,并且仅当页面完成加载所有内容时,才能获取您需要的内容。您可以查看类似的工具selenium。
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python