猿问

如何在不知道位置的情况下在某个点拆分字符串。

我目前正在从 TFL API 中提取天气预报。一旦为“今天的预测”提取了 json,段落中间就会出现随机符号——我认为这可能是从 API 格式化的。


这是提取的内容:


Bank holiday Monday will stay dry with some long sunny spells. Temperatures will remain warm for the time of year.<br/><br/>PM2.5 particle pollution increased rapidly overnight. Increases began across Essex and spread across south London.  Initial chemical analysis suggests that this is composed mainly of wood burning particles but also with some additional particle pollution from agriculture and traffic. This would be consistent with an air flow from the continent where large bonfires are part of the Easter tradition. This will combine with our local emissions today and 'high' PM2.5 is possible.<br/><br/>The sunny periods, high temperatures and east winds will bring additional ozone precursors allowing for photo-chemical generation of ozone to take place. Therefore 'moderate' ozone is likely.<br/><br/>Air pollution should remain 'Low' through the forecast period for the following pollutants:<br/><br/>Nitrogen Dioxide<br/>Sulphur Dioxide.

这一段比必要的更详细,前两句话就是我所需要的。我认为.split这是一个好主意,并通过 for 循环运行它,直到它到达 string "<br/><br/>PM2.5"。

但是,我不能确定这是否每天都是相同的字符串,或者简化的预测是否仍然只是前两个句子。


有人对我如何解决这个问题有任何想法吗?


作为参考,这是我目前拥有的代码,它还不是其他任何东西的一部分。


import urllib.parse

import requests


main_api = "https://api.tfl.gov.uk/AirQuality?"


idno = "1"

url = main_api + urllib.parse.urlencode({"$id": idno})


json_data = requests.get(main_api).json()


disclaimer = json_data['disclaimerText']

print("Disclaimer: " + disclaimer)


print()


today_weather = json_data['currentForecast'][0]['forecastText']

print("Today's forecast: " + today_weather.replace("<br/><br/>"," "))


宝慕林4294392
浏览 194回答 3
3回答

陪伴而非守候

我相信,如果您清理 HTML 标记,然后使用 NLTK 的句子标记器对段落进行标记,那么您应该很高兴。from nltk.tokenize import sent_tokenizeimport urllib.parseimport requestsimport remain_api = "https://api.tfl.gov.uk/AirQuality?"idno = "1"url = main_api + urllib.parse.urlencode({"$id": idno})json_data = requests.get(main_api).json()disclaimer = json_data['disclaimerText']print("Disclaimer: " + disclaimer)print()# Clean out HTML tagstoday_weather_str = re.sub(r'<.*?>', '', json_data['currentForecast'][0]['forecastText'])# Get the first two sentences out of the listtoday_weather = ' '.join(sent_tokenize(today_weather_str)[:2])print("Today's forecast: {}".format(today_weather))

慕侠2389804

如果您要编写一个没有为每个数据集显式编码的脚本,那么您需要找到某种模式,如果该模式是您想要的字符串始终是前两行,那么您可以使用for循环:data = [line for line in your_variable_here]data = data[:2]如果似乎有关于简化预测的模式,您也可以尝试使用正则表达式。但是,如果没有更多关于数据集是什么样子的信息,我认为这是我能想到的最好的。

素胚勾勒不出你

这些“随机符号”&lt;br/&gt;是一个 HTML 编码<br/>或 HTML 中的新行,因此看起来像是一个可靠的拆分方法:lines&nbsp;=&nbsp;today_weather.split('&lt;br/&gt;')我认为可以合理地假设第一行就是您所追求的:short_forecast&nbsp;=&nbsp;lines[0]时间会证明这是否正确,但您可以轻松调整以包含更多或更少。
随时随地看视频慕课网APP

相关分类

Python
我要回答