尝试从 DataFrame 中的源中删除 html 格式

3回答

手掌心

您可以首先通过将标签设置为BeautifulSoup对象来获取 url 。如果它已经是一个 BeautifulSoup 对象那么你可以直接应用它.find("a").get("href")如果没有，那么您可以将其设为 BeautifulSoup 对象。from bs4 import BeautifulSoup #pip install beautifulsoup4a_tag ='<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'soup = BeautifulSoup(a_tag,"html5lib") #pip install html5libprint(soup.find("a").get("href"))#output - > http://twitter.com/download/iphone然后用这个函数去掉html，文字就剩下了import redef remove_html_tags(raw_html): cleanr = re.compile("<.*?>") clean_text = re.sub(cleanr,'',raw_html) return clean_textoutput = remove_html_tags(a_tag)print(output)#output -> Twitter for iPhone

0 0

BIG阳

您可以使用 python urlextract模块从任何字符串中提取 URL -from urlextract import URLExtracttext = '''<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'''text = text.replace(' ', '').replace('=','')extractor = URLExtract()print(extractor.find_urls(text))输出-['http://twitter.com/download/iphone']

0 0

慕姐4208626

您可以拆分“”。并获取第二个元素。.split('"')[1]https://docs.python.org/3/library/stdtypes.html?highlight=split#str.split

0 0