从包含在 HTML 标记和不带标记的字符串中的一系列字符串中提取文本

根据您的问题和评论，我认为获取子字符串的索引并对 HTML 的整个子集进行操作可以满足您的需求。让我们首先创建一个函数来检索子字符串的所有索引（参见@AkiRoss 的回答）：def findall(p, s):    i = s.find(p)    while i != -1:        yield i        i = s.find(p, i+1)然后使用它来查找和的出现。opening_b_occurrences = [i for i in findall('', html)]# has the value of [21, 40, 58]closing_b_occurrences = [i for i in findall('', html)]# has the value of [28, 44, 67]现在您可以使用该信息来获取 HTML 的子字符串来进行文本提取：first_br = opening_b_occurrences[0]last_br = closing_b_occurrences[-1] # getting the last one from listtext_inside_br = html[first_br:last_br]中的文本text_inside_br现在应该是'This\n" is "\na\n" test "\nstring'. 您现在可以清理它，例如通过附加回它并使用 BeautifulSoup 提取值或仅使用正则表达式来执行此操作。

从包含在 HTML 标记和不带标记的字符串中的一系列字符串中提取文本

2回答