在 Go 中匹配 html 标记之外的 html 文本的最佳方法是什么？

我有一堆我正在解析的 html，<a>如果它们包含某些文本，我需要删除它们。通常，我会使用 Goquery，但我正在搜索的文本通常不在 html 标记本身的范围内。例如，这个 html：

This is the start.

<a href="http://example.com/path">We don't want to match this text.</a>

<a href="http://www.example.com/another/path" style="font-family:Arial, Helvetica, 'sans-serif'; color:#838383;font-size:12px; line-height:14px"></a> match this text.<a href="blah">We also don't want to match this text</a>

</body></html>

我正在使用这个正则表达式，但它失败并匹配我不想匹配的文本：

(?is)<a[^>]+href=["'](?P<link>.*?)["']*.?> match this text\.

https://regex101.com/r/iEXpqc/1

小唯快跑啊

浏览 151回答 1

1回答

回首忆惘然

像这样，使用路径（不是去，但逻辑可以重新实现）：xmlstarlet ed -d '//a[contains(text(), "want to match")]' file.html 输出<?xml version="1.0"?><html>  <body>This is the start.  <a href="http://www.example.com/another/path" style="font-family:Arial, Helvetica, 'sans-serif'; color:#838383;font-size:12px; line-height:14px"/> match this text.</body></html> 笔记-L如果要即时更换，请添加开关

随时随地看视频慕课网APP