如何在golang中用空字符串替换所有html标签

4回答

波斯汪

对于那些来这里寻找快速解决方案的人来说，有一个库可以做到这一点：bluemonday。包bluemonday提供了一种将 HTML 元素和属性的白名单描述为策略的方法，并将该策略应用于来自可能包含标记的用户的不受信任的字符串。所有不在白名单上的元素和属性都将被删除。package mainimport ( "fmt" "github.com/microcosm-cc/bluemonday")func main() { // Do this once for each unique policy, and use the policy for the life of the program // Policy creation/editing is not safe to use in multiple goroutines p := bluemonday.StripTagsPolicy() // The policy can then be used to sanitize lots of input and it is safe to use the policy in multiple goroutines html := p.Sanitize( `<a onblur="alert(secret)" href="http://www.google.com">Google</a>`, ) // Output: // Google fmt.Println(html)}https://play.golang.org/p/jYARzNwPToZ

0 0

蝴蝶刀刀

正则表达式的问题这是一个非常简单的 RegEx 替换方法，它从字符串中格式良好的HTML中删除 HTML 标记。strip_html_regex.gopackage mainimport "regexp"const regex = `<.*?>`// This method uses a regular expresion to remove HTML tags.func stripHtmlRegex(s string) string {    r := regexp.MustCompile(regex)    return r.ReplaceAllString(s, "")}注意：这不适用于格式错误的HTML。不要用这个。更好的方法由于 Go 中的字符串可以被视为字节的一部分，因此可以轻松遍历字符串并查找不在 HTML 标记中的部分。当我们识别字符串的有效部分时，我们可以简单地截取该部分的一部分并使用strings.Builder.strip_html.gopackage mainimport (    "strings"    "unicode/utf8")const (    htmlTagStart = 60 // Unicode `<`    htmlTagEnd   = 62 // Unicode `>`)// Aggressively strips HTML tags from a string.// It will only keep anything between `>` and `<`.func stripHtmlTags(s string) string {    // Setup a string builder and allocate enough memory for the new string.    var builder strings.Builder    builder.Grow(len(s) + utf8.UTFMax)    in := false // True if we are inside an HTML tag.    start := 0  // The index of the previous start tag character `<`    end := 0    // The index of the previous end tag character `>`    for i, c := range s {        // If this is the last character and we are not in an HTML tag, save it.        if (i+1) == len(s) && end >= start {            builder.WriteString(s[end:])        }        // Keep going if the character is not `<` or `>`        if c != htmlTagStart && c != htmlTagEnd {            continue        }        if c == htmlTagStart {            // Only update the start if we are not in a tag.            // This make sure we strip out `<<br>` not just `<br>`            if !in {                start = i            }            in = true            // Write the valid string between the close and start of the two tags.            builder.WriteString(s[end:start])            continue        }        // else c == htmlTagEnd        in = false        end = i + 1    }    s = builder.String()    return s}如果我们使用 OP 的文本和一些格式错误的 HTML 运行这两个函数，您会发现结果不一致。main.gopackage mainimport "fmt"func main() {    s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"    res := stripHtmlTags(s)    fmt.Println(res)    // Malformed HTML examples    fmt.Println("\n:: stripHTMLTags ::\n")    fmt.Println(stripHtmlTags("Do something <strong>bold</strong>."))    fmt.Println(stripHtmlTags("h1>I broke this</h1>"))    fmt.Println(stripHtmlTags("This is <a href='#'>>broken link</a>."))    fmt.Println(stripHtmlTags("I don't know ><where to <<em>start</em> this tag<."))        // Regex Malformed HTML examples    fmt.Println(":: stripHtmlRegex ::\n")    fmt.Println(stripHtmlRegex("Do something <strong>bold</strong>."))    fmt.Println(stripHtmlRegex("h1>I broke this</h1>"))    fmt.Println(stripHtmlRegex("This is <a href='#'>>broken link</a>."))    fmt.Println(stripHtmlRegex("I don't know ><where to <<em>start</em> this tag<."))}输出：afsdf4534534!@@!!#345345afsdf4534534!@@!!#:: stripHTMLTags ::Do something bold.I broke thisThis is broken link.start this tag:: stripHtmlRegex ::Do something bold.h1>I broke thisThis is >broken link.I don't know >start this tag<.注意：RegEx 方法不会始终如一地删除所有 HTML 标记。老实说，我不太擅长 RegEx，无法编写 RegEx 匹配字符串来正确处理剥离 HTML。基准除了在剥离格式错误的 HTML 标签方面更安全和更积极的优势之外，stripHtmlTags它比 . 快 4 倍左右stripHtmlRegex。> go test -run=Calculate -bench=.goos: windowsgoarch: amd64BenchmarkStripHtmlRegex-8          51516             22726 ns/opBenchmarkStripHtmlTags-8          230678              5135 ns/op

0 0

萧十郎

如果你想替换所有的 HTML 标签，使用 strip of html 标签。匹配 HTML 标签的正则表达式不是一个好主意。package mainimport (    "fmt"    "github.com/grokify/html-strip-tags-go")func main() {    text := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"    stripped := strip.StripTags(text)    fmt.Println(text)    fmt.Println(stripped)}

0 0

至尊宝的传说

我们已经在生产中尝试过这个，但在某些极端情况下，所提出的解决方案都没有真正起作用。如果你需要一些强大的东西，请检查 Go 内部库的未导出方法（html-strip-tags-go pkg 基本上是使用 BSD-3 许可证导出的）。或者https://github.com/microcosm-cc/bluemonday是我们最终使用的非常流行的库（也包括 BSD-3）。=================================================这里唯一的区别是由于len对所有 utf-8 字符的字符串评估。对于使用的每个字符，它将返回 1-4 之间。所以len(è)实际上会评估为2. 为了解决这个问题，我们将把字符串转换为rune.https://go.dev/play/p/xo7Mrx5qw-_J// Aggressively strips HTML tags from a string.// It will only keep anything between `>` and `<`.func stripHTMLTags(s string) string { // Supports utf-8, since some char could take more than 1 byte. ie: len("è") -> 2 d := []rune(s) // Setup a string builder and allocate enough memory for the new string. var builder strings.Builder builder.Grow(len(d) + utf8.UTFMax) in := false // True if we are inside an HTML tag. start := 0 // The index of the previous start tag character `<` end := 0 // The index of the previous end tag character `>` for i, c := range d { // If this is the last character and we are not in an HTML tag, save it. if (i+1) == len(d) && end >= start { builder.WriteString(s[end:]) } // Keep going if the character is not `<` or `>` if c != htmlTagStart && c != htmlTagEnd { continue } if c == htmlTagStart { // Only update the start if we are not in a tag. // This make sure we strip out `<<br>` not just `<br>` if !in { start = i } in = true // Write the valid string between the close and start of the two tags. builder.WriteString(s[end:start]) continue } // else c == htmlTagEnd in = false end = i + 1 } s = builder.String() return s}

0 0