猿问

从大文本中删除所有非字母数字字符的有效方法

我需要处理大量文本,其中一个步骤是删除所有非字母数字字符。我正试图找到一种有效的方法来做到这一点。


到目前为止,我有两个功能:


func stripMap(str, chr string) string {

    return strings.Map(func(r rune) rune {

        if strings.IndexRune(chr, r) < 0 {

            return r

        }

        return -1

    }, str)

}

在这里,我实际上必须提供一个包含所有非字母字符的字符串。


和普通的旧正则表达式


func stripRegex(in string) string {

    reg, _ := regexp.Compile("[^a-zA-Z0-9 ]+")

    return reg.ReplaceAllString(in, "")

}

正则表达式似乎慢得多


BenchmarkStripMap-8        30000         37907 ns/op        8192 B/op          2 allocs/op


BenchmarkStripRegex-8          10000        131449 ns/op       57552 B/op         35 allocs/op

寻找建议。还有其他更好的方法吗?改善以上?


沧海一幻觉
浏览 150回答 2
2回答

不负相思意

因为存活的符文少于utf8.RuneSelf,这个问题可以通过对字节进行操作来解决。如果任何字节不在 中[^a-zA-Z0-9 ],则该字节是要删除的符文的一部分。func strip(s string) string {    var result strings.Builder    for i := 0; i < len(s); i++ {        b := s[i]        if ('a' <= b && b <= 'z') ||            ('A' <= b && b <= 'Z') ||            ('0' <= b && b <= '9') ||            b == ' ' {            result.WriteByte(b)        }    }    return result.String()}此函数的一个变体是通过调用 result.Grow 来预分配结果:func strip(s string) string {    var result strings.Builder    result.Grow(len(s))    ...这确保函数进行一次内存分配,但如果幸存符文与源符文的比率较低,则内存分配可能会大大超过所需。此答案中的函数strip被编写为与参数和结果类型一起使用,string因为这些是问题中使用的类型。如果应用程序正在处理源文本并且可以修改该源文本,那么就地[]byte更新会更有效。[]byte为此,将幸存的字节复制到切片的开头并在完成后重新切片。这避免了 strings.Builder 中的内存分配和开销。这种变化类似于 peterSO 对这个问题的回答。func strip(s []byte) []byte {    n := 0    for _, b := range s {        if ('a' <= b && b <= 'z') ||            ('A' <= b && b <= 'Z') ||            ('0' <= b && b <= '9') ||            b == ' ' {            s[n] = b            n++        }    }    return s[:n]}根据使用的实际数据,此答案中的一种方法可能比问题中的方法更快。

墨色风雨

从大文本中删除所有非字母数字字符的有效方法。在 Go 中,“高效方式”意味着我们运行 Gotesting包基准测试。您对大文本的描述含糊不清。让我们假设它以来自文件或其他byte切片的文本开始。string([]byte)您可能有、几个make([]byte)和 的开销string([]byte)。您可以使用strings.Builder将开销减少到string([]byte)和 几个make([]byte)。string([]byte)您可以通过从函数开始进一步减少它clean([]byte) string。例如,func clean(s []byte) string {&nbsp; &nbsp; j := 0&nbsp; &nbsp; for _, b := range s {&nbsp; &nbsp; &nbsp; &nbsp; if ('a' <= b && b <= 'z') ||&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ('A' <= b && b <= 'Z') ||&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ('0' <= b && b <= '9') ||&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; b == ' ' {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; s[j] = b&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; j++&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }&nbsp; &nbsp; return string(s[:j])}对于大文,莎士比亚全集作为一部[]byte,$ go fmt && go test strip_test.go -bench=. -benchmemBenchmarkSendeckyMap-8&nbsp; &nbsp; &nbsp; &nbsp;20&nbsp; &nbsp; &nbsp;65988121 ns/op&nbsp; &nbsp; 11730958 B/op&nbsp; &nbsp; &nbsp; 2 allocs/opBenchmarkSendeckyRegex-8&nbsp; &nbsp; &nbsp; 5&nbsp; &nbsp; 242834302 ns/op&nbsp; &nbsp; 40013144 B/op&nbsp; &nbsp; 130 allocs/opBenchmarkThunder-8&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 100&nbsp; &nbsp; &nbsp;21791532 ns/op&nbsp; &nbsp; 34682926 B/op&nbsp; &nbsp; &nbsp;43 allocs/opBenchmarkPeterSO-8&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 100&nbsp; &nbsp; &nbsp;16172591 ns/op&nbsp; &nbsp; &nbsp;5283840 B/op&nbsp; &nbsp; &nbsp; 1 allocs/op$strip_test.go:package mainimport (&nbsp; &nbsp; "io/ioutil"&nbsp; &nbsp; "regexp"&nbsp; &nbsp; "strings"&nbsp; &nbsp; "testing")func stripMap(str, chr string) string {&nbsp; &nbsp; return strings.Map(func(r rune) rune {&nbsp; &nbsp; &nbsp; &nbsp; if strings.IndexRune(chr, r) >= 0 {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return r&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; return -1&nbsp; &nbsp; }, str)}var alphanum = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 "func BenchmarkSendeckyMap(b *testing.B) {&nbsp; &nbsp; for N := 0; N < b.N; N++ {&nbsp; &nbsp; &nbsp; &nbsp; b.StopTimer()&nbsp; &nbsp; &nbsp; &nbsp; bytShakespeare := []byte(strShakespeare)&nbsp; &nbsp; &nbsp; &nbsp; b.StartTimer()&nbsp; &nbsp; &nbsp; &nbsp; strShakespeare = string(bytShakespeare)&nbsp; &nbsp; &nbsp; &nbsp; stripMap(strShakespeare, alphanum)&nbsp; &nbsp; }}func stripRegex(in string) string {&nbsp; &nbsp; reg, _ := regexp.Compile("[^a-zA-Z0-9 ]+")&nbsp; &nbsp; return reg.ReplaceAllString(in, "")}func BenchmarkSendeckyRegex(b *testing.B) {&nbsp; &nbsp; for N := 0; N < b.N; N++ {&nbsp; &nbsp; &nbsp; &nbsp; b.StopTimer()&nbsp; &nbsp; &nbsp; &nbsp; bytShakespeare := []byte(strShakespeare)&nbsp; &nbsp; &nbsp; &nbsp; b.StartTimer()&nbsp; &nbsp; &nbsp; &nbsp; strShakespeare = string(bytShakespeare)&nbsp; &nbsp; &nbsp; &nbsp; stripRegex(strShakespeare)&nbsp; &nbsp; }}func strip(s string) string {&nbsp; &nbsp; var result strings.Builder&nbsp; &nbsp; for i := 0; i < len(s); i++ {&nbsp; &nbsp; &nbsp; &nbsp; b := s[i]&nbsp; &nbsp; &nbsp; &nbsp; if ('a' <= b && b <= 'z') ||&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ('A' <= b && b <= 'Z') ||&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ('0' <= b && b <= '9') ||&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; b == ' ' {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; result.WriteByte(b)&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }&nbsp; &nbsp; return result.String()}func BenchmarkThunder(b *testing.B) {&nbsp; &nbsp; for N := 0; N < b.N; N++ {&nbsp; &nbsp; &nbsp; &nbsp; b.StopTimer()&nbsp; &nbsp; &nbsp; &nbsp; bytShakespeare := []byte(strShakespeare)&nbsp; &nbsp; &nbsp; &nbsp; b.StartTimer()&nbsp; &nbsp; &nbsp; &nbsp; strShakespeare = string(bytShakespeare)&nbsp; &nbsp; &nbsp; &nbsp; strip(strShakespeare)&nbsp; &nbsp; }}func clean(s []byte) string {&nbsp; &nbsp; j := 0&nbsp; &nbsp; for _, b := range s {&nbsp; &nbsp; &nbsp; &nbsp; if ('a' <= b && b <= 'z') ||&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ('A' <= b && b <= 'Z') ||&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ('0' <= b && b <= '9') ||&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; b == ' ' {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; s[j] = b&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; j++&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }&nbsp; &nbsp; return string(s[:j])}func BenchmarkPeterSO(b *testing.B) {&nbsp; &nbsp; for N := 0; N < b.N; N++ {&nbsp; &nbsp; &nbsp; &nbsp; b.StopTimer()&nbsp; &nbsp; &nbsp; &nbsp; bytShakespeare := []byte(strShakespeare)&nbsp; &nbsp; &nbsp; &nbsp; b.StartTimer()&nbsp; &nbsp; &nbsp; &nbsp; clean(bytShakespeare)&nbsp; &nbsp; }}var strShakespeare = func() string {&nbsp; &nbsp; // The Complete Works of William Shakespeare by William Shakespeare&nbsp; &nbsp; // http://www.gutenberg.org/files/100/100-0.txt&nbsp; &nbsp; data, err := ioutil.ReadFile(`/home/peter/shakespeare.100-0.txt`)&nbsp; &nbsp; if err != nil {&nbsp; &nbsp; &nbsp; &nbsp; panic(err)&nbsp; &nbsp; }&nbsp; &nbsp; return string(data)}()
随时随地看视频慕课网APP

相关分类

Go
我要回答