猿问

Go如何提高逐行读取大文件的速度

我试图找出逐行读取大文件并检查该行是否包含字符串的最快方法。我正在测试的文件大小约为 680mb:


    package main

    

    import (

        "bufio"

        "fmt"

        "os"

        "strings"

    )

    

    func main() {

        f, err := os.Open("./crackstation-human-only.txt")

    

        scanner := bufio.NewScanner(f)

        if err != nil {

            panic(err)

        }

        defer f.Close()

    

        for scanner.Scan() {

            if strings.Contains(scanner.Text(), "Iforgotmypassword") {

                fmt.Println(scanner.Text())

            }

        }

    }

构建程序并在我的机器上计时后,它运行了 3 秒以上 ./speed  3.13s user 1.25s system 122% cpu 3.563 total


增加缓冲区后


buf := make([]byte, 64*1024)

scanner.Buffer(buf, bufio.MaxScanTokenSize)

它变得更好一点 ./speed  2.47s user 0.25s system 104% cpu 2.609 total


我知道它会变得更好,因为其他工具可以在一秒钟内完成它而无需任何类型的索引。这种方法的瓶颈似乎是什么?


0.33s user 0.14s system 94% cpu 0.501 total


喵喔喔
浏览 209回答 3
3回答

元芳怎么了

最后编辑这是对花费微不足道的时间的问题的“逐行”解决方案,它会打印整个匹配行。package mainimport (&nbsp; &nbsp; "bytes"&nbsp; &nbsp; "fmt"&nbsp; &nbsp; "io/ioutil")func main() {&nbsp; &nbsp; dat, _ := ioutil.ReadFile("./jumble.txt")&nbsp; &nbsp; i := bytes.Index(dat, []byte("Iforgotmypassword"))&nbsp; &nbsp; if i != -1 {&nbsp; &nbsp; &nbsp; &nbsp; var x int&nbsp; &nbsp; &nbsp; &nbsp; var y int&nbsp; &nbsp; &nbsp; &nbsp; for x = i; x > 0; x-- {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if dat[x] == byte('\n') {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; for y = i; y < len(dat); y++ {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if dat[y] == byte('\n') {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; fmt.Println(string(dat[x : y+1]))&nbsp; &nbsp; }}real&nbsp; &nbsp; 0m0.421suser&nbsp; &nbsp; 0m0.068ssys&nbsp; &nbsp; &nbsp;0m0.352s原始答案如果您只需要查看字符串是否在文件中,为什么不使用正则表达式呢?注意:我将数据保存为字节数组而不是转换为字符串。package mainimport (&nbsp; &nbsp; "fmt"&nbsp; &nbsp; "io/ioutil"&nbsp; &nbsp; "regexp")var regex = regexp.MustCompile(`Ilostmypassword`)func main() {&nbsp; &nbsp; dat, _ := ioutil.ReadFile("./jumble.txt")&nbsp; &nbsp; if regex.Match(dat) {&nbsp; &nbsp; &nbsp; &nbsp; fmt.Println("Yes")&nbsp; &nbsp; }}jumble.txt是一个 859 MB 的包含换行符的混乱文本。运行time ./code我得到:real&nbsp; &nbsp; 0m0.405suser&nbsp; &nbsp; 0m0.064ssys&nbsp; &nbsp; &nbsp;0m0.340s为了尝试回答您的评论,我不认为瓶颈本质上来自于逐行搜索,Golang 使用一种有效的算法来搜索字符串/符文。我认为瓶颈来自IO读取,当程序从文件读取时,它通常不会在读取队列中排在第一位,因此,程序必须等到可以读取才能开始实际比较。因此,当您一遍又一遍地阅读时,您将被迫等待轮到您的 IO。给你一些数学,如果你的缓冲区大小是 64 * 1024(或 65535 字节),你的文件是 1 GB。将 1 GB / 65535 字节除以检查整个文件所需的 15249 次读取。在我的方法中,我“一次”读取整个文件并检查构造的数组。我能想到的另一件事就是遍历文件所需的循环总数以及每个循环所需的时间:给定以下代码:dat, _ := ioutil.ReadFile("./jumble.txt")sdat := bytes.Split(dat, []byte{'\n'})for _, l := range sdat {&nbsp; &nbsp; if bytes.Equal([]byte("Iforgotmypassword"), l) {&nbsp; &nbsp; &nbsp; &nbsp; fmt.Println("Yes")&nbsp; &nbsp; }}我计算出每个循环平均需要 32 纳秒,字符串 Iforgotmypassword 在我的文件中的第 100000000 行,因此这个循环的执行时间大约是 32 纳秒 * 100000000 ~= 3.2 秒。

暮色呼如

H. Ross's answer is awesome,但它会将整个文件读入内存,如果你的文件太大,这可能不可行。如果您仍然想逐行扫描,也许如果您正在搜索多个项目,我发现使用 scanner.Bytes() 而不是 scanner.Text() 可以稍微提高我机器上的速度,从 2.244s 到原题,1.608s。bufio 的 scanner.Bytes() 方法不分配任何额外的内存,而 Text() 从其缓冲区创建一个字符串。package mainimport (&nbsp; &nbsp; "bufio"&nbsp; &nbsp; "fmt"&nbsp; &nbsp; "os"&nbsp; &nbsp; "bytes")// uses scanner.Bytes to avoid allocation.func main() {&nbsp; &nbsp; f, err := os.Open("./crackstation-human-only.txt")&nbsp; &nbsp; scanner := bufio.NewScanner(f)&nbsp; &nbsp; if err != nil {&nbsp; &nbsp; &nbsp; &nbsp; panic(err)&nbsp; &nbsp; }&nbsp; &nbsp; defer f.Close()&nbsp; &nbsp; toFind := []byte("Iforgotmypassword")&nbsp; &nbsp; for scanner.Scan() {&nbsp; &nbsp; &nbsp; &nbsp; if bytes.Contains(scanner.Bytes(), toFind) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; fmt.Println(scanner.Text())&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }}

慕侠2389804

使用我自己的 700MB 测试文件和你的原始文件,时间刚刚超过 7 秒使用 grep 是 0.49 秒使用这个程序(不打印出行,它只是说是)0.082 秒package mainimport (&nbsp; &nbsp; "bytes"&nbsp; &nbsp; "fmt"&nbsp; &nbsp; "io/ioutil"&nbsp; &nbsp; "os")func check(e error) {&nbsp; &nbsp; if e != nil {&nbsp; &nbsp; &nbsp; &nbsp; panic(e)&nbsp; &nbsp; }}func main() {&nbsp; &nbsp; find := []byte(os.Args[1])&nbsp; &nbsp; dat, err := ioutil.ReadFile("crackstation-human-only.txt")&nbsp; &nbsp; check(err)&nbsp; &nbsp; if bytes.Contains(dat, find) {&nbsp; &nbsp; &nbsp; &nbsp; fmt.Print("yes")&nbsp; &nbsp; }}
随时随地看视频慕课网APP

相关分类

Go
我要回答