处理巨大的文本文件时突然内存消耗跳跃导致内存不足异常

我需要处理一个非常大的文本文件(6-8 GB)。我写了下面附上的代码。不幸的是,每次输出文件达到(在源文件旁边创建)达到~2GB 时,我观察到内存消耗突然增加(~100MB 到几 GB)和结果 -内存不足异常。


调试器指示 OOM 发生在while ((tempLine = streamReader.ReadLine()) != null) 我仅针对 .NET 4.7 和 x64 架构。单行最多 50 个字符长。


我可以解决这个问题并将原始文件拆分为较小的部分,以免在处理时遇到问题并将结果合并回一个文件,但我不想这样做。


代码:


public async Task PerformDecodeAsync(string sourcePath, string targetPath)

    {

        var allLines = CountLines(sourcePath);

        long processedlines = default;

        using (File.Create(targetPath));

        var streamWriter = File.AppendText(targetPath);

        var decoderBlockingCollection = new BlockingCollection<string>(1000);

        var writerBlockingCollection = new BlockingCollection<string>(1000);


        var producer = Task.Factory.StartNew(() =>

        {

            using (var streamReader = new StreamReader(File.OpenRead(sourcePath), Encoding.Default, true))

            {

                string tempLine;

                while ((tempLine = streamReader.ReadLine()) != null)

                {

                    decoderBlockingCollection.Add(tempLine);

                }

                decoderBlockingCollection.CompleteAdding();


            }

        });

        var consumer1 = Task.Factory.StartNew(() =>

        {

            foreach (var line in decoderBlockingCollection.GetConsumingEnumerable())

            {

                short decodeCounter = 0;

                StringBuilder builder = new StringBuilder();

                foreach (var singleChar in line)

                {


                    var positionInDecodeKey = decodingKeysList[decodeCounter].IndexOf(singleChar);


                    if (positionInDecodeKey > 0)

                        builder.Append(model.Substring(positionInDecodeKey, 1));

                    else

                        builder.Append(singleChar);



                    if (decodeCounter > 18)

                        decodeCounter = 0;

                    else ++decodeCounter;

                }

            }

        });


非常感谢解决方案以及如何对其进行更多优化的建议。


青春有我
浏览 207回答 2
2回答

慕村225694

就像我说的,我可能会先做一些更简单的事情,除非或直到证明它表现不佳。正如 Adi 在他们的回答中所说,这项工作似乎受 I/O 限制 - 因此为其创建多个任务似乎没有什么好处。publiv void PerformDecode(string sourcePath, string targetPath){&nbsp; &nbsp; File.WriteAllLines(targetPath,File.ReadLines(sourcePath).Select(line=>{&nbsp; &nbsp; &nbsp; &nbsp; short decodeCounter = 0;&nbsp; &nbsp; &nbsp; &nbsp; StringBuilder builder = new StringBuilder();&nbsp; &nbsp; &nbsp; &nbsp; foreach (var singleChar in line)&nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; var positionInDecodeKey = decodingKeysList[decodeCounter].IndexOf(singleChar);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (positionInDecodeKey > 0)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; builder.Append(model.Substring(positionInDecodeKey, 1));&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; else&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; builder.Append(singleChar);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (decodeCounter > 18)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; decodeCounter = 0;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; else ++decodeCounter;&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; return builder.ToString();&nbsp; &nbsp; }));}现在,当然,这段代码在完成之前实际上是阻塞的,这就是我没有标记它的原因async。但是,你的也是如此,它应该已经警告过这一点。(您可以尝试对Select部分使用 PLINQ 而不是 LINQ,但老实说,我们在这里所做的处理量看起来微不足道;在应用任何此类更改之前先进行分析)

qq_笑_17

由于您所做的工作主要是 IO 绑定,因此您并没有真正从并行化中获得任何好处。在我看来(如果我错了,请纠正我)您的转换算法不依赖于您逐行阅读文件,因此我建议改为执行以下操作:void Main(){&nbsp; &nbsp; //Setup streams for testing&nbsp; &nbsp; using(var inputStream = new MemoryStream())&nbsp; &nbsp; using(var outputStream = new MemoryStream())&nbsp; &nbsp; using (var inputWriter = new StreamWriter(inputStream))&nbsp; &nbsp; using (var outputReader = new StreamReader(outputStream))&nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; //Write test string and rewind stream&nbsp; &nbsp; &nbsp; &nbsp; inputWriter.Write("abcdefghijklmnop");&nbsp; &nbsp; &nbsp; &nbsp; inputWriter.Flush();&nbsp; &nbsp; &nbsp; &nbsp; inputStream.Seek(0, SeekOrigin.Begin);&nbsp; &nbsp; &nbsp; &nbsp; var inputBuffer = new byte[5];&nbsp; &nbsp; &nbsp; &nbsp; var outputBuffer = new byte[5];&nbsp; &nbsp; &nbsp; &nbsp; int inputLength;&nbsp; &nbsp; &nbsp; &nbsp; while ((inputLength = inputStream.Read(inputBuffer, 0, inputBuffer.Length)) > 0)&nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; for (var i = 0; i < inputLength; i++)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //transform each character&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; outputBuffer[i] = ++inputBuffer[i];&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //Write to output&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; outputStream.Write(outputBuffer, 0, inputLength);&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; //Read for testing&nbsp; &nbsp; &nbsp; &nbsp; outputStream.Seek(0, SeekOrigin.Begin);&nbsp; &nbsp; &nbsp; &nbsp; var output = outputReader.ReadToEnd();&nbsp; &nbsp; &nbsp; &nbsp; Console.WriteLine(output);&nbsp; &nbsp; &nbsp; &nbsp; //Outputs: "bcdefghijklmnopq"&nbsp; &nbsp; }}显然,您将使用 FileStreams 而不是 MemoryStreams,并且您可以将缓冲区长度增加到更大的值(因为这只是一个演示示例)。此外,由于您的原始方法是 Async,因此您可以使用 Stream.Write 和 Stream.Read 的异步变体
打开App,查看更多内容
随时随地看视频慕课网APP