如何在python中解析超大文件?

我有这个日志文件“ internet.log”,大约10GB。当我在python中解析它时,出现异常“ MemoryError”。日志文件看起来像这样...


Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.107

Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi uno.gycpi.b.yahoodns.net is 216.115.100.123

Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi-uno.gycpi.b.yahoodns.net is 216.115.100.124

Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.106

Jun 15 16:26:21 dnsmasq[1979]: query[A] fd-geoycpi-uno.gycpi.b.yahoodns.net from 192.168.1.33

Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.106

Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 216.115.100.124

Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 216.115.100.123

Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.107

Jun 15 16:26:23 dnsmasq[1979]: query[A] armdl.adobe.com from 192.168.1.24

我目前正在使用此方法来解析日志文件:


def parse():

Date = []

IPAddress = []

DomainsVisited = []

with open("internet.log", "r") as file:

    content = file.readlines()

    for items in content:

        if 'query[A]' in items:

            getDate(Date, items)

            getIPAddress(IPAddress, items)

            getDomainsVisited(DomainsVisited, items)

finalResult = [[i, j, k] for i, j, k in zip(Date, IPAddress, DomainsVisited)]

return display(finalResult)

如果我解析一个说10MB的日志文件,则显示输出,但是当我解析10GB的日志文件时,我得到了错误。我怎样才能解决这个问题?谢谢你。


白板的微信
浏览 185回答 2
2回答

qq_遁去的一_1

您不应该使用file.readlines()。这样做会立即将整个文件读入内存,这很可能会立即将其填满。相反,遍历文件:with open("internet.log", "r") as file:    for items in file:(当然,取决于您对数据的处理方式,当您遍历文件时,这仍然可能会中断。)

守着一只汪

您正在使用读取整个文件到内存中readlines。您可以说一次读一行for items in file。使用更好的变量名和列表理解来稍微整理代码,以生成结果:with open("internet.log") as log:    finalResults = [[getDate(line), getIPAddress(line), getDomainsVisited(line)]                    for line in log                    if 'query[A]' in line]我将结果提取到一个函数:def parse_log_line(line):    return [getDate(line),            getIPAddress(line),            getDomainsVisited(line)]那么您的代码将是:with open("internet.log") as log:    finalResults = [parse_log_line(line)                    for line in log                    if 'query[A]' in line]
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python