在Python中从Freebase提取数据转储

从网站上下载数据转储Freebase Triples (freebase-rdf-latest.gz),打开和读取此文件以提取信息的最佳过程是什么,比如说有关公司和企业的相对信息?(在Python中)


据我所知,有一些软件包可以实现此目标:在python中打开gz文件并读取rdf文件,我不确定如何实现此目标...


我的失败尝试python 3.6:


import gzip


with gzip.open('freebase-rdf-latest.gz','r') as uncompressed_file:

       for line in uncompressed_file.read():

           print(line)

之后,使用xml结构,我可以通过解析获取信息,但无法读取文件。


Cats萌萌
浏览 308回答 1
1回答

慕斯709654

问题在于gzip模块会立即将整个文件解压缩,然后将未压缩的文件存储在内存中。对于这么大的文件,更实际的方法是一次将文件解压缩一点,流式传输结果。#!/usr/bin/env python3import ioimport zlibdef stream_unzipped_bytes(filename):    """    Generator function, reads gzip file `filename` and yields    uncompressed bytes.    This function answers your original question, how to read the file,    but its output is a generator of bytes so there's another function    below to stream these bytes as text, one line at a time.    """    with open(filename, 'rb') as f:        wbits = zlib.MAX_WBITS | 16  # 16 requires gzip header/trailer        decompressor = zlib.decompressobj(wbits)        fbytes = f.read(16384)        while fbytes:            yield decompressor.decompress(decompressor.unconsumed_tail + fbytes)            fbytes = f.read(16384)def stream_text_lines(gen):    """    Generator wrapper function, `gen` is a bytes generator.    Yields one line of text at a time.    """    try:        buf = next(gen)        while buf:            lines = buf.splitlines(keepends=True)            # yield all but the last line, because this may still be incomplete            # and waiting for more data from gen            for line in lines[:-1]:                yield line.decode()            # set buf to end of prior data, plus next from the generator.            # do this in two separate calls in case gen is done iterating,            # so the last output is not lost.            buf = lines[-1]            buf += next(gen)    except StopIteration:        # yield the final data        if buf:            yield buf.decode()# Sample usage, using the stream_text_lines generator to stream# one line of RDF text at a timebytes_generator = (x for x in stream_unzipped_bytes('freebase-rdf-latest.gz'))for line in stream_text_lines(bytes_generator):    # do something with `line` of text    print(line, end='') 
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python