猿问

StringIO 类在 python 3 中不返回预期结果

在 python 版本 2 中工作的代码在 python 3 中失败。


AttributeError: '_io.StringIO' object has no attribute 'name'

这是代码:


!pip install warc3-wet


import warc

import requests

from contextlib import closing

from io import StringIO


def get_partial_warc_file(url, num_bytes=1024 * 10):

    with closing(requests.get(url, stream=True)) as r:

        buf = StringIO(r.raw.read(num_bytes).decode('utf-8'))  

    return warc.WARCFile(fileobj=buf, compress=True)


urls = {

    'warc': 'https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33/warc/CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.gz',

    'wat':  'https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33/wat/CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.wat.gz',

    'wet':  'https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33/wet/CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.wet.gz'

}


files = {file_type: get_partial_warc_file(url=url) for file_type, url in urls.items()}

这是来源:


https://dmorgan.info/posts/common-crawl-python/


更新:


此代码返回记录的元数据,我如何阅读新闻文章?


aws s3 cp --no-sign-request s3://commoncrawl/crawl-data/CC-NEWS/crawl-data/CC-NEWS/2019/08/CC-NEWS-20190824001636-00982.warc.gz


import warc


var = 0 


with warc.open("/tmp/CC-NEWS-20190824001636-00982.warc") as f:

    for record in f:

        if var > 1:

            break

        else:

            print (record.date, record.from_response, record.header, record.ip_address, record.offset, record.payload, record.type, record.url, record.write_to)

        var = var + 1


aluckdog
浏览 149回答 1
1回答

蝴蝶刀刀

这是将返回新闻文章源代码以及元数据的代码。# wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33/warc/CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.gz# gunzip CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.gz#!pip install warc3-wetimport warcvar = -10 with warc.open("CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc") as f:    for record in f:        if var > 1:            break        else:            print (record.payload.read(), record.date, record.from_response, record.header, record.ip_address, record.offset, record.payload, record.type, record.url, record.write_to)        var = var + 1
随时随地看视频慕课网APP

相关分类

Python
我要回答