如何将下载并解压的文本文件加载到 pandas 数据框中?

以下代码下载并解压包含数千个文本文件的文件


zip_file_url = "https://docsia-temp.s3-sa-east-1.amazonaws.com/docsia-desafio-dataset.zip"

res = requests.get(zip_file_url, stream=True) # fazendo o request do dado

print("fazendo o download...")

z = zipfile.ZipFile(io.BytesIO(res.content))

print("extraindo os dados")

z.extractall("./")

print("ok..")

如何将这些文件加载到 pandas 数据框中?


繁花不似锦
浏览 75回答 1
1回答

莫回无

查看代码的内联解释代码使用pathlib模块来查找已经解压的文件有 20 种文章类型,这意味着数据框字典中有 20 个键dd。每个键的值是一个数据框,其中包含每种文章类型的所有文章。每个数据框有 1000 行,每篇文章 1 行。总共有20000篇文章。此实现将保持文章的形状。当从数据框中打印一行时,文章将采用带有换行符和标点符号的可读形式。要从各个数据帧创建单个数据帧:dfc = pd.concat(dd.values()).reset_index(drop=True)这就是'type'在最初创建数据框时添加列的原因。在组合数据框中,文章类型将是可识别的。这回答了如何将所有文件加载到数据框中的问题。有关处理文本的更多问题,请提出新问题。from pathlib import Pathfrom io import BytesIOimport requestsimport pandas as pdfrom collections import defaultdictfrom zipfile import ZipFile####################################################################### download and save zipped files# location to save files; this create a pathlib object of the path, and patlib objects have methods, like rglob, parts, and is_filesave_path = Path('data/zipped')zip_file_url = "https://docsia-temp.s3-sa-east-1.amazonaws.com/docsia-desafio-dataset.zip"res = requests.get(zip_file_url, stream=True)with ZipFile(BytesIO(res.content), 'r') as zip_ref:&nbsp; &nbsp; zip_ref.extractall(save_path)####################################################################### find all the files; the methods in this list comprehension are pathlib methodsfiles = [file for file in list(save_path.rglob('*')) if file.is_file()]# dict to save dataframes for each filedd = defaultdict(list)for file in files:&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; # extract the type of article from the path&nbsp; &nbsp; article_type = file.parts[-2].replace('.', '_')&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; # open the file&nbsp; &nbsp; with file.open(mode='r', encoding='utf-8', errors='ignore') as f:&nbsp; &nbsp; &nbsp; &nbsp; # read the lines and combine them into one string inside a list&nbsp; &nbsp; &nbsp; &nbsp; f = [' '.join([line for line in f.readlines() if line.strip()])]&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp; &nbsp; # create a dataframe from f&nbsp; &nbsp; df = pd.DataFrame(f, columns=['article'])&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; # add a column for the article type&nbsp; &nbsp; df['type'] = article_type&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; # add the dataframe to the default dict&nbsp; &nbsp; dd[article_type].append(df.copy())# each value of the dict is a list of dataframes, iterate through all keys and create a single dataframe for each keyfor k, v in dd.items():&nbsp; &nbsp; # for all the article type, combine all the dataframes into a single dataframe&nbsp; &nbsp; dd[k] = pd.concat(v).reset_index(drop=True)print(dd.keys())[out]:dict_keys(['alt_atheism', 'comp_graphics', 'comp_os_ms-windows_misc', 'comp_sys_ibm_pc_hardware', 'comp_sys_mac_hardware', 'comp_windows_x', 'misc_forsale', 'rec_autos', 'rec_motorcycles', 'rec_sport_baseball', 'rec_sport_hockey', 'sci_crypt', 'sci_electronics', 'sci_med', 'sci_space', 'soc_religion_christian', 'talk_politics_guns', 'talk_politics_mideast', 'talk_politics_misc', 'talk_religion_misc'])# print the first article for the alt_atheism keyprint(dd['alt_atheism'].iloc[0, 0])[out]:Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49960 alt.atheism.moderated:713 news.answers:7054 alt.answers:126&nbsp;Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!spool.mu.edu!uunet!pipex!ibmpcug!mantis!mathew&nbsp;From: mathew <mathew@mantis.co.uk>&nbsp;Newsgroups: alt.atheism,alt.atheism.moderated,news.answers,alt.answers&nbsp;Subject: Alt.Atheism FAQ: Atheist Resources&nbsp;Summary: Books, addresses, music -- anything related to atheism&nbsp;Keywords: FAQ, atheism, books, music, fiction, addresses, contacts&nbsp;Message-ID: <19930329115719@mantis.co.uk>&nbsp;Date: Mon, 29 Mar 1993 11:57:19 GMT&nbsp;Expires: Thu, 29 Apr 1993 11:57:19 GMT&nbsp;Followup-To: alt.atheism&nbsp;Distribution: world&nbsp;Organization: Mantis Consultants, Cambridge. UK.&nbsp;Approved: news-answers-request@mit.edu&nbsp;Supersedes: <19930301143317@mantis.co.uk>&nbsp;Lines: 290&nbsp;Archive-name: atheism/resources...
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python