在字典中寻找相似之处

我试图找到多个 .txt 文件之间的相似之处。我已经把所有这些文件放在一个字典中,文件名作为关键字。


当前代码:


import pandas as pd

from os import listdir, chdir, getcwd

path = (r'C:\...path')

chdir(path)

files = [f for f in listdir(path)]

files_dict = {}


for filename in files:

    if filename.lower().endswith(('.txt')):

        files_dict[str(filename)] = pd.read_csv(filename).to_dict('split')


for key, value in files_dict.items():

    print(key + str(value) +'\n')

在这种情况下,关键是文件名。值是标题和数据。我想找出多个文件之间的值是否有重复,以便我可以在 SQL 中加入它们。我不知道该怎么做


编辑示例文件:


timestamp,Name,Description,Default Column Layout,Analysis View Name

00000000B42852FA,ADM_EIG,Administratief eigenaar,ADM_EIG,ADM_EIG

000000005880959E,OPZ,Opzeggingen,STANDAARD,

并从代码:


Acc_ Schedule Name.txt{'index': [0, 1], 'columns': ['timestamp', 'Name', 'Description', 'Default Column Layout', 'Analysis View Name'], 'data': [['00000000B42852FA', 'ADM_EIG', 'Administratief eigenaar', 'ADM_EIG', 'ADM_EIG'], ['000000005880959E', 'OPZ', 'Opzeggingen', 'STANDAARD', nan]]}

编辑 2:建议的代码


for key, value in files_dict.items():

    data = value['data']

    counter = Counter([item for sublist in data for item in sublist])

    print([value for value, count in counter.items()])

输出: ['00000000B99BD831', 5050, 'CK102', '0,00000000000000000000', 'Thuiswonend', 0, '00000000B99BD832', ........


湖上湖
浏览 172回答 2
2回答

翻阅古今

该Counter数项的频率,所以会告诉你什么比这一次更出现。data从你的字典中取出:from Collections import Counterdata = [   ['00000000B42852FA', 'ADM_EIG', 'Administratiefeigenaar', 'ADM_EIG', 'ADM_EIG'],   ['000000005880959E', 'OPZ', 'Opzeggingen', 'STANDAARD', nan]]您需要展平列表列表:[item for sublist in data for item in sublist]计数器将为您提供每个项目的频率:>>> Counter([item for sublist in data for item in sublist])Counter({'ADM_EIG': 3, '00000000B42852FA': 1, 'Administratief eigenaar': 1, '000000005880959E': 1, 'OPZ': 1, 'Opzeggingen': 1, 'STANDAARD': 1, nan: 1})然后您可以过滤您需要的内容:counter = Counter([item for sublist in data for item in sublist])[value for value, count in counter.items() if count > 1]这使 ['ADM_EIG']编辑以匹配问题编辑要查看所有行,请获取所有数据并查找重复项:data = []for key, value in files_dict.items():    data.extend(value['data'])counter = Counter([item for sublist in data for item in sublist])print([value for value, count in counter.items() if count > 1])
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python