我正在使用 core-python API 在 python 2.7 中为我的项目编写详细的文件验证脚本。这是用于比较另一个 ETL 代码的源文件和目标文件。这包括逐行元数据验证、计数验证、重复检查、空检查和完整数据验证。我已经完成了脚本并且它在 100k 数据集上运行良好(我在 100k,200k 卷上做了一些测试运行)。但是如果我运行,重复检查的方法将永远运行(我的意思是花费大量时间)数以百万计的数据。我调试了代码,发现下面的重复检查方法导致了问题。
def dupFind(dup_list=[],output_path=""):
#dup_list is the list containing duplicates. Actually this is the list of contents of a file line by line as entries
#output_path is the path to which output records and respective duplicate count of each records are saved as a single file
#duplicates is a set which contains tuples with two elements each in which first element is the duplicated record and second is the duplicated count
duplicates=set((x,dup_list.count(x)) for x in filter(lambda rec : dup_list.count(rec)>1,dup_list))
print "time taken for preparing duplicate list is {}".format(str(t1-t0))
dup_report="{}\dup.{}".format(output_path, int(time.time()))
print "Please find the duplicate records in {}".format(dup_report)
print ""
with open(dup_report, 'w+') as f:
f.write("RECORD|DUPLICATE_COUNT\n")
for line in duplicates:
f.write("{}|{}\n".format(line[0], line[1]))
首先,我正在读取文件并将其转换为如下所示的列表(运行速度很快):
with open(sys.argv[1]) as src,open(sys.argv[2]) as tgt:
src = map(lambda x : x.strip(),list(src))
tgt = map(lambda x : x.strip(),list(tgt))
输出重复文件如下所示:
RECORD|DUPLICATE_COUNT
68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT|2
68835,2014-05-02 00:00:00.0,764,COMPLETE|2
68878,2014-07-08 00:00:00.0,6753,COMPLETE|2
68834,2014-05-01 00:00:00.0,6938,COMPLETE|2
谁能帮我修改一下逻辑或者写一个新的逻辑,这样我就可以一次处理数百万条记录。在我的项目中,文件最大可达40M或50M。
MM们
www说
相关分类