呼啦一阵风
在将数据放入pandas可以对需要进行的比较次数进行更多选择的地方之前,您可能会更幸运地完成所有繁重的工作——尽管事实上您会放弃一些 numpy 加速剂在pandas。namedtuples为了方便起见,我编写了下面的示例,并在制作数据框之前进行了所有比较。对于 200K x 200K 的虚假数据,它在我的机器上大约需要 30 秒就可以完成,并获得 1000 万行匹配项,这完全取决于我使用的随机数据的多样性。YMMV。这里可能还有更多“留在地板上”。一些智能排序(除了我所做的按“chr”分箱)可能会更进一步。import pandas as pdfrom collections import namedtuple, defaultdictfrom random import randintfrom itertools import product# structuresrna = namedtuple('rna', 'name chr promoter_start promoter_stop info')cage = namedtuple('cage', 'ID chr peak_start peak_stop')row = namedtuple('row', 'name chr promoter_start promoter_stop info ID peak_start peak_stop')# some data entry from post to check...rnas = [rna('inc1',1,1,10,'x'), rna('inc2',1,11,20,'y'), rna('inc1',1,21,30,'z')]cages = [cage('peak1',1,3,7), cage('peak2',1,15,17), cage('peak3',1,4,6), cage('peak4',2,6,9)]result_rows = [row(r.name, r.chr, r.promoter_start, r.promoter_stop, r.info, c.ID, c.peak_start, c.peak_stop) for r in rnas for c in cages if r.chr == c.chr and r.promoter_start <= c.peak_start and r.promoter_stop >= c.peak_stop]df = pd.DataFrame(data=result_rows)print(df)print()# stress test# big fake datarnas = [rna('xx', randint(1,1000), randint(1,50), randint(10,150), 'yy') for t in range(200_000)]cages = [cage('pk', randint(1,1000), randint(1,50), randint(10,150)) for t in range(200_000)]# group by chr to expedite comparisonsrna_dict = defaultdict(list)cage_dict = defaultdict(list)for r in rnas: rna_dict[r.chr].append(r)for c in cages: cage_dict[c.chr].append(c)print('fake data made')# use the chr's that are keys in the rna dictionary and make all comparisions...result_rows = []for k in rna_dict.keys(): result_rows.extend([row(r.name, r.chr, r.promoter_start, r.promoter_stop, r.info, c.ID, c.peak_start, c.peak_stop) for r in rna_dict.get(k) for c in cage_dict.get(k) if r.promoter_start <= c.peak_start and r.promoter_stop >= c.peak_stop])df = pd.DataFrame(data=result_rows)print(df.head(5))print(df.info())输出: name chr promoter_start promoter_stop info ID peak_start peak_stop0 inc1 1 1 10 x peak1 3 71 inc1 1 1 10 x peak3 4 62 inc2 1 11 20 y peak2 15 17fake data made name chr promoter_start promoter_stop info ID peak_start peak_stop0 xx 804 34 35 yy pk 36 111 xx 804 34 35 yy pk 39 112 xx 804 34 35 yy pk 37 143 xx 804 34 35 yy pk 34 284 xx 804 34 35 yy pk 39 20<class 'pandas.core.frame.DataFrame'>RangeIndex: 10280046 entries, 0 to 10280045Data columns (total 8 columns):name objectchr int64promoter_start int64promoter_stop int64info objectID objectpeak_start int64peak_stop int64dtypes: int64(5), object(3)memory usage: 627.4+ MBNone[Finished in 35.4s]来自 DataFrame --> namedtuple下面的几个选项......研究了同样的事情并选择了几个例子。您可以使用pd.itertuples下面的方法将它们剥离出来并将它们放入命名元组中。但是,它似乎只进行位置匹配。所以要小心。注意第二个例子是顶起的。Pandas 似乎也做自己的命名行事情,这可能同样有效。(最后一个例子)。我没有对它进行太多修改,但它似乎可以在内部通过名称寻址,这就像 namedtuple 一样好。In [22]: df Out[22]: name chr promoter_start promoter_stop info0 lnc1 1 1 10 x1 lnc2 1 11 20 y2 lnc3 1 21 30 zIn [23]: rna = namedtuple('rna', 'name chr promoter_start promoter_stop info') In [24]: rows = [rna(*t) for t in df.itertuples(index=False)] In [25]: rows Out[25]: [rna(name='lnc1', chr=1, promoter_start=1, promoter_stop=10, info='x'), rna(name='lnc2', chr=1, promoter_start=11, promoter_stop=20, info='y'), rna(name='lnc3', chr=1, promoter_start=21, promoter_stop=30, info='z')]In [26]: rna = namedtuple('rna', 'name chr info promoter_start promoter_stop') # note: wrongIn [27]: rows = [rna(*t) for t in df.itertuples(index=False)] In [28]: rows Out[28]: [rna(name='lnc1', chr=1, info=1, promoter_start=10, promoter_stop='x'), rna(name='lnc2', chr=1, info=11, promoter_start=20, promoter_stop='y'), rna(name='lnc3', chr=1, info=21, promoter_start=30, promoter_stop='z')]In [29]: # note the above is mis-aligned!!! In [32]: rows = [t for t in df.itertuples(name='row', index=False)] In [33]: rows Out[33]: [row(name='lnc1', chr=1, promoter_start=1, promoter_stop=10, info='x'), row(name='lnc2', chr=1, promoter_start=11, promoter_stop=20, info='y'), row(name='lnc3', chr=1, promoter_start=21, promoter_stop=30, info='z')]In [34]: type(rows[0]) Out[34]: pandas.core.frame.rowIn [35]: rows[0].chr Out[35]: 1In [36]: rows[0].info Out[36]: 'x'