比 iterrows 更好的解决方案,用于比较不同数据集之间的值

我一直在使用iterrows()比较两个数据集之间的列值并合并满足某些条件的行,但是这需要很长时间。有没有更好的方法可以在不迭代的情况下做到这一点?


这是完整的功能


def find_peak_matches(lncRNA, CAGE):

"""isolates CAGE peaks that match an lncRNA"""


lncRNA['promoter_start'] = lncRNA['promoter_start'].apply(pd.to_numeric).astype('int32')

lncRNA['promoter_stop'] = lncRNA['promoter_stop'].apply(pd.to_numeric).astype('int32')

CAGE['peak_start'] = CAGE['peak_start'].apply(pd.to_numeric).astype('int32')

CAGE['peak_stop'] = CAGE['peak_stop'].apply(pd.to_numeric).astype('int32')

peak_matches = pd.DataFrame()

for i, row in lncRNA.iterrows():

    mask = (

        (CAGE['chr'] == row['chr']) & \

        (row['promoter_start'] <= CAGE['peak_start']) & \

        (row['promoter_stop'] >= CAGE['peak_stop'])

    )#finds peaks in lncRNA promoters

    matches = CAGE[mask].dropna() #isolates only the peak matches

    if len(matches) == 0: #if no matches found continue

        continue

    merged = pd.merge(

        row.to_frame().T, matches, 

        on=['chr']

    ) #merges rows that meet mask conditions

    peak_matches = pd.concat(

        [peak_matches, merged],

        ignore_index=True

    ) #creates a new df from all the merged rows   


logging.debug('found peak matches')

return (peak_matches)

这是一个示例 lncRNA 数据集:


name   chr    promoter_start    promoter_stop    info

lnc1    1        1                10              x

lnc2    1       11                20              y

lnc3    1       21                30              z

样品笼:


ID      chr    peak_start    peak_stop 

peak1   1        3             7            

peak2   1        15            17            

peak3   1        4             6            

peak4   2        6             9 


真实的数据集每个包含大约 20 万行,所以我当前的代码花费的时间太长了。我正在尝试合并峰值/启动子具有相同 chr 值且峰值开始/停止落在启动子启动/停止之间的行。有什么优化建议吗?我对 python 相当了解,所以我不知道最好的做事方式是什么。


子衿沉夜
浏览 90回答 2
2回答

呼啦一阵风

在将数据放入pandas可以对需要进行的比较次数进行更多选择的地方之前,您可能会更幸运地完成所有繁重的工作——尽管事实上您会放弃一些 numpy 加速剂在pandas。namedtuples为了方便起见,我编写了下面的示例,并在制作数据框之前进行了所有比较。对于 200K x 200K 的虚假数据,它在我的机器上大约需要 30 秒就可以完成,并获得 1000 万行匹配项,这完全取决于我使用的随机数据的多样性。YMMV。这里可能还有更多“留在地板上”。一些智能排序(除了我所做的按“chr”分箱)可能会更进一步。import pandas as pdfrom collections import namedtuple, defaultdictfrom random import randintfrom itertools import product# structuresrna = namedtuple('rna', 'name chr promoter_start promoter_stop info')cage = namedtuple('cage', 'ID chr peak_start peak_stop')row = namedtuple('row', 'name chr promoter_start promoter_stop info ID peak_start peak_stop')# some data entry from post to check...rnas = [rna('inc1',1,1,10,'x'), rna('inc2',1,11,20,'y'), rna('inc1',1,21,30,'z')]cages = [cage('peak1',1,3,7), cage('peak2',1,15,17), cage('peak3',1,4,6), cage('peak4',2,6,9)]result_rows = [row(r.name, r.chr, r.promoter_start, r.promoter_stop, r.info, c.ID, c.peak_start, c.peak_stop)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; for r in rnas for c in cages if&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; r.chr == c.chr and&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; r.promoter_start <= c.peak_start and&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; r.promoter_stop&nbsp; >= c.peak_stop]df = pd.DataFrame(data=result_rows)print(df)print()# stress test# big fake datarnas = [rna('xx', randint(1,1000), randint(1,50), randint(10,150), 'yy') for t in range(200_000)]cages = [cage('pk', randint(1,1000), randint(1,50), randint(10,150))&nbsp; &nbsp; &nbsp;for t in range(200_000)]# group by chr to expedite comparisonsrna_dict = defaultdict(list)cage_dict = defaultdict(list)for r in rnas:&nbsp; &nbsp; rna_dict[r.chr].append(r)for c in cages:&nbsp; &nbsp; cage_dict[c.chr].append(c)print('fake data made')# use the chr's that are keys in the rna dictionary and make all comparisions...result_rows = []for k in rna_dict.keys():&nbsp; &nbsp; result_rows.extend([row(r.name, r.chr, r.promoter_start, r.promoter_stop, r.info, c.ID, c.peak_start, c.peak_stop)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; for r in rna_dict.get(k) for c in cage_dict.get(k) if&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; r.promoter_start <= c.peak_start and&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; r.promoter_stop&nbsp; >= c.peak_stop])df = pd.DataFrame(data=result_rows)print(df.head(5))print(df.info())输出:&nbsp; &nbsp;name&nbsp; chr&nbsp; promoter_start&nbsp; promoter_stop info&nbsp; &nbsp; &nbsp;ID&nbsp; peak_start&nbsp; peak_stop0&nbsp; inc1&nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;10&nbsp; &nbsp; x&nbsp; peak1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 71&nbsp; inc1&nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;10&nbsp; &nbsp; x&nbsp; peak3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 62&nbsp; inc2&nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 11&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;20&nbsp; &nbsp; y&nbsp; peak2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 15&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;17fake data made&nbsp; name&nbsp; chr&nbsp; promoter_start&nbsp; promoter_stop info&nbsp; ID&nbsp; peak_start&nbsp; peak_stop0&nbsp; &nbsp;xx&nbsp; 804&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 34&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;35&nbsp; &nbsp;yy&nbsp; pk&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 36&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;111&nbsp; &nbsp;xx&nbsp; 804&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 34&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;35&nbsp; &nbsp;yy&nbsp; pk&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 39&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;112&nbsp; &nbsp;xx&nbsp; 804&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 34&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;35&nbsp; &nbsp;yy&nbsp; pk&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 37&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;143&nbsp; &nbsp;xx&nbsp; 804&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 34&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;35&nbsp; &nbsp;yy&nbsp; pk&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 34&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;284&nbsp; &nbsp;xx&nbsp; 804&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 34&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;35&nbsp; &nbsp;yy&nbsp; pk&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 39&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;20<class 'pandas.core.frame.DataFrame'>RangeIndex: 10280046 entries, 0 to 10280045Data columns (total 8 columns):name&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; objectchr&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;int64promoter_start&nbsp; &nbsp; int64promoter_stop&nbsp; &nbsp; &nbsp;int64info&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; objectID&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; objectpeak_start&nbsp; &nbsp; &nbsp; &nbsp; int64peak_stop&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;int64dtypes: int64(5), object(3)memory usage: 627.4+ MBNone[Finished in 35.4s]来自 DataFrame --> namedtuple下面的几个选项......研究了同样的事情并选择了几个例子。您可以使用pd.itertuples下面的方法将它们剥离出来并将它们放入命名元组中。但是,它似乎只进行位置匹配。所以要小心。注意第二个例子是顶起的。Pandas 似乎也做自己的命名行事情,这可能同样有效。(最后一个例子)。我没有对它进行太多修改,但它似乎可以在内部通过名称寻址,这就像 namedtuple 一样好。In [22]: df&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Out[22]:&nbsp;&nbsp; &nbsp;name&nbsp; chr&nbsp; promoter_start&nbsp; promoter_stop info0&nbsp; lnc1&nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;10&nbsp; &nbsp; x1&nbsp; lnc2&nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 11&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;20&nbsp; &nbsp; y2&nbsp; lnc3&nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 21&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;30&nbsp; &nbsp; zIn [23]: rna = namedtuple('rna', 'name chr promoter_start promoter_stop info')&nbsp;&nbsp;In [24]: rows = [rna(*t) for t in df.itertuples(index=False)]&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;In [25]: rows&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Out[25]:&nbsp;[rna(name='lnc1', chr=1, promoter_start=1, promoter_stop=10, info='x'),&nbsp;rna(name='lnc2', chr=1, promoter_start=11, promoter_stop=20, info='y'),&nbsp;rna(name='lnc3', chr=1, promoter_start=21, promoter_stop=30, info='z')]In [26]: rna = namedtuple('rna', 'name chr info promoter_start promoter_stop')&nbsp; # note:&nbsp; wrongIn [27]: rows = [rna(*t) for t in df.itertuples(index=False)]&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;In [28]: rows&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Out[28]:&nbsp;[rna(name='lnc1', chr=1, info=1, promoter_start=10, promoter_stop='x'),&nbsp;rna(name='lnc2', chr=1, info=11, promoter_start=20, promoter_stop='y'),&nbsp;rna(name='lnc3', chr=1, info=21, promoter_start=30, promoter_stop='z')]In [29]: # note the above is mis-aligned!!!&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;In [32]: rows = [t for t in df.itertuples(name='row', index=False)]&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;In [33]: rows&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Out[33]:&nbsp;[row(name='lnc1', chr=1, promoter_start=1, promoter_stop=10, info='x'),&nbsp;row(name='lnc2', chr=1, promoter_start=11, promoter_stop=20, info='y'),&nbsp;row(name='lnc3', chr=1, promoter_start=21, promoter_stop=30, info='z')]In [34]: type(rows[0])&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;Out[34]: pandas.core.frame.rowIn [35]: rows[0].chr&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;Out[35]: 1In [36]: rows[0].info&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Out[36]: 'x'

守着星空守着你

您可以合并所有lncRNA和CAGE数据框,然后使用df.query.例如:df = lncRNA.merge(CAGE, on='chr')df = df.query('(promoter_start <= peak_start) & (promoter_stop >= peak_stop)')print(df)印刷:&nbsp; &nbsp;name&nbsp; chr&nbsp; promoter_start&nbsp; promoter_stop info&nbsp; &nbsp; &nbsp;ID&nbsp; peak_start&nbsp; peak_stop0&nbsp; lnc1&nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;10&nbsp; &nbsp; x&nbsp; peak1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 72&nbsp; lnc1&nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;10&nbsp; &nbsp; x&nbsp; peak3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 64&nbsp; lnc2&nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 11&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;20&nbsp; &nbsp; y&nbsp; peak2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 15&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;17
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python