-
慕沐林林
看起来您想要字符串对的杰卡德距离。groupby这是使用and的一种方法scipy.spatial.distance.jaccard:from scipy.spatial.distance import jaccardg = df.groupby(df.name.str[0])df['diff'] = [sim for _, seqs in g.seq for sim in [float('nan'), jaccard(*map(list,seqs))]]print(df) name seq diff1 a1 bbb NaN2 a2 bbc 1.03 b1 fff NaN4 b2 fff 0.05 c1 aaa NaN6 c2 acg 2.0
-
饮歌长啸
Levenshtein距离替代:import Levenshteins = df['name'].str[0]out = df.assign(Diff=s.drop_duplicates(keep='last').map(df.groupby(s)['seq'] .apply(lambda x: Levenshtein.distance(x.iloc[0],x.iloc[-1])))) name seq Diff1 a1 bbb NaN2 a2 bbc 1.03 b1 fff NaN4 b2 fff 0.05 c1 aaa NaN6 c2 acg 2.0
-
鸿蒙传说
作为第一步,我使用以下方法重新创建了您的数据:#!/usr/bin/env python3import pandas as pd# Setupdata = {'name': {1: 'a1', 2: 'a2', 3: 'b1', 4: 'b2', 5: 'c1', 6: 'c2'}, 'seq': {1: 'bbb', 2: 'bbc', 3: 'fff', 4: 'fff', 5: 'aaa', 6: 'acg'}}df = pd.DataFrame(data)解决方案 您可以尝试迭代数据框并将seq最后一次迭代的值与当前迭代值进行比较。为了比较两个字符串(存储在数据框的seq列中),您可以应用一个简单的列表推导,如在此函数中:def diff_letters(a,b): return sum ( a[i] != b[i] for i in range(len(a)) )迭代 Dataframe 行diff = ['NA']row_iterator = df.iterrows()_, last = next(row_iterator)# Iterate over the df get populate a list with result of the comparisonfor i, row in row_iterator: if i % 2 == 0: diff.append(diff_letters(last['seq'],row['seq'])) else: # for odd row numbers append NA value diff.append("NA") last = rowdf['diff'] = diff结果看起来像这样 name seq diff1 a1 bbb NA2 a2 bbc 13 b1 fff NA4 b2 fff 05 c1 aaa NA6 c2 acg 2
-
侃侃尔雅
检查这个import pandas as pddata = {'name': ['a1', 'a2','b1','b2','c1','c2'], 'seq': ['bbb', 'bbc','fff','fff','aaa','acg'] }df = pd.DataFrame (data, columns = ['name','seq'])diffCntr=0df['diff'] = np.nani=0while i < len(df)-1: diffCntr=np.nan item=df.at[i,'seq'] df.at[i,'diff']=diffCntr diffCntr=0 for j in df.at[i+1,'seq']: if item.find(j) < 0: diffCntr +=1 df.at[i+1,'diff']=diffCntr i +=2 df 结果是这样的: name seq diff0 a1 bbb NaN1 a2 bbc 1.02 b1 fff NaN3 b2 fff 0.04 c1 aaa NaN5 c2 acg 2.0