应用于 Pandas DataFrame 的*交错*组

我有一个 3 轴数据的 DataFrames,带有一个成员资格标签,我用它来分组:

df = pd.DataFrame( [[0, 1, 2,  0], 
                    [-1, 0, 1, 0],
                    [-2, 0, 3, 1],
                    [1, 1, 3,  1],
                    [1, 0, 2,  2],
                    [1, 0, 3,  2],
                    [6, 2, 1,  5],
                    [-4, 3, 0, 5],
                    [1, 0, -1, 6],
                    [0, 0, 3,  6]], columns = ['x', 'y', 'z', 'member'])

我的目标有点做作:我希望找到每个组的点与下一个组之间的成对距离,从小到大排序。这就是我所说的交错的意思:n_skipn_skip

例如,对于 ,我希望找到以下距离:n_skip=2

  • 带有 --> against 的行member == 0member == 1, 2

  • 带有 --> 反对的行member == 1member == 2, 5

  • 带有 --> 反对的行member == 2member == 5, 6

  • 带有 --> 反对的行member == 5member == 6

  • 没有计算 。member == 6

有没有一种高性能的方法可以在没有嵌套的for循环的情况下做到这一点?这个问题的答案中提到了这一点。直观地说,我无法使用传统的方法来并行化 Pandas DataFrame 上的函数。将函数应用于一组交错组的快速方法是什么?apply


EDIT1 我的解决方案(仅适用于一个轴):

  ## Heading ### Organize by group membership

    groups = df.groupby('member')


    # Define constants

    max_member = 6

    n_skip = 2

    start_row = 0

    matrix = np.zeros((df.shape[0], df.shape[0]))


    # Iterate for each group

    for i in range(max_member):


        try:

            pts_curr = groups.get_group(i)


        except KeyError:

            continue


        # Save end row index 

        end_row = start_row + pts_curr.shape[0]    


        # Save start col index

        start_col = end_row

        

        # Grab the destination group nodes

        for j in range(i+1, int(np.min([i+n_skip+1, max_member]))):


            try:

                pts_clr_next = groups.get_group(j)


            except KeyError:

                continue


            # Save end col index

            end_col = start_col + pts_clr_next.shape[0]


            # Calculate cdist

            z_sq = cdist(pts_curr[['z']], pts_next[['z']])


            # Save results in matrix at right positions

            matrix[start_row:end_row, start_col:end_col] = z_sq

            

            # update col index

            start_col = end_col


        # update row index

        start_row = end_row


qq_笑_17
浏览 56回答 1
1回答

慕哥6287543

4K 行的交叉合并还不错(产生大约 16M 行)。让我们尝试交叉合并和查询:n = 2# dummy keydf['dummy'] = 1# this is the member group numberdf['rank'] = df['member'].rank(method='dense')# cross merge and filternew_df = (df.merge(df, on='dummy')&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .query('rank_x<rank_y<=rank_x+@n')&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;)# euclidean distancedist = (new_df[['x_x','y_x','z_x']].sub(new_df[['x_y','y_y','z_y']].values)**2).sum(1)**.5# output dataframe with member labelpd.DataFrame({'member1':new_df['member_x'], 'member2':new_df['member_y'],&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'dist':dist})输出:&nbsp; &nbsp; member1&nbsp; member2&nbsp; &nbsp; &nbsp; dist2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; 2.4494903&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; 1.4142144&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; 1.4142145&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; 1.73205112&nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; 2.23606813&nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; 3.00000014&nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; 2.23606815&nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; 2.82842724&nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; 3.16227825&nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; 3.00000026&nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; 5&nbsp; 8.48528127&nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; 5&nbsp; 4.69041634&nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; 1.41421435&nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; 1.00000036&nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; 5&nbsp; 5.47722637&nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; 5&nbsp; 6.16441446&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp; &nbsp; 5&nbsp; 5.47722647&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp; &nbsp; 5&nbsp; 6.16441448&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp; &nbsp; 6&nbsp; 3.00000049&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp; &nbsp; 6&nbsp; 1.41421456&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp; &nbsp; 5&nbsp; 5.74456357&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp; &nbsp; 5&nbsp; 6.55743958&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp; &nbsp; 6&nbsp; 4.00000059&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp; &nbsp; 6&nbsp; 1.00000068&nbsp; &nbsp; &nbsp; &nbsp; 5&nbsp; &nbsp; &nbsp; &nbsp; 6&nbsp; 5.74456369&nbsp; &nbsp; &nbsp; &nbsp; 5&nbsp; &nbsp; &nbsp; &nbsp; 6&nbsp; 6.63325078&nbsp; &nbsp; &nbsp; &nbsp; 5&nbsp; &nbsp; &nbsp; &nbsp; 6&nbsp; 5.91608079&nbsp; &nbsp; &nbsp; &nbsp; 5&nbsp; &nbsp; &nbsp; &nbsp; 6&nbsp; 5.830952选项 2:如果数据帧较大,则循环可能还不错:from scipy.spatial.distance import cdistret = []for i in set(df['rank']):&nbsp; &nbsp; this_group = df['rank']==i&nbsp; &nbsp; other_groups = df['rank'].between(i,i+n, inclusive=False)&nbsp; &nbsp; t = df.loc[this_group,['x','y','z']].values&nbsp; &nbsp; o = df.loc[other_groups,['x','y','z']].values&nbsp; &nbsp; ret.append(cdist(t,o).ravel())dist = np.concatenate(ret)
打开App,查看更多内容
随时随地看视频慕课网APP