加速熊猫中的双 iterrows()

我有几个需要用 pandas 处理的大型数据集(约 3000 行,100 列)。每行代表地图上的一个点,并且有一堆与该点相关的数据。我正在做空间计算(将来可能会引入更多变量),所以对于每一行我只使用来自 1-4 列的数据。问题是我必须将每一行与其他每一行进行比较——本质上,我试图找出每个点之间的空间关系。在项目的这个阶段,我正在计算以确定表中每个点的给定半径内有多少点。我必须这样做 5 或 6 次(即运行多个半径大小的距离计算功能。)这意味着我最终需要进行大约 10-50 百万次计算。它很慢。非常慢(比如 9+ 小时的计算时间。)


运行所有这些计算后,我需要将它们作为新列附加到原始(非常大)数据框中。目前,我一直在将整个数据框传递给我的函数,这可能会进一步减慢速度。


我知道很多人在超级计算机或专用多核单元上运行这种规模的计算,但我想尽我所能优化我的代码以尽可能高效地运行,而不管硬件如何。


我目前正在使用带有 .iterrows() 的双 for 循环。我已经尽可能多地去掉了不必要的步骤。我可以将数据帧配对成一个子集,然后将其传递给函数,并在另一个步骤中将计算附加到原始数据,如果这有助于加快速度的话。我还考虑过使用 .apply() 来消除外部循环(例如 .apply() 内部循环到数据帧中的所有行......?)


下面,我展示了我正在使用的功能。这可能是我为这个项目所拥有的最简单的应用程序......还有其他人根据某些空间标准进行更多计算/返回对或点组,但这是展示我的基本概念的最佳示例正在做。


# specify file to be read into pandas

df = pd.read_csv('input_file.csv', low_memory = False)


# function to return distance between two points w/ (x,y) coordinates

def xy_distance_calc(x1, x2, y1, y2):

    return math.sqrt((x1 - x2)**2 + (y1-y2)**2)


# function to calculate number of points inside a given radius for each point

def spacing_calc(data, rad_crit, col_x, col_y):

    count_list = list()

    df_list = pd.DataFrame()


    for index, row in data.iterrows():

        x_row_current = row[col_x]

        y_row_current = row[col_y]

        count = 0

        # dist_list = list()


        for index1, row1 in data.iterrows():

            x1 = row1[col_x]

            y1 = row1[col_y]

            dist = xy_distance_calc(x_row_current, x1, y_row_current, y1)


            if dist < rad_crit: 

                count += 1


            else:

                continue


        count_list.append(count)


    df_list = pd.DataFrame(data=count_list, columns = [str(rad_crit) + ' radius'])


    return df_list


# call the function for each radius in question, append new data


df_2640 = spacing_calc(df, 2640.0, 'MID_X', 'MID_Y')


df = df.join(df_2640)


df_1320 = spacing_calc(df, 1320.0, 'MID_X', 'MID_Y')

df = df.join(df_1320)


没有错误,一切正常,我只是不认为它尽可能高效。


守候你守候我
浏览 93回答 1
1回答

婷婷同学_

你的问题是你循环太多次。至少,您应该计算一个距离矩阵并计算有多少点落在该矩阵的半径内。但是,最快的解决方案是使用 numpy 的向量化函数,它们是高度优化的 C 代码。与大多数学习经验一样,最好从一个小问题开始:>>> import numpy as np>>> import pandas as pd>>> from scipy.spatial import distance_matrix# Create a dataframe with columns two MID_X and MID_Y assigned at random>>> np.random.seed(42)>>> df = pd.DataFrame(np.random.uniform(1, 10, size=(5, 2)), columns=['MID_X', 'MID_Y'])>>> df.index.name = 'PointID'&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; MID_X&nbsp; &nbsp; &nbsp;MID_YPointID&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;0&nbsp; &nbsp; &nbsp; &nbsp; 4.370861&nbsp; 9.5564291&nbsp; &nbsp; &nbsp; &nbsp; 7.587945&nbsp; 6.3879262&nbsp; &nbsp; &nbsp; &nbsp; 2.404168&nbsp; 2.4039513&nbsp; &nbsp; &nbsp; &nbsp; 1.522753&nbsp; 8.7955854&nbsp; &nbsp; &nbsp; &nbsp; 6.410035&nbsp; 7.372653# Calculate the distance matrix>>> cols = ['MID_X', 'MID_Y']>>> d = distance_matrix(df[cols].values, df[cols].values)array([[0.&nbsp; &nbsp; &nbsp; &nbsp; , 4.51542241, 7.41793942, 2.94798323, 2.98782637],&nbsp; &nbsp; &nbsp; &nbsp; [4.51542241, 0.&nbsp; &nbsp; &nbsp; &nbsp; , 6.53786001, 6.52559479, 1.53530446],&nbsp; &nbsp; &nbsp; &nbsp; [7.41793942, 6.53786001, 0.&nbsp; &nbsp; &nbsp; &nbsp; , 6.4521226 , 6.38239593],&nbsp; &nbsp; &nbsp; &nbsp; [2.94798323, 6.52559479, 6.4521226 , 0.&nbsp; &nbsp; &nbsp; &nbsp; , 5.09021286],&nbsp; &nbsp; &nbsp; &nbsp; [2.98782637, 1.53530446, 6.38239593, 5.09021286, 0.&nbsp; &nbsp; &nbsp; &nbsp; ]])# The radii for which you want to measure. They need to be raised&nbsp;# up 2 extra dimensions to prepare for array broadcasting later>>> radii = np.array([3,6,9])[:, None, None]array([[[3]],&nbsp; &nbsp; &nbsp; &nbsp;[[6]],&nbsp; &nbsp; &nbsp; &nbsp;[[9]]])# Count how many points fall within a certain radius from another# point using numpy's array broadcasting. `d < radii` will return# an array of `True/False` and we can count the number of `True`# by `sum` over the last axis.## The distance between a point to itself is 0 and we don't want# to count that hence the -1.>>> count = (d < radii).sum(axis=-1) - 1array([[2, 1, 0, 1, 2],&nbsp; &nbsp; &nbsp; &nbsp;[3, 2, 0, 2, 3],&nbsp; &nbsp; &nbsp; &nbsp;[4, 4, 4, 4, 4]])# Putting everything together for export>>> result = pd.DataFrame(count, index=radii.flatten()).stack().to_frame('Count')>>> result.index.names = ['Radius', 'PointID']&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; CountRadius PointID&nbsp; &nbsp; &nbsp; &nbsp;3&nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp;2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp;3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp;4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 26&nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3&nbsp; &nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp; &nbsp;2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp;3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp; &nbsp;4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 39&nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4&nbsp; &nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4&nbsp; &nbsp; &nbsp; &nbsp;2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4&nbsp; &nbsp; &nbsp; &nbsp;3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4&nbsp; &nbsp; &nbsp; &nbsp;4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 4最终结果意味着在半径 3 内,点 #0 有 2 个邻居,点 #1 有 1 个邻居,点 #2 有 0 个邻居,依此类推。根据您的喜好重塑和格式化框架。将其扩展到数千个点应该没有问题。
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python