PySpark-显示数据框中的列数据类型计数

如何像使用熊猫数据框那样查看Spark数据框中每种数据类型的计数?


例如,假设df是熊猫数据帧:


>>> df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 5 entries, 0 to 4

Data columns (total 3 columns):

int_col      5 non-null int64

text_col     5 non-null object

float_col    5 non-null float64

**dtypes: float64(1), int64(1), object(1)**

memory usage: 200.0+ bytes

我们可以很清楚地看到每种数据类型的计数。如何使用Spark数据框执行类似操作?也就是说,如何看到有多少列是浮动的,有多少列是int的,有多少列是对象的?


素胚勾勒不出你
浏览 261回答 3
3回答

茅侃侃

下面的代码应该可以为您带来理想的结果# create data frame&nbsp;df = sqlContext.createDataFrame([(1,'Y','Y',0,0,0,2,'Y','N','Y','Y'),&nbsp;(2,'N','Y',2,1,2,3,'N','Y','Y','N'),&nbsp;(3,'Y','N',3,1,0,0,'N','N','N','N'),&nbsp;(4,'N','Y',5,0,1,0,'N','N','N','Y'),&nbsp;(5,'Y','N',2,2,0,1,'Y','N','N','Y'),&nbsp;(6,'Y','Y',0,0,3,6,'Y','N','Y','N'),&nbsp;(7,'N','N',1,1,3,4,'N','Y','N','Y'),&nbsp;(8,'Y','Y',1,1,2,0,'Y','Y','N','N')],('id', 'compatible', 'product', 'ios', 'pc', 'other', 'devices', 'customer', 'subscriber', 'circle', 'smb'))# Find data types of data framedatatypes_List = df.dtypes# Querying datatypes_List gives you column and its data type as a tupledatatypes_List[('id', 'bigint'), ('compatible', 'string'), ('product', 'string'), ('ios', 'bigint'), ('pc', 'bigint'), ('other', 'bigint'), ('devices', 'bigint'), ('customer', 'string'), ('subscriber', 'string'), ('circle', 'string'), ('smb', 'string')]# create empty dictonary to store output valuesdict_count = {}# Loop statement to count number of times the data type is present in the data framefor x, y in datatypes_List:&nbsp; &nbsp; dict_count[y] = dict_count.get(y, 0) + 1# query dict_count to find the number of times a data type is present in data framedict_count&nbsp;&nbsp;
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python