如何在 pyspark 数据框中使用 write.partitionBy 时删除重复项？

首页课程实战体系课手记专栏慕课教程

如何在 pyspark 数据框中使用 write.partitionBy 时删除重复项？

我有一个数据框，如下所示：

|------------|-----------|---------------|---------------|

|------------|-----------|---------------|---------------|

| Roger | A | X | Y |

|------------|-----------|---------------|---------------|

| Roger | A | X | Y |

|------------|-----------|---------------|---------------|

| Roger | A | X | Y |

|------------|-----------|---------------|---------------|

| Rafael | A | G | H |

|------------|-----------|---------------|---------------|

| Rafael | A | G | H |

|------------|-----------|---------------|---------------|

| Rafael | B | G | H |

|------------|-----------|---------------|---------------|

我想根据名称和类型对此数据框进行分区并将其保存到磁盘

目前的代码行看起来像这样，

df.write.partitionBy("Name", "Type").mode("append").csv("output/", header=True)

输出被正确保存，但有重复的行，如下所述

在文件夹中

/输出/罗杰/A

|---------------|---------------|

| Attribute 1 | Attribute 2 |

|---------------|---------------|

| X | Y |

|---------------|---------------|

| X | Y |

|---------------|---------------|

| X | Y |

|---------------|---------------|

/输出/拉斐尔/A

|---------------|---------------|

| Attribute 1 | Attribute 2 |

|---------------|---------------|

| G | H |

|---------------|---------------|

| G | H |

|---------------|---------------|

/输出/拉斐尔/B

|---------------|---------------|

| Attribute 1 | Attribute 2 |

|---------------|---------------|

| G | H |

|---------------|---------------|

如您所见，此 csv 包含重复项。使用 write.partitionbY 时如何删除这些重复项？

慕桂英3389331

浏览 140回答 1

1回答

狐的传说

.distinct()写作前使用。df.distinct().write.partitionBy("Name", "Type").mode("append").csv("output/", header=True)

0 0

随时随地看视频慕课网APP