Spark Java:当列的顺序不同时如何比较模式?

在这个问题之后,我现在运行这个代码:


List<StructField> fields = new ArrayList<>();

fields.add(DataTypes.createStructField("A",DataTypes.LongType,true));

fields.add(DataTypes.createStructField("B",DataTypes.DoubleType,true));

StructType schema1 = DataTypes.createStructType(fields);

Dataset<Row> df1 = spark.sql("select 1 as A, 2.2 as B");

Dataset<Row> finalDf1 = spark.createDataFrame(df1.javaRDD(), schema1);


fields = new ArrayList<>();

fields.add(DataTypes.createStructField("B",DataTypes.DoubleType,true));

fields.add(DataTypes.createStructField("A",DataTypes.LongType,true));

StructType schema2 = DataTypes.createStructType(fields);

Dataset<Row> df2 = spark.sql("select 2.2 as B, 1 as A");

Dataset<Row> finalDf2 = spark.createDataFrame(df2.javaRDD(), schema2);


finalDf1.printSchema();

finalDf2.printSchema();

System.out.println(finalDf1.schema());

System.out.println(finalDf2.schema());

System.out.println(finalDf1.schema().equals(finalDf2.schema()));

这是输出:


root

 |-- A: long (nullable = true)

 |-- B: double (nullable = true)


root

 |-- B: double (nullable = true)

 |-- A: long (nullable = true)


StructType(StructField(A,LongType,true), StructField(B,DoubleType,true))

StructType(StructField(B,DoubleType,true), StructField(A,LongType,true))

false

虽然列的排列顺序不同,但这两个数据集具有完全相同的列和列类型。这里需要什么比较才能得到true?


一只萌萌小番薯
浏览 240回答 3
3回答

慕沐林林

假设订单 cols 不匹配并且相同的名称具有相同的语义并且需要相同数量的列。一个使用 SCALA 的例子,你应该能够适应 JAVA:import spark.implicits._val df = sc.parallelize(Seq(&nbsp; &nbsp; &nbsp; &nbsp; ("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),&nbsp; &nbsp; &nbsp; &nbsp; ("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)&nbsp; &nbsp; &nbsp; &nbsp; )).toDF("c1", "c2", "Val1", "Val2")val names = df.columnsval df2 = sc.parallelize(Seq(&nbsp; &nbsp; &nbsp; &nbsp;("A", "X", 2, 1))).toDF("c1", "c2", "Val1", "Val2")val names2 = df2.columnsnames.sortWith(_ < _) sameElements names2.sortWith(_ < _)返回真或假,试验输入。

慕标5832272

如果它们的顺序不同,则它们不相同。即使它们都具有相同的列数和相同的名称。如果您想查看两个架构是否具有相同的列名,请从两个数据帧的列表中获取架构,然后编写代码来比较它们。见下面的java示例public static void main(String[] args){&nbsp; &nbsp; List<String> firstSchema =Arrays.asList(DataTypes.createStructType(ConfigConstants.firstSchemaFields).fieldNames());&nbsp; &nbsp; List<String> secondSchema = Arrays.asList(DataTypes.createStructType(ConfigConstants.secondSchemaFields).fieldNames());&nbsp; &nbsp; if(schemasHaveTheSameColumnNames(firstSchema,secondSchema))&nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; System.out.println("Yes, schemas have the same column names");&nbsp; &nbsp; }else&nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; System.out.println("No, schemas do not have the same column names");&nbsp; &nbsp; }}private static boolean schemasHaveTheSameColumnNames(List<String> firstSchema, List<String> secondSchema){&nbsp; &nbsp; if(firstSchema.size() != secondSchema.size())&nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; return false;&nbsp; &nbsp; }else&nbsp;&nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; for (String column : secondSchema)&nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if(!firstSchema.contains(column))&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return false;&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }&nbsp; &nbsp; return true;}

梦里花落0921

按照前面的答案,似乎是比较StructFields(列和类型)而不仅仅是名称的最快方法如下:Set<StructField> set1 = new HashSet<>(Arrays.asList(schema1.fields()));Set<StructField> set2 = new HashSet<>(Arrays.asList(schema2.fields()));boolean result = set1.equals(set2);
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Java