需求
一个表有一个abtest字段,是带有嵌套结构的json字符串,里面的key:value可能会重,需要将abtest里的所有的key:value打平去重,去掉双引号后再用逗号拼接返回,并且需要例子
输入的abtest如下
{"trip_ab_deal_packagerankst":"B", "trip_ab_poivideo":"C", "trip_ab_group_fengchao_zhinan":"A", "trip_ab_BookingProduct":"A", "trip_ab_OptimalGoods_2018":"B", "trip_ab_poitoubuyouhua":"B", "trip_ab_group_fengchao":{"trip_ab_group_fengchao_zhinan":"A"}, "trip_ab_group_BookingProduct":{"trip_ab_BookingProduct":"A"}, "trip_ab_group_common":{"trip_ab_poivideo":"C"}, "trip_ab_group_common":{"trip_ab_poitoubuyouhua":"B"}, "trip_ab_group_common":{"trip_ab_deal_packagerankst":"B"}, "trip_ab_group_OptimalGoods_2018":{"trip_ab_OptimalGoods_2018":"B"}}
要求输出如下
trip_ab_deal_packagerankst:B,trip_ab_poivideo:C,trip_ab_group_fengchao_zhinan:A,trip_ab_BookingProduct:A,trip_ab_OptimalGoods_2018:B,trip_ab_poitoubuyouhua:B
解决方案
将abtest的json字符串按照
,
打平
lateral view explode(split(abtest,',')) b as f2
将打平后的每个值按照正则表达式解析出
"":""
格式的数据,并去掉双引号
regexp_replace( // 正则表达式编写思路: // 要解析"":""格式的数据:(".*":".*") // 因为一行可能有多对键值对,所以需要非贪婪匹配:(".*":".*?") // 因为有"trip_ab_group_OptimalGoods_2018":{"trip_ab_OptimalGoods_2018":"B"}}这种数据, // 我们实际要解析的是"trip_ab_OptimalGoods_2018":"B",所以需要做兼容:("[^"]*"\:".*?") regexp_extract(f2,'("[^"]*"\:".*?")',0), '"','')
使用窗口函数再将这些值使用窗口函数聚合到一起,按照
,
拼接
concat_ws(',',collect_set(regexp_replace(regexp_extract(f2,'("[^"]*"\:".*?")',0),'"','')))
测试用例
hive> select concat_ws(',',collect_set(regexp_replace(regexp_extract(f2,'("[^"]*"\:".*?")',0),'"',''))) > from ( > select 1 as rk, > '{"trip_ab_deal_packagerankst":"B","trip_ab_poivideo":"C","trip_ab_group_fengchao_zhinan":"A","trip_ab_BookingProduct":"A","trip_ab_OptimalGoods_2018":"B","trip_ab_poitoubuyouhua":"B","trip_ab_group_fengchao":{"trip_ab_group_fengchao_zhinan":"A"},"trip_ab_group_BookingProduct":{"trip_ab_BookingProduct":"A"},"trip_ab_group_common":{"trip_ab_poivideo":"C"},"trip_ab_group_common":{"trip_ab_poitoubuyouhua":"B"},"trip_ab_group_common":{"trip_ab_deal_packagerankst":"B"},"trip_ab_group_OptimalGoods_2018":{"trip_ab_OptimalGoods_2018":"B"}}' as abtest > ) a lateral view explode(split(abtest,',')) b as f2 > GROUP BY a.rk; Query ID = gaowenfeng_20180927192959_991bcfd6-ead7-4ca6-a7b0-c542828f2d0e Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1537411935549_0008, Tracking URL = http://gaowenfengdeMacBook-Pro.local:8088/proxy/application_1537411935549_0008/ Kill Command = /Users/gaowenfeng/software/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1537411935549_0008 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12018-09-27 19:30:04,677 Stage-1 map = 0%, reduce = 0%2018-09-27 19:30:08,794 Stage-1 map = 100%, reduce = 0%2018-09-27 19:30:14,951 Stage-1 map = 100%, reduce = 100% Ended Job = job_1537411935549_0008 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 HDFS Read: 12371 HDFS Write: 158 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK trip_ab_deal_packagerankst:B,trip_ab_poivideo:C,trip_ab_group_fengchao_zhinan:A,trip_ab_BookingProduct:A,trip_ab_OptimalGoods_2018:B,trip_ab_poitoubuyouhua:B Time taken: 16.953 seconds, Fetched: 1 row(s) hive>
结果sql
select a.event_identifier, concat_ws(',',collect_set(regexp_replace(regexp_extract(f2,'("[^"]*"\:".*?")',0),'"',''))), MAX(abtest) from our_table a lateral view explode(split(abtest,',')) b as f2 WHERE a.datekey = 20180926 GROUP BY a.event_identifier LIMIT 100
作者:Meet相识_bfa5
链接:https://www.jianshu.com/p/7cc068ad9e44