用匹配的键合并行

我有一个具有以下结构的文本文件


ID,operator,a,b,c,d,true

WCBP12236,J1,75.7,80.6,65.9,83.2,82.1

WCBP12236,J2,76.3,79.6,61.7,81.9,82.1

WCBP12236,S1,77.2,81.5,69.4,84.1,82.1

WCBP12236,S2,68.0,68.0,53.2,68.5,82.1

WCBP12234,J1,63.7,67.7,72.2,71.6,75.3

WCBP12234,J2,68.6,68.4,41.4,68.9,75.3

WCBP12234,S1,81.8,82.7,67.0,87.5,75.3

WCBP12234,S2,66.6,67.9,53.0,70.7,75.3

WCBP12238,J1,78.6,79.0,56.2,82.1,84.1

WCBP12239,J2,66.6,72.9,79.5,76.6,82.1

WCBP12239,S1,86.6,87.8,23.0,23.0,82.1

WCBP12239,S2,86.0,86.9,62.3,89.7,82.1

WCBP12239,J1,70.9,71.3,66.0,73.7,82.1

WCBP12238,J2,75.1,75.2,54.3,76.4,84.1

WCBP12238,S1,65.9,66.0,40.2,66.5,84.1

WCBP12238,S2,72.7,73.2,52.6,73.9,84.1

每个ID数据集都对应一个数据集,操作员会对其进行多次分析。即J1和J2是由操作者J的措施,第一和第二次尝试a,b,c和d使用4个略有不同的算法来测量其真正价值在于所述列中的值true


我想做的是创建3个新的文本文件,比较J1vs J2,S1vsS2和J1vs的结果S1。J1vs的示例输出J2:


ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true

WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1

WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3

其中a1被测量a为J1等


另一个例子是S1vs S2:


ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true

WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1

WCBP12234,81.8,66.6,82.7,67.9,67.0,53,87.5,70.7,75.3

这些ID不会按字母数字顺序排列,也不会为同一ID聚集运算符。我不确定如何最好地完成此任务-使用linux工具或像perl / python这样的脚本语言。


我最初使用linux的尝试很快就碰壁了


首先找到所有唯一ID(已排序)


awk -F, '/^WCBP/ {print $1}' file | uniq | sort -k 1.5n > unique_ids

通过这些ID循环和排序J1,J2:


foreach i (`more unique_ids`)

    grep $i test.txt | egrep 'J[1-2]' | sort -t',' -k2

end

这给我排序的数据


WCBP12234,J1,63.7,67.7,72.2,71.6,75.3

WCBP12234,J2,68.6,68.4,41.4,68.9,80.4

WCBP12236,J1,75.7,80.6,65.9,83.2,82.1

WCBP12236,J2,76.3,79.6,61.7,81.9,82.1

WCBP12238,J1,78.6,79.0,56.2,82.1,82.1

WCBP12238,J2,75.1,75.2,54.3,76.4,82.1

WCBP12239,J1,70.9,71.3,66.0,73.7,75.3

WCBP12239,J2,66.6,72.9,79.5,76.6,75.3

我不确定如何重新排列这些数据以获得所需的结构。我试图awk在foreach循环中添加一个额外的管道awk 'BEGIN {RS="\n\n"} {print $1, $3,$10,$4,$11,$5,$12,$6,$13,$7}'


有任何想法吗?我敢肯定,可以使用awk,以较少麻烦的方式完成此操作,尽管使用适当的脚本语言可能会更好。


一只名叫tom的猫
浏览 164回答 3
3回答

红糖糍粑

您可以使用Perl csv模块Text :: CSV提取字段,然后将它们存储在散列中,其中ID是主键,第二个字段是辅助键,所有字段都存储为值。这样,您可以轻松进行所需的比较。如果要保留行的原始顺序,可以在第一个循环内使用数组。use strict;use warnings;use Text::CSV;my %data;my $csv = Text::CSV->new({            binary => 1,      # safety precaution            eol    => $/,     # important when using $csv->print()    });while ( my $row = $csv->getline(*ARGV) ) {    my ($id, $J) = @$row;   # first two fields    $data{$id}{$J} = $row;  # store line}

SMILET

我没有像TLP那样使用Text :: CSV。如果需要,但对于此示例,我认为由于字段中没有嵌入的逗号,因此对','进行了简单的拆分。另外,列出了两个运算符的真实字段(而不是仅列出1),因为我认为最后一个值的特殊情况会使解决方案复杂化。#!/usr/bin/perluse strict;use warnings;use List::MoreUtils qw/ mesh /;my %data;while (<DATA>) {&nbsp; &nbsp; chomp;&nbsp; &nbsp; my ($id, $op, @vals) = split /,/;&nbsp; &nbsp; $data{$id}{$op} = \@vals;}my @ops = ([qw/J1 J2/], [qw/S1 S2/], [qw/J1 S1/]);for my $id (sort keys %data) {&nbsp; &nbsp; for my $comb (@ops) {&nbsp; &nbsp; &nbsp; &nbsp; open my $fh, ">>", "@$comb.txt" or die $!;&nbsp; &nbsp; &nbsp; &nbsp; my $a1 = $data{$id}{ $comb->[0] };&nbsp; &nbsp; &nbsp; &nbsp; my $a2 = $data{$id}{ $comb->[1] };&nbsp; &nbsp; &nbsp; &nbsp; print $fh join(",", $id, mesh(@$a1, @$a2)), "\n";&nbsp; &nbsp; &nbsp; &nbsp; close $fh or die $!;&nbsp; &nbsp; }&nbsp; &nbsp;}__DATA__WCBP12236,J1,75.7,80.6,65.9,83.2,82.1WCBP12236,J2,76.3,79.6,61.7,81.9,82.1WCBP12236,S1,77.2,81.5,69.4,84.1,82.1WCBP12236,S2,68.0,68.0,53.2,68.5,82.1WCBP12234,J1,63.7,67.7,72.2,71.6,75.3WCBP12234,J2,68.6,68.4,41.4,68.9,75.3WCBP12234,S1,81.8,82.7,67.0,87.5,75.3WCBP12234,S2,66.6,67.9,53.0,70.7,75.3WCBP12239,J1,78.6,79.0,56.2,82.1,82.1WCBP12239,J2,66.6,72.9,79.5,76.6,82.1WCBP12239,S1,86.6,87.8,23.0,23.0,82.1WCBP12239,S2,86.0,86.9,62.3,89.7,82.1WCBP12238,J1,70.9,71.3,66.0,73.7,84.1WCBP12238,J2,75.1,75.2,54.3,76.4,84.1WCBP12238,S1,65.9,66.0,40.2,66.5,84.1WCBP12238,S2,72.7,73.2,52.6,73.9,84.1产生的输出文件如下J1 J2.txtWCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3,75.3WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1,82.1WCBP12238,70.9,75.1,71.3,75.2,66.0,54.3,73.7,76.4,84.1,84.1WCBP12239,78.6,66.6,79.0,72.9,56.2,79.5,82.1,76.6,82.1,82.1S1 S2.txtWCBP12234,81.8,66.6,82.7,67.9,67.0,53.0,87.5,70.7,75.3,75.3WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1,82.1WCBP12238,65.9,72.7,66.0,73.2,40.2,52.6,66.5,73.9,84.1,84.1WCBP12239,86.6,86.0,87.8,86.9,23.0,62.3,23.0,89.7,82.1,82.1J1 S1.txtWCBP12234,63.7,81.8,67.7,82.7,72.2,67.0,71.6,87.5,75.3,75.3WCBP12236,75.7,77.2,80.6,81.5,65.9,69.4,83.2,84.1,82.1,82.1WCBP12238,70.9,65.9,71.3,66.0,66.0,40.2,73.7,66.5,84.1,84.1WCBP12239,78.6,86.6,79.0,87.8,56.2,23.0,82.1,23.0,82.1,82.1更新:要仅获得1个真值,可以将for循环编写为:for my $id (sort keys %data) {&nbsp; &nbsp; for my $comb (@ops) {&nbsp; &nbsp; &nbsp; &nbsp; local $" = '';&nbsp; &nbsp; &nbsp; &nbsp; open my $fh, ">>", "@$comb.txt" or die $!;&nbsp; &nbsp; &nbsp; &nbsp; my $a1 = $data{$id}{ $comb->[0] };&nbsp; &nbsp; &nbsp; &nbsp; my $a2 = $data{$id}{ $comb->[1] };&nbsp; &nbsp; &nbsp; &nbsp; pop @$a2;&nbsp; &nbsp; &nbsp; &nbsp; my @mesh = grep defined, mesh(@$a1, @$a2);&nbsp; &nbsp; &nbsp; &nbsp; print $fh join(",", $id, @mesh), "\n";&nbsp; &nbsp; &nbsp; &nbsp; close $fh or die $!;&nbsp; &nbsp; }&nbsp; &nbsp;}更新:在grep expr中添加了“定义”以进行测试。因为这是正确的方法(而不是仅测试'$ _',它可能为0并被grep错误地排除在列表之外)。

慕哥6287543

Python方式:import os,sys, re, itertoolsinfo=["WCBP12236,J1,75.7,80.6,65.9,83.2,82.1",&nbsp; "WCBP12236,J2,76.3,79.6,61.7,81.9,82.1",&nbsp; "WCBP12236,S1,77.2,81.5,69.4,84.1,82.1",&nbsp; "WCBP12236,S2,68.0,68.0,53.2,68.5,82.1",&nbsp; "WCBP12234,J1,63.7,67.7,72.2,71.6,75.3",&nbsp; "WCBP12234,J2,68.6,68.4,41.4,68.9,80.4",&nbsp; "WCBP12234,S1,81.8,82.7,67.0,87.5,75.3",&nbsp; "WCBP12234,S2,66.6,67.9,53.0,70.7,72.7",&nbsp; "WCBP12238,J1,78.6,79.0,56.2,82.1,82.1",&nbsp; "WCBP12239,J2,66.6,72.9,79.5,76.6,75.3",&nbsp; "WCBP12239,S1,86.6,87.8,23.0,23.0,82.1",&nbsp; "WCBP12239,S2,86.0,86.9,62.3,89.7,82.1",&nbsp; "WCBP12239,J1,70.9,71.3,66.0,73.7,75.3",&nbsp; "WCBP12238,J2,75.1,75.2,54.3,76.4,82.1",&nbsp; "WCBP12238,S1,65.9,66.0,40.2,66.5,80.4",&nbsp; "WCBP12238,S2,72.7,73.2,52.6,73.9,72.7" ]def extract_data(operator_1, operator_2):&nbsp; &nbsp; operator_index=1&nbsp; &nbsp; id_index=0&nbsp; &nbsp; data={}&nbsp; &nbsp; result=[]&nbsp; &nbsp; ret=[]&nbsp; &nbsp; for line in info:&nbsp; &nbsp; &nbsp; &nbsp; conv_list=line.split(",")&nbsp; &nbsp; &nbsp; &nbsp; if len(conv_list) > operator_index and ((operator_1.strip().upper() == conv_list[operator_index].strip().upper()) or (operator_2.strip().upper() == conv_list[operator_index].strip().upper()) ):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if data.has_key(conv_list[id_index]):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; iters = [iter(conv_list[int(operator_index)+1:]), iter(data[conv_list[id_index]])]&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; data[conv_list[id_index]]=list(it.next() for it in itertools.cycle(iters))&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; continue&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; data[conv_list[id_index]]=conv_list[int(operator_index)+1:]&nbsp; &nbsp; return dataret=extract_data("j1", "s2")print retO / P:{'WCBP12239':['70 .9','86 .0','71 .3','86 .9','66 .0','62 .3','73 .7','89 .7','75 .3','82 .1'],'WCBP12238' :['72.7','78.6','73.2','79.0','52.6','56.2','73.9','82.1','72.7','82.1'],'WCBP12234':['66.6 ','63 .7','67.9','67.7','53.0','72.2','70.7','71.6','72.7','75.3'],'WCBP12236':['68.0','75.7 ','68 .0','80 .6','53 .2','65 .9','68 .5','83 .2','82 .1','82 .1']}
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python