我有一个像这样的大文本文件:
#RefName Pos Coverage
lcl|LGDX01000053.1_cds_KOV95322.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 0 0
lcl|LGDX01000053.1_cds_KOV95322.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 1 0
lcl|LGDX01000053.1_cds_KOV95322.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 2 1
lcl|LGDX01000053.1_cds_KOV95323.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 3 0
lcl|LGDX01000053.1_cds_KOV95323.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 4 0
lcl|LGDX01000053.1_cds_KOV95324.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 5 0
lcl|LGDX01000053.1_cds_KOV95324.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 6 101
lcl|LGDX01000053.1_cds_KOV95325.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 7 10
lcl|LGDX01000053.1_cds_KOV95325.1_1 [locus_tag=ADL02_09560] [protein=MerR family transcriptional regulator] [protein_id=KOV95322.1] [location=complement(1866..2243)] [gbkey=CDS] 8 0
第一行是标题,可以忽略或删除。我有两个不同的目标:
1) 我想提取最后一列中值不是 0 的所有行。 2) 我想按第一列分组,并在分组文件中:删除第二列,并对最后一列求和。
我知道如何在 Pandas 中执行这些操作,但是文件大于 10G,加载到 Pandas 本身很痛苦。
有没有干净的方法来做这些?喜欢使用 bash 或 awk 什么的?
谢谢!
一只名叫tom的猫
慕尼黑5688855
相关分类