使用 ValidateSamFile确定bam文件是否有问题？

Picard ValidateSamFile

Validates a SAM or BAM file.

ValidateSamFile下有两种mode:
VERBOSE : 检查到100个错误之后退出，并且输出错误到终端；
SUMMARYL: 输出结果是一个表格，展示errors和warnings的数目；

$ java -jar picard.jar ValidateSamFile \
   I=input.bam \
   MODE=SUMMARY   
Error Type  Count
ERROR:MISSING_READ_GROUP    1WARNING:RECORD_MISSING_READ_GROUP   1287903

问题在bam文件 MISSING_READ_GROUP；请自动屏蔽WARNING结果；

获取GATK格式bam文件

方法法1

bwa 比对使用参数-R

-R STR: Complete read group header line. ’\t’ can be used in STR and will be converted to a TAB in the output SAM. The read group ID will be attached to every read in the output. An example is ’@RG\tID:foo\tSM:bar’. [null]

bwa -R 参数的使用

# 不加参数Rbwa mem ref.fa read1.fq read2.fq > aln-pe.sam# 加参数Rbwa mem -R '@RG\tID:group_n\tLB:library_n\tPL:illumina\tPU:unit1\tSM:sample_n' ref.fa read1.fq read2.fq > aln-pe.R.sam'@RG\tID:group_n\tLB:library_n\tPL:illumina\tPU:unit1\tSM:sample_n'# ID:每一个Read group独有的ID；# LB:DNA preparation library identifier# PL:测序使用的平台： ILLUMINA, SOLID, LS454, HELICOS and PACBIO;# PU:Platform Unit,由三部分组成： {FLOWCELL_BARCODE}.{LANE};# SM:样品名read group参数详细信息见本文后半部分

注：GATK 使用时，PU不是必须要求的；但是PU与ID同时存在时，PU优先级高于ID。# 查看文件aln-pe.sam与aln-pe.R.sam中的groupsamtools view -H aln-pe.sam  | grep '@RG'# 不会有输出samtools view -H aln-pe.R.sam  | grep '@RG'@RG ID:group_n  LB:library_n    PL:illumina PU:unit1    SM:sample_n# 多个比对的时候，使用循环ls *.gRNA.fastq | while read id ;dobwa mem -R "@RG\tID:group_${id}\tLB:library_${id}\tPL:illumina\tSM:sample_${id}" ref.fa $id > ${id}.R.samdone

方法2
Picard AddOrReplaceReadGroups

Replace read groups in a BAM file.This tool enables the user to replace all read groups in the INPUT file with a single new read group and assign all reads to this read group in the OUTPUT BAM file.

$ java -jar picard.jar AddOrReplaceReadGroups \
       I=input.bam \
       O=output.bam \
       RGID=4 \
       RGLB=library1 \
       RGPL=illumina \
       RGPU=unit1 \
       RGSM=sample1
Option  Description
INPUT (String)  Input file (BAM or SAM or a GA4GH url). Required.
OUTPUT (File)   Output file (BAM or SAM). Required.
SORT_ORDER (SortOrder)  Optional sort order to output in. If not supplied OUTPUT is in the same order as INPUT. Default value: null. Possible values: {unsorted, queryname, coordinate, duplicate, unknown}
RGID (String)   Read Group ID Default value: 1. This option can be set to 'null' to clear the default value.
RGLB (String)   Read Group library Required.
RGPL (String)   Read Group platform (e.g. illumina, solid) Required.
RGPU (String)   Read Group platform unit (eg. run barcode) Required.
RGSM (String)   Read Group sample name Required.

Read groups详细介绍

英文原版查看

There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument.

测序时：
样本建一个库在同一条lane上完成测序产生reads sets可定义为一个Read group;
样本建成分开独立的库测序得到的reads sets也就分属于不同的Read groups;
存在于SAM/BAM /CRAM 文件中Read groups是由一系列标签组成；这些标签代表着测序中的一些技术特征；有了这些数据之后，大家就可以对bam文件进行重复序列标识和碱基质量重新矫正。
GATK要求输入的bam文件包含Read groups，如果没有就会报错。

例子：

samtools view -H sample.bam | grep '@RG'@RG ID:H0164.2  PL:illumina PU:H0164ALXX140820.2    LB:Solexa-272222    PI:0    DT:2014-08-20T00:00:00-0400 SM:NA12878  CN:BI

GATk 要求read group的格式
Read group是@RG开始。

ID = Read group identifier
每一个Read group独有的ID；
Illumina 测序数据中，read group IDs由flowcell ，lane name 和number组成。
在矫正碱基质量时，read group IDs对区分技术批次效应是必须的；在这过程中，同一read group的reads假定为有一样的技术误差。

PU = Platform Unit
Platform Unit由三部分组成： {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}
{FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell;
The {LANE} indicates the lane of the flow cell ;
The {SAMPLE_BARCODE} is a sample/library-specific identifier;
GATK 使用时，PU不是必须要求的；但是PU与ID同时存在时，PU优先级高于ID。

SM = Sample
reads属于的样品名；SM要设定正确，因为GATK产生的VCF文件也使用这个名字。

PL= Platform/technology used to produce the read
测序使用的平台： ILLUMINA, SOLID, LS454, HELICOS and PACBIO。

LB = DNA preparation library identifier
对一个read group的reads进行重复序列标记时，需要使用LB来区分reads来自那条lane;有时候，同一个库可能在不同的lane上完成测序;为了加以区分，同一个或不同库只要是在不同的lane产生的reads都要单独给一个ID。

从read names中提取ID与PU

H0164ALXX140820:2:1101:10003:23460H0164ALXX140820:2:1101:15118:25288H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell
_____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run
_______________:2 #portion of @RG ID and PU fields indicating flow cell lane

例子-Multi-sample and multiplexed example
三个样品：MOM, DAD, KID；
建库：每个样品分别建两个库，一个insert为200bp，一个insert为400bp；
上机：每个测序库使用Illumina HiSeq的两条lane；
reads：来自 3 x 2 x 2 = 12条lane,可以产生12个bam文件，结果如下：

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400

#解释
样本DAD建了两个库LIB-DAD-1(200bp)和LIB-DAD-2(400bp)，使用了FLOWCELL1上面的4条lane;

Mom's data:@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

作者：_eason_
链接：https://www.jianshu.com/p/9a29bfc87a50

GATK - Read groups

使用 ValidateSamFile确定bam文件是否有问题？

获取GATK格式bam文件

Read groups详细介绍