使用 ValidateSamFile确定bam文件是否有问题?
Picard ValidateSamFile
Validates a SAM or BAM file.
ValidateSamFile下有两种mode:
VERBOSE : 检查到100个错误之后退出,并且输出错误到终端;
SUMMARYL: 输出结果是一个表格,展示errors和warnings的数目;
$ java -jar picard.jar ValidateSamFile \ I=input.bam \ MODE=SUMMARY Error Type Count ERROR:MISSING_READ_GROUP 1WARNING:RECORD_MISSING_READ_GROUP 1287903
问题在bam文件 MISSING_READ_GROUP;请自动屏蔽WARNING结果;
获取GATK格式bam文件
方法法1
bwa 比对使用参数-R
-R STR: Complete read group header line. ’\t’ can be used in STR and will be converted to a TAB in the output SAM. The read group ID will be attached to every read in the output. An example is ’@RG\tID:foo\tSM:bar’. [null]
bwa -R 参数的使用
# 不加参数Rbwa mem ref.fa read1.fq read2.fq > aln-pe.sam# 加参数Rbwa mem -R '@RG\tID:group_n\tLB:library_n\tPL:illumina\tPU:unit1\tSM:sample_n' ref.fa read1.fq read2.fq > aln-pe.R.sam'@RG\tID:group_n\tLB:library_n\tPL:illumina\tPU:unit1\tSM:sample_n'# ID:每一个Read group独有的ID;# LB:DNA preparation library identifier# PL:测序使用的平台: ILLUMINA, SOLID, LS454, HELICOS and PACBIO;# PU:Platform Unit,由三部分组成: {FLOWCELL_BARCODE}.{LANE};# SM:样品名read group参数详细信息见本文后半部分 注:GATK 使用时,PU不是必须要求的;但是PU与ID同时存在时,PU优先级高于ID。# 查看文件aln-pe.sam与aln-pe.R.sam中的groupsamtools view -H aln-pe.sam | grep '@RG'# 不会有输出samtools view -H aln-pe.R.sam | grep '@RG'@RG ID:group_n LB:library_n PL:illumina PU:unit1 SM:sample_n# 多个比对的时候,使用循环ls *.gRNA.fastq | while read id ;dobwa mem -R "@RG\tID:group_${id}\tLB:library_${id}\tPL:illumina\tSM:sample_${id}" ref.fa $id > ${id}.R.samdone
方法2
Picard AddOrReplaceReadGroups
Replace read groups in a BAM file.This tool enables the user to replace all read groups in the INPUT file with a single new read group and assign all reads to this read group in the OUTPUT BAM file.
$ java -jar picard.jar AddOrReplaceReadGroups \ I=input.bam \ O=output.bam \ RGID=4 \ RGLB=library1 \ RGPL=illumina \ RGPU=unit1 \ RGSM=sample1 Option Description INPUT (String) Input file (BAM or SAM or a GA4GH url). Required. OUTPUT (File) Output file (BAM or SAM). Required. SORT_ORDER (SortOrder) Optional sort order to output in. If not supplied OUTPUT is in the same order as INPUT. Default value: null. Possible values: {unsorted, queryname, coordinate, duplicate, unknown} RGID (String) Read Group ID Default value: 1. This option can be set to 'null' to clear the default value. RGLB (String) Read Group library Required. RGPL (String) Read Group platform (e.g. illumina, solid) Required. RGPU (String) Read Group platform unit (eg. run barcode) Required. RGSM (String) Read Group sample name Required.
Read groups详细介绍
There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument.
测序时:
样本建一个库在同一条lane上完成测序产生reads sets可定义为一个Read group;
样本建成分开独立的库测序得到的reads sets也就分属于不同的Read groups;
存在于SAM/BAM /CRAM 文件中Read groups是由一系列标签组成;这些标签代表着测序中的一些技术特征;有了这些数据之后,大家就可以对bam文件进行重复序列标识和碱基质量重新矫正。
GATK要求输入的bam文件包含Read groups,如果没有就会报错。
例子:
samtools view -H sample.bam | grep '@RG'@RG ID:H0164.2 PL:illumina PU:H0164ALXX140820.2 LB:Solexa-272222 PI:0 DT:2014-08-20T00:00:00-0400 SM:NA12878 CN:BI
GATk 要求read group的格式
Read group是@RG开始。
ID = Read group identifier
每一个Read group独有的ID;
Illumina 测序数据中,read group IDs由flowcell ,lane name 和number组成。
在矫正碱基质量时,read group IDs对区分技术批次效应是必须的;在这过程中,同一read group的reads假 定为有一样的技术误差。
PU = Platform Unit
Platform Unit由三部分组成: {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}
{FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell;
The {LANE} indicates the lane of the flow cell ;
The {SAMPLE_BARCODE} is a sample/library-specific identifier;
GATK 使用时,PU不是必须要求的;但是PU与ID同时存在时,PU优先级高于ID。
SM = Sample
reads属于的样品名;SM要设定正确,因为GATK产生的VCF文件也使用这个名字。
PL= Platform/technology used to produce the read
测序使用的平台: ILLUMINA, SOLID, LS454, HELICOS and PACBIO。
LB = DNA preparation library identifier
对一个read group的reads进行重复序列标记时,需要使用LB来区分reads来自那条lane;有时候,同一个库可能在不同的lane上完成测序;为了加以区分,同一个或不同库只要是在不同的lane产生的reads都要单独给一个ID。
从read names中提取ID与PU
H0164ALXX140820:2:1101:10003:23460H0164ALXX140820:2:1101:15118:25288H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell _____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run _______________:2 #portion of @RG ID and PU fields indicating flow cell lane
例子-Multi-sample and multiplexed example
三个样品:MOM, DAD, KID;
建库:每个样品分别建两个库,一个insert为200bp,一个insert为400bp;
上机:每个测序库使用Illumina HiSeq的 两条lane;
reads:来自 3 x 2 x 2 = 12条lane,可以产生12个bam文件,结果如下:
Dad's data: @RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 @RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 #解释 样本DAD建了两个库LIB-DAD-1(200bp)和LIB-DAD-2(400bp),使用了FLOWCELL1上面的4条lane; Mom's data:@RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200@RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200@RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400@RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400Kid's data: @RG ID:FLOWCELL2.LANE1 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE2 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE3 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400 @RG ID:FLOWCELL2.LANE4 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400
作者:_eason_
链接:https://www.jianshu.com/p/9a29bfc87a50