Bioconductor 分析基因芯片数据(转帖）-原创手记-慕课网

原文地址：https://www.jianshu.com/p/50bcf4ba9d8a

（代码太多，请看原贴）

还有系统的使用affy处理芯片，看原贴：

https://www.jianshu.com/p/859a7a7d040f

https://www.jianshu.com/p/ccafb8a184c3

https://www.jianshu.com/p/f44dc5d39bed

https://www.jianshu.com/p/e867ed33f43e

https://www.jianshu.com/p/cc171c5faf46

1. 快速入门

安装并加载所需R包

source("http://bioconductor.org/biocLite.R");

biocLite(“CLL”)

# 载入CLL包，CLL包会自动调用affy包，该包含有一系列处理函数

library(CLL)# read example dataset，（CLL包附带的示例数据集）data("CLLbatch")# pre-process using RMA methodCLLrma<- rma(CLLbatch)# read expression value after pre-processinge <- exprs(CLLrma)# 查看部分数据e[1:5,1:5]

webp

image.png

2. 基因芯片数据预处理

2.1 数据输入

# 载入CLL包，CLL包会自动调用affy包，该包含有一系列处理函数library(CLL)# read example dataset，（CLL包附带的示例数据集）data("CLLbatch")# 查看数据类型data.class(CLLbatch)# 读入所有样本的状态信息data(disease)# 查看所有样本的状态信息disease

webp

image.png

# 查看"AffyBatch"的详细介绍help(AffyBatch)phenoData(CLLbatch)featureData(CLLbatch)protocolData(CLLbatch)annotation(CLLbatch)exprs_matrix <- assayData(CLLbatch)[[1]]exprs_matrix[1:5,1:5]exprs(CLLbatch)[1:5,1:5]

webp

image.png

2.2 质量控制

质量控制分三步，直观观察，平均值方法，数据拟合方法。这三个层次的质量控制分别由image 函数、simpleaffy 包和affyPLM包实现

2.2.1 用image包对芯片数据进行质量评估

# 查看第一张芯片的灰度图像image(CLLbatch[,1])

如果图像特别黑，说明型号强度低；如果图像特别亮，说明信号强度很有可能过饱和

webp

image.png

2.2.2 用simpleaffy包对芯片数据进行质量评估

# 安装simpleaffy包source('http://Bioconductor.org/biocLite.R')biocLite("simpleaffy")library(BiocInstaller)biocLite("simpleaffy")library(simpleaffy)library(CLL)data(CLLbatch)# 获取质量分析报告Data.qc <- qc(CLLbatch)# 图型化显示报告plot(Data.qc)

webp

image.png

第一列是所有样品的名称

第二列检出率和平均背景噪声（往往较高的平均背景噪声都伴随着较低的检出率）

第三列

蓝色实现为尺度因子，取值（-3，3）

"o" 不能超过1.25，否则数据质量不好

“三角型“不能超过3，否则数据质量不好

bioB 说明芯片检测没有达标

2.2.3 用affyPLM包对芯片数据进行质量评估

source('http://Bioconductor.org/biocLite.R')biocLite('affyPLM')library(affyPLM)data(CLLbatch)# 对数据集做回归分析，结果为一个PLMset类型的对象Pset <- fitPLM(CLLbatch)image(CLLbatch[,1])# 根据计算结果，画权重图image(Pset,type="weights",which=1, main="Weights")# 根据计算结果，画残差图image(Pset,type="resids",which=1, main="Residuals")# 根据计算结果，画残差符号图image(Pset,type="sign.resids",which=1, main="Residuals.sign")

webp

image.png

webp

image.png

2.2.4 查看总体质量

一个样品的所有探针组的RLE分布可以用一个统计学中常用的箱型图形表示

source('http://Bioconductor.org/biocLite.R')biocLite("RColorBrewer")library(affyPLM)library(RColorBrewer)library(CLL)# 读入数据data("CLLbatch")# 对数据集做回归计算Psel<-fitPLM(CLLbatch)# 载入一组颜色colors<-brewer.pal(12,"Set3")# 绘制RLE箱线图Mbox(Pset, ylim=c(-1,1),col=colors,main="RLE",las=3);# 绘制NUSE箱线图boxplot(Pset,ylim=c(0.95,1.22),col=colors,main="NUSE",las=3)

从以下两幅图中可以看出CLL1，CLL10 的质量明显有别与其他样品，需要去除

webp

image.png

webp

image.png

2.2.5 查看RNA降解曲线

source('http://Bioconductor.org/biocLite.R')biocLite("affy")library(affy)library(RColorBrewer)library(CLL)data("CLLbatch")# 获取降解数据data.deg <- AffyRNAdeg(CLLbatch)# 载入一组颜色colors <- brewer.pal(12,"Set3")# 绘制RNA降解图plotAffyRNAdeg(data.deg, col=colors)# 在左上部位加注图注legend("topleft", rownames(pData(CLLbatch)), col=colors, lwd=1, inset=0.05, cex=0.5)

webp

image.png

去掉三个质量差的

CLLbatch<-CLLbatch[, -match(c("CLL10.CEL","CLL1.CEL","CLL13.CEL"), sampleNames(CLLbatch))]

2.2.6 从聚类分析的角度看数据质量

source('http://Bioconductor.org/biocLite.R')biocLite("gcrma")biocLite("graph")biocLite("affycoretools")library(CLL)library(gcrma)library(graph)library(affycoretools)data(CLLbatch)data("disease")# 使用gcrma算法来预处理数据CLLgcrma<-gcrma(CLLbatch)# 提取基因表达矩阵eset<-exprs(CLLgcrma)# 计算样品两两之间的Pearson相关系数pearson_cor<-cor(eset)# 得到Pearson距离的下三角矩阵dist.lower<-as.dist(1-pearson_cor)# 聚类分析hc<-hclust(dist.lower,"ave")plot(hc)# PCA分析samplenames<-sub(pattern="\\.CEL", replacement ="",colnames(eset))groups<-factor(disease[,2])plotPCA(eset,addtext=samplenames,groups=groups,groupnames=levels(groups))

从聚类分析的结果来看，稳定组（红框）和恶化组分不开

webp

image.png

从主成成份分析来看，也分不开

webp

image.png

2.3 背景校正、标准化和汇总

芯片数据通过质量控制，剔除不合格的样品，留下的样品数据往往要通过三步处理（背景校正、标准化和汇总）才能得到下一步分析所需要的基因表达矩阵

去除背景噪声的过程叫背景校正

标准化目的是使各次/组测量或各种实验条件下的测量可以相互比较

使用一定的统计方法将前面得到的荧光强度值从探针水平汇总到探针组水平

> bgcorrect.methods()[1]"bg.correct""mas""none""rma"> normalize.methods(CLLbatch) [1]"constant""contrasts""invariantset""loess""methods""qspline"[7]"quantiles""quantiles.robust""quantiles.probeset""scaling"> pmcorrect.methods()[1]"mas""methods""pmonly""subtractmm"> express.summary.stat.methods()[1]"avgdiff""liwong""mas""medianpolish""playerout"

参数说明

afbatch输入数据的类型必须是AffyBatch对象

bgcorrect.method背景校正方法

bgcorrect.param指定背景校正方法的参数

normalize.method标准化方法

normalize.param指定标准化方法的参数

pmcorrect.methodPM调整方法

pmcorrect.param指定PM方法参数

summary.method汇总方法

summary.param指定汇总方法的参数

芯片内标准化（Loess）前后MA图

# 使用mas方法做背景校正>CLLmas5<- bg.correct(CLLbatch, method ="mas")# 使用constant方法标准化> data_mas5 <- normalize(CLLmas5, method="constant")#　查看每个样品的缩放系数> head(pm(data_mas5)/pm(CLLmas5),5)CLL11.CELCLL12.CELCLL14.CELCLL15.CELCLL16.CELCLL17.CELCLL18.CELCLL19.CELCLL20.CELCLL21.CELCLL22.CELCLL23.CEL17521811.1558491.0238731.4931931.5493692.0002991.4515761.7765010.98251080.70708280.99587331.43236535668911.1558491.0238731.4931931.5493692.0002991.4515761.7765010.98251080.70708280.99587331.43236522769611.1558491.0238731.4931931.5493692.0002991.4515761.7765010.98251080.70708280.99587331.43236523791911.1558491.0238731.4931931.5493692.0002991.4515761.7765010.98251080.70708280.99587331.43236527517311.1558491.0238731.4931931.5493692.0002991.4515761.7765010.98251080.70708280.99587331.432365CLL24.CELCLL2.CELCLL3.CELCLL4.CELCLL5.CELCLL6.CELCLL7.CELCLL8.CELCLL9.CEL1752181.7060261.1563780.84254190.97750820.98165731.1829631.1149761.136390.89392483566891.7060261.1563780.84254190.97750820.98165731.1829631.1149761.136390.89392482276961.7060261.1563780.84254190.97750820.98165731.1829631.1149761.136390.89392482379191.7060261.1563780.84254190.97750820.98165731.1829631.1149761.136390.89392482751731.7060261.1563780.84254190.97750820.98165731.1829631.1149761.136390.8939248# 查看第二个样品的缩放系数是怎么计算来的> mean(intensity(CLLmas5)[,1])/mean(intensity(CLLmas5)[,2])[1]1.155849

2.4 预处理的一体化算法

预处理方法

方法背景校正方法标准化方法汇总方法

MASSmasconstantmas

dChipmasinvariantsetliwong

RMArmaquantilemedianpolish

MAS5 每个芯片可以单独进行标准化；RMA 采用多芯片模型，需要对所有芯片一起进行标准化。

MAS5 利用MM探针的信息去除背景噪声，基本思路是MP-MM；RMA 不使用MM信息，而是基于PM信号分布采用的随机模型来估计表达值。

RMA处理后的数据是经过2为底的对数转换的，而MAS5不是，这一点非常重要，因为大多数芯片分析软件和函数的输入数据必须经过对数变换。

比较不同算法的处理效果

library(affy)library(gcrma)library(affyPLM)library(RColorBrewer)library(CLL)data("CLLbatch")colors <- brewer.pal(12,"Set3")# use MAS5CLLmas5<- mas5(CLLbatch)# use rma CLLrma<- rma(CLLbatch)# use gcrmaCLLgcrma<- gcrma(CLLbatch)## hist plothist(CLLbatch, main="orignal", col=colors)legend("topright", rownames(pData(CLLbatch)), col=colors, lwd=1, inset=0.05, cex=0.5, ncol=3)hist(CLLmas5, main="MAS 5.0", xlim=c(-150,2^10), col=colors)hist(CLLrma, main="RMA", col=colors)hist(CLLgcrma, main="gcRMA", col=colors)## boxplotboxplot(CLLbatch, col=colors, las=3, main="original")boxplot(CLLmas5, col=colors, las=3, ylim=c(0,1000), main="MAS 5.0")boxplot(CLLrma, col=colors, las=3, main="RMA")boxplot(CLLgcrma, col=colors, las=3, main="gcRMA")

RMA算法将多条曲线重合到了一起，有利于进一步的差异分析，但却出现了双峰现象，不符合高斯正态分布。很显然gcRMA算法在这里表现的更好。

webp

image.png

webp

image.png

webp

image.png

webp

image.png

三个算法处理后的各样品的中值都十分接近。MAS5算法总体而言还不错，有一定拖尾现象；而gcRMA的拖尾现象比RMA要明显得多。这说明针对低表达量的基因，RMA算法比gcRMA算法表现更好一些。

webp

image.png

webp

image.png

webp

image.png

webp

image.png

通过MA图来查看标准化处理的效

library(gcrma)library(RColorBrewer)library(CLL)library(affy)data("CLLbatch")colors<-brewer.pal(12."Set3")CLLgcrma<-gcrma(CLLbatch)MAplot(CLLbatch[,1:4],pairs=TRUE,plot.method="smoothScatter",cex=0.8,main="original MA plot")MAplot(CLLgcrma[,1:4],pairs=TRUE,plot.method="smoothScatter",cex=0.8,main="gcRMA MA plot")

原始数据中，中值（红线）偏离0，经过gcRMA处理后，中值基本保持在零线上。

webp

image.png

webp

image.png

3 基因芯片数据分析

3.1 选取差异表达基因

######## DEG analysis#　如果没有安装limma包，请取消下面两行注释# library(BiocInstaller)# biocLite("limma")# 导入包library(limma)library(gcrma)library(CLL)# 导入CLL数据data("CLLbatch")data(disease)# 移除 CLL1, CLL10, CLL13CLLbatch<-CLLbatch[, -match(c("CLL10.CEL","CLL1.CEL","CLL13.CEL"), sampleNames(CLLbatch))]# 用gcRMA算法进行预处理CLLgcrma<- gcrma(CLLbatch)# remove .CEL in sample namessampleNames(CLLgcrma) <- gsub(".CEL$","", sampleNames(CLLgcrma))# remove record in data disease about CLL1, 10 and 13.CELdisease <- disease[match(sampleNames(CLLgcrma), disease[,"SampleID"]), ]# 获得余下21个样品的基因表达矩阵eset <- exprs(CLLgcrma)# 提取实验条件信息disease <- factor(disease[,"Disease"])# 构建实验设计矩阵design <- model.matrix(~-1+disease)# 构建对比模型，比较两个实验条件下表达数据#　这里的.是简写而不是运算符号contrast.matrix <- makeContrasts(contrasts ="diseaseprogres. - diseasestable", levels=design)# 线性模型拟合fit <- lmFit(eset, design)# 根据对比模型进行差值计算 fit1 <- contrasts.fit(fit, contrast.matrix)# 贝叶斯检验fit2 <- eBayes(fit1)# 生成所有基因的检验结果报告dif <- topTable(fit2, coef="diseaseprogres. - diseasestable", n=nrow(fit2), lfc=log2(1.5))# 用P.Value进行筛选，得到全部差异表达基因dif <- dif[dif[,"P.Value"]<0.01,]# 显示一部分报告结果head(dif)

>head(dif)logFCAveExprtP.Valueadj.P.ValB39400_at-0.99978505.634004-5.7273291.482860e-050.10345442.44583541303_at-1.34303064.540225-5.5968131.974284e-050.10345442.154635033791_at1.96479626.8379035.4004993.047498e-050.10345441.713558336131_at0.95742149.9453345.3677413.277762e-050.10345441.639622337636_at2.05340935.4786835.1205195.699454e-050.14391121.078831336122_at0.80086047.1462934.7767211.241402e-040.26121180.2922106

下面逐次介绍这个分析过程的六个关键步骤：构建基因表达矩阵、构建实验设计矩阵、构建对比模型（对比矩阵）、线性模型拟合、贝叶斯检验和生成结果报表。

design

webp

image.png

3.2 注释

# 加载注释工具包library(annotate)# 获得基因芯片注释包名称affydb <- annPkgName(CLLbatch@annotation,type="db")affydb# 如果没有安装，先安装biocLite(affydb)#　加载该包library(affydb, character.only = TRUE)# 根据每个探针组的ID获取对应基因Gene Symbol，并作为新的一列dif$symbols<- getSYMBOL(rownames(dif), affydb)# 根据探针ID获取对应基因Entrez IDdif$EntrezID<- getEG(rownames(dif), affydb)# 显示前几行head(dif)

webp

image.png

3.3 统计分析及可视化

# 加载包library(GOstats)# 提取HG_U95Av2芯片中所有探针组对应的EntrezID，注意保证uniqentrezUniverse <- unique(unlist(mget(rownames(eset), hgu95av2ENTREZID)))# 提取差异表达基因及其对应的EntrezIDentrezSelected <- unique(dif[!is.na(dif$EntrezID),"EntrezID"])# 设置GO富集分析的所有参数params <-new("GOHyperGParams", geneIds=entrezSelected, universeGeneIds=entrezUniverse, annotation=affydb, ontology="BP", pvalueCutoff=0.001, conditional=FALSE, testDirection="over")# 对所有的GOterm根据params参数做超几何检验hgOver <- hyperGTest(params)# 生成所有GOterm的检验结果报表bp <- summary(hgOver)# 同时生成所有GOterm的检验结果文件，每个GOterm都有指向官方网站的链接，以获得详细信息htmlReport(hgOver, file="ALL_go.html")# 显示结果前几行head(bp)

webp

image.png

# 安装并加载所需包biocLite("GeneAnswers")library(GeneAnswers)# 选取dif中的三列信息构成新的矩阵，新一列必须是EntrezIDhumanGeneInput <- dif[, c("EntrezID","logFC","P.Value")]# 获得humanGeneInput中基因的表达值humanExpr <- eset[match(rownames(dif), rownames(eset)), ]humanExpr <- cbind(humanGeneInput[,"EntrezID"], humanExpr)# 去除NA数据humanGeneInput <- humanGeneInput[!is.na(humanGeneInput[,1]),]humanExpr <- humanExpr[!is.na(humanExpr[,1]),]# KEGG通路的超几何检验y <- geneAnswersBuilder(humanGeneInput,"org.Hs.eg.db", categoryType ="KEGG", testType ="hyperG", pvalueT=0.1, geneExpressionProfile=humanExpr, verbose = FALSE)getEnrichmentInfo(y)[1:6]

webp

image.png

biocLite("pheatmap")library(pheatmap)# 从基因表达矩阵中，选取差异表达基因对应的数据selected <- eset[rownames(dif), ]# 将selected矩阵每行的名称由探针组ID转换为对应的基因symbolrownames(selected) <- dif$symbols# 考虑显示问题，这里只画前20个基因的热图pheatmap(selected[1:20,], color = colorRampPalette(c("green","black","red"))(100), fontsize_row = 4, scale ="row", border_color = NA)

webp

image.png

biocLite("Rgraphviz")library(Rgraphviz)# 显著富集的GO term的DAG关系图ghandle <- goDag(hgOver)# 这个图太大，这里只能选一部分数据构建局部图subGHandle <- subGraph(snodes=as.character(summary(hgOver)[,1]), graph = ghandle)plot(subGHandle)

webp

image.png

source("https://bioconductor.org/biocLite.R")biocLite("KEGG.db")yy<- geneAnswersReadable(y)geneAnswersConceptNet(yy, colorValueColumn ="logFC", centroidSize ="pvalue", output="interactive")

这里我报错了，这个网站解决问题http://planspace.org/2013/01/17/fix-r-tcltk-dependency-problem-on-mac/

webp

image.png

yyy<- geneAnswersSort(yy,sortBy ="pvalue")geneAnswersHeatmap(yyy)

webp

作者：苏慕晨枫
链接：https://www.jianshu.com/p/393657ffd850