生信编程实战第2题（python、R、shell）-原创手记-慕课网

webp

image.png

统计人类参考基因组的每条染色体长度，每条染色体N的含量，GC含量
因为我下载的是hg38,所以这里用hg38的版本

life is short,I use python
所以我还是主要用python来做
再就是用R，这是我第二想掌握的
最后用shell来做

这个模式也是接下来做其他题目的模式

1.python
首先是我自己写的脚本，这个脚本不太好看，但还是好用的

import sysimport collections
args=sys.argv
filename=args[1]
count_all=collections.OrderedDict() #构建有顺序的字典Length=0N_count=0GC_count=0print("chr","N_count","GC_count","Length",sep="\t")for line in open(filename):   if line.startswith(">"):      if Length:               #N和GC可能不一定统计的到，但是length肯定是有的
         result=str(N_count)+"\t"+str(GC_count)+"\t"+str(Length) #将所有的结果建立成一个以\t分割的字符串，以该字符串和ID构建字典
         count_all[ID]=result
         N_count=0
         GC_count=0
         Length=0
      ID=line[1:].strip()   else:
      line=line.upper()
      N_count+=line.strip().count("N")
      GC_count+=line.strip().count("G")+line.strip().count("C")
      Length+=len(line.strip())
result=str(N_count)+"\t"+str(GC_count)+"\t"+str(Length)
count_all[ID]=resultfor ID,result in count_all.items():
   print(ID,result,sep="\t")

之所以不好看，主要还是我之前不会用一个key对应value，并且这个value中又形成多个键值对的用法，比如：

chr_1
 {'c': 7, 'n': 4, 't': 10, 'a': 13, 'g': 10}
chr_2
 {'c': 5, 'n': 4, 't': 6, 'a': 11, 'g': 8}
chr_3
 {'c': 3, 'n': 4, 't': 6, 'a': 10, 'g': 10}
chr_4
 {'c': 2, 'n': 3, 't': 4, 'a': 9, 'g': 7}

所以接下来给出这种用法的代码
代码很好看，语法很简洁，速度也很快，这就是我比较喜欢python的原因

import sysimport collections
args=sys.argv
filename=args[1]
count_ATCG=collections.OrderedDict() #构建有顺序的字典Bases=["a","t","g","c","n"]for line in open(filename):   if line.startswith(">"):
      chr_id = line[1:]
      count_ATCG[chr_id] = {}      for base in Bases:
        count_ATCG[chr_id][base] = 0 ##字典中，chr_id属于key，base和0共同构成#value，而value中，base又成了key，0成了value
   else:
      line = line.lower()      for base in Bases:
        count_ATCG[chr_id][base] += line.count(base)for chr_id, ATCG_count in count_ATCG.items():
    count_sum = sum(ATCG_count.values())
    count_GC = ATCG_count['g'] + ATCG_count['c']
    print(chr_id)    for base in Bases:
        print("%s: %s" % (base, ATCG_count[base])) #注意占位符的使用

    print("Sum: %s" % count_sum)
    print("N %%: %.2f%%" % (ATCG_count['n']*100.00/count_sum))
    print("GC %%: %.2f%%" % (count_GC*100.00/count_sum))

2.R
先给出R的脚本

seq <- readLines("~/WGS_new/input/genome/new/hg38.fa")  
# 逐行读取文件is_ID <- regexpr("^>",seq,perl=T)  
# 对多有文件进行操作，匹配到>开头的定位1，不是则为-1, 1是指匹配到"^>"的位置，第一个位置seq_ID <- seq[which(is_ID==1)]# 将seq中is_ID为1的取出来，也就是将ID行取出来seq_content <- seq[which(is_ID==-1)]

start <- which(is_ID==1) 
#取出is_ID=1的位置end <- start[2:length(start)]-1#取出除了最后一个元素外，其他的每个非ID元素的最后一个位置end <- c(end,length(seq))# 把最后一个位置补上distance <- end - start# 这算的是每一个ID对应的多少行序列seq_num_position <- 1:length(start) 
# 算出有多少个ID，按1234...排序index <- rep(seq_num_position,distance)# 构建tapply中的因子，用于分组用seq_content <- tapply(seq_content,index,paste,collapse="")# 将同一组的序列合并在一起seq_content <- as.character(seq_content)# seq_content本来是数组,as.character()之后转换成了向量seq_length <- nchar(seq_content)# 统计字符串的长度，也就是对应序列的长度# 计算GC含量和N含量G_count=""C_count=""N_count=""for ( i in 1:length(seq_content))
{   G_count[i]<-length(gregexpr("[Gg]",seq_content,perl = T)[[i]])+length(gregexpr("[Cc]",seq_content,perl = T)[[i]])
    N_count[i]<-length(gregexpr("[Nn]",seq_content,perl = T)[[i]])
}#gregexpr与regexpr不一样，会将所有的匹配的位置输出result <- data.frame(seq_ID,G_count,N_count,seq_length)

这里先总结几个知识点
1.R中逐行读取数据用的是readLines（）
2.which返回的是判断结果的坐标
3.tapply()函数，tapply(x,f,g)
需要向量 x (x不可以是数据框)，因子或因子列表 f 以及函数 g
tapply（）执行的操作是：暂时将x分组，每组对应一个因子水平，得到x的子向量，然后这些子向量应用函数 g
举个简单的例子

a <- c(24,25,36,37)
b <- c('q', 'w', 'q','w')
tapply(a, b, mean)
 q  w 
30 31

4.nchar()函数用来计算字符串的长度
5.regexpr()函数
匹配的函数，返回的是匹配的第一个位置
举个例子

string <- "abbcbcbcbdbdbdbcbcb" #定义一个字符串regexpr("c",string,perl=T)    #在string中匹配“c”，返回的是匹配的第一个“c”的位置# [1] 4# attr(,"match.length")# [1] 1# attr(,"index.type")# [1] "chars"# attr(,"useBytes")# [1] TRUE

如果是匹配一个不存在字符或字符串，返回值为-1

6.gregexpr()函数
相比regexpr()函数，gregexpr()函数表面上就是多了一个“g”,我没有查，意思应该就是global，也就是把所有的匹配的位置都给返回出来，这样的话，用于测序的序列，就可以求个length就能算出匹配了多少个，也就是匹配字符的数量
举个例子

string <- "abbcbcbcbdbdbdbcbcb"gregexpr("c",string,perl=T)# [[1]]# [1]  4  6  8 16 18# attr(,"match.length")# [1] 1 1 1 1 1# attr(,"index.type")# [1] "chars"# attr(,"useBytes")# [1] TRUElength(gregexpr("c",string,perl=T)[[1]])#[1] 5

我不知道有没有什么其他的方法，但是这个R脚本处理大的数据集会很吃力。
原因很简单这里面很多步骤都重复遍历了整个文件再做出处理，所以会很慢。而python写的脚本，是逐行读取并处理，所以很快，python大概就2~3min就可以处理完，而这个R脚本跑了30多分钟还没有结束。

webp

image.png

3.shell(awk)
尝试用shell主要还是强化一下相关的知识，而且awk我只能算出length
几个概念:
R:record 行
F:field 列
NR:number of record 行的数目
NF:number of field 列的数目
RS:record split 行分割
FS:field split 列分割

time awk 'BEGIN{RS=">";FS="\n"}{if(NR>1){seq="";for(i=2;i<=NF;i++) seq=seq$i; print $1"\t"length(seq)}}' ~/WGS_new/input/genome/new/hg38.fa >count_awk.txt#seq=""是因为每次循环的时候，seq值要初始化为""

作者：天秤座的机器狗
链接：https://www.jianshu.com/p/240d33bd81b3