Snakemake 从输入文件名派生多个变量

首页课程实战体系课手记专栏慕课教程

Snakemake 从输入文件名派生多个变量

我在从输入文件名派生变量时遇到问题 - 特别是如果您想根据分隔符进行拆分。我尝试了不同的方法（我无法开始工作），到目前为止唯一有效的方法最终失败了，因为它寻找变量的所有可能变化（以及因此不存在的输入文件）。

我的问题 - 输入文件按以下模式命名： 18AR1376_S57_R2_001.fastq.gz

我一开始对变量的初步定义：

SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz")

但这最终导致我的文件全部被命名18AR1376_S57，我想删除 _S57 （指的是示例表 ID）。

我在搜索时发现的一种可行的方法是：

SAMPLES,SHEETID, = glob_wildcards("../run_links/{sample}_{SHEETID}_R1.001.fastq.gz"}

但它会查找样本和sheetid的所有可能组合，因此会查找不存在的输入文件。

然后我尝试了一种更基本的 python 方法：

SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz")

ID,=expand([{sample}.split("_")[0]], sample=SAMPLES)``

但这根本不起作用

然后我尝试保留原来的通配符 glob SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz") 但定义新的输出文件名称（基于我在另一个论坛中找到的说明） - 但这给了我一个我无法弄清楚的语法错误。

rule trim:

input:

R1 = "../run_links/{sample}_R1_001.fastq.gz",

R2 = "../run_links/{sample}_R2_001.fastq.gz"

params:

prefix=lambda wildcards, output:

output:

R1_Pair = ["./output/trimmed/{}_R1_trim_PE.fastq".format({sample}.split("_")[0])], #or the below version

R1_Sing = ["./output/trimmed/{}_R1_trim_SE.fastq".format(a) for a in {sample}.split("_")],

R2_Pair = ["./output/trimmed/{}_R2_trim_PE.fastq".format(a) for a in {sample}.split("_")],

R2_Sing = ["./output/trimmed/{}_R2_trim_SE.fastq".format(a) for a in {sample}.split("_")]

resources:

cpus=8

log:

"../logs/trim_{sample}.log"

conda:

"envs/trim.yaml"

shell:

"""

trimmomatic PE -trimlog {log} -threads {resources.cpus} {input.R1} {input.R2} {output.R1_Pair} {output.R1_Sing} {output.R2_Pair} {output.R2_Sing} HEADCROP:10 ILLUMINACLIP:scripts/Truseq-Adapter.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:50

"""

所以，有两个选择，

工作流程开始：将样本拆分为样本和样本表 ID
定义新的输出名称并在_{sample} 的分隔符上使用 split

有人对如何解决这个问题有建议吗？谢谢

慕斯王

浏览 278回答 1

1回答

开满天机

glob_wildcards我将使用一个简单的 python 字典来定义附加到 fastq 文件的示例名称，而不是使用：import osimport red = dict()fastqPath = "."for fastqF in [f for f in os.listdir(fastqPath) if(re.match(r"^[\w-]+_R1_001\.fastq\.gz", f))]:        s = re.search(r"(^[\w-]+)_(S\d+)_R1_(001.fastq.gz)", fastqF)        samplename = s.group(1)        fastqFfile = os.path.join(fastqPath, fastqF)        fastqRfile = os.path.join(fastqPath, s.group(1) + "_" + s.group(2) + "_R2_" + s.group(3))        if(os.path.isfile(fastqRfile)):                d[samplename] = {"read1":os.path.abspath(fastqFfile),"read2":os.path.abspath(fastqRfile)}fastq 输入文件非常易于使用：rule all:        input:                expand("output/trimmed/{sample}_R1_trim_PE.fastq",sample=d)rule trim:        input:                R1 = lambda wildcards: d[wildcards.sample]["read1"],                R2 = lambda wildcards: d[wildcards.sample]["read2"]        output:                R1_Pair="output/trimmed/{sample}_R1_trim_PE.fastq",                R1_Sing="output/trimmed/{sample}_R1_trim_SE.fastq",                R2_Pair="output/trimmed/{sample}_R2_trim_PE.fastq",                R2_Sing="output/trimmed/{sample}_R2_trim_SE.fastq"        resources:                cpus=8        log:                "../logs/trim_{sample}.log"        conda:                "envs/trim.yaml"        shell:                """                trimmomatic PE -trimlog {log} -threads {resources.cpus} {input.R1} {input.R2} {output.R1_Pair} {output.R1_Sing} {output.R2_Pair} {output.R2_Sing} HEADCROP:10 I>                """注意：我删除了params您的规则中未使用的部分trim。

0 0

随时随地看视频慕课网APP

相关分类

Python