猿问

如何从平面文件(Gene Ontology OBO 文件)生成递归树状字典?

我正在尝试编写代码来解析 Gene Ontology (GO) OBO 文件并将 go 术语 ID(例如 GO:0003824)推送到树状嵌套字典中。OBO 文件中的层次结构用“is_a”标识符表示,用于标记每个 GO 术语的每个父级。一个 GO 术语可能有多个父级,而层次结构中最高的 Go 术语没有父级。


GO OBO 文件的一个小例子如下所示:


[Term]

id: GO:0003674

name: molecular_function

namespace: molecular_function

alt_id: GO:0005554

def: "A molecular process that can be carried out by the action of a single macromolecular machine, usually via direct physical interactions with other molecular entities. Function in this sense denotes an action, or activity, that a gene product (or a complex) performs. These actions are described from two distinct but related perspectives: (1) biochemical activity, and (2) role as a component in a larger system/process." [GOC:pdt]

comment: Note that, in addition to forming the root of the molecular function ontology, this term is recommended for use for the annotation of gene products whose molecular function is unknown. When this term is used for annotation, it indicates that no information was available about the molecular function of the gene product annotated as of the date the annotation was made; the evidence code "no data" (ND), is used to indicate this. Despite its name, this is not a type of 'function' in the sense typically defined by upper ontologies such as Basic Formal Ontology (BFO). It is instead a BFO:process carried out by a single gene product or complex.

subset: goslim_aspergillus

subset: goslim_candida

subset: goslim_chembl

subset: goslim_generic

subset: goslim_metagenomics

subset: goslim_pir

subset: goslim_plant

subset: goslim_yeast

synonym: "molecular function" EXACT []



慕无忌1623718
浏览 379回答 2
2回答

繁花不似锦

你写了if (parent_go_id in parent_list):    go_dict[parent_go_id][go_id] = generate_go_tree([go_term], all_go_terms, True)正确的是if (parent_go_id in parent_list):    go_dict[parent_go_id][go_id] = generate_go_tree([go_term], all_go_terms, True)[go_id]在此更改后,它会产生:{    'GO:0003674': {        'GO:0003824': {},         'GO:0005198': {},         'GO:0005488': {            'GO:0005515': {},            'GO:0005549': {                'GO:0005550': {}            }        }    }}但我会建议完全不同的方法。创建一个类来解析术语并构建依赖树,因为它这样做。为方便起见,我将它派生自dict,因此您可以编写term.id而不是term['id']:class Term(dict):    __getattr__ = dict.__getitem__    __setattr__ = dict.__setitem__    __delattr__ = dict.__delitem__    registry = {}    single_valued = 'id name namespace alt_id def comment synonym is_a'.split()    multi_valued = 'subset xref'.split()    def __init__(self, text):        self.children = []        self.parent = None        for line in text.splitlines():            if not ': ' in line:                continue            key, val = line.split(': ', 1)            if key in Term.single_valued:                self[key] = val            elif key in Term.multi_valued:                if not key in self:                    self[key] = [val]                else:                    self[key].append(val)            else:                print('unclear property: %s' % line)        if 'id' in self:            Term.registry[self.id] = self        if 'alt_id' in self:            Term.registry[self.alt_id] = self        if 'is_a' in self:            key = self.is_a.split(' ! ', 1)[0]            if key in Term.registry:                Term.registry[key].children.append(self)                self.parent = Term.registry[key]    def is_top(self):        return self.parent == None    def is_valid(self):        return self.get('is_obsolete') != 'true' and self.id != None现在,您可以一口气读取文件:with open('tiny_go.obo', 'rt') as f:    contents = f.read()terms = [Term(text) for text in contents.split('\n\n')]并且递归树变得容易。例如,一个仅输出非过时节点的简单“打印”函数:def print_tree(terms, indent=''):    valid_terms = [term for term in terms if term.is_valid()]    for term in valid_terms:        print(indent + 'Term %s - %s' % (term.id, term.name))        print_tree(term.children, indent + '  ')top_terms = [term for term in terms if term.is_top()]print_tree(top_terms)这打印:术语 GO:0003674-molecular_function  术语 GO:0003824 - 催化活性  术语 GO:0005198 - 结构分子活性  术语 GO:0005488 - 绑定    术语 GO:0005515 - 蛋白质结合    术语 GO:0005549 - 气味绑定      术语 GO:0005550 - 信息素结合你也可以做类似的事情Term.registry['GO:0005549'].parent.name,这会得到"binding".我将生成嵌套dicts的 GO-ID(例如在您自己的示例中)作为练习,但您甚至可能不需要它,因为Term.registry已经与此非常相似。

侃侃无极

您可以将递归用于更短的解决方案:import itertools, re, jsoncontent = list(filter(None, [i.strip('\n') for i in open('filename.txt')]))entries = [[a, list(b)] for a, b in itertools.groupby(content, key=lambda x:x== '[Term]')]terms = [(lambda x:x if 'is_a' not in x else {**x, 'is_a':re.findall('^GO:\d+', x['is_a'])[0]})(dict(i.split(': ', 1) for i in b)) for a, b in entries if not a]terms = sorted(terms, key=lambda x:'is_a' in x)def tree(d, _start):  t = [i for i in d if i.get('is_a') == _start]  return {} if not t else {i['id']:tree(d, i['id']) for i in t}print(json.dumps({terms[0]['id']:tree(terms, terms[0]['id'])}, indent=4))输出:{  "GO:0003674": {    "GO:0003824": {},    "GO:0005198": {},    "GO:0005488": {        "GO:0005515": {},        "GO:0005549": {            "GO:0005550": {}        }      }   }}如果父数据集未在其子数据集之前定义,这也将起作用。例如,当父级位于其原始位置以下三个位置时,仍会生成相同的结果(请参阅文件):print(json.dumps({terms[0]['id']:tree(terms, terms[0]['id'])}, indent=4))输出:{"GO:0003674": {    "GO:0003824": {},    "GO:0005198": {},    "GO:0005488": {        "GO:0005515": {},        "GO:0005549": {            "GO:0005550": {}        }      }   }}
随时随地看视频慕课网APP

相关分类

Python
我要回答