从我的 XML 文件中提取信息并为其分配一个向量

我想用 python 解析我的计算机上的一些 XML 文件并从每个文件中提取一些信息

这是我的其中之一的 xml 文件:

https://img1.sycdn.imooc.com/656ecaee0001f6c810620393.jpg

(如果您想要文本在这里: https://github.com/peldszus/arg-microtexts/blob/master/corpus/en/micro_b002.xml)


作为第一级,我已经完成了第一级:


myList = []                #read the whole text from 

for root, dirs, files in os.walk(path):

    for file in files:

        if file.endswith('.xml'):

            with open(os.path.join(root, file), encoding="UTF-8") as content:

                tree = ET.parse(content)

                myList.append(tree)

在 myList 中,我有一些 XMl 文件 <xml.etree.ElementTree.ElementTree at 0x1f0fb1f8430>


现在对于根“边缘”,它们没有 type="seg"


 <edge id="c1" src="a1" trg="a3" type="sup"/>

  <edge id="c2" src="a2" trg="a3" type="sup"/>

  <edge id="c4" src="a4" trg="a3" type="reb"/>

  <edge id="c5" src="a5" trg="c4" type="und"/>

我想提取标签“src”,我想提取标签=Src,


  src="a1"  

  src="a2"  

  src="a4" 

  src="a5" 

然后我想分配的数字不在src中,因为这句话称为前提,例如这里...我想说“a3”是所谓的“前提”(因为它不是标签src)


例如这里


(0,0,1,0,0) 应该是我的过程的结果,因为 a3 没有被应用,我将第三个数组设置为 1,其余的设置为零


一般来说,我想提取信息以注释我的文本,该文本已使用 xml 进行了一些注释


蛊毒传说
浏览 75回答 3
3回答

德玛西亚99

您的问题中并非所有内容都清楚...以下是数据提取部分import xml.etree.ElementTree as ETxml = '''<?xml version='1.0' encoding='UTF-8'?><arggraph id="micro_b002" topic_id="higher_dog_poo_fines" stance="pro">&nbsp; <edu id="e1"><![CDATA[One can hardly move in Friedrichshain or Neukölln these days without permanently scanning the ground for dog dirt.]]></edu>&nbsp; <edu id="e2"><![CDATA[And when bad luck does strike and you step into one of the many 'land mines' you have to painstakingly scrape the remains off your soles.]]></edu>&nbsp; <edu id="e3"><![CDATA[Higher fines are therefore the right measure against negligent, lazy or simply thoughtless dog owners.]]></edu>&nbsp; <edu id="e4"><![CDATA[Of course, first they'd actually need to be caught in the act by public order officers,]]></edu>&nbsp; <edu id="e5"><![CDATA[but once they have to dig into their pockets, their laziness will sure vanish!]]></edu>&nbsp; <adu id="a1" type="pro"/>&nbsp; <adu id="a2" type="pro"/>&nbsp; <adu id="a3" type="pro"/>&nbsp; <adu id="a4" type="opp"/>&nbsp; <adu id="a5" type="pro"/>&nbsp; <edge id="c6" src="e1" trg="a1" type="seg"/>&nbsp; <edge id="c7" src="e2" trg="a2" type="seg"/>&nbsp; <edge id="c8" src="e3" trg="a3" type="seg"/>&nbsp; <edge id="c9" src="e4" trg="a4" type="seg"/>&nbsp; <edge id="c10" src="e5" trg="a5" type="seg"/>&nbsp; <edge id="c1" src="a1" trg="a3" type="sup"/>&nbsp; <edge id="c2" src="a2" trg="a3" type="sup"/>&nbsp; <edge id="c4" src="a4" trg="a3" type="reb"/>&nbsp; <edge id="c5" src="a5" trg="c4" type="und"/></arggraph>'''root = ET.fromstring(xml)interesting_edges_src = [e.attrib['src'] for e in root.findall('.//edge') if e.attrib['type'] != 'seg' ]print(interesting_edges_src)输出['a1', 'a2', 'a4', 'a5']

手掌心

这里可以被认为是某种接近最终答案的答案myList = []  myEdgesList=[]#read the whole text from for root, dirs, files in os.walk(path):    for file in files:        if file.endswith('.xml'):            with open(os.path.join(root, file), encoding="UTF-8") as content:                tree = ET.parse(content)                myList.append(tree)                for k in myList:    Edge= [e.attrib['src'] for e in k.findall('.//edge') if e.attrib['type'] != 'seg' ]    myEdgesList.append(Edge)这提供['a1', 'a2', 'a4', 'a5'] 对于上面的示例以及所有其他示例的列表[['a1', 'a2', 'a3', 'a4'], ['a1', 'a2', 'a4', 'a5'], ['a1', 'a2', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a1', 'a2', 'a4', 'a5'], ['a1', 'a2', 'a4', 'a5'], ['a1', 'a2', 'a4', 'a5'], ['a1', 'a2', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a1', 'a2', 'a4', 'a5'], ['a1', 'a2', 'a4', 'a5'], ['a1', 'a2', 'a4', 'a5'], ['a1', 'a2', 'a4', 'a5'], ['a1', 'a2', 'a4', 'a5'], ['a1', 'a2', 'a4', 'a5'], ['a1', 'a2', 'a4', 'a5'], ['a1', 'a2', 'a3', 'a4'], ['a1', 'a2', 'a3', 'a4'], ['a1', 'a2', 'a3', 'a4'], ['a1', 'a2', 'a3', 'a4'], ['a1', 'a2', 'a3', 'a4'], ['a1', 'a2', 'a3', 'a4'], ['a1', 'a2', 'a3', 'a4'], ['a1', 'a2', 'a3', 'a4'], ['a2', 'a3', 'a4'], ['a2', 'a3', 'a4'], ['a2', 'a3', 'a4'], ['a2', 'a3', 'a4'], ['a2', 'a3', 'a4'], ['a2', 'a3', 'a4'], ['a2', 'a3', 'a4'], ['a2', 'a3', 'a4'], ['a2', 'a3', 'a4'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3', 'a4', 'a5'], ['a2', 'a3'], ['a2', 'a3'], ['a2', 'a3'], ['a2', 'a3'], ['a2', 'a3'], ['a2', 'a3'], ['a2', 'a3'], ['a2', 'a3'], ['a2', 'a3'], ['a2', 'a3'], ['a2', 'a3'], ['a1', 'a2', 'a3'], ['a1', 'a2', 'a3'], ['a1', 'a2', 'a3'], ['a1', 'a2', 'a3'], ['a1', 'a2', 'a3'], ['a1', 'a2', 'a3'], ['a1', 'a2', 'a3'], ['a1', 'a2', 'a3'], ['a1', 'a2', 'a3'], ['a1', 'a2', 'a3'], ['a1', 'a2', 'a3'], ['a1', 'a2', 'a3'], ['a2', 'a3', 'a4', 'a5'],...只剩下将此列表转换为(0,0,0,0,1) <----- ['a1', 'a2', 'a3', 'a4']#as a5 is missing (0,0,1,0,0) <------  ['a1', 'a2', 'a4', 'a5']#as a3 is misisng ...(0,0,1)    <-------   ['a2', 'a3']#as a1 is missing 等等如果您有任何想法请告诉我,我也在努力

牧羊人nacy

对于下一个问题myEdgtlistmap=[]for lst in myEdgesList:&nbsp; &nbsp; tp=[]&nbsp; &nbsp; for el in lst:&nbsp; &nbsp; &nbsp; &nbsp; if el=="a1":&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tp.append(1)&nbsp; &nbsp; &nbsp; &nbsp; if el=="a2":&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tp.append(2)&nbsp; &nbsp; &nbsp; &nbsp; if el=="a3":&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tp.append(3)&nbsp; &nbsp; &nbsp; &nbsp; if el=="a4":&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tp.append(4)&nbsp; &nbsp; &nbsp; &nbsp; if el=="a5":&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tp.append(5)&nbsp; &nbsp; &nbsp; &nbsp; if el=="a6":&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tp.append(6)&nbsp; &nbsp; myEdgtlistmap.append(tp)label=[]for le in myEdgtlistmap:&nbsp; &nbsp; b=[1]*(len(le)+1)&nbsp; &nbsp; for v in le:&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; b[v-1]=0&nbsp; &nbsp; label.append(b)y=[l for lab in label for l in lab ]
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python