如何将可能格式错误的xml解析为数据框?

我有一个从API看起来像这样的xml。


import requests

import pandas as pd

import lxml.etree as et

from lxml import etree



 url = 'abc.com'


 xml_data1 = requests.get(url).content

 print(xml_data1)

xml_data1:


    <?xml version="1.0" encoding="utf-8"?>

    <Leads>

      <Lead Id="123" LeadTitle="test, test.,  , (123) 456-7890, " CreateDate="01/01/2017 11:11:11" ModifyDate="01/04/2017 03:03:03" ACount="1" LCount="4" RCount="0" ROnly="false" Flagged="false" LastDistributionDate="01/01/2017 10:10:10" LeadFormType="test test">

    <Campaign CampaignId="123" CampaignTitle="abc" />

    <Status StatusId="123" StatusTitle="test" />

    <Agent AgentId="123" AgentName="test, test" AgentEmail="a@a.com">

      <AgentCustomFields custom1="test test, test" custom2="test" custom3="" custom4="" />

    </Agent>

    <Fields>

      <Field FieldId="7" Value="a@a.com" FieldTitle="test" FieldType="test" />

      <Field FieldId="8" Value="test" FieldTitle="test 1" FieldType="test" />

      <Field FieldId="9" Value="test" FieldTitle="City" FieldType="Text" />

      <Field FieldId="10" Value="test" FieldTitle="State" FieldType="State" />

      <Field FieldId="11" Value="test" FieldTitle="test" FieldType="Zip" />

      <Field FieldId="950" Value="test." FieldTitle="Business Name" FieldType="Text" />

      <Field FieldId="1261" Value="Intuit Desktop" FieldTitle="test" FieldType="Text" />

      <Field FieldId="1262" Value="test" FieldTitle="test" FieldType="Text" />

      <Field FieldId="1263" Value="test" FieldTitle="test" FieldType="Number" />

您是否有工作上的顾虑,我无法发布整个xml字符串,但它遵循上面的结构。根据一个xml验证器,该xml是正确的,但是当我进行另一个API调用并返回另一个xml字符串时,

但是,当我将可能格式错误的xml字符串传递给上述函数时,出现错误:

AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'getroottree'

由于可能格式错误的xml在同一个标记中具有多个值,因此我认为该函数无法对其进行解析。

我希望将可能格式错误的xml推送到平面数据框中。



明月笑刀无情
浏览 149回答 2
2回答

九州编程

自从您更新了问题以来,我决定用新的xml发布另一个答案。from bs4 import BeautifulSoup&nbsp;import pandas as pdxml = """&nbsp; &nbsp; <?xml version="1.0" encoding="utf-8"?>&nbsp; &nbsp; <Leads>&nbsp; &nbsp; &nbsp; <Lead Id="123" LeadTitle="test, test.,&nbsp; , (123) 456-7890, " CreateDate="01/01/2017 11:11:11" ModifyDate="01/04/2017 03:03:03" ACount="1" LCount="4" RCount="0" ROnly="false" Flagged="false" LastDistributionDate="01/01/2017 10:10:10" LeadFormType="test test">&nbsp; &nbsp; <Campaign CampaignId="123" CampaignTitle="abc" />&nbsp; &nbsp; <Status StatusId="123" StatusTitle="test" />&nbsp; &nbsp; <Agent AgentId="123" AgentName="test, test" AgentEmail="a@a.com">&nbsp; &nbsp; &nbsp; <AgentCustomFields custom1="test test, test" custom2="test" custom3="" custom4="" />&nbsp; &nbsp; </Agent>&nbsp; &nbsp; <Fields>&nbsp; &nbsp; &nbsp; <Field FieldId="7" Value="a@a.com" FieldTitle="test" FieldType="test" />&nbsp; &nbsp; &nbsp; <Field FieldId="8" Value="test" FieldTitle="test 1" FieldType="test" />&nbsp; &nbsp; &nbsp; <Field FieldId="9" Value="test" FieldTitle="City" FieldType="Text" />&nbsp; &nbsp; &nbsp; <Field FieldId="10" Value="test" FieldTitle="State" FieldType="State" />&nbsp; &nbsp; &nbsp; <Field FieldId="11" Value="test" FieldTitle="test" FieldType="Zip" />&nbsp; &nbsp; &nbsp; <Field FieldId="950" Value="test." FieldTitle="Business Name" FieldType="Text" />&nbsp; &nbsp; &nbsp; <Field FieldId="1261" Value="Intuit Desktop" FieldTitle="test" FieldType="Text" />&nbsp; &nbsp; &nbsp; <Field FieldId="1262" Value="test" FieldTitle="test" FieldType="Text" />&nbsp; &nbsp; &nbsp; <Field FieldId="1263" Value="test" FieldTitle="test" FieldType="Number" />&nbsp; &nbsp; &nbsp; <Field FieldId="1267" Value="test" FieldTitle="test" FieldType="Text" />&nbsp; &nbsp; &nbsp; <Field FieldId="1310" Value="test" FieldTitle="test" FieldType="Phone" />&nbsp; &nbsp; &nbsp; <Field FieldId="1319" Value="test" FieldTitle="test" FieldType="Number" />&nbsp; &nbsp; &nbsp; <Field FieldId="1485" Value="test" FieldTitle="tst" FieldType="State" />&nbsp; &nbsp; </Fields>&nbsp; &nbsp; <Logs>&nbsp; &nbsp; &nbsp; <StatusLog>&nbsp; &nbsp; &nbsp; &nbsp; <Status LogId="123" LogDate="01/04/2017 03:08:44" StatusId="28" StatusTitle="test" AgentId="19" AgentName="test" AgentEmail="test@test.com" />&nbsp; &nbsp; &nbsp; </StatusLog>&nbsp; &nbsp; &nbsp; <ActionLog>&nbsp; &nbsp; &nbsp; &nbsp; <Action LogId="123" ActionTypeId="73" ActionTypeName="test" MilestoneId="1" ActionDate="01/04/2017 03:08:44" ActionNote="test" AgentId="19" AgentName="test,test" AgentEmail="test@test.com" />&nbsp; &nbsp; &nbsp; </ActionLog>&nbsp; &nbsp; &nbsp; <EmailLog>&nbsp; &nbsp; &nbsp; &nbsp; <Email LogId="123" SendDate="01/01/2017 20:53:39" EmailTemplateId="1" EmailTemplateName="test " AgentId="1" AgentName="test" AgentEmail="test@test.com" />&nbsp; &nbsp; &nbsp; </EmailLog>&nbsp; &nbsp; &nbsp; <DistributionLog>&nbsp; &nbsp; &nbsp; &nbsp; <Distribution LogId="1" LogDate="01/01/2017 10:10:08" DistributionProgramId="1" DistributionProgramName="test" AssignedAgentId="1" AssignedAgentName="test,test" AssignedAgentEmail="test@test.com" />&nbsp; &nbsp; &nbsp; </DistributionLog>&nbsp; &nbsp; &nbsp; <CreationLog LogId="1" LogDate="01/01/2017 10:10:05" Imported="true" CreatedByAgentId="1" CreatedByAgentName="test, test" CreatedByAgentEmail="test@test.com" />&nbsp; &nbsp; </Logs>&nbsp; </Lead></Leads>"""soup = BeautifulSoup(xml, "xml")# Get Attributes from all nodesattrs = []for elm in soup():&nbsp; # soup() is equivalent to soup.find_all()&nbsp; &nbsp; attrs.append(elm.attrs)# Since you want the data in a dataframe, it makes sense for each field to be a new row consisting of all the other node attributesfields_attribute_list= [x for x in attrs if 'FieldId' in x.keys()]other_attribute_list = [x for x in attrs if 'FieldId' not in x.keys() and x != {}]# Make a single dictionary with the attributes of all nodes except for the `Field` nodes.attribute_dict = {}for d in other_attribute_list:&nbsp; &nbsp; for k, v in d.items():&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; attribute_dict.setdefault(k, v)# Update each field row with attributes from all other nodes.full_list = []for field in fields_attribute_list:&nbsp; &nbsp; field.update(attribute_dict)&nbsp; &nbsp; full_list.append(field)# Make Dataframedf = pd.DataFrame(full_list)但是,请注意,此方法会使用相同的名称(例如LogId您的xml中的名称)覆盖属性ID 。无论如何,这段代码应该可以帮助您入门。

白衣染霜花

我认为您会发现BeautifulSoup执行XML / HTML解析要容易得多。它还很好地处理了格式错误的XML和HTML。pip install beautifulsoup4以下是如何解析BeautifulSoup提供的xml。from bs4 import BeautifulSoup&nbsp;import pandas as pdxml = """<?xml version="1.0" encoding="utf-8"?><Leads>&nbsp; &nbsp; <Lead Id="123" LeadTitle="test, test.,&nbsp; , (123) 456-7890, " CreateDate="01/01/2017 11:11:11" ModifyDate="01/04/2017 03:03:03" ACount="1" LCount="4" RCount="0" ROnly="false" Flagged="false" LastDistributionDate="01/01/2017 10:10:10" LeadFormType="test test"></Lead>&nbsp; &nbsp; <Lead Id="123" />&nbsp; &nbsp; <Lead Id="456" /></Leads>"""soup = BeautifulSoup(xml, "xml")leads = soup.findAll('Lead')lead_list = []for lead in leads:&nbsp; &nbsp; lead_list.append(lead.attrs)df = pd.DataFrame(lead_list)df输出:ACount&nbsp; CreateDate&nbsp; Flagged Id&nbsp; LCount&nbsp; LastDistributionDate&nbsp; &nbsp; LeadFormType&nbsp; &nbsp; LeadTitle&nbsp; &nbsp;ModifyDate&nbsp; RCount&nbsp; ROnly0&nbsp; &nbsp;1&nbsp; &nbsp;01/01/2017 11:11:11 false&nbsp; &nbsp;123 4&nbsp; &nbsp;01/01/2017 10:10:10 test test&nbsp; &nbsp;test, test., , (123) 456-7890,&nbsp; 01/04/2017 03:03:03 0&nbsp; &nbsp;false1&nbsp; &nbsp;NaN NaN NaN 123 NaN NaN NaN NaN NaN NaN NaN2&nbsp; &nbsp;NaN NaN NaN 456 NaN NaN NaN NaN NaN NaN NaN
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python