用漂亮的汤解析和排序 html 标签

我有以下 HTML 文件,其中包含bbox来自 PDF 文件的信息:


<flow>

  <block xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">

    <line xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">

      <word xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">10</word>

    </line>

  </block>

</flow>

<flow>

  <block xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">

    <line xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">

      <word xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">20</word>

    </line>

  </block>

</flow>

<flow>

  <block xMin="111.351361" yMin="369.965298" xMax="134.220382" yMax="380.991433">

    <line xMin="111.351361" yMin="369.965298" xMax="134.220382" yMax="380.991433">

      <word xMin="111.351361" yMin="369.965298" xMax="116.331548" yMax="380.991433">1</word>

      <word xMin="121.909358" yMin="369.965298" xMax="134.220382" yMax="380.991433">PC</word>

    </line>

  </block>

</flow>

以上是单词的边界框区域:10 20 1 PC


在原始文档中,是这样写的:


10 1 PC

20

因此,我想解析上面的 HTML 文件并提取所有 <line>标签,然后按yMin值对它们进行排序。上面的最终输出将是:10 1 PC 20而不是。


到目前为止我尝试过的

我不是很远,因为我还在学习 Python。我正在使用 BeautifulSoup4:


with open("test.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')


    for line in soup.find_all("line", attrs={"ymin":True}):

        print(line.get('ymin'))

上面只是打印出每个标签及其内容。


我不确定如何对行标签进行排序。


任何帮助将不胜感激。


小怪兽爱吃肉
浏览 99回答 2
2回答

小唯快跑啊

您可以BeautifulSoup使用soup.find_all:from bs4 import BeautifulSoup as soupr = [i.find_all('word') for i in sorted(soup(html, 'html.parser').find_all('line'), key=lambda x:float(x['ymin']))]result = [i.text for b in r for i in b]输出:['10', '1', 'PC', '20']

冉冉说

试试下面的代码。可以定义平均值,然后检查平均值。from bs4 import BeautifulSouphtml='''<flow>&nbsp; <block xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">&nbsp; &nbsp; <line xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">&nbsp; &nbsp; &nbsp; <word xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">10</word>&nbsp; &nbsp; </line>&nbsp; </block></flow><flow>&nbsp; <block xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">&nbsp; &nbsp; <line xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">&nbsp; &nbsp; &nbsp; <word xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">20</word>&nbsp; &nbsp; </line>&nbsp; </block></flow><flow>&nbsp; <block xMin="111.351361" yMin="369.965298" xMax="134.220382" yMax="380.991433">&nbsp; &nbsp; <line xMin="111.351361" yMin="369.965298" xMax="134.220382" yMax="380.991433">&nbsp; &nbsp; &nbsp; <word xMin="111.351361" yMin="369.965298" xMax="116.331548" yMax="380.991433">1</word>&nbsp; &nbsp; &nbsp; <word xMin="121.909358" yMin="369.965298" xMax="134.220382" yMax="380.991433">PC</word>&nbsp; &nbsp; </line>&nbsp; </block></flow>'''soup=BeautifulSoup(html,'lxml')pricemin=soup.select_one('line[yMin]')['ymin']list1=[]list_last=[]for item in soup.select('line[yMin]'):&nbsp; &nbsp; if float(pricemin) < float(item['ymin']):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;for w in item.select('word'):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;list_last.append(w.text)&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; for w in item.select('word'):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; list1.append(w.text)print(list1+list_last)输出:['10', '1', 'PC', '20']打印这个print(' '.join(list1+list_last))输出:10 1 PC 20
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python