如何删除Python和pdfminer中的单个或可行的单词形式列表无法隐藏卢比字体

我正在从 PDF 中提取文本并将其转换为 HTML。当我们在 BeautifulSoup 的帮助下从 Html 中提取文本时。我遇到了货币(卢比符号)等符号的问题。卢比符号就像蒂尔达 ['``']


['Amid', '41'], ['``', '41'], ['3L cr 短缺,GST 流程将持续到 2022 年 6 月之后', '41'], ['Cong 剪掉了写信人的翅膀 � 在新任命中 ', '32'] ,['MVA 旨在削减政府选择风险投资人的权力 ', '28']}


当前输出


  1. Amid 

    2. 3L cr shortfall, GST cess to continue beyond June 2022 

    3. Cong clips wings of ‘letter writers’ in new appointments  

    4. MVA aims to cut guv’s power to choose VC

我想要输出具有更高字体大小的文本,并且还想删除列表中的单行字符,例如 [['``', '41']


我想要的输出应该是这样的


 1. Amid  3L cr shortfall, GST cess to continue beyond June 2022 

 2. Cong clips wings of ‘letter writers’ in new appointments 

 3. Cong clips wings of ‘letter writers’ in new appointments   

我的完整代码:

import sys,os,re,operator,tempfile,fileinput

from bs4 import BeautifulSoup,Tag,UnicodeDammit

from io import  StringIO

from pdfminer.layout import LAParams

from pdfminer.high_level import extract_text_to_fp


def convert_html(filename):

    output = StringIO()

    with open(filename, 'rb') as fin:

        extract_text_to_fp(fin, output, laparams=LAParams(),output_type='html', codec=None)

        Out_txt=output.getvalue()

        return Out_txt


def get_the_start_of_font(x,attr):

    """ Return the index of the 'font-size' first occurrence or None. """

    match = re.search(x, attr)

    if match is not None:

        return match.start()

    return None 


def get_font_size_from(attr):

    """ Return the font size as string or None if not found. """

    font_start_i = get_the_start_of_font('font-size:',attr)

    if font_start_i is not None:

        font_size=str(attr[font_start_i + len('font-size:'):].split('px')[0])

        if int(font_size)>25:

            return font_size


            

米琪卡哇伊
浏览 168回答 2
2回答

湖上湖

headlines = [['In bid to boost realty, state cuts stamp duty for 7 mths ', '42'],             ['India sees world’s third-biggest spike of 76,000+ cases, toll crosses 60k ','28'],             ['O', '33'],             ['Don’t hide behind RBI on loan interest waiver: SC to govt ', '28']]for idx, line in enumerate(sorted([row for row in headlines if len(row[0]) > 1], key=lambda z: int(z[1]), reverse=True)):    print("{}. {}".format(idx+1, line[0]))输出:1. In bid to boost realty, state cuts stamp duty for 7 mths2. India sees world’s third-biggest spike of 76,000+ cases, toll crosses 60k3. Don’t hide behind RBI on loan interest waiver: SC to govt上面发生的事情的细分:[row for row in headlines if len(row[0]) > 1]headlines如果 的长度entry_in_headlines[0]大于 1,这将创建一个新列表,其中包含所有条目。sorted(<iterable>, key=lambda z: int(z[1]), reverse=True)将使用 lambda 函数对给定的可迭代对象进行排序,该函数采用一个参数,并以整数形式返回该变量的第二个索引。然后反转结果,由于reverse=True.for idx, line in enumerate(<iterable>):循环enumerate将返回它被调用的次数的“计数”,以及迭代器内的下一个值。print("{}. {}".format(idx+1, line[0]))使用字符串格式化,我们在 for 循环内创建新字符串。

呼如林

我无法真正弄清楚您正在尝试什么或您的数据在哪里,但您需要添加一个 if 语句。例如:data = ['In bid to boost realty, state cuts stamp duty for 7 mths ', '42']if len(data[0].split()) >= 2:&nbsp; &nbsp; print(data[0])任何 2 个字或更少的语句都不会被打印。如果您有一个列表列表:data = [['In bid to boost realty, state cuts stamp duty for 7 mths ', '42'],&nbsp;['India sees world’s third-biggest spike of 76,000+ cases, toll crosses 60k',&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'28'], ['O', '33'], ['Don’t hide behind RBI on loan interest waiver: SC to&nbsp;govt ', '28']]for lists in data:&nbsp; &nbsp; if len(lists[0].split()) <= 2:&nbsp; &nbsp; &nbsp; &nbsp; data.remove(lists)print(*("".join(lists[0]) for lists in data), sep='\n')
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python