我正在从 PDF 中提取文本并将其转换为 HTML。当我们在 BeautifulSoup 的帮助下从 Html 中提取文本时。我遇到了货币(卢比符号)等符号的问题。卢比符号就像蒂尔达 ['``']
['Amid', '41'], ['``', '41'], ['3L cr 短缺,GST 流程将持续到 2022 年 6 月之后', '41'], ['Cong 剪掉了写信人的翅膀 � 在新任命中 ', '32'] ,['MVA 旨在削减政府选择风险投资人的权力 ', '28']}
当前输出
1. Amid
2. 3L cr shortfall, GST cess to continue beyond June 2022
3. Cong clips wings of ‘letter writers’ in new appointments
4. MVA aims to cut guv’s power to choose VC
我想要输出具有更高字体大小的文本,并且还想删除列表中的单行字符,例如 [['``', '41']
我想要的输出应该是这样的
1. Amid 3L cr shortfall, GST cess to continue beyond June 2022
2. Cong clips wings of ‘letter writers’ in new appointments
3. Cong clips wings of ‘letter writers’ in new appointments
我的完整代码:
import sys,os,re,operator,tempfile,fileinput
from bs4 import BeautifulSoup,Tag,UnicodeDammit
from io import StringIO
from pdfminer.layout import LAParams
from pdfminer.high_level import extract_text_to_fp
def convert_html(filename):
output = StringIO()
with open(filename, 'rb') as fin:
extract_text_to_fp(fin, output, laparams=LAParams(),output_type='html', codec=None)
Out_txt=output.getvalue()
return Out_txt
def get_the_start_of_font(x,attr):
""" Return the index of the 'font-size' first occurrence or None. """
match = re.search(x, attr)
if match is not None:
return match.start()
return None
def get_font_size_from(attr):
""" Return the font size as string or None if not found. """
font_start_i = get_the_start_of_font('font-size:',attr)
if font_start_i is not None:
font_size=str(attr[font_start_i + len('font-size:'):].split('px')[0])
if int(font_size)>25:
return font_size
湖上湖
呼如林
相关分类