如何搜索目录中所有文件类型的正则表达式

所以,我想在我的整个目录中搜索包含正则表达式列表的文件。这包括:目录、pdf 和 csv 文件。仅搜索文本文件时,我可以成功完成此任务,但搜索所有文件类型却很困难。以下是我迄今为止的工作:


import glob

import re

import PyPDF2

#-------------------------------------------------Input----------------------------------------------------------------------------------------------

folder_path = "/home/"

file_pattern = "/*"

folder_contents = glob.glob(folder_path + file_pattern)



#Search for Emails

regex1= re.compile(r'\S+@\S+')

#Search for Phone Numbers

regex2 = re.compile(r'\d\d\d[-]\d\d\d[-]\d\d\d\d')

#Search for Locations

regex3 =re.compile("([A-Z]\w+), ([A-Z]{2})")



for file in folder_contents:

    read_file = open(file, 'rt').read()

if readile_file == pdf:


    pdfFileObj = open('pdf.pdf', 'rb') 


    pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 


    pageObj = pdfReader.getPage(0)  


    content= pageObj.extractText()) 


    if regex1.findall(read_file) or regex2.findall(read_file) or regex3.findall(read_file):

        print ("YES, This file containts PHI")

        print(file)

    else:

        print("No, This file DOES NOT contain PHI")

        print(file)

当我运行它时,我收到此错误:


YES, This file containts PHI

/home/e136320/sample.txt

No, This file DOES NOT contain PHI

/home/e136320/medicalSample.txt


---------------------------------------------------------------------------

UnicodeDecodeError                        Traceback (most recent call last)

<ipython-input-129-be0b68229c20> in <module>()

     19 

     20 for file in folder_contents:

---> 21     read_file = open(file, 'rt').read()

     22 if readile_file == pdf:

     23     # creating a pdf file object

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 10: invalid continuation byte

有什么建议?


九州编程
浏览 205回答 1
1回答

UYOU

你不能打开这样的 pdf 文件,它需要一个纯文本文件。你可以使用这样的东西:fn, ext = os.path.splitext(file)if ext == '.pdf':&nbsp; &nbsp; open_function = PyPDF2.PdfFileReaderelse:&nbsp; # plain text&nbsp; &nbsp; open_function = openwith open_function(file, 'rt') as open_file:&nbsp; &nbsp; # Do something with open file...此代码段检查文件扩展名,然后根据它找到的内容分配一个打开函数,这有点幼稚,可以使用类似于此答案中显示的方法来做得更好。
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python