使用 zipfile 库解压缩 .docx 文件

我正在尝试编写一个应用程序从 word docx 文件中的表中获取信息，以便通过将其转换为 pandas 对其进行一些分析DataFrame。第一步是正确读取 docx 文件，为此，我遵循 Virantha Ekanayake 的Reading and writing Microsoft Word docx files with Python指南。

我在第一步，他们说要使用库Zipfile的方法zipfile将 docx 文件解压缩到 xml 文件中。我将指南中的函数定义改编为我的代码（下面包含的代码），但是当我运行我的代码时，我收到一条错误消息，指出 docx 文件“不是 zip 文件”。

指南中的这个人说，“本质上，docx 文件只是一个 zip 文件（尝试在其上运行解压缩！）……”我尝试将 docx 文件重命名为 zip 文件，并使用 WinZip 成功解压缩。但是，在我的程序中，我希望能够解压缩 docx 文件而不必手动.zip将其重命名为文件。我能以某种方式解压缩 docx 文件而不重命名它吗？或者，如果我必须重命名它才能使用该方法，我该如何在我的 python 代码中执行此操作？Zipfile

import zipfile

from lxml import etree

import pandas as pd

FILE_PATH = 'C:/Users/user/Documents/Python Project'

class Application():

def __init__(self):

#debug print('Initialized!')

xml_content = self.get_word_xml(f'{FILE_PATH}/DocxFile.docx')

xml_tree = self.get_xml_tree(xml_content)

def get_word_xml(self, docx_filename):

with open(docx_filename) as f:

zip = zipfile.ZipFile(f)

xml_content = zip.read('word/document.xml')

return xml_content

def get_xml_tree(self, xml_string):

return (etree.fromstring(xml_string))

a = Application()

a.mainloop()

错误：

Traceback (most recent call last):

File "C:\Users\user\Documents\New_Tool.py", line 39, in <module>

a = Application()

File "C:\Users\user\Documents\New_Tool.py", line 27, in __init__

xml_content = self.get_word_xml(f'{FILE_PATH}/DocxFile.docx')

File "C:\Users\user\Documents\New_Tool.py", line 32, in get_word_xml

zip = zipfile.ZipFile(f)

File "C:\Progra~1\Anaconda3\lib\zipfile.py", line 1222, in __init__

self._RealGetContents()

File "C:\Progra~1\Anaconda3\lib\zipfile.py", line 1289, in _RealGetContents

raise BadZipFile("File is not a zip file")

zipfile.BadZipFile: File is not a zip file

Helenr

浏览 446回答 1

使用 zipfile 库解压缩 .docx 文件

1回答