猿问

将 Word 文档转换为 HTML 而不丢失原始文档

我目前正在开发一个程序,需要将 Word 文档显示为 HTML,但要跟踪 HTML 和原始文件的位置。


为此,在最初加载 Word 文档时,会为文档中的每个元素生成 ID。


foreach (Table t in document.Tables)

{

    t.ID = GUID();


    Range range = t.Range;

    foreach (Cell c in range.Cells)

    {

        c.ID = t.ID + TableIDSeparator + GUID();

    }

}


foreach (Paragraph p in document.Paragraphs)

{

    p.ID = GUID();

}

然后我可以通过这种方式将文档保存为 HTML:


document.SaveAs2(tempFileName, WdSaveFormat.wdFormatFilteredHTML);

但随后document对象变成了 HTML 文档,而不是原始的 Word 文档(就像使用 Word 菜单中的另存为时,当前窗口显示新保存的文档而不是原始文档一样)。


所以我尝试以这种方式将文档保存为 HTML:


Document temp = new Document();

string x = document.Range().XML;

temp.Range().InsertXML(x);

temp.SaveAs2(fn, WdSaveFormat.wdFormatFilteredHTML);

temp.Close(false);

但是现在新temp文档缺少我在原始文档中创建的所有 ID,因此我无法根据原始文档找到 HTML 文件中的位置。


我是否遗漏了一些重要的东西,或者有什么方法可以在不丢失对原始文件的引用的情况下另存为 word 文档?


翻翻过去那场雪
浏览 172回答 2
2回答

RISEBY

由于文档结果相同,我使用以下方法将 ID 复制到新文档。请注意段落/表格/等。数组从元素索引 1 开始,而不是 0。&nbsp; &nbsp; &nbsp; &nbsp; string fn = Path.GetTempPath() + TmpPrefix +GUID() + ".html";&nbsp; &nbsp; &nbsp; &nbsp; Document temp = new Document();&nbsp; &nbsp; &nbsp; &nbsp; // Copy whole old document to new document&nbsp; &nbsp; &nbsp; &nbsp; temp.Range().InsertXML(doc.Range().XML);&nbsp; &nbsp; &nbsp; &nbsp; // copy IDs assuming the documents are identical and have same amount of elements&nbsp; &nbsp; &nbsp; &nbsp; for (int i = 1; i <= temp.Tables.Count; i++) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; temp.Tables[i].ID = doc.Tables[i].ID;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Range sRange = doc.Tables[i].Range;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Range tRange = temp.Tables[i].Range;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; for(int j = 1; j <= tRange.Cells.Count; j++)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tRange.Cells[j].ID = sRange.Cells[j].ID;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; for(int i=1; i <= temp.Paragraphs.Count; i++)&nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; temp.Paragraphs[i].ID = doc.Paragraphs[i].ID;&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; // Save new temp document as HTML&nbsp; &nbsp; &nbsp; &nbsp; temp.SaveAs2(fn, WdSaveFormat.wdFormatFilteredHTML);&nbsp; &nbsp; &nbsp; &nbsp; temp.Close();&nbsp; &nbsp; &nbsp; &nbsp; return fn;由于我不需要输出的 DOCX 文件中的 ID(我只使用 ID 来跟踪内存中加载的 DOCX 文件和我的应用程序中显示的 HTML 表示),这对我的情况非常有用。

互换的青春

尽管上面的这种方法在大型文档上非常慢,所以我不得不以不同的方式做:&nbsp; &nbsp; public static string RenderHTMLFile(Document doc)&nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; string fn = Path.GetTempPath() + TmpPrefix +GUID() + ".html";&nbsp; &nbsp; &nbsp; &nbsp; var vba = doc.VBProject;&nbsp; &nbsp; &nbsp; &nbsp; var module = vba.VBComponents.Add(Microsoft.Vbe.Interop.vbext_ComponentType.vbext_ct_StdModule);&nbsp; &nbsp; &nbsp; &nbsp; var code = Properties.Resources.HTMLRenderer;&nbsp; &nbsp; &nbsp; &nbsp; module.CodeModule.AddFromString(code);&nbsp; &nbsp; &nbsp; &nbsp; var dataMacro = Word.Run("renderHTMLCopy", fn);&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; return fn;&nbsp; &nbsp; }Properties.Resources.HTMLRenderer带有以下VB代码的txt文件在哪里:Sub renderHTMLCopy(ByVal path As String)'' renderHTMLCopy Macro''Selection.WholeStorySelection.CopyDocuments.AddSelection.PasteAndFormat wdPasteDefaultActiveDocument.SaveAs2 path, WdSaveFormat.wdFormatFilteredHTMLActiveDocument.Close FalseEnd Sub之前的版本处理一个小文档大约需要 1500 毫秒,而这个版本在大约 400 毫秒内渲染同一个文档!
随时随地看视频慕课网APP
我要回答