从 word 转换的 pdf 中读取复选框值

当您将包含 Word 表单字段的 Word 文档转换为 PDF（使用另存为 *.pdf）时，遗憾的是，没有从中创建 PDF 表单字段。（这本来就很整洁）。复选框存储为MS Gothic字体的字符。因此，如果您想提取它们，您需要提取 PDF 的文本。该复选框可以有两种状态，因此有两个字符：☐ - 统一码 2610☒ - unicode 2612一些示例代码：public static void main(String args[]) throws IOException {    InputStream pdfIs = //load your PDF    RandomAccessBufferedFileInputStream rbfi = new RandomAccessBufferedFileInputStream(pdfIs);    PDFParser parser = new PDFParser(rbfi);    parser.parse();    try (COSDocument cosDoc = parser.getDocument()) {        PDFTextStripper pdfStripper = new PDFTextStripper();        PDDocument pdDoc = new PDDocument(cosDoc);        String parsedText = pdfStripper.getText(pdDoc);        //System.out.println("Full text"+parsedText);        for (int i = 0; i < parsedText.length(); i++) {            if('☒'==parsedText.charAt(i)) {                System.out.println("Found a checked box at index "+i);                System.out.println("\\u" + Integer.toHexString(parsedText.charAt(i) | 0x10000).substring(1));            }            else if('☐'==parsedText.charAt(i)) {                System.out.println("Found an unchecked box at index "+i);                System.out.println("\\u" + Integer.toHexString(parsedText.charAt(i) | 0x10000).substring(1));            }            //else {//skip}        }                }}更新：您提供了示例 PDF。复选框以“绘图”的形式存储为 xobject 流。如果您查看页面对象，内容入口会为您指明正确的方向：您会在其中找到以以下内容开头的3 0 obj<</Type /Page/Contents 4 0 R...内容：4 0 obj4 0 obj<</Length 807>>stream /P <</MCID 0>> BDC q0.00000912 0 612 792 reW* nBT/F1 9.96 Tf1 0 0 1 72.024 710.62 Tm/GS7 gs0 g/GS8 gs0 G[( )] TJETQ EMC q0.000018243 0 612 792 reW* n /P <</MCID 1>> BDC 0.72 w0 G 1 j 73.104 696.34 9.24 9.24 reS0.48 w72.984 705.7 m82.464 696.22 lS82.464 705.7 m72.984 696.22 lSQ EMC  /P <</MCID 2>> BDC q0.00000912 0 612 792 reW* nBT/F1 9.96 Tf1 0 0 1 83.544 697.3 Tm0 g0 G[( )] TJET这基本上就是复选框的绘制方式。您现在可以使用 pdfbox 阅读此内容，但您必须自己解释/识别它。看看 PDF 规范如何解释这些绘图指令......

从 word 转换的 pdf 中读取复选框值

1回答