猿问

itext7 - 如何在写入期间过滤渲染事件

我想过滤 RENDER_TEXT 事件,因为它们被写入输出文件。我有一个 PDF,里面有一些我想过滤掉的文本。我发现我可以遍历文档一次并确定我想要过滤的渲染事件的特征。现在我想复制源文档的页面并跳过一些 RENDER_TEXT 事件,以便文本不会出现在目标文档中。我有一个 IEventFilter 可以接受正确的事件。我只需要知道如何将此过滤器放在文档编写器上。


目标是采用议程格式从 Google 日历创建的 PDF 并删除“创建者:”和“日历:”行。这些行通常由 3 个 RENDER_TEXT 事件组成。


我当前的代码如下。我发现所有具有相同基线 y 坐标的 RENDER_TEXT 事件将标识我想要删除的事件。


import java.io.FileNotFoundException;

import java.io.IOException;

import java.nio.file.Path;

import java.nio.file.Paths;

import java.util.Collections;

import java.util.LinkedList;

import java.util.List;

import java.util.Set;


import org.apache.logging.log4j.LogManager;

import org.apache.logging.log4j.Logger;


import com.itextpdf.kernel.geom.LineSegment;

import com.itextpdf.kernel.geom.PageSize;

import com.itextpdf.kernel.geom.Rectangle;

import com.itextpdf.kernel.pdf.PdfDocument;

import com.itextpdf.kernel.pdf.PdfPage;

import com.itextpdf.kernel.pdf.PdfReader;

import com.itextpdf.kernel.pdf.PdfWriter;

import com.itextpdf.kernel.pdf.canvas.parser.EventType;

import com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor;

import com.itextpdf.kernel.pdf.canvas.parser.data.IEventData;

import com.itextpdf.kernel.pdf.canvas.parser.data.TextRenderInfo;

import com.itextpdf.kernel.pdf.canvas.parser.filter.IEventFilter;

import com.itextpdf.kernel.pdf.canvas.parser.listener.IEventListener;


public class Main {


    private static final Logger LOGGER = LogManager.getLogger();


    public static void main(String[] args) throws FileNotFoundException, IOException {

        final Path src = Paths.get("calendar_2018-08-04_2018-08-19.pdf");

        final Path dest = Paths.get("/home/jpschewe/Downloads/calendar_clean.pdf");


        final Main app = new Main(src, dest);


    }


尚方宝剑之说
浏览 271回答 1
1回答

喵喔喔

正如评论中所建议的,您可以使用PdfCanvasEditorfrom this answer根据需要从内容流中过滤操作。实际上,我稍微扩展了该类,以便能够正确支持'和"文本绘制运算符。您可以在此处找到该课程。就像在您的方法中一样,要清除的行是在第一次运行时确定的:我RegexBasedLocationExtractionStrategy为此使用了一个实例。此后,在该PdfCanvasEditor步骤中,将在这些行上绘制文本的指令更改为仅绘制空字符串。不过,由于不是您检查的事件导致在此处绘制文本,而是更基本的运算符和操作数结构,因此确切的机制不是从IEventFilter. 但是机制与您的方法相似。try (PdfDocument pdfDocument = new PdfDocument(SOURCE_PDF_READER, TARGET_PDF_WRITER)) {&nbsp; &nbsp; List<Rectangle> triggerRectangles = new ArrayList<>();&nbsp; &nbsp; PdfCanvasEditor editor = new PdfCanvasEditor()&nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Field field = PdfCanvasProcessor.class.getDeclaredField("textMatrix");&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; field.setAccessible(true);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; textMatrixField = field;&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; @Override&nbsp; &nbsp; &nbsp; &nbsp; protected void nextOperation(PdfLiteral operator, List<PdfObject> operands) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; try {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; recentTextMatrix = (Matrix)textMatrixField.get(this);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; } catch (IllegalArgumentException | IllegalAccessException e) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; throw new RuntimeException(e);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; @Override&nbsp; &nbsp; &nbsp; &nbsp; protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)&nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; String operatorString = operator.toString();&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (TEXT_SHOWING_OPERATORS.contains(operatorString))&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Matrix matrix = null;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; try {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; matrix = recentTextMatrix.multiply(getGraphicsState().getCtm());&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; } catch (IllegalArgumentException e) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; throw new RuntimeException(e);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; float y = matrix.get(Matrix.I32);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (triggerRectangles.stream().anyMatch(rect -> rect.getBottom() <= y && y <= rect.getTop())) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if ("TJ".equals(operatorString))&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; operands.set(0, new PdfArray());&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; else&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; operands.set(operands.size() - 2, new PdfString(""));&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; super.write(processor, operator, operands);&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");&nbsp; &nbsp; &nbsp; &nbsp; final Field textMatrixField;&nbsp; &nbsp; &nbsp; &nbsp; Matrix recentTextMatrix;&nbsp; &nbsp; };&nbsp; &nbsp; for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)&nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; PdfPage page = pdfDocument.getPage(i);&nbsp; &nbsp; &nbsp; &nbsp; Set<PdfName> xobjectNames = page.getResources().getResourceNames(PdfName.XObject);&nbsp; &nbsp; &nbsp; &nbsp; for (PdfName xobjectName : xobjectNames) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; PdfFormXObject xobject = page.getResources().getForm(xobjectName);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; byte[] content = xobject.getPdfObject().getBytes();&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; PdfResources resources = xobject.getResources();&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; RegexBasedLocationExtractionStrategy regexLocator = new RegexBasedLocationExtractionStrategy("Created by:|Calendar:");&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; new PdfCanvasProcessor(regexLocator).processContent(content, resources);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; triggerRectangles.clear();&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; triggerRectangles.addAll(regexLocator.getResultantLocations().stream().map(loc -> loc.getRectangle()).collect(Collectors.toSet()));&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; PdfCanvas pdfCanvas = new PdfCanvas(new PdfStream(), resources, pdfDocument);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; editor.editContent(content, resources, pdfCanvas);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; xobject.getPdfObject().setData(pdfCanvas.getContentStream().getBytes());&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }}(EditPageContent测试testRemoveSpecificLinesCalendar)请注意,这是一个概念验证,它是为 OP 的用例特别定制的:PdfCanvasEditor此处仅用于检查和编辑每个页面的第一级表单 XObjects,因为从 Google 日历以 Agenda 格式创建的 PDF 包含他们所有的页面内容都以 XObject 形式呈现,而 XObject 又会在页面内容流中绘制。此外,预计文本将与页面顶部平行。
随时随地看视频慕课网APP

相关分类

Java
我要回答