使用java将PDF转换为CSV

我已经尝试了堆栈溢出和外部的大部分内容


问题:我有一个包含内容和表格的 pdf。我还需要解析表格和内容。


APIs: https ://github.com/tabulapdf/tabula-java 我正在使用tabula-java它忽略了一些内容,并且表格单元格内的内容没有以正确的方式分离。


我的 PDF 有这样的内容


 DATE :1/1/2018         ABCD                   SCODE:FFFT

                       --ACCEPTED--

    USER:ADMIN         BATCH:RR               EEE

    CON BATCH

    =======================================================================

    MAIN SNO SUB  VALUE DIS %

    R    12   rr1 0125  24.5

            SLNO  DESC  QTY  TOTAL  CODE   FREE

            1     ABD   12   90     BBNEW  -NILL-

            2     XDF   45   55     GHT55  MRP

            3     QWE   08   77     CAT    -NILL-

    =======================================================================

    MAIN SNO SUB  VALUE DIS %

    QW    14   rr2 0122  24.5

            SLNO  DESC  QTY  TOTAL  CODE   FREE

            1     ABD   12   90     BBNEW  -NILL-

            2     XDF   45   55     GHT55  MRP

            3     QWE   08   77     CAT    -NILL-

要转换的表格代码:


public static void toCsv() throws ParseException {

        String commandLineOptions[] = { "-p", "1", "-o", "$csv", };

        CommandLineParser parser = new DefaultParser();

        try {

            CommandLine line = parser.parse(TabulaUtil.buildOptions(), commandLineOptions);

            new TabulaUtil(System.out, line).extractFileInto(

                    new File("/home/sample/firstPage.pdf"),

                    new File("/home/sample/onePage.csv"));

        } catch (Exception e) {

            e.printStackTrace();

        }

    }

tabula 甚至支持命令行界面


java -jar TabulaJar/tabula-1.0.2-jar-with-dependencies.jar -p all  -o  $csv -b Pdfs

我尝试使用-c,--columns <COLUMNS>表格,它通过列边界的 X 坐标获取单元格


但问题是我的 pdfs 内容是动态的。即表大小已更改。


堆栈溢出中的这些链接和更多的力对我有用。


牛魔王的故事
浏览 334回答 2
2回答

守着一只汪

Apache基金会的项目很少Tikka 支持广泛的扩展,包括 pdf、ppt、xls。https://tika.apache.org/1.24.1/formats.html中提到了支持的格式https://tika.apache.org/PDF Box - 特定于 pdf 相关功能https://pdfbox.apache.org/

眼眸繁星

在此处查看使用 Java 将 PDF 提取为 CSV 的任何示例:https ://github.com/pdftables/java-pdftables-api 。每个页面都是独立考虑的,因此您的 PDF 的动态特性不应该成为问题。您可以在他们的网站上使用免费试用版。package com.pdftables.examples;import java.io.File;import java.util.Arrays;import java.util.List;import org.apache.commons.io.FileUtils;import org.apache.http.HttpEntity;import org.apache.http.client.config.CookieSpecs;import org.apache.http.client.config.RequestConfig;import org.apache.http.client.methods.CloseableHttpResponse;import org.apache.http.client.methods.HttpPost;import org.apache.http.entity.mime.MultipartEntityBuilder;import org.apache.http.entity.mime.content.FileBody;import org.apache.http.impl.client.CloseableHttpClient;import org.apache.http.impl.client.HttpClients;public class ConvertToFile {&nbsp; &nbsp; private static List<String> formats = Arrays.asList(new String[] { "csv", "xml", "xlsx-single", "xlsx-multiple" });&nbsp; &nbsp; public static void main(String[] args) throws Exception {&nbsp; &nbsp; &nbsp; &nbsp; if (args.length != 3) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.out.println("Command line: <API_KEY> <FORMAT> <PDF filename>");&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.exit(1);&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; final String apiKey = args[0];&nbsp; &nbsp; &nbsp; &nbsp; final String format = args[1].toLowerCase();&nbsp; &nbsp; &nbsp; &nbsp; final String pdfFilename = args[2];&nbsp; &nbsp; &nbsp; &nbsp; if (!formats.contains(format)) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.out.println("Invalid output format: \"" + format + "\"");&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.exit(1);&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; // Avoid cookie warning with default cookie configuration&nbsp; &nbsp; &nbsp; &nbsp; RequestConfig globalConfig = RequestConfig.custom().setCookieSpec(CookieSpecs.STANDARD).build();&nbsp; &nbsp; &nbsp; &nbsp; File inputFile = new File(pdfFilename);&nbsp; &nbsp; &nbsp; &nbsp; if (!inputFile.canRead()) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.out.println("Can't read input PDF file: \"" + pdfFilename + "\"");&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.exit(1);&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; try (CloseableHttpClient httpclient = HttpClients.custom().setDefaultRequestConfig(globalConfig).build()) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; HttpPost httppost = new HttpPost("https://pdftables.com/api?format=" + format + "&key=" + apiKey);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; FileBody fileBody = new FileBody(inputFile);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; HttpEntity requestBody = MultipartEntityBuilder.create().addPart("f", fileBody).build();&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; httppost.setEntity(requestBody);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.out.println("Sending request");&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; try (CloseableHttpResponse response = httpclient.execute(httppost)) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (response.getStatusLine().getStatusCode() != 200) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.out.println(response.getStatusLine());&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.exit(1);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; HttpEntity resEntity = response.getEntity();&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (resEntity != null) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; final String outputFilename = getOutputFilename(pdfFilename, format.replaceFirst("-.*$", ""));&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.out.println("Writing output to " + outputFilename);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; final File outputFile = new File(outputFilename);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; FileUtils.copyToFile(resEntity.getContent(), outputFile);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; } else {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.out.println("Error: file missing from response");&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.exit(1);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }&nbsp; &nbsp; private static String getOutputFilename(String pdfFilename, String suffix) {&nbsp; &nbsp; &nbsp; &nbsp; if (pdfFilename.length() >= 5 && pdfFilename.toLowerCase().endsWith(".pdf")) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return pdfFilename.substring(0, pdfFilename.length() - 4) + "." + suffix;&nbsp; &nbsp; &nbsp; &nbsp; } else {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return pdfFilename + "." + suffix;&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }}
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Java