如何更有效地注释多个斯坦福 CoreNLP 核心文档？

如何更有效地注释多个斯坦福 CoreNLP 核心文档？

我通过 Stanford Corenlp 将大量字符串注释为 CoreDocuments。StanfordCoreNLP 管道具有用于多线程注释以优化流程的内部功能，但是据我所知，CoreDocument 对象在我运行的版本中无法使用该功能，即 stanford-corenlp-full-2018-10-05。

由于我无法制作 Pipelines Annotate CoreDocuments 集合，因此我尝试通过将单个注释放在多线程方法中来优化各个注释。我对多线程环境没有任何问题。我按预期收到了所有结果，我唯一的缺点是时间消耗。我尝试了大约 7 种不同的实现，这些是最快的 3 种：

//ForkJoinPool is initialized in the main method in my application

private static ForkJoinPool executor = new ForkJoinPool(Runtime.getRuntime().availableProcessors(), ForkJoinPool.defaultForkJoinWorkerThreadFactory, null, false);

public static ConcurrentMap<String, CoreDocument> getMultipleCoreDocumentsWay1(Collection<String> str) {

ConcurrentMap<String, CoreDocument> pipelineCoreDocumentAnnotations = new MapMaker().concurrencyLevel(2).makeMap();

str.parallelStream().forEach((str1) -> {

CoreDocument coreDocument = new CoreDocument(str1);

pipeline.annotate(coreDocument);

pipelineCoreDocumentAnnotations.put(str1, coreDocument);

System.out.println("pipelineCoreDocumentAnnotations size1: " + pipelineCoreDocumentAnnotations.size() + "\nstr size: " + str.size() + "\n");

});

return pipelineCoreDocumentAnnotations;

}

并行时间 1：336562 毫秒。

并行时间 4：391556 毫秒。

时间parallel7：491639 ms。

老实说，如果管道本身可以以某种方式进行多注释，那么最大的好处是，但是只要我不知道如何实现这一点，我希望有人可以解释我如何单独优化 CoreDocument 注释。PS：将所有字符串混合到一个单独的核心文档中进行注释也不是我想要的，因为之后我需要单独的核心文档进行比较。

回首忆惘然

浏览 144回答 1

1回答

潇湘沐

我没有计时，但你可以试试这个示例代码（将测试字符串添加到字符串列表中）......它应该同时适用于 4 个文档：package edu.stanford.nlp.examples;import edu.stanford.nlp.pipeline.*;import java.util.*;import java.util.function.*;import java.util.stream.*;public class MultiThreadStringExample {    public static class AnnotationCollector<T> implements Consumer<T> {        List<T> annotations = new ArrayList<T>();        public void accept(T ann) {            annotations.add(ann);        }    }    public static void main(String[] args) throws Exception {        Properties props = new Properties();        props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,depparse");        props.setProperty("threads", "4");        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);        AnnotationCollector<Annotation> annCollector = new AnnotationCollector<Annotation>();        List<String> exampleStrings = new ArrayList<String>();        for (String exampleString : exampleStrings) {            pipeline.annotate(new Annotation(exampleString), annCollector);        }        Thread.sleep(10000);        List<CoreDocument> coreDocs =                annCollector.annotations.stream().map(ann -> new CoreDocument(ann)).collect(Collectors.toList());        for (CoreDocument coreDoc : coreDocs) {            System.out.println(coreDoc.tokens());        }    }}

0

0

随时随地看视频慕课网APP

相关分类

Java