是否有可能获得 Spacy 命名实体识别的置信度分数

首页课程实战体系课手记专栏慕课教程

是否有可能获得 Spacy 命名实体识别的置信度分数

我需要获得 Spacy NER 所做预测的置信度分数。

CSV 文件

Text,Amount & Nature,Percent of Class

"T. Rowe Price Associates, Inc.","28,223,360 (1)",8.7% (1)

100 E. Pratt Street,Not Listed,Not Listed

"Baltimore, MD 21202",Not Listed,Not Listed

"BlackRock, Inc.","21,871,854 (2)",6.8% (2)

55 East 52nd Street,Not Listed,Not Listed

"New York, NY 10022",Not Listed,Not Listed

The Vanguard Group,"21,380,085 (3)",6.64% (3)

100 Vanguard Blvd.,Not Listed,Not Listed

"Malvern, PA 19355",Not Listed,Not Listed

FMR LLC,"20,784,414 (4)",6.459% (4)

245 Summer Street,Not Listed,Not Listed

"Boston, MA 02210",Not Listed,Not Listed

代码

import pandas as pd

import spacy

with open('/path/table.csv') as csvfile:

reader1 = csv.DictReader(csvfile)

data1 =[["Text","Amount & Nature","Prediction"]]

for row in reader1:

AmountNature = row["Amount & Nature"]

nlp = spacy.load('en_core_web_sm')

doc1 = nlp(row["Text"])

for ent in doc1.ents:

#output = [ent.text, ent.start_char, ent.end_char, ent.label_]

label1 = ent.label_

text1 = ent.text

data1.append([str(doc1),AmountNature,label1])

my_df1 = pd.DataFrame(data1)

my_df1.columns = my_df1.iloc[0]

my_df1 = my_df1.drop(my_df1.index[[0]])

my_df1.to_csv('/path/output.csv', index=False, header=["Text","Amount & Nature","Prediction"])

输出 CSV

Text,Amount & Nature,Prediction

"T. Rowe Price Associates, Inc.","28,223,360 (1)",ORG

100 E. Pratt Street,Not Listed,FAC

"Baltimore, MD 21202",Not Listed,CARDINAL

"BlackRock, Inc.","21,871,854 (2)",ORG

55 East 52nd Street,Not Listed,LOC

"New York, NY 10022",Not Listed,DATE

The Vanguard Group,"21,380,085 (3)",ORG

100 Vanguard Blvd.,Not Listed,FAC

"Malvern, PA 19355",Not Listed,DATE

FMR LLC,"20,784,414 (4)",ORG

245 Summer Street,Not Listed,CARDINAL

"Boston, MA 02210",Not Listed,GPE

在上面的输出中，是否有可能在 Spacy NER 预测上获得 Confident Score。如果是，我该如何实现？

有人可以帮我吗？

紫衣仙女

浏览 416回答 3

3回答

慕雪6442864

不，不可能在 Spacy 中获得模型的置信度分数（不幸的是）。虽然使用 F1 分数有利于整体评估，但我更希望 Spacy 为其预测提供个人置信度分数，而目前还没有提供。

0 0

繁星点点滴滴

要么获得一个完全注释的数据集，要么自己手动注释（因为你有一个 CSV 文件，这可能是你的首选）。通过这种方式，您可以将地面实况与您的 Spacy 预测区分开来。基于此，您可以计算混淆矩阵。我建议使用 F1 分数作为信心的衡量标准。这里有 一些 很棒的 链接，讨论各种公开可用的数据集和注释方法（包括 CRF）。

0 0

慕无忌1623718

对此没有直接的解释。首先，spaCy为命名实体解析实现两个不同的目标：贪婪的模仿学习目标。这个目标询问，“如果我从这个状态执行，哪些可用的操作不会引入新的错误？”全局波束搜索目标。全局模型不是优化单个转换决策，而是询问最终解析是否正确。为了优化这个目标，我们构建了 top-k 最有可能不正确解析和 top-k 最可能正确解析的集合。注意：测试过spaCy v2.0.13import spacyimport sysfrom collections import defaultdictnlp = spacy.load('en')text = 'Hi there! Hope you are doing good. Greetings from India.'with nlp.disable_pipes('ner'): doc = nlp(text)threshold = 0.2# Number of alternate analyses to consider. More is slower, and not necessarily better -- you need to experiment on your problem.beam_width = 16# This clips solutions at each step. We multiply the score of the top-ranked action by this value, and use the result as a threshold. This prevents the parser from exploring options that look very unlikely, saving a bit of efficiency. Accuracy may also improve, because we've trained on greedy objective.beam_density = 0.0001 beams, _ = nlp.entity.beam_parse([ doc ], beam_width, beam_density)entity_scores = defaultdict(float)for beam in beams: for score, ents in nlp.entity.moves.get_beam_parses(beam): for start, end, label in ents: entity_scores[(start, end, label)] += score for key in entity_scores: start, end, label = key score = entity_scores[key] if score > threshold: print ('Label: {}, Text: {}, Score: {}'.format(label, doc[start:end], score))输出：Label: GPE, Text: India, Score: 0.9999509961251819

0 0

随时随地看视频慕课网APP

相关分类

Python