猿问

spaCy - 连字符的标记化

美好的一天,


我正在尝试对被标记为单独标记的连字符进行后处理,因为它们应该是单个标记。例如:


Example:


Sentence: "up-scaled"

Tokens: ['up', '-', 'scaled']

Expected: ['up-scaled']

现在,我的解决方案是使用匹配器:


matcher = Matcher(nlp.vocab)

pattern = [{'IS_ALPHA': True, 'IS_SPACE': False},

           {'ORTH': '-'},

           {'IS_ALPHA': True, 'IS_SPACE': False}]


matcher.add('HYPHENATED', None, pattern)


def quote_merger(doc):

    # this will be called on the Doc object in the pipeline

    matched_spans = []

    matches = matcher(doc)

    for match_id, start, end in matches:

        span = doc[start:end]

        matched_spans.append(span)

    for span in matched_spans:  # merge into one token after collecting all matches

        span.merge()

    #print(doc)

    return doc


nlp.add_pipe(quote_merger, first=True)  # add it right after the tokenizer

doc = nlp(text)

但是,这将导致以下预期问题:


Example 2:


Sentence: "I know I will be back - I had a very pleasant time"

Tokens: ['i', 'know', 'I', 'will', 'be', 'back - I', 'had', 'a', 'very', 'pleasant', 'time']

Expected: ['i', 'know', 'I', 'will', 'be', 'back', '-', 'I', 'had', 'a', 'very', 'pleasant', 'time']

有没有一种方法可以只处理由连字符分隔且字符之间没有空格的单词?因此,像“up-scaled”这样的词将被匹配并组合成一个单独的标记,而不是“.. back - I ..”


非常感谢


编辑:我已经尝试过发布的解决方案:为什么 spaCy 在标记化过程中不会像斯坦福 CoreNLP 那样保留单词内连字符?


但是,我没有使用此解决方案,因为它导致带有撇号 (') 的单词和带有小数的数字的错误标记化:


Sentence: "It's"

Tokens: ["I", "t's"]

Expected: ["It", "'s"]


Sentence: "1.50"

Tokens: ["1", ".", "50"]

Expected: ["1.50"]

这就是为什么我使用 Matcher 而不是尝试编辑正则表达式的原因。


偶然的你
浏览 137回答 2
2回答

慕后森

Matcher 并不是真正合适的工具。您应该改为修改标记器。如果您想保留其他所有内容的处理方式并且只更改连字符的行为,您应该修改现有的中缀模式并保留所有其他设置。当前的英文中缀模式定义在这里:https://github.com/explosion/spaCy/blob/58533f01bf926546337ad2868abe7fc8f0a3b3ae/spacy/lang/punctuation.py#L37-L49您可以在不定义自定义分词器的情况下添加新模式,但如果不定义自定义分词器,则无法删除模式。因此,如果您注释掉连字符模式并定义自定义标记器:import spacyfrom spacy.tokenizer import Tokenizerfrom spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONSfrom spacy.util import compile_infix_regexdef custom_tokenizer(nlp):&nbsp; &nbsp; infixes = (&nbsp; &nbsp; &nbsp; &nbsp; LIST_ELLIPSES&nbsp; &nbsp; &nbsp; &nbsp; + LIST_ICONS&nbsp; &nbsp; &nbsp; &nbsp; + [&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; r"(?<=[0-9])[+\-\*^](?=[0-9-])",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),&nbsp; &nbsp; &nbsp; &nbsp; ]&nbsp; &nbsp; )&nbsp; &nbsp; infix_re = compile_infix_regex(infixes)&nbsp; &nbsp; return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; suffix_search=nlp.tokenizer.suffix_search,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; infix_finditer=infix_re.finditer,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; token_match=nlp.tokenizer.token_match,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; rules=nlp.Defaults.tokenizer_exceptions)nlp = spacy.load("en")nlp.tokenizer = custom_tokenizer(nlp)print([t.text for t in nlp("It's 1.50, up-scaled haven't")])# ['It', "'s", "'", '1.50', "'", ',', 'up-scaled', 'have', "n't"]在初始化新的 Tokenizer 以保留现有的 tokenizer 行为时,您确实需要提供当前的前缀/后缀/token_match 设置。另请参阅(德语,但非常相似):https ://stackoverflow.com/a/57304882/461847编辑添加(因为这看起来确实不必要地复杂,你真的应该能够重新定义中缀模式而无需加载全新的自定义标记器):如果您刚刚加载了模型(对于 v2.1.8)并且您还没有调用nlp(),您也可以直接替换infix_re.finditer而不创建自定义标记器:nlp = spacy.load('en')nlp.tokenizer.infix_finditer = infix_re.finditer有一个缓存错误应该有望在 v2.2 中得到修复,它可以让它在任何时候都能正常工作,而不仅仅是新加载的模型。(否则这种行为非常令人困惑,这就是为什么创建自定义标记器对于 v2.1.8 来说是一个更好的通用建议。)

料青山看我应如是

如果nlp = spacy.load('en')抛出错误,请使用nlp = spacy.load("en_core_web_sm")
随时随地看视频慕课网APP

相关分类

Python
我要回答