如何在文本中获取匹配的 n-gram 的偏移量

我想匹配文本中的字符串(n-gram),并使用一种方法来获得偏移量:


string_to_match = "many workers are very underpaid" 

 text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."


所以结果我想得到一个像这样的元组("matched", 44, 75),其中 44 是开始,75 是结束。


这是我构建的代码,但它仅适用于 unigram。


def extract_offsets(line, _len=len):

    words = line.split()

    index = line.index

    offsets = []

    append = offsets.append

    running_offset = 0

    for word in words:

        word_offset = index(word, running_offset)

        word_len = _len(word)

        running_offset = word_offset + word_len

        append(("matched", word_offset, running_offset - 1))

    return offsets


def get_entities(offsets):

    entities = []

    for elm in offsets:

        if elm[0] == "string_to_match": # here string_to_match is only one word

            entities.append(elm)

    return entities


offsets = extract_offsets(text)

entities = get_entities(offsets) # [("matched", start, end)]

任何使之适用于字符串序列或 n-gram 的提示!


噜噜哒
浏览 89回答 1
1回答

鸿蒙传说

您可以re.finditer()调用span()匹配对象上的方法来获取匹配子字符串的开始和结束索引-def m():    string_to_match = "many workers are very underpaid"    text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."    m = re.finditer(r'%s'%(string_to_match),text)    for x in m:        print x.group(0), x.span()     # x.span() will return the beginning and the ending indices of the matched substring as a tuple
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python