我想匹配文本中的字符串(n-gram),并使用一种方法来获得偏移量:
string_to_match = "many workers are very underpaid"
text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."
所以结果我想得到一个像这样的元组("matched", 44, 75),其中 44 是开始,75 是结束。
这是我构建的代码,但它仅适用于 unigram。
def extract_offsets(line, _len=len):
words = line.split()
index = line.index
offsets = []
append = offsets.append
running_offset = 0
for word in words:
word_offset = index(word, running_offset)
word_len = _len(word)
running_offset = word_offset + word_len
append(("matched", word_offset, running_offset - 1))
return offsets
def get_entities(offsets):
entities = []
for elm in offsets:
if elm[0] == "string_to_match": # here string_to_match is only one word
entities.append(elm)
return entities
offsets = extract_offsets(text)
entities = get_entities(offsets) # [("matched", start, end)]
任何使之适用于字符串序列或 n-gram 的提示!
鸿蒙传说
相关分类