背景
使用过大名鼎鼎的NLP工具包NLTK的同学们都知道, 自从NLTK更新到3.0版本后, 子包'model'被移除了. 原因是各种依赖的接口有较大调整, 子包'model'的迁移出现问题, 被维护者暂时移除但又迟迟没有合并回去. 这是十分可惜的事情, 因为其中包括我们常用的Ngram模型!
不过, 对应地维护者在'model'分支上提供了Ngram模型的基类 BaseNgramModel`, 使用者可以通过这个基类实现自己的模型. 作者根据此基类, 实现递归NgramCounter, 进而重新实现了2.x版本的Katz backoff平滑Ngrams模型. 代码保存在github. 下面, 作者会对实现过程做些简单介绍.
BaseNgramModel
我们先来看看 BaseNgramModel 长什么样子:
@compat.python_2_unicode_compatibleclass BaseNgramModel(object):
"""An example of how to consume NgramCounter to create a language model.
This class isn't intended to be used directly, folks should inherit from it
when writing their own ngram models.
"""
def __init__(self, ngram_counter):
self.ngram_counter = ngram_counter # for convenient access save top-most ngram order ConditionalFreqDist
self.ngrams = ngram_counter.ngrams[ngram_counter.order]
self._ngrams = ngram_counter.ngrams
self._order = ngram_counter.order
self._check_against_vocab = self.ngram_counter.check_against_vocab def check_context(self, context):
"""Makes sure context not longer than model's ngram order and is a tuple."""
if len(context) >= self._order: raise ValueError("Context is too long for this ngram order: {0}".format(context)) # ensures the context argument is a tuple
return tuple(context) def score(self, word, context):
"""
This is a dummy implementation. Child classes should define their own
implementations.
:param word: the word to get the probability of
:type word: str
:param context: the context the word is in
:type context: Tuple[str]
"""
return 0.5
def logscore(self, word, context):
"""
Evaluate the log probability of this word in this context.
This implementation actually works, child classes don't have to
redefine it.
:param word: the word to get the probability of
:type word: str
:param context: the context the word is in
:type context: Tuple[str]
"""
score = self.score(word, context) if score == 0.0: return NEG_INF return log(score, 2) def entropy(self, text):
"""
Calculate the approximate cross-entropy of the n-gram model for a
given evaluation text.
This is the average log probability of each word in the text.
:param text: words to use for evaluation
:type text: Iterable[str]
"""
normed_text = (self._check_against_vocab(word) for word in text)
H = 0.0 # entropy is conventionally denoted by "H"
processed_ngrams = 0
for ngram in self.ngram_counter.to_ngrams(normed_text):
context, word = tuple(ngram[:-1]), ngram[-1]
H += self.logscore(word, context)
processed_ngrams += 1
return - (H / processed_ngrams) def perplexity(self, text):
"""
Calculates the perplexity of the given text.
This is simply 2 ** cross-entropy for the text.
:param text: words to calculate perplexity of
:type text: Iterable[str]
"""
return pow(2.0, self.entropy(text))可以看到, 要继承这个类重新实现NgramModel, 我们有两大任务:
实现初始化参数
ngram_counter派生类要覆盖
score方法
NgramCounter
从上面的代码我们可以看到, 参数ngram_counter的类必须实现以下属性和方法:
order: 属性, int, 模型阶数
ngrams: 属性, dict<int, ConditionalFreqDist>, 各阶模型的条件概率分布的集合
vocabulary: 属性, set<tuple<str>>, ngram词汇表
to_gram: 方法, (list<str>)-> yield tuple<str>, 通过输入文本生成ngram
check_against_vocab: 方法, (str)-> str, 根据词汇表对单词做映射
小菜一叠, 唯独需要注意的里面的低阶模型的递归生成, 因为我们要靠这个数据结构实现Katz backoff平滑模型. 另外顺便一提, 尽管python的类属性没有公有私有的区别, 但是大家尽可能不要外部直接访问类属性, 应该用@property和@xxx.setter保护起来, 道理大家懂的. 实现代码如下:
class NgramCounter(object):
"""
依据 NLTK 3.0 给出的模型基类'BaseNgramModel'所实现的NgramCounter
必要成员属性和方法
- order: 属性, int, 模型阶数
- ngrams: 属性, dict<int, ConditionalFreqDist>, 各界模型的条件概率分布的集合
- vocabulary: 属性, set<tuple<str>>, ngram词汇表
- to_gram: 方法, (list<str>)-> yield tuple<str>, 通过输入文本生成ngram
- check_against_vocab: 方法, (str)-> str, 根据词汇表对单词做映射
"""
def __init__(self, order: int, train: list,
pad_left: bool=True, pad_right: bool =False, left_pad_symbol: str ='', right_pad_symbol: str ='',
recursive: bool =True):
"""
:param order: 模型阶数
:param train: 训练样本
:param pad_left: 是否进行左填充
:param pad_right: 是否进行右填充
:param left_pad_symbol: 左填充符号
:param right_pad_symbol: 右填充符号
:param recursive: 是否生成低阶模型
"""
self._ngrams = dict() # 模型阶数必须大于0
assert (order > 0), order # 保存模型阶数
self._order = order # 为方便检查, 为n=1的1阶模型保存一个快捷变量
# padding的设置
assert (isinstance(pad_left, bool)) assert (isinstance(pad_right, bool))
self._pad_left = pad_left
self._pad_right = pad_right
self._left_pad_symbol = left_pad_symbol
self._right_pad_symbol = right_pad_symbol
cfd = ConditionalFreqDist()
self._vocabulary = set() # 输入适配. 如果输入的训练数据不是list<list<str>>, 用一个列表包裹它
if (train is not None) and isinstance(train[0], compat.string_types):
train = [train] for sent in train: for ngram in self.to_ngrams(sent):
self._vocabulary.add(ngram)
context = tuple(ngram[:-1])
token = ngram[-1] # NB, ConditionalFreqDist的接口已经改变, 已经没有方法'inc', 需要改为如下语句
cfd[context][token] += 1
self._ngrams[self._order] = cfd # NB, 关键代码: 递归生成低阶NgramCounter
# 如果递归, 那就生成低阶概率分布, 注意还要把order-2至1阶的概率分布取回来
if recursive and not order == 1:
self._backoff = NgramCounter(order - 1, train,
pad_left=pad_left, left_pad_symbol=left_pad_symbol,
pad_right=pad_right, right_pad_symbol=right_pad_symbol) # 递归地把个低阶概率分布取回来
cursor = self._backoff while cursor is not None:
self._ngrams[cursor.order] = cursor.ngrams[cursor.order]
cursor = cursor.backoff else:
self._backoff = None @property
def order(self) -> int:
return self._order @property
def vocabulary(self) -> set:
return self._vocabulary @property
def ngrams(self) -> dict:
return self._ngrams @property
def backoff(self) -> type('NgramCounter'):
return self._backoff def check_against_vocab(self, word) -> str:
"""
目前不对生词作任何处理
:param word:
"""
return word def to_ngrams(self, text) -> tuple:
return ngrams(text, self._order,
pad_left=self._pad_left, pad_right=self._pad_right,
left_pad_symbol=self._left_pad_symbol, right_pad_symbol=self._right_pad_symbol)NgramModel
有了可以递归的NgramCounter, 我们就可以继承BaseNgramModel复活NgramModel. 需要注意的两点是:
先调父类的构造函数, 因为它初始化了各种属性
注意低阶模型的递归
Talk is cheap, show me the code:
class NgramModel(BaseNgramModel): """ 继承模型基类'BaseNgramModel'重新实现NgramModel Note: 1. 原方法'prob'和'logprob'已分别改名为'score'和'logstore' 2. 原方法'entropy'显式对输入文本进行padding, 然而基类'BaseNgramModel'的'entorpy'没有. 但是, 基类'BaseNgramModel'的'entorpy'的调用'NgramCounter'to_ngram, 已经进行padding. 所以我们不需要覆盖'entropy' """ def __init__(self, ngram_counter, estimator=None, *estimator_args, **estimator_kwargs): super(NgramModel, self).__init__(ngram_counter) # 设置频率平滑器, 没有就使用默认 if estimator is None: estimator = _estimator # 使用频率平滑器, 生成ngram模型 if not estimator_args and not estimator_kwargs: self._model = ConditionalProbDist(self.ngrams, estimator, len(self.ngrams)) else: self._model = ConditionalProbDist(self.ngrams, estimator, *estimator_args, **estimator_kwargs) # 递归生成低阶模型 if self._order > 1 and self.ngram_counter.backoff is not None: self._backoff = NgramModel(self.ngram_counter.backoff, estimator, *estimator_args, **estimator_kwargs) def score(self, word, context): """ Evaluate the probability of this word in this context using Katz Backoff. :param word: the word to get the probability of :type word: str :param context: the context the word is in :type context: list(str) """ context = tuple(context) # NB, 属性'_ngrams'已经在基类'BaseNgramModel'被赋值为'NgramCounter'的ConditionalFreqDist集合. # 词汇表实际上是NgramCounter的属性'vocabulary'. 具体修改如下 # if (context + (word,) in self._ngrams) or (self._n == 1): if (context + (word,) in self.ngram_counter.vocabulary) or (self._order == 1): return self[context].prob(word) else: return self._alpha(context) * self._backoff.score(word, context[1:]) def _alpha(self, tokens): return self._beta(tokens) / self._backoff._beta(tokens[1:]) def _beta(self, tokens): return self[tokens].discount() if tokens in self else 1 def choose_random_word(self, context): """ Randomly select a word that is likely to appear in this context. :param context: the context the word is in :type context: list(str) """ return self.generate(1, context)[-1] # NB, this will always start with same word if the model # was trained on a single text def generate(self, num_words, context=()): """ Generate random text based on the language model. :param num_words: number of words to generate :type num_words: int :param context: initial words in generated string :type context: list(str) """ text = list(context) for i in range(num_words): text.append(self._generate_one(text)) return text def _generate_one(self, context): context = (self._lpad + tuple(context))[-self._n + 1:] if context in self: return self[context].generate() elif self._n > 1: return self._backoff._generate_one(context[1:]) else: return '.' def __contains__(self, item): return tuple(item) in self._model def __getitem__(self, item): return self._model[tuple(item)] def __repr__(self): return '<NgramModel with %d %d-grams>' % (len(self._ngrams), self._n)
结语
复活的模型和原2.x中的模型计算结果完全一致, 大家可以自行测试, 或直接运行github上的代码测试.
作者:KAMIWei
链接:https://www.jianshu.com/p/75b96aae77be
随时随地看视频