标记停止字生成的标记 ['ha'， 'le'， 'u'， 'wa'] 而不是stop

原因是您已经使用了自定义和默认，因此在提取要素时，请检查和之间是否存在任何不一致tokenizerstop_words='english'stop_wordstokenizer如果您深入研究代码，您会发现此代码片段正在执行一致性检查：sklearn/feature_extraction/text.pydef _check_stop_words_consistency(self, stop_words, preprocess, tokenize):    """Check if stop words are consistent    Returns    -------    is_consistent : True if stop words are consistent with the preprocessor                    and tokenizer, False if they are not, None if the check                    was previously performed, "error" if it could not be                    performed (e.g. because of the use of a custom                    preprocessor / tokenizer)    """    if id(self.stop_words) == getattr(self, '_stop_words_id', None):        # Stop words are were previously validated        return None    # NB: stop_words is validated, unlike self.stop_words    try:        inconsistent = set()        for w in stop_words or ():            tokens = list(tokenize(preprocess(w)))            for token in tokens:                if token not in stop_words:                    inconsistent.add(token)        self._stop_words_id = id(self.stop_words)        if inconsistent:            warnings.warn('Your stop_words may be inconsistent with '                          'your preprocessing. Tokenizing the stop '                          'words generated tokens %r not in '                          'stop_words.' % sorted(inconsistent))如您所见，如果发现不一致，它会引发警告。

标记停止字生成的标记 ['ha'， 'le'， 'u'， 'wa'] 而不是stop_words

1回答