猿问

带有 TfidfVectorizer 的 ColumnTransformer 产生“空词汇”错误

我正在运行一个非常简单的实验,ColumnTransformer目的是转换一个列数组,在本例中为 ["a"]:


from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.compose import ColumnTransformer

dataset = pd.DataFrame({"a":["word gone wild","gone with wind"],"c":[1,2]})

tfidf = TfidfVectorizer(min_df=0)

clmn = ColumnTransformer([("tfidf", tfidf, ["a"])],remainder="passthrough")

clmn.fit_transform(dataset)

这给了我:


ValueError: empty vocabulary; perhaps the documents only contain stop words

显然,TfidfVectorizer可以fit_transform()自己做:


tfidf.fit_transform(dataset.a)

<2x5 sparse matrix of type '<class 'numpy.float64'>'

    with 6 stored elements in Compressed Sparse Row format>

出现这种错误的原因可能是什么以及如何纠正它?


萧十郎
浏览 249回答 2
2回答

小怪兽爱吃肉

那是因为您提供["a"]而不是"a"in ColumnTransformer。根据文档:在转换器期望 X 是一维数组(向量)的情况下,应使用标量字符串或整数,否则将向转换器传递二维数组。现在,TfidfVectorizer需要一个字符串迭代器用于输入(因此是一维字符串数组)。但是由于您要发送一个列名列表ColumnTransformer(即使该列表只包含一列),它将是将传递给TfidfVectorizer. 因此错误。将其更改为:clmn = ColumnTransformer([("tfidf", tfidf, "a")],&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;remainder="passthrough")为了更好地理解,请尝试使用上述内容从 Pandas DataFrame 中选择数据。执行以下操作时检查返回数据的格式(dtype、shape):dataset['a']vs&nbsp;dataset[['a']]更新:@SergeyBushmanov,关于您对另一个答案的评论,我认为您误解了文档。如果你想在两列上做 tfidf,那么你需要传递两个变压器。像这样的东西:tfidf_1 = TfidfVectorizer(min_df=0)tfidf_2 = TfidfVectorizer(min_df=0)clmn = ColumnTransformer([("tfidf_1", tfidf_1, "a"),&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ("tfidf_2", tfidf_2, "b")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;],&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;remainder="passthrough")

德玛西亚99

我们可以创建一个自定义的 tfidf 转换器,它可以接受一个列数组,然后在应用.fit()或之前加入它们.transform()。尝试这个!from sklearn.base import BaseEstimator,TransformerMixinclass custom_tfidf(BaseEstimator,TransformerMixin):&nbsp; &nbsp; def __init__(self,tfidf):&nbsp; &nbsp; &nbsp; &nbsp; self.tfidf = tfidf&nbsp; &nbsp; def fit(self, X, y=None):&nbsp; &nbsp; &nbsp; &nbsp; joined_X = X.apply(lambda x: ' '.join(x), axis=1)&nbsp; &nbsp; &nbsp; &nbsp; self.tfidf.fit(joined_X)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; return self&nbsp; &nbsp; def transform(self, X):&nbsp; &nbsp; &nbsp; &nbsp; joined_X = X.apply(lambda x: ' '.join(x), axis=1)&nbsp; &nbsp; &nbsp; &nbsp; return self.tfidf.transform(joined_X)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;import pandas as pdfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.compose import ColumnTransformerdataset = pd.DataFrame({"a":["word gone wild","word gone with wind"],&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; "b":[" gone fhgf wild","gone with wind"],&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; "c":[1,2]})tfidf = TfidfVectorizer(min_df=0)clmn = ColumnTransformer([("tfidf", custom_tfidf(tfidf), ['a','b'])],remainder="passthrough")clmn.fit_transform(dataset)#array([[0.36439074, 0.51853403, 0.72878149, 0.&nbsp; &nbsp; &nbsp; &nbsp; , 0.&nbsp; &nbsp; &nbsp; &nbsp; ,&nbsp; &nbsp; &nbsp; &nbsp; 0.25926702, 1.&nbsp; &nbsp; &nbsp; &nbsp; ],&nbsp; &nbsp; &nbsp; &nbsp;[0.&nbsp; &nbsp; &nbsp; &nbsp; , 0.438501&nbsp; , 0.&nbsp; &nbsp; &nbsp; &nbsp; , 0.61629785, 0.61629785,&nbsp; &nbsp; &nbsp; &nbsp; 0.2192505 , 2.&nbsp; &nbsp; &nbsp; &nbsp; ]])PS:可能您可能想要为每一列创建一个 tfidf 向量化器,然后创建一个以键作为列名和值作为拟合向量化器的字典。该字典可以在相应列的转换过程中使用
随时随地看视频慕课网APP

相关分类

Python
我要回答