猿问

在 Keras 模型中使用 Tf-Idf

我已将我的训练、测试和验证句子读入 train_sentences、test_sentences、val_sentences


然后我在这些上应用了 Tf-IDF 矢量化器。


vectorizer = TfidfVectorizer(max_features=300)

vectorizer = vectorizer.fit(train_sentences)


X_train = vectorizer.transform(train_sentences)

X_val = vectorizer.transform(val_sentences)

X_test = vectorizer.transform(test_sentences)

我的模型看起来像这样


model = Sequential()


model.add(Input(????))


model.add(Flatten())


model.add(Dense(256, activation='relu'))


model.add(Dense(32, activation='relu'))


model.add(Dense(8, activation='sigmoid'))


model.summary()


model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

通常我们在 word2vec 的情况下在嵌入层中传递嵌入矩阵。


我应该如何在 Keras 模型中使用 Tf-IDF?请给我一个使用的例子。


谢谢。


富国沪深
浏览 100回答 1
1回答

收到一只叮咚

我无法想象将 TF/IDF 值与嵌入向量结合的充分理由,但这里有一个可能的解决方案:使用功能 API、多个Inputs 和concatenate函数。要连接层输出,它们的形状必须对齐(被连接的轴除外)。一种方法是平均嵌入,然后连接到 TF/IDF 值的向量。设置和一些示例数据from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import fetch_20newsgroupsimport numpy as npimport kerasfrom keras.models import Modelfrom keras.layers import Dense, Activation, concatenate, Embedding, Inputfrom keras.preprocessing.text import Tokenizerfrom keras.preprocessing.sequence import pad_sequences# some sample training databunch = fetch_20newsgroups()all_sentences = []for document in bunch.data:  sentences = document.split("\n")  all_sentences.extend(sentences)all_sentences = all_sentences[:1000]X_train, X_test = train_test_split(all_sentences, test_size=0.1)len(X_train), len(X_test)vectorizer = TfidfVectorizer(max_features=300)vectorizer = vectorizer.fit(X_train)df_train = vectorizer.transform(X_train)tokenizer = Tokenizer()tokenizer.fit_on_texts(X_train)maxlen = 50sequences_train = tokenizer.texts_to_sequences(X_train)sequences_train = pad_sequences(sequences_train, maxlen=maxlen)模型定义vocab_size = len(tokenizer.word_index) + 1embedding_size = 300input_tfidf = Input(shape=(300,))input_text = Input(shape=(maxlen,))embedding = Embedding(vocab_size, embedding_size, input_length=maxlen)(input_text)# this averaging method taken from:# https://stackoverflow.com/a/54217709/1987598mean_embedding = keras.layers.Lambda(lambda x: keras.backend.mean(x, axis=1))(embedding)concatenated = concatenate([input_tfidf, mean_embedding])dense1 = Dense(256, activation='relu')(concatenated)dense2 = Dense(32, activation='relu')(dense1)dense3 = Dense(8, activation='sigmoid')(dense2)model = Model(inputs=[input_tfidf, input_text], outputs=dense3)model.summary()model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])模型汇总输出Model: "model_2"__________________________________________________________________________________________________Layer (type)                    Output Shape         Param #     Connected to                     ==================================================================================================input_11 (InputLayer)           (None, 50)           0                                            __________________________________________________________________________________________________embedding_5 (Embedding)         (None, 50, 300)      633900      input_11[0][0]                   __________________________________________________________________________________________________input_10 (InputLayer)           (None, 300)          0                                            __________________________________________________________________________________________________lambda_1 (Lambda)               (None, 300)          0           embedding_5[0][0]                __________________________________________________________________________________________________concatenate_4 (Concatenate)     (None, 600)          0           input_10[0][0]                                                                                    lambda_1[0][0]                   __________________________________________________________________________________________________dense_5 (Dense)                 (None, 256)          153856      concatenate_4[0][0]              __________________________________________________________________________________________________dense_6 (Dense)                 (None, 32)           8224        dense_5[0][0]                    __________________________________________________________________________________________________dense_7 (Dense)                 (None, 8)            264         dense_6[0][0]                    ==================================================================================================Total params: 796,244Trainable params: 796,244Non-trainable params: 0
随时随地看视频慕课网APP

相关分类

Python
我要回答