我正在尝试对 TensorFlow 数据集中的单个列进行标记。如果只有一个特征列,我一直使用的方法效果很好,例如:
text = ["I played it a while but it was alright. The steam was a bit of trouble."
" The more they move these game to steam the more of a hard time I have"
" activating and playing a game. But in spite of that it was fun, I "
"liked it. Now I am looking forward to anno 2205 I really want to "
"play my way to the moon.",
"This game is a bit hard to get the hang of, but when you do it's great."]
target = [0, 1]
df = pd.DataFrame({"text": text,
"target": target})
training_dataset = (
tf.data.Dataset.from_tensor_slices((
tf.cast(df.text.values, tf.string),
tf.cast(df.target, tf.int32))))
tokenizer = tfds.features.text.Tokenizer()
lowercase = True
vocabulary = Counter()
for text, _ in training_dataset:
if lowercase:
text = tf.strings.lower(text)
tokens = tokenizer.tokenize(text.numpy())
vocabulary.update(tokens)
vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))
encoder = tfds.features.text.TokenTextEncoder(vocabulary,
lowercase=True,
tokenizer=tokenizer)
然而,当我尝试在有一组特征列的情况下执行此操作时,比如说从(每个特征列被命名的地方)出来,make_csv_dataset上述方法失败了。( ValueError: Attempt to convert a value (OrderedDict([]) to a Tensor.)。
并更改tokenizer.tokenize(text.numpy())
为tokenizer.tokenize(text)
引发另一个错误TypeError: Expected binary or unicode string, got <tf.Tensor 'StringLower:0' shape=(2,) dtype=string>
宝慕林4294392
哈士奇WWW
相关分类