引用和标记多特征 TensorFlow 数据集中的单个特征列

引用和标记多特征 TensorFlow 数据集中的单个特征列

我正在尝试对 TensorFlow 数据集中的单个列进行标记。如果只有一个特征列，我一直使用的方法效果很好，例如：

text = ["I played it a while but it was alright. The steam was a bit of trouble."

" The more they move these game to steam the more of a hard time I have"

" activating and playing a game. But in spite of that it was fun, I "

"liked it. Now I am looking forward to anno 2205 I really want to "

"play my way to the moon.",

"This game is a bit hard to get the hang of, but when you do it's great."]

target = [0, 1]

df = pd.DataFrame({"text": text,

"target": target})

training_dataset = (

tf.data.Dataset.from_tensor_slices((

tf.cast(df.text.values, tf.string),

tf.cast(df.target, tf.int32))))

tokenizer = tfds.features.text.Tokenizer()

lowercase = True

vocabulary = Counter()

for text, _ in training_dataset:

if lowercase:

text = tf.strings.lower(text)

tokens = tokenizer.tokenize(text.numpy())

vocabulary.update(tokens)

vocab_size = 5000

vocabulary, _ = zip(*vocabulary.most_common(vocab_size))

encoder = tfds.features.text.TokenTextEncoder(vocabulary,

lowercase=True,

tokenizer=tokenizer)

然而，当我尝试在有一组特征列的情况下执行此操作时，比如说从（每个特征列被命名的地方）出来，make_csv_dataset上述方法失败了。( ValueError: Attempt to convert a value (OrderedDict([]) to a Tensor.)。

并更改tokenizer.tokenize(text.numpy())为tokenizer.tokenize(text)引发另一个错误TypeError: Expected binary or unicode string, got <tf.Tensor 'StringLower:0' shape=(2,) dtype=string>

12345678_0001

浏览 141回答 2

2回答

宝慕林4294392

错误只是tokenizer.tokenize需要一个字符串，而你给它一个列表。这个简单的编辑将会起作用。我只是做了一个循环，将所有字符串提供给分词器，而不是给它一个字符串列表。dataset = tf.data.experimental.make_csv_dataset(    'test.csv',    batch_size=2,    label_name='target',    num_epochs=1)tokenizer = tfds.features.text.Tokenizer()lowercase = Truevocabulary = Counter()for features, _ in dataset:    text = features['text']    if lowercase:        text = tf.strings.lower(text)    for t in text:        tokens = tokenizer.tokenize(t.numpy())        vocabulary.update(tokens)

0

0

哈士奇WWW

创建的数据集的每个元素make_csv_dataset都是CVS 文件的一批行，而不是单个行；这就是为什么它需要batch_size作为输入参数。另一方面，for用于处理和标记文本特征的当前循环期望一次单个输入样本（即行）。因此，tokenizer.tokenize给定一批字符串会失败并引发TypeError: Expected binary or unicode string, got array(...).以最小的更改解决此问题的一种方法是首先以某种方式取消批处理数据集，对数据集执行所有预处理，然后再次对数据集进行批处理。unbatch幸运的是，我们可以在这里使用一个内置方法：dataset = tf.data.experimental.make_csv_dataset( ..., # This change is **IMPORTANT**, otherwise the `for` loop would continue forever! num_epochs=1)# Unbatch the dataset; this is required even if you have used `batch_size=1` above.dataset = dataset.unbatch()############################################### Do all the preprocessings on the dataset here...################################################ When preprocessings are finished and you are ready to use your dataset:#### 1. Batch the dataset (only if needed for or applicable to your specific workflow)#### 2. Repeat the dataset (only if needed for or applicable to specific your workflow)dataset = dataset.batch(BATCH_SIZE).repeat(NUM_EPOCHS or -1)@NicolasGervais 的答案中建议的另一种解决方案是调整和修改所有预处理代码，以处理一批样本，而不是一次处理单个样本。

0

0

随时随地看视频慕课网APP

相关分类

Python