猿问

如何为一种热编码实现生成器功能

我实现了一个生成器函数来产生一个热编码向量,但生成器实际上是在抛出错误


我使用生成器函数来生成一个热编码向量,因为后者将用作深度学习 lstm 模型的输入。我这样做是为了避免在尝试在非常大的数据集上创建一个热编码时出现过多的负载和内存故障。但是,我没有收到生成器功能的错误。我需要帮助来弄清楚我哪里出错了。


之前的代码:


X = np.zeros((len(sequences), seq_length, vocab_size), dtype=np.bool)

y = np.zeros((len(sequences), vocab_size), dtype=np.bool)

for i, sentence in enumerate(sequences):

    for t, word in enumerate(sentence):

        X[i, t, vocab[word]] = 1

    y[i, vocab[next_words[i]]] = 1

这里,


sequences = sentences generated from data set

seq_length = length of each sentence(this is constant)

vocab_size = number of unique words in dictionary


My program when run on the large data set produces,


sequences = 44073315

seq_length = 30

vocab_size = 124958

所以,当上面的代码直接用于后面的输入时,它会给出 beloe 错误。


Traceback (most recent call last):

  File "1.py", line 206, in <module>

    X = np.zeros((len(sequences), seq_length, vocab_size), dtype=np.bool)

MemoryError

(my_env) [rjagannath1@login ~]$

所以,我尝试创建一个生成器函数(用于测试),如下所示 -


def gen(batch_size, no_of_sequences, seq_length, vocab_size):

    bs = batch_size

    ns = no_of_sequences

    X = np.zeros((batch_size, seq_length, vocab_size), dtype=np.bool)

    y = np.zeros((batch_size, vocab_size), dtype=np.bool)

    while(ns > bs):

        for i, sentence in enumerate(sequences):

            for t, word in enumerate(sentence):

                X[i, t, vocab[word]] = 1

            y[i, vocab[next_words[i]]] = 1

        print(X.shape())

        print(y.shape())

        yield(X, y)

        ns = ns - bs 


for item in gen(1000, 44073315, 30, 124958):

    print(item) 

但我收到以下错误 -


File "path_of_file", line 247, in gen

    X[i, t, vocab[word]] = 1


IndexError: index 1000 is out of bounds for axis 0 with size 1000

我在生成器函数中犯了什么错误?


九州编程
浏览 136回答 1
1回答

森栏

在您的生成器中进行如下修改:batch_i = 0while(ns > bs):&nbsp; &nbsp; s = batch_i*batch_size&nbsp; &nbsp; e = (batch_i+1)*batch_size&nbsp; &nbsp; for i, sentence in enumerate(sequences[s:e]):基本上,您想要运行大小的窗口,batch_size因此您正在制作一个运行切片,sequences它似乎是您的整个数据集。你还必须增加batch_i,把它放在后面yield,所以添加 batch_i+=1
随时随地看视频慕课网APP

相关分类

Python
我要回答