序列
介绍
现实生活中,有许多数据并不是像图片一样,看一眼就能了解全部信息.
举个简单的例子:
一句话中,词语的出现顺序可能也会影响这句话的信息量:**“狗咬人”所带来的信息远没有 “人咬狗”**带来的信息量大……
序列数据已经存在了人们的生活中当中:
音乐、语言和视频
地震:大地震后,往往伴随小的余震
如何对序列进行统计
上述提到了信息量,而信息量这个词往往就和概率相关了
在时间t t t 观察到x t x_t x t ,那么得到T T T 个不独立的随机变量( x 1 , . . . x T ) − p ( x ) (x_1,...x_T) - p(\bold x) ( x 1 , ... x T ) − p ( x )
使用条件概率展开p ( a , b ) = p ( a ) p ( b ∣ a ) = p ( b ) p ( a ∣ b ) p(a,b)=p(a)p(b|a)=p(b)p(a|b) p ( a , b ) = p ( a ) p ( b ∣ a ) = p ( b ) p ( a ∣ b )
所以p ( x ) p(\bold x) p ( x ) 就可以写为条件概率的乘积,如下图所示
f ( ⋅ ) f(\cdot) f ( ⋅ ) 表示训练的模型
A.马尔可夫
当前数据只跟过去有限个数据相关(概率论学过,此处不再赘述)
B.潜变量模型
循环神经网络
我们决定以潜变量模型为理论基础,搭建一个网络。与多层感知机不同的是,每次隐变量发生更新,每次的输出取决于当下输出和前一时间的隐变量
这张图很好的展示了循环神经网络的工作原理,第4层就是循环神经网络层
如图所示,第四层网络的输入不仅有上一网络的输出,还有上一次计算时自己本身的输出
困惑度(perplexity)
衡量一个语言模型的好坏可以用平均交叉熵
π = 1 n ∑ i = 1 n − log p ( x t ∣ x t − 1 , . . . ) \pi=\frac{1}{n}\sum_{i=1}^n-\log p(x_t|x_{t-1},...)
π = n 1 i = 1 ∑ n − log p ( x t ∣ x t − 1 , ... )
p是模型的预测概率,xt是真实值,困惑度就是$ exp(\pi) $
梯度裁剪
一种防止梯度爆炸的方式,假设所有层的梯度为g,如果g的长度超过θ,那么长度将会被拉回θ
g ← min ( 1 , θ ∥ g ∥ ) g \mathbf{g}\leftarrow\min\left(1,\frac{\theta}{\|\mathbf{g}\|}\right)\mathbf{g}
g ← min ( 1 , ∥ g ∥ θ ) g
代码
这里的代码比较复杂,所以没有全部放入文章中建议认真看
https://courses.d2l.ai/zh-v2/assets/notebooks/chapter_recurrent-neural-networks/sequence.slides.html#/
使用正弦函数和一些可加性噪声来生成序列数据,时间步为 1,2,…,1000
1 2 3 4 5 6 7 8 import torchfrom torch import nnfrom d2l import torch as d2lT = 1000 time = torch.arange(1 , T + 1 , dtype=torch.float32) x = torch.sin(0.01 * time) + torch.normal(0 , 0.2 , (T,)) d2l.plot(time, [x], 'time' , 'x' , xlim=[1 , 1000 ], figsize=(6 , 3 ))
将数据映射为数据对 yt=xtyt=xt 和 xt=[xt−τ,…,xt−1]
1 2 3 4 5 6 7 8 9 tau = 4 features = torch.zeros((T - tau, tau)) for i in range (tau): features[:, i] = x[i:T - tau + i] labels = x[tau:].reshape((-1 , 1 )) batch_size, n_train = 16 , 600 train_iter = d2l.load_array((features[:n_train], labels[:n_train]), batch_size, is_train=True )
使用一个相当简单的结构:只是一个拥有两个全连接层的多层感知机
1 2 3 4 5 6 7 8 9 10 def init_weights (m ): if type (m) == nn.Linear: nn.init.xavier_uniform_(m.weight) def get_net (): net = nn.Sequential(nn.Linear(4 , 10 ), nn.ReLU(), nn.Linear(10 , 1 )) net.apply(init_weights) return net loss = nn.MSELoss()
训练模型
1 2 3 4 5 6 7 8 9 10 11 12 13 def train (net, train_iter, loss, epochs, lr ): trainer = torch.optim.Adam(net.parameters(), lr) for epoch in range (epochs): for X, y in train_iter: trainer.zero_grad() l = loss(net(X), y) l.backward() trainer.step() print (f'epoch {epoch + 1 } , ' f'loss: {d2l.evaluate_loss(net, train_iter, loss):f} ' ) net = get_net() train(net, train_iter, loss, 5 , 0.01 )
模型预测下一个时间步
1 2 3 4 5 onestep_preds = net(features) d2l.plot( [time, time[tau:]], [x.detach().numpy(), onestep_preds.detach().numpy()], 'time' , 'x' , legend=['data' , '1-step preds' ], xlim=[1 , 1000 ], figsize=(6 , 3 ))
进行多步预测
1 2 3 4 5 6 7 8 9 10 11 multistep_preds = torch.zeros(T) multistep_preds[:n_train + tau] = x[:n_train + tau] for i in range (n_train + tau, T): multistep_preds[i] = net(multistep_preds[i - tau:i].reshape((1 , -1 ))) d2l.plot([time, time[tau:], time[n_train + tau:]], [ x.detach().numpy(), onestep_preds.detach().numpy(), multistep_preds[n_train + tau:].detach().numpy()], 'time' , 'x' , legend=['data' , '1-step preds' , 'multistep preds' ], xlim=[1 , 1000 ], figsize=(6 , 3 ))
更仔细地看一下k步预测
1 2 3 4 5 6 7 8 9 10 11 12 13 14 max_steps = 64 features = torch.zeros((T - tau - max_steps + 1 , tau + max_steps)) for i in range (tau): features[:, i] = x[i:i + T - tau - max_steps + 1 ] for i in range (tau, tau + max_steps): features[:, i] = net(features[:, i - tau:i]).reshape(-1 ) steps = (1 , 4 , 16 , 64 ) d2l.plot([time[tau + i - 1 :T - max_steps + i] for i in steps], [features[:, (tau + i - 1 )].detach().numpy() for i in steps], 'time' , 'x' , legend=[f'{i} -step preds' for i in steps], xlim=[5 , 1000 ], figsize=(6 , 3 ))
https://courses.d2l.ai/zh-v2/assets/notebooks/chapter_recurrent-neural-networks/rnn-concise.slides.html#/
1 2 3 4 5 6 7 import torchfrom torch import nnfrom torch.nn import functional as Ffrom d2l import torch as d2lbatch_size, num_steps = 32 , 35 train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
定义模型
1 2 num_hiddens = 256 rnn_layer = nn.RNN(len (vocab), num_hiddens)
使用张量来初始化隐藏状态
1 2 state = torch.zeros((1 , batch_size, num_hiddens)) state.shape
通过一个隐藏状态和一个输入,我们可以用更新后的隐藏状态计算输出
1 2 3 X = torch.rand(size=(num_steps, batch_size, len (vocab))) Y, state_new = rnn_layer(X, state) Y.shape, state_new.shape
我们为一个完整的循环神经网络模型定义一个RNNModel
类
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 class RNNModel (nn.Module): """循环神经网络模型。""" def __init__ (self, rnn_layer, vocab_size, **kwargs ): super (RNNModel, self ).__init__(**kwargs) self .rnn = rnn_layer self .vocab_size = vocab_size self .num_hiddens = self .rnn.hidden_size if not self .rnn.bidirectional: self .num_directions = 1 self .linear = nn.Linear(self .num_hiddens, self .vocab_size) else : self .num_directions = 2 self .linear = nn.Linear(self .num_hiddens * 2 , self .vocab_size) def forward (self, inputs, state ): X = F.one_hot(inputs.T.long(), self .vocab_size) X = X.to(torch.float32) Y, state = self .rnn(X, state) output = self .linear(Y.reshape((-1 , Y.shape[-1 ]))) return output, state def begin_state (self, device, batch_size=1 ): if not isinstance (self .rnn, nn.LSTM): return torch.zeros((self .num_directions * self .rnn.num_layers, batch_size, self .num_hiddens), device=device) else : return (torch.zeros((self .num_directions * self .rnn.num_layers, batch_size, self .num_hiddens), device=device), torch.zeros((self .num_directions * self .rnn.num_layers, batch_size, self .num_hiddens), device=device))
用一个具有随机权重的模型进行预测
1 2 3 4 device = d2l.try_gpu() net = RNNModel(rnn_layer, vocab_size=len (vocab)) net = net.to(device) d2l.predict_ch8('time traveller' , 10 , net, vocab, device)
使用高级API训练模型
1 2 num_epochs, lr = 500 , 1 d2l.train_ch8(net, train_iter, vocab, lr, num_epochs, device)
总结
循环神经网络的输出取决于当下输入和前一时间的隐变量
应用到语言模型中时,循环神经双络根据一前词预测下一次时刻词
通常使用困惑度来衡量语言模型的好坏