tf源码实现的《Attention-based Extraction of Structured
Information from Street View Imagery》:https://github.com/tensorflow/models/tree/master/attention
{squeeze,
temp:
(N,32,1,1664)
,a:
(N,32)
,
query:
(1,N,512)
,sum:(N,256),Linear3:
(256)
,RNN(LSTM)
hidden_size=256
,Linear1:
(4682)
,hidden_states:
tuple(h,c)
h/c:(1,N,256)
,Decoder,sum:(N,4682),输入图片:(N,3,64,1024),softmax,1*1卷积
(channel=1664)
,
y:
(N,1664)
,维度变换,cnn ferature:
(N,1664,1,32)
,attns(in),初始化V:
(1664,)整个网络仅初始化一次,作为网络参数参与训练
(作为cnn feature的通道注意力)
,sum(dim=[2,3]),Attention,Linear4 and log_softmax:
(256)
,hiden_states在LSTM中是由h,c组成的元组,起始字符时进行随机初始化,更新,hidden:
(N,32,1,1664)
,decoder_input,初始化attns(利用随机值输入Attention进行初始化):
(N,1664)
一个batch的图片初始化一次,不作为网络参数
,attns(out):
(N,1664)
,
s:
(2,32)
,参与运算的输入或者输出(不包含训练参数),rnn_out:
(N,256)
,decoder_input:
(N,4682)
one-hot编码
,Linear2:
(4682)
,网络结构(包含训练的参数),全局自适应平均池化,Linear4:
(256)
,Linear5:
(1664)
,CNN(densenet169)
4倍下采样
,concate,V*tanh(hidden+y)}