我们的decoderLayer的call函数实现如下:
def call(self, x, encoding_outputs, training,
decoder_mask, encoder_decoder_padding_mask):
# decoder_mask: 由look_ahead_mask和decoder_padding_mask合并而来
# x.shape: (batch_size, target_seq_len, d_model)
# encoding_outputs.shape: (batch_size, input_seq_len, d_model)
# attn1, out1.shape : (batch_size, target_seq_len, d_model)
attn1, attn_weights1 = self.mha1(x, x, x, decoder_mask)
attn1 = self.dropout1(attn1, training = training)
out1 = self.layer_norm1(attn1 + x)
# attn2, out2.shape : (batch_size, target_seq_len, d_model)
attn2, attn_weights2 = self.mha2(
out1, encoding_outputs, encoding_outputs,
encoder_decoder_padding_mask)
attn2 = self.dropout2(attn2, training = training)
out2 = self.layer_norm2(attn2 + out1)
# ffn_output, out3.shape: (batch_size, target_seq_len, d_model)
ffn_output = self.ffn(out2)
ffn_output = self.dropout3(ffn_output, training=training)
out3 = self.layer_norm3(ffn_output + out2)
return out3, attn_weights1, attn_weights2
self.mha2中,out1是query,encoding_outputs是key和value啊。
query和key去计算attention权重,value去和权重做乘积。物理意义就是,对于decoder来说,每一步都去和encoder的所有输出去做关联度(attention权重)计算,然后用encoder每一步的输出用关联度加权,得到decoder这一步需要encoder里的那些信息。然后再做下一步的计算。