The highlighted text presents an equation that represents an attention mechanism used in the proposed network architecture. The attention mechanism is a way for the model to focus on specific parts of the input sequence when generating the output sequence. The equation is written as:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
Here, Q, K, and V are matrices representing the queries, keys, and values, respectively. These matrices are obtained from the input sequence and are used to compute the attention scores. The softmax function is applied to the scores to obtain a probability distribution over the input sequence. The square root of the dimension of the key matrix (d_k) is used to scale the scores. Finally, the values matrix is multiplied by the attention scores to obtain the weighted sum of the values, which is used as the output of the attention mechanism.
The proposed network architecture is based solely on this attention mechanism, without using any recurrent or convolutional layers. This makes the model more parallelizable and faster to train, while still achieving superior performance on machine translation tasks. The results show that the proposed model outperforms existing state-of-the-art models on English-to-German and English-to-French translation tasks.