The Mechanism of Self-Attention

Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. It has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations.

Key Concept: Query, Key, and Value

In the self-attention layer, for each word, we create three vectors: a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

When computing the self-attention for a word, we score each word of the input sentence against it. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

Previous

RNNs and Vanishing Gradients

Next

The Transformer Architecture