The claims regarding attention heads in transformers can be understood through the concepts of QK (query-key) circuits and OV (output-value) circuits. Each of these circuits plays a distinct role in the attention mechanism, which is fundamental to how transformers process information.
Understanding QK and OV Circuits
1. QK Circuit
The QK circuit is responsible for computing the attention pattern. It determines how much focus one token should place on another token in the input sequence. This is done by calculating the dot product between query vectors (Q) and key vectors (K). The resulting scores indicate the strength of attention that should be applied to each token.
Example:
- Consider a sentence: "The cat sat on the mat."
- Suppose we have a query vector for "cat" and key vectors for all tokens.
- The QK circuit computes the attention scores between "cat" and all other tokens, determining which tokens "cat" should attend to based on their relevance.
2. OV Circuit
The OV circuit computes how each token's value vector (V) contributes to the final output if attended to. After determining which tokens are attended to via the QK circuit, the OV circuit aggregates these contributions to produce the final output vector.
Example:
- Continuing with our previous example, if "cat" attends to "the" and "sat," the OV circuit will combine their value vectors based on the attention scores computed by the QK circuit.
- This combination yields a new representation that reflects the context of "cat" within the sentence.