Math behind Attention - Q, K, and V
- Authors
- Name
- Amit Shekhar
- Published on
I am Amit Shekhar, Founder @ Outcome School, I have taught and mentored many developers, and their efforts landed them high-paying tech jobs, helped many tech companies in solving their unique problems, and created many open-source libraries being used by top companies. I am passionate about sharing knowledge through open-source, blogs, and videos.
I teach AI and Machine Learning, and Android at Outcome School.
Join Outcome School and get high paying tech job:
In this blog, we will learn about the math behind Attention: Query(Q), Key(K), and Value(V) with a step-by-step numeric example.
We will cover the following:
- The Attention Formula
- Setting Up: From Words to Vectors
- Creating Q, K, and V Matrices
- Computing Attention Scores (Q x K^T)
- Scaling the Scores
- Applying Softmax
- Computing the Final Output (Attention Weights x V)
- Putting It All Together
Let's get started.
The Attention Formula
The core formula for Scaled Dot-Product Attention is:
Attention(Q, K, V) = softmax(Q x K^T / sqrt(d_k)) x V
Here:
Qis the Query matrixKis the Key matrixVis the Value matrixK^Tis the transpose of the Key matrixd_kis the dimension of the Key vectorsqrtmeans square root
Do not worry if this formula looks complex. We will break it down step by step with actual numbers. By the end of this blog, every part of this formula will be clear.
The idea is simple: we compare what each word is looking for (Query) with what every other word offers (Key), and then use those comparisons to collect the actual information (Value) from the most relevant words.
Now, let's see this with real numbers.
Setting Up: From Words to Vectors
Let's take a simple sentence with 3 words:
"I love AI"
In a Transformer, each word is first converted into a vector (i.e. a list of numbers) called an embedding. Just for the sake of understanding, we will use very small vectors of size 4 (i.e. d_emb = 4).
"I" → [1.0, 0.0, 1.0, 0.0]
"love" → [0.0, 1.0, 0.0, 1.0]
"AI" → [1.0, 1.0, 0.0, 0.0]
We stack these into an input matrix X of shape 3 x 4 (3 words, each with 4 numbers):
X = | 1.0 0.0 1.0 0.0 | ← "I"
| 0.0 1.0 0.0 1.0 | ← "love"
| 1.0 1.0 0.0 0.0 | ← "AI"
Here, each row is one word and each column is one number in the embedding. This is our starting point.
Creating Q, K, and V Matrices
Now, we need to create three separate matrices: Q (Query), K (Key), and V (Value).
How do we create them? We multiply the input matrix X with three separate weight matrices: W_Q, W_K, and W_V. These weight matrices are learned during training.
Q = X x W_Q
K = X x W_K
V = X x W_V
For our example, let's say d_k = 3 (i.e. we want Q, K, V vectors of size 3). So each weight matrix has the shape 4 x 3 (input dimension 4, output dimension 3).
Let's use the following weight matrices for the sake of understanding:
W_Q = | 1 0 1 |
| 0 1 0 |
| 1 0 0 |
| 0 1 1 |
W_K = | 0 1 0 |
| 1 0 1 |
| 0 0 1 |
| 1 1 0 |
W_V = | 1 0 0 |
| 0 1 0 |
| 0 0 1 |
| 1 1 0 |
Now, let's compute Q, K, and V.
Computing Q = X x W_Q:
Q = | 1.0 0.0 1.0 0.0 | | 1 0 1 |
| 0.0 1.0 0.0 1.0 | x | 0 1 0 |
| 1.0 1.0 0.0 0.0 | | 1 0 0 |
| 0 1 1 |
For the first row ("I"): [1*1 + 0*0 + 1*1 + 0*0, 1*0 + 0*1 + 1*0 + 0*1, 1*1 + 0*0 + 1*0 + 0*1] = [2, 0, 1]
For the second row ("love"): [0*1 + 1*0 + 0*1 + 1*0, 0*0 + 1*1 + 0*0 + 1*1, 0*1 + 1*0 + 0*0 + 1*1] = [0, 2, 1]
For the third row ("AI"): [1*1 + 1*0 + 0*1 + 0*0, 1*0 + 1*1 + 0*0 + 0*1, 1*1 + 1*0 + 0*0 + 0*1] = [1, 1, 1]
Q = | 2 0 1 | ← "I"
| 0 2 1 | ← "love"
| 1 1 1 | ← "AI"
Computing K = X x W_K:
For the first row ("I"): [1*0 + 0*1 + 1*0 + 0*1, 1*1 + 0*0 + 1*0 + 0*1, 1*0 + 0*1 + 1*1 + 0*0] = [0, 1, 1]
For the second row ("love"): [0*0 + 1*1 + 0*0 + 1*1, 0*1 + 1*0 + 0*0 + 1*1, 0*0 + 1*1 + 0*1 + 1*0] = [2, 1, 1]
For the third row ("AI"): [1*0 + 1*1 + 0*0 + 0*1, 1*1 + 1*0 + 0*0 + 0*1, 1*0 + 1*1 + 0*1 + 0*0] = [1, 1, 1]
K = | 0 1 1 | ← "I"
| 2 1 1 | ← "love"
| 1 1 1 | ← "AI"
Computing V = X x W_V:
For the first row ("I"): [1*1 + 0*0 + 1*0 + 0*1, 1*0 + 0*1 + 1*0 + 0*1, 1*0 + 0*0 + 1*1 + 0*0] = [1, 0, 1]
For the second row ("love"): [0*1 + 1*0 + 0*0 + 1*1, 0*0 + 1*1 + 0*0 + 1*1, 0*0 + 1*0 + 0*1 + 1*0] = [1, 2, 0]
For the third row ("AI"): [1*1 + 1*0 + 0*0 + 0*1, 1*0 + 1*1 + 0*0 + 0*1, 1*0 + 1*0 + 0*1 + 0*0] = [1, 1, 0]
V = | 1 0 1 | ← "I"
| 1 2 0 | ← "love"
| 1 1 0 | ← "AI"
Now, we have our Q, K, and V matrices ready.
Note: In real models, these weight matrices are not set by hand. They are learned during training. We are using simple numbers here for the sake of understanding.
Till now, we have learned how to create Q, K, and V from the input. Now, let's move to the next step.
Computing Attention Scores (Q x K^T)
The first step in the attention formula is to compute Q x K^T.
This tells us how much each word should attend to every other word. We multiply the Query of each word with the Key of every other word.
First, we need K^T (transpose of K). Transposing simply means we swap the rows and columns. The first row becomes the first column, the second row becomes the second column, and so on:
K^T = | 0 2 1 |
| 1 1 1 |
| 1 1 1 |
Now, let's compute Q x K^T:
Q x K^T = | 2 0 1 | | 0 2 1 |
| 0 2 1 | x | 1 1 1 |
| 1 1 1 | | 1 1 1 |
For "I" attending to all words:
- "I" → "I":
2*0 + 0*1 + 1*1=1 - "I" → "love":
2*2 + 0*1 + 1*1=5 - "I" → "AI":
2*1 + 0*1 + 1*1=3
For "love" attending to all words:
- "love" → "I":
0*0 + 2*1 + 1*1=3 - "love" → "love":
0*2 + 2*1 + 1*1=3 - "love" → "AI":
0*1 + 2*1 + 1*1=3
For "AI" attending to all words:
- "AI" → "I":
1*0 + 1*1 + 1*1=2 - "AI" → "love":
1*2 + 1*1 + 1*1=4 - "AI" → "AI":
1*1 + 1*1 + 1*1=3
"I" "love" "AI"
Scores = | 1 5 3 | ← "I"
| 3 3 3 | ← "love"
| 2 4 3 | ← "AI"
Here, we can see that the word "I" gives the highest attention score of 5 to "love", meaning "I" finds "love" the most relevant. The word "love" gives equal scores of 3 to all words. The word "AI" gives the highest score of 4 to "love".
This was all about computing the attention scores. Now, let's move to scaling.
Scaling the Scores
Now, we divide every score by sqrt(d_k).
In our example, d_k = 3 (the dimension of our Key vectors), so sqrt(3) = 1.732.
Now, the question is: why do we scale? The answer is: if the dot product values are too large, the softmax function in the next step will produce very extreme values (close to 0 or 1). This makes it difficult for the model to learn. Scaling keeps the values in a manageable range.
Scaled Scores = Scores / sqrt(3)
"I" "love" "AI"
Scaled Scores = | 0.577 2.887 1.732 | ← "I"
| 1.732 1.732 1.732 | ← "love"
| 1.155 2.309 1.732 | ← "AI"
Here, we can notice that the relative order of the scores has not changed. The word "I" still attends to "love" the most. But the values are now smaller, which is exactly what we wanted.
In the future, I plan to write about why we choose to divide by sqrt(d_k) instead of some other factor.
Now, let's apply softmax.
Applying Softmax
Now, we apply the softmax function to each row. Softmax converts the scores into probabilities that add up to 1.
The softmax formula for a value x_i in a row is:
softmax(x_i) = e^(x_i) / sum(e^(x_j) for all j in that row)
Here, e is the mathematical constant (approximately 2.718). Do not worry about the formula too much. We will compute it step by step.
For the "I" row: [0.577, 2.887, 1.732]
e^0.577= 1.781e^2.887= 17.940e^1.732= 5.651- Sum = 1.781 + 17.940 + 5.651 = 25.372
Attention Weights:
- "I" → "I": 1.781 / 25.372 = 0.070
- "I" → "love": 17.940 / 25.372 = 0.707
- "I" → "AI": 5.651 / 25.372 = 0.223
For the "love" row: [1.732, 1.732, 1.732]
All values are equal, so each word gets equal attention:
- "love" → "I": 0.333
- "love" → "love": 0.333
- "love" → "AI": 0.333
For the "AI" row: [1.155, 2.309, 1.732]
e^1.155= 3.174e^2.309= 10.063e^1.732= 5.651- Sum = 3.174 + 10.063 + 5.651 = 18.888
Attention Weights:
- "AI" → "I": 3.174 / 18.888 = 0.168
- "AI" → "love": 10.063 / 18.888 = 0.533
- "AI" → "AI": 5.651 / 18.888 = 0.299
The final Attention Weight Matrix is:
"I" "love" "AI"
Attention Weights = | 0.070 0.707 0.223 | ← "I"
| 0.333 0.333 0.333 | ← "love"
| 0.168 0.533 0.299 | ← "AI"
Here, we can see that:
- The word "I" pays 70.7% of its attention to "love", 22.3% to "AI", and only 7.0% to itself.
- The word "love" pays equal attention to all three words (33.3% each).
- The word "AI" pays 53.3% of its attention to "love", 29.9% to itself, and 16.8% to "I".
Each row adds up to 1.0 (i.e. 100%). This is what softmax ensures.
Now, we have the attention weights. It is time to compute the final output.
Computing the Final Output (Attention Weights x V)
Now, we multiply the Attention Weight Matrix with the Value matrix V to get the final output.
Output = Attention Weights x V
Output = | 0.070 0.707 0.223 | | 1 0 1 |
| 0.333 0.333 0.333 | x | 1 2 0 |
| 0.168 0.533 0.299 | | 1 1 0 |
For "I":
0.070*1 + 0.707*1 + 0.223*1=1.0000.070*0 + 0.707*2 + 0.223*1=1.6370.070*1 + 0.707*0 + 0.223*0=0.070
For "love":
0.333*1 + 0.333*1 + 0.333*1=0.9990.333*0 + 0.333*2 + 0.333*1=0.9990.333*1 + 0.333*0 + 0.333*0=0.333
For "AI":
0.168*1 + 0.533*1 + 0.299*1=1.0000.168*0 + 0.533*2 + 0.299*1=1.3650.168*1 + 0.533*0 + 0.299*0=0.168
Output = | 1.000 1.637 0.070 | ← "I"
| 1.000 1.000 0.333 | ← "love"
| 1.000 1.365 0.168 | ← "AI"
Here, we can see that each word now has a new vector that is a weighted combination of the Value vectors of all words. The word "I" gets a new representation that is heavily influenced by the Value of "love" (because it had the highest attention weight of 0.707 for "love"). This is the power of the attention mechanism.
The output for each word is no longer just about that word alone. It now contains information from all the other words, weighted by how relevant they are. This is how the model understands context and relationships between words.
This was all about computing the final output. Now, let's put it all together.
Putting It All Together
Let's summarize the entire computation in one place:
Step 1: Start with the input embeddings (X).
Step 2: Multiply X with weight matrices to get Q, K, and V.
Step 3: Compute attention scores by multiplying Q with the transpose of K (Q x K^T).
Step 4: Scale the scores by dividing by sqrt(d_k).
Step 5: Apply softmax to get attention weights (probabilities).
Step 6: Multiply attention weights with V to get the final output.
The entire process in one formula:
Attention(Q, K, V) = softmax(Q x K^T / sqrt(d_k)) x V
Now, every part of this formula is clear.
This is how the math behind Query, Key, and Value works in the Attention mechanism.
Every Large Language Model uses this exact computation at its core. The numbers we used were small for the sake of understanding, but in real models, the dimensions are much larger (512, 1024, or more) and the weight matrices are learned during training.
Now, we must have understood the math behind Attention: Q, K, and V.
Prepare yourself for AI Engineering Interview: AI Engineering Interview Questions
That's it for now.
Thanks
Amit Shekhar
Founder @ Outcome School
You can connect with me on:
Follow Outcome School on:
