Math
RMSNorm (Root Mean Square Layer Normalization)
In this blog, we will learn about RMSNorm, a faster and simpler alternative to Layer Normalization that powers most modern Large Language Models like Llama, Mistral, Gemma, Qwen, PaLM, and DeepSeek.
Math Behind RoPE (Rotary Position Embedding)
In this blog, we will learn about the math behind Rotary Position Embedding (RoPE) and why it is used in modern Large Language Models.
Math Behind Cross-Entropy Loss
In this blog, we will learn about the math behind Cross-Entropy Loss with a step-by-step numeric example.
Math Behind Gradient Descent
In this blog, we will learn about the math behind gradient descent with a step-by-step numeric example.
Math Behind Backpropagation
In this blog, we will learn about the math behind backpropagation in neural networks.
Math behind √dₖ Scaling Factor in Attention
In this blog, we will learn about why we scale the dot product attention by √dₖ in the Transformer architecture with a step-by-step numeric example.
Math behind Attention - Q, K, and V
In this blog, we will learn about the math behind Attention - Query(Q), Key(K), and Value(V) with a step-by-step numeric example.