All Blogs
Decoding DeepSeek-V4
In this blog, we will learn about DeepSeek-V4, the new family of open Mixture-of-Experts language models that natively supports a one-million-token context with dramatically lower inference cost.
LoRA - Low-Rank Adaptation of LLMs
In this blog, we will learn about LoRA - Low-Rank Adaptation of Large Language Models.
Math Behind RoPE (Rotary Position Embedding)
In this blog, we will learn about the math behind Rotary Position Embedding (RoPE) and why it is used in modern Large Language Models.
Grouped Query Attention
In this blog, we will learn about Grouped-Query Attention (GQA) and how it differs from Multi-Head Attention (MHA).
Math Behind Cross-Entropy Loss
In this blog, we will learn about the math behind Cross-Entropy Loss with a step-by-step numeric example.
Math Behind Gradient Descent
In this blog, we will learn about the math behind gradient descent with a step-by-step numeric example.
Decoding Vision Transformer (ViT)
In this blog, we will learn about the Vision Transformer (ViT) by decoding how it splits an image into patches, turns those patches into tokens, and processes them with a transformer to classify the image.
Feed-Forward Networks in LLMs
In this blog, we will learn about Feed-Forward Networks in LLMs - understanding what they are, how they work inside the Transformer architecture, why every Transformer layer needs one, and what role they play in making Large Language Models so powerful.
Decoding Flash Attention in LLMs
In this blog, we will learn about Flash Attention by decoding it piece by piece - understanding why standard attention is slow, what makes Flash Attention fast, how it uses GPU memory cleverly, and why it is used in almost every modern Large Language Model (LLM).