All Blogs

Feed-Forward Networks in LLMs

Feed-Forward Networks in LLMs

In this blog, we will learn about Feed-Forward Networks in LLMs - understanding what they are, how they work inside the Transformer architecture, why every Transformer layer needs one, and what role they play in making Large Language Models so powerful.

Decoding Flash Attention in LLMs

Decoding Flash Attention in LLMs

In this blog, we will learn about Flash Attention by decoding it piece by piece - understanding why standard attention is slow, what makes Flash Attention fast, how it uses GPU memory cleverly, and why it is used in almost every modern Large Language Model (LLM).

Mixture of Experts Explained

Mixture of Experts Explained

In this blog, we will learn about the Mixture of Experts (MoE) architecture - understanding what experts are, how the router picks them, why MoE makes large models faster and cheaper, and why it powers many of today''s most powerful Large Language Models (LLMs).

Decoding Transformer Architecture

Decoding Transformer Architecture

In this blog, we will learn about the Transformer architecture by decoding it piece by piece - understanding what each component does, how they work together, and why this architecture powers every modern Large Language Model (LLM)

Math Behind Backpropagation

Math Behind Backpropagation

In this blog, we will learn about the math behind backpropagation in neural networks.

Math behind √dₖ Scaling Factor in Attention

Math behind √dₖ Scaling Factor in Attention

In this blog, we will learn about why we scale the dot product attention by √dₖ in the Transformer architecture with a step-by-step numeric example.

Math behind Attention - Q, K, and V

Math behind Attention - Q, K, and V

In this blog, we will learn about the math behind Attention - Query(Q), Key(K), and Value(V) with a step-by-step numeric example.

Harness Engineering in AI

Harness Engineering in AI

In this blog, we will learn about Harness Engineering in AI.

Byte Pair Encoding in LLMs

Byte Pair Encoding in LLMs

In this blog, we will learn about BPE (Byte Pair Encoding) - the tokenization algorithm used by most modern Large Language Models (LLMs) to break text into smaller pieces before processing it.