- Published on
Math Behind Cross-Entropy Loss
In this blog, we will learn about the math behind Cross-Entropy Loss with a step-by-step numeric example.
In this blog, we will learn about the math behind Cross-Entropy Loss with a step-by-step numeric example.
In this blog, we will learn about the math behind gradient descent with a step-by-step numeric example.
In this blog, we will learn about the Vision Transformer (ViT) by decoding how it splits an image into patches, turns those patches into tokens, and processes them with a transformer to classify the image.
In this blog, we will learn about Feed-Forward Networks in LLMs - understanding what they are, how they work inside the Transformer architecture, why every Transformer layer needs one, and what role they play in making Large Language Models so powerful.
In this blog, we will learn about Flash Attention by decoding it piece by piece - understanding why standard attention is slow, what makes Flash Attention fast, how it uses GPU memory cleverly, and why it is used in almost every modern Large Language Model (LLM).
In this blog, we will learn about the Mixture of Experts (MoE) architecture - understanding what experts are, how the router picks them, why MoE makes large models faster and cheaper, and why it powers many of today''s most powerful Large Language Models (LLMs).
In this blog, we will learn about the Transformer architecture by decoding it piece by piece - understanding what each component does, how they work together, and why this architecture powers every modern Large Language Model (LLM)
In this blog, we will learn about the math behind backpropagation in neural networks.
In this blog, we will learn about why we scale the dot product attention by √dₖ in the Transformer architecture with a step-by-step numeric example.
In this blog, we will learn about the math behind Attention - Query(Q), Key(K), and Value(V) with a step-by-step numeric example.
In this blog, we will learn about Harness Engineering in AI.
In this blog, we will learn about BPE (Byte Pair Encoding) - the tokenization algorithm used by most modern Large Language Models (LLMs) to break text into smaller pieces before processing it.
In this blog, we will learn about Paged Attention, a technique that solves the memory waste problem of KV Cache, allowing LLMs to serve many more users at the same time.
In this blog, we will learn about KV Cache - where K stands for Key and V stands for Value - and why it is used in Large Language Models (LLMs) to speed up text generation.
In this blog, we will learn about the Causal Masking in Attention.
In this blog, we will learn about Linear Regression vs Logistic Regression in Machine Learning.
In this blog, we will learn about Supervised vs Unsupervised Learning in Machine Learning.
In this blog, we will learn what is Bias In Artificial Neural Network.
In this blog, we will learn about the Feature Engineering for Machine Learning.
In this blog, we will learn how the Machine Learning library TensorFlow works.
In this blog, we will learn about the L1 and L2 Loss functions.
In this blog, we will learn what is Machine Learning.
In this blog, we will learn about the Recurrent Neural Network.
In this blog, we will learn about the Regularization In Machine Learning.
In this blog, we will learn about the Reinforcement Learning in Machine Learning.