<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>Ajay Walia</title><link>https://curiousbit.netlify.app/</link><description>Digital workplace, artificial intelligence, cloud, security, automation, and enterprise technology notes by Ajay Walia.</description><language>en-au</language><managingEditor>Ajay Walia</managingEditor><webMaster>Ajay Walia</webMaster><copyright>Copyright 2026 Ajay Walia</copyright><lastBuildDate>Sun, 21 Jun 2026 05:46:10 +0000</lastBuildDate><atom:link href="https://curiousbit.netlify.app/tags/deep-learning/index.xml" rel="self" type="application/rss+xml"/><image><url>https://curiousbit.netlify.app/images/og-default.png</url><title>Ajay Walia</title><link>https://curiousbit.netlify.app/</link></image><item><title>LLMs Are Probability Engines, Not "Thinkers"</title><link>https://curiousbit.netlify.app/llms-are-probability-engines-not-ai/</link><guid isPermaLink="true">https://curiousbit.netlify.app/llms-are-probability-engines-not-ai/</guid><pubDate>Sun, 07 Jun 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;style&gt;
@import url('https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;600;700&amp;family=JetBrains+Mono:wght@400;500&amp;display=swap');
.pe-article {
--bg: #070b14;
--bg2: #0d1423;
--bg3: #111827;
--cyan: #00e5ff;
--purple: #a855f7;
--gold: #fbbf24;
--text: #e2e8f0;
--muted: #94a3b8;
--border: #1e293b;
--danger: #f87171;
font-family: 'Space Grotesk', system-ui, sans-serif;
font-size: 1.08rem;
line-height: 1.85;
color: var(--text);
}
/* TOC */
.pe-toc {
background: var(--bg2);
border: 1px solid var(--border);
border-left: 3px solid var(--cyan);
border-radius: 10px;
padding: 1.25rem 1.75rem;
margin: 2rem 0;
}
.pe-toc h3 {
font-size: 0.95rem;
letter-spacing: 0.18em;
text-transform: uppercase;
color: var(--cyan);
margin: 0 0 1rem;
}
.pe-toc ol { padding-left: 1.3rem; margin: 0; }
.pe-toc li { margin-bottom: 0.6rem; }
.pe-toc a { color: var(--muted); text-decoration: none; font-size: 1.15rem; font-weight: 600; transition: color 0.2s; }
.pe-toc a:hover { color: var(--cyan); }
/* Video */
.pe-video { margin: 2rem 0; border-radius: 12px; overflow: hidden; border: 1px solid var(--border); background: #000; }
.pe-video video { width: 100%; display: block; }
.pe-video-header {
background: var(--bg2);
padding: 1rem 1.4rem;
font-size: 1.15rem;
font-weight: 600;
color: var(--cyan);
border-bottom: 1px solid var(--border);
line-height: 1.5;
}
/* Typography */
.pe-article h2 {
font-size: 1.75rem;
font-weight: 700;
color: #fff;
margin: 3rem 0 0.9rem;
padding-bottom: 0.45rem;
border-bottom: 1px solid var(--border);
}
.pe-sec-num { color: var(--cyan); font-size: 1rem; font-weight: 600; display: block; margin-bottom: 0.2rem; letter-spacing: 0.1em; }
.pe-article p { margin-bottom: 1.1rem; }
.pe-article strong { color: #fff; }
.pe-em { color: var(--gold); }
/* Callouts */
.pe-callout { background: var(--bg2); border-left: 4px solid var(--purple); border-radius: 0 8px 8px 0; padding: 1.4rem 1.8rem; margin: 1.5rem 0; font-size: 1.4rem; color: var(--muted); line-height: 1.75; }
.pe-callout.cy { border-color: var(--cyan); }
.pe-callout.gd { border-color: var(--gold); }
.pe-callout strong { color: var(--text); }
/* Compare table */
.pe-table { width: 100%; border-collapse: collapse; font-size: 1rem; margin: 1.25rem 0; }
.pe-table th { text-align: left; padding: 0.7rem 1rem; background: var(--bg2); color: var(--cyan); font-size: 0.85rem; letter-spacing: 0.08em; text-transform: uppercase; border-bottom: 1px solid var(--border); }
.pe-table td { padding: 0.85rem 1rem; border-bottom: 1px solid var(--border); color: var(--muted); vertical-align: top; line-height: 1.6; }
.pe-table td:first-child { color: var(--text); font-weight: 500; }
.pe-table tr:hover td { background: var(--bg2); }
/* Formula boxes */
.pe-box { background: var(--bg2); border: 1px solid var(--border); border-radius: 12px; padding: 1.75rem; margin: 1.75rem 0; }
.pe-box-title { font-size: 0.95rem; letter-spacing: 0.15em; text-transform: uppercase; color: var(--purple); margin-bottom: 1rem; }
/* Anim 1 — token prediction */
.pe-sentence { font-size: 1.3rem; font-family: 'JetBrains Mono', monospace; color: var(--text); min-height: 2rem; margin-bottom: 1.1rem; }
.pe-cursor { display: inline-block; width: 2px; height: 1em; background: var(--cyan); animation: pe-blink 0.8s infinite; vertical-align: middle; margin-left: 2px; }
@keyframes pe-blink { 0%,100%{opacity:1} 50%{opacity:0} }
.pe-prob-bars { display: flex; flex-direction: column; gap: 0.55rem; }
.pe-prob-row { display: flex; align-items: center; gap: 0.8rem; font-size: 1.1rem; }
.pe-prob-lbl { width: 80px; text-align: right; color: var(--muted); font-family: 'JetBrains Mono', monospace; flex-shrink: 0; }
.pe-prob-track { flex: 1; height: 26px; background: var(--bg3); border-radius: 5px; overflow: hidden; }
.pe-prob-fill { height: 100%; background: var(--cyan); border-radius: 5px; transition: width 0.55s cubic-bezier(0.4,0,0.2,1); width: 0; }
.pe-prob-fill.win { background: var(--gold); }
.pe-prob-pct { width: 50px; font-family: 'JetBrains Mono', monospace; font-size: 1rem; color: var(--muted); }
/* Math display */
.pe-math { font-family: 'Georgia', serif; font-size: 1.45rem; color: var(--gold); text-align: center; padding: 1.3rem; background: var(--bg3); border-radius: 8px; margin-bottom: 0.9rem; }
.pe-term { display: inline; opacity: 0; transition: opacity 0.4s; cursor: help; position: relative; }
.pe-term.on { opacity: 1; }
.pe-term:hover::after { content: attr(data-tip); position: absolute; bottom: 115%; left: 50%; transform: translateX(-50%); background: var(--bg); border: 1px solid var(--purple); color: var(--text); padding: 0.4rem 0.85rem; border-radius: 6px; font-family: 'Space Grotesk', sans-serif; font-size: 0.88rem; white-space: nowrap; z-index: 20; }
.pe-anns { display: grid; grid-template-columns: 1fr 1fr; gap: 0.6rem; margin-top: 0.8rem; }
.pe-ann { background: var(--bg3); border-radius: 6px; padding: 0.55rem 0.85rem; font-size: 0.93rem; opacity: 0; transition: opacity 0.5s; }
.pe-ann.on { opacity: 1; }
.pe-ann-sym { color: var(--gold); font-family: 'JetBrains Mono', monospace; font-weight: bold; }
.pe-ann-desc { color: var(--muted); }
/* Softmax */
.pe-sm-demo { display: flex; gap: 1rem; align-items: flex-start; flex-wrap: wrap; }
.pe-sm-col { flex: 1; min-width: 160px; }
.pe-col-lbl { font-size: 0.82rem; letter-spacing: 0.1em; text-transform: uppercase; color: var(--muted); margin-bottom: 0.75rem; }
.pe-logit-row { display: flex; align-items: center; gap: 0.6rem; margin-bottom: 0.55rem; font-size: 0.97rem; font-family: 'JetBrains Mono', monospace; }
.pe-logit-w { width: 60px; color: var(--text); }
.pe-logit-v { padding: 0.22rem 0.6rem; border-radius: 4px; font-size: 0.92rem; }
.pe-logit-v.neg { background: rgba(248,113,113,0.15); color: var(--danger); }
.pe-logit-v.pos { background: rgba(0,229,255,0.1); color: var(--cyan); }
.pe-sm-bar { height: 22px; border-radius: 4px; background: var(--purple); transition: width 0.75s cubic-bezier(0.4,0,0.2,1); width: 0; display: flex; align-items: center; padding-left: 7px; font-size: 0.86rem; color: #fff; overflow: hidden; white-space: nowrap; }
.pe-arrow { display: flex; align-items: center; justify-content: center; padding-top: 1.4rem; font-size: 1.6rem; color: var(--cyan); }
/* Loss */
.pe-loss-wrap { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; align-items: start; }
@media(max-width:500px) { .pe-loss-wrap { grid-template-columns: 1fr; } }
.pe-loss-num { font-size: 2.8rem; font-weight: 800; font-family: 'JetBrains Mono', monospace; color: var(--danger); transition: color 0.5s; line-height: 1; }
.pe-loss-num.good { color: #4ade80; }
.pe-loss-lbl { font-size: 0.88rem; color: var(--muted); margin-top: 0.3rem; }
.pe-loss-slider label { font-size: 0.92rem; color: var(--muted); display: block; margin: 0.8rem 0 0.3rem; }
input[type=range] { width: 100%; accent-color: var(--cyan); }
.pe-loss-formula { background: var(--bg3); border-radius: 8px; padding: 1.1rem; font-family: 'JetBrains Mono', monospace; font-size: 1rem; color: var(--text); line-height: 2.1; }
.pe-lf-hl { color: var(--gold); }
.pe-lf-res { color: var(--cyan); font-weight: bold; }
/* Attention */
.pe-attn-words { display: flex; gap: 0.5rem; flex-wrap: wrap; margin-bottom: 0.9rem; }
.pe-attn-word { padding: 0.4rem 0.8rem; border-radius: 6px; background: var(--bg3); border: 1px solid var(--border); font-size: 1rem; cursor: pointer; transition: all 0.2s; user-select: none; }
.pe-attn-word:hover { border-color: var(--cyan); }
.pe-attn-word.sel { background: rgba(0,229,255,0.12); border-color: var(--cyan); color: var(--cyan); }
/* Attention formula cards */
.pe-qkv { display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 0.6rem; margin-top: 0.8rem; }
@media(max-width:480px) { .pe-qkv { grid-template-columns: 1fr; } }
.pe-qkv-card { border-radius: 6px; padding: 0.75rem 0.9rem; font-size: 0.93rem; }
/* Temperature */
.pe-temp-row-ctrl { display: flex; align-items: center; gap: 1rem; }
.pe-temp-big { font-size: 1.6rem; font-family: 'JetBrains Mono', monospace; font-weight: 700; color: var(--cyan); width: 56px; flex-shrink: 0; }
.pe-temp-lbl { font-size: 0.92rem; color: var(--muted); margin-top: 0.2rem; }
.pe-tbars { display: flex; flex-direction: column; gap: 0.45rem; margin-top: 0.9rem; }
.pe-trow { display: flex; align-items: center; gap: 0.65rem; font-size: 0.95rem; font-family: 'JetBrains Mono', monospace; }
.pe-tlbl { width: 64px; color: var(--muted); text-align: right; flex-shrink: 0; }
.pe-ttrack { flex: 1; height: 19px; background: var(--bg3); border-radius: 4px; overflow: hidden; }
.pe-tfill { height: 100%; border-radius: 4px; background: var(--purple); transition: width 0.5s cubic-bezier(0.4,0,0.2,1); }
.pe-tpct { width: 48px; text-align: right; color: var(--muted); }
/* Limits */
.pe-limits { display: grid; grid-template-columns: repeat(3, 1fr); gap: 1rem; margin: 1.5rem 0; }
@media(max-width:640px) { .pe-limits { grid-template-columns: 1fr; } }
.pe-limit-card { background: var(--bg2); border: 1px solid var(--border); border-radius: 12px; padding: 1.5rem 1.6rem; transition: border-color 0.2s; display: flex; flex-direction: column; gap: 0.4rem; }
.pe-limit-card:hover { border-color: var(--purple); }
.pe-limit-icon { font-size: 2.2rem; line-height: 1; }
.pe-limit-title { font-weight: 700; color: var(--text); font-size: 1.35rem; margin: 0; }
.pe-limit-desc { font-size: 1.15rem; color: var(--muted); line-height: 1.65; margin: 0; }
/* Dual meters */
.pe-meters { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; margin: 1.25rem 0; }
.pe-meter { border-radius: 8px; padding: 1.1rem; text-align: center; }
.pe-meter-lbl { font-size: 0.82rem; letter-spacing: 0.1em; text-transform: uppercase; margin-bottom: 0.4rem; }
.pe-meter-val { font-size: 2.2rem; font-weight: 800; }
/* Buttons */
.pe-btn {
margin-top: 0.9rem;
background: var(--bg3);
border: 1px solid var(--cyan);
color: var(--cyan);
padding: 0.4rem 1.1rem;
border-radius: 6px;
cursor: pointer;
font-size: 0.82rem;
font-family: 'Space Grotesk', sans-serif;
transition: background 0.2s;
}
.pe-btn:hover { background: rgba(0,229,255,0.1); }
.pe-btn-pur { border-color: var(--purple); color: var(--purple); }
.pe-btn-pur:hover { background: rgba(168,85,247,0.1); }
/* Interactive badge */
.pe-interactive-header {
display: flex;
align-items: center;
justify-content: space-between;
margin-bottom: 1rem;
}
.pe-interactive-header .pe-box-title { margin-bottom: 0; }
.pe-interactive-badge {
display: inline-flex;
align-items: center;
gap: 0.4rem;
background: rgba(0,229,255,0.08);
border: 1px solid var(--cyan);
color: var(--cyan);
font-size: 0.72rem;
font-weight: 700;
letter-spacing: 0.12em;
text-transform: uppercase;
padding: 0.3rem 0.75rem;
border-radius: 999px;
flex-shrink: 0;
}
.pe-interactive-badge::before {
content: '';
width: 7px;
height: 7px;
border-radius: 50%;
background: var(--cyan);
animation: pe-pulse 1.6s ease-in-out infinite;
flex-shrink: 0;
}
@keyframes pe-pulse {
0%, 100% { opacity: 1; transform: scale(1); }
50% { opacity: 0.4; transform: scale(0.7); }
}
.pe-interact-hint {
display: flex;
align-items: center;
gap: 0.6rem;
margin-top: 0.9rem;
padding: 0.85rem 1.1rem;
background: rgba(0,229,255,0.05);
border: 1px dashed rgba(0,229,255,0.25);
border-radius: 8px;
font-size: 1.05rem;
color: var(--muted);
}
.pe-interact-hint span { font-size: 1.25rem; }
.pe-divider { border: none; border-top: 1px solid var(--border); margin: 2.5rem 0; }
@media(max-width:520px) {
.pe-anns { grid-template-columns: 1fr; }
.pe-sm-demo { flex-direction: column; }
.pe-arrow { transform: rotate(90deg); }
}
&lt;/style&gt;
&lt;div class="pe-article"&gt;
&lt;div class="pe-video"&gt;
&lt;div class="pe-video-header"&gt;▶ Full Video Explainer — covering how LLMs work, from next-token prediction to attention, training, and why hallucinations are inevitable&lt;/div&gt;
&lt;video controls poster="/images/llms-are-pe/hero.jpg"&gt;
&lt;source src="https://curiousbit.netlify.app/images/llms-are-pe/explainer.mp4" type="video/mp4" /&gt;
Your browser doesn't support HTML5 video.
&lt;/video&gt;
&lt;/div&gt;
&lt;nav class="pe-toc"&gt;
&lt;h3&gt;In this article&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="#pe-s1"&gt;What ChatGPT and Claude actually are&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s2"&gt;The one job every LLM does&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s3"&gt;The probability formula (interactive)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s4"&gt;Softmax: turning scores into probabilities&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s5"&gt;How it learns: cross-entropy loss&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s6"&gt;The Transformer architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s7"&gt;Self-attention: every word watches every word&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s8"&gt;How text is actually generated&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s9"&gt;Temperature: controlling randomness&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s10"&gt;Why it sometimes lies (hallucinations)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s11"&gt;Key limitations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s12"&gt;What's next&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/nav&gt;
&lt;p&gt;You've used ChatGPT. You've heard the word "AI" a thousand times this year. But here's something almost nobody explains clearly: the thing powering these tools is &lt;strong&gt;not intelligent in any human sense&lt;/strong&gt;. It doesn't think. It doesn't understand. It doesn't have goals.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/llms-are-pe/hero.jpg" alt="Deep-Learning" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><style>
@import url('https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500&display=swap');
.pe-article {
--bg: #070b14;
--bg2: #0d1423;
--bg3: #111827;
--cyan: #00e5ff;
--purple: #a855f7;
--gold: #fbbf24;
--text: #e2e8f0;
--muted: #94a3b8;
--border: #1e293b;
--danger: #f87171;
font-family: 'Space Grotesk', system-ui, sans-serif;
font-size: 1.08rem;
line-height: 1.85;
color: var(--text);
}
/* TOC */
.pe-toc {
background: var(--bg2);
border: 1px solid var(--border);
border-left: 3px solid var(--cyan);
border-radius: 10px;
padding: 1.25rem 1.75rem;
margin: 2rem 0;
}
.pe-toc h3 {
font-size: 0.95rem;
letter-spacing: 0.18em;
text-transform: uppercase;
color: var(--cyan);
margin: 0 0 1rem;
}
.pe-toc ol { padding-left: 1.3rem; margin: 0; }
.pe-toc li { margin-bottom: 0.6rem; }
.pe-toc a { color: var(--muted); text-decoration: none; font-size: 1.15rem; font-weight: 600; transition: color 0.2s; }
.pe-toc a:hover { color: var(--cyan); }
/* Video */
.pe-video { margin: 2rem 0; border-radius: 12px; overflow: hidden; border: 1px solid var(--border); background: #000; }
.pe-video video { width: 100%; display: block; }
.pe-video-header {
background: var(--bg2);
padding: 1rem 1.4rem;
font-size: 1.15rem;
font-weight: 600;
color: var(--cyan);
border-bottom: 1px solid var(--border);
line-height: 1.5;
}
/* Typography */
.pe-article h2 {
font-size: 1.75rem;
font-weight: 700;
color: #fff;
margin: 3rem 0 0.9rem;
padding-bottom: 0.45rem;
border-bottom: 1px solid var(--border);
}
.pe-sec-num { color: var(--cyan); font-size: 1rem; font-weight: 600; display: block; margin-bottom: 0.2rem; letter-spacing: 0.1em; }
.pe-article p { margin-bottom: 1.1rem; }
.pe-article strong { color: #fff; }
.pe-em { color: var(--gold); }
/* Callouts */
.pe-callout { background: var(--bg2); border-left: 4px solid var(--purple); border-radius: 0 8px 8px 0; padding: 1.4rem 1.8rem; margin: 1.5rem 0; font-size: 1.4rem; color: var(--muted); line-height: 1.75; }
.pe-callout.cy { border-color: var(--cyan); }
.pe-callout.gd { border-color: var(--gold); }
.pe-callout strong { color: var(--text); }
/* Compare table */
.pe-table { width: 100%; border-collapse: collapse; font-size: 1rem; margin: 1.25rem 0; }
.pe-table th { text-align: left; padding: 0.7rem 1rem; background: var(--bg2); color: var(--cyan); font-size: 0.85rem; letter-spacing: 0.08em; text-transform: uppercase; border-bottom: 1px solid var(--border); }
.pe-table td { padding: 0.85rem 1rem; border-bottom: 1px solid var(--border); color: var(--muted); vertical-align: top; line-height: 1.6; }
.pe-table td:first-child { color: var(--text); font-weight: 500; }
.pe-table tr:hover td { background: var(--bg2); }
/* Formula boxes */
.pe-box { background: var(--bg2); border: 1px solid var(--border); border-radius: 12px; padding: 1.75rem; margin: 1.75rem 0; }
.pe-box-title { font-size: 0.95rem; letter-spacing: 0.15em; text-transform: uppercase; color: var(--purple); margin-bottom: 1rem; }
/* Anim 1 — token prediction */
.pe-sentence { font-size: 1.3rem; font-family: 'JetBrains Mono', monospace; color: var(--text); min-height: 2rem; margin-bottom: 1.1rem; }
.pe-cursor { display: inline-block; width: 2px; height: 1em; background: var(--cyan); animation: pe-blink 0.8s infinite; vertical-align: middle; margin-left: 2px; }
@keyframes pe-blink { 0%,100%{opacity:1} 50%{opacity:0} }
.pe-prob-bars { display: flex; flex-direction: column; gap: 0.55rem; }
.pe-prob-row { display: flex; align-items: center; gap: 0.8rem; font-size: 1.1rem; }
.pe-prob-lbl { width: 80px; text-align: right; color: var(--muted); font-family: 'JetBrains Mono', monospace; flex-shrink: 0; }
.pe-prob-track { flex: 1; height: 26px; background: var(--bg3); border-radius: 5px; overflow: hidden; }
.pe-prob-fill { height: 100%; background: var(--cyan); border-radius: 5px; transition: width 0.55s cubic-bezier(0.4,0,0.2,1); width: 0; }
.pe-prob-fill.win { background: var(--gold); }
.pe-prob-pct { width: 50px; font-family: 'JetBrains Mono', monospace; font-size: 1rem; color: var(--muted); }
/* Math display */
.pe-math { font-family: 'Georgia', serif; font-size: 1.45rem; color: var(--gold); text-align: center; padding: 1.3rem; background: var(--bg3); border-radius: 8px; margin-bottom: 0.9rem; }
.pe-term { display: inline; opacity: 0; transition: opacity 0.4s; cursor: help; position: relative; }
.pe-term.on { opacity: 1; }
.pe-term:hover::after { content: attr(data-tip); position: absolute; bottom: 115%; left: 50%; transform: translateX(-50%); background: var(--bg); border: 1px solid var(--purple); color: var(--text); padding: 0.4rem 0.85rem; border-radius: 6px; font-family: 'Space Grotesk', sans-serif; font-size: 0.88rem; white-space: nowrap; z-index: 20; }
.pe-anns { display: grid; grid-template-columns: 1fr 1fr; gap: 0.6rem; margin-top: 0.8rem; }
.pe-ann { background: var(--bg3); border-radius: 6px; padding: 0.55rem 0.85rem; font-size: 0.93rem; opacity: 0; transition: opacity 0.5s; }
.pe-ann.on { opacity: 1; }
.pe-ann-sym { color: var(--gold); font-family: 'JetBrains Mono', monospace; font-weight: bold; }
.pe-ann-desc { color: var(--muted); }
/* Softmax */
.pe-sm-demo { display: flex; gap: 1rem; align-items: flex-start; flex-wrap: wrap; }
.pe-sm-col { flex: 1; min-width: 160px; }
.pe-col-lbl { font-size: 0.82rem; letter-spacing: 0.1em; text-transform: uppercase; color: var(--muted); margin-bottom: 0.75rem; }
.pe-logit-row { display: flex; align-items: center; gap: 0.6rem; margin-bottom: 0.55rem; font-size: 0.97rem; font-family: 'JetBrains Mono', monospace; }
.pe-logit-w { width: 60px; color: var(--text); }
.pe-logit-v { padding: 0.22rem 0.6rem; border-radius: 4px; font-size: 0.92rem; }
.pe-logit-v.neg { background: rgba(248,113,113,0.15); color: var(--danger); }
.pe-logit-v.pos { background: rgba(0,229,255,0.1); color: var(--cyan); }
.pe-sm-bar { height: 22px; border-radius: 4px; background: var(--purple); transition: width 0.75s cubic-bezier(0.4,0,0.2,1); width: 0; display: flex; align-items: center; padding-left: 7px; font-size: 0.86rem; color: #fff; overflow: hidden; white-space: nowrap; }
.pe-arrow { display: flex; align-items: center; justify-content: center; padding-top: 1.4rem; font-size: 1.6rem; color: var(--cyan); }
/* Loss */
.pe-loss-wrap { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; align-items: start; }
@media(max-width:500px) { .pe-loss-wrap { grid-template-columns: 1fr; } }
.pe-loss-num { font-size: 2.8rem; font-weight: 800; font-family: 'JetBrains Mono', monospace; color: var(--danger); transition: color 0.5s; line-height: 1; }
.pe-loss-num.good { color: #4ade80; }
.pe-loss-lbl { font-size: 0.88rem; color: var(--muted); margin-top: 0.3rem; }
.pe-loss-slider label { font-size: 0.92rem; color: var(--muted); display: block; margin: 0.8rem 0 0.3rem; }
input[type=range] { width: 100%; accent-color: var(--cyan); }
.pe-loss-formula { background: var(--bg3); border-radius: 8px; padding: 1.1rem; font-family: 'JetBrains Mono', monospace; font-size: 1rem; color: var(--text); line-height: 2.1; }
.pe-lf-hl { color: var(--gold); }
.pe-lf-res { color: var(--cyan); font-weight: bold; }
/* Attention */
.pe-attn-words { display: flex; gap: 0.5rem; flex-wrap: wrap; margin-bottom: 0.9rem; }
.pe-attn-word { padding: 0.4rem 0.8rem; border-radius: 6px; background: var(--bg3); border: 1px solid var(--border); font-size: 1rem; cursor: pointer; transition: all 0.2s; user-select: none; }
.pe-attn-word:hover { border-color: var(--cyan); }
.pe-attn-word.sel { background: rgba(0,229,255,0.12); border-color: var(--cyan); color: var(--cyan); }
/* Attention formula cards */
.pe-qkv { display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 0.6rem; margin-top: 0.8rem; }
@media(max-width:480px) { .pe-qkv { grid-template-columns: 1fr; } }
.pe-qkv-card { border-radius: 6px; padding: 0.75rem 0.9rem; font-size: 0.93rem; }
/* Temperature */
.pe-temp-row-ctrl { display: flex; align-items: center; gap: 1rem; }
.pe-temp-big { font-size: 1.6rem; font-family: 'JetBrains Mono', monospace; font-weight: 700; color: var(--cyan); width: 56px; flex-shrink: 0; }
.pe-temp-lbl { font-size: 0.92rem; color: var(--muted); margin-top: 0.2rem; }
.pe-tbars { display: flex; flex-direction: column; gap: 0.45rem; margin-top: 0.9rem; }
.pe-trow { display: flex; align-items: center; gap: 0.65rem; font-size: 0.95rem; font-family: 'JetBrains Mono', monospace; }
.pe-tlbl { width: 64px; color: var(--muted); text-align: right; flex-shrink: 0; }
.pe-ttrack { flex: 1; height: 19px; background: var(--bg3); border-radius: 4px; overflow: hidden; }
.pe-tfill { height: 100%; border-radius: 4px; background: var(--purple); transition: width 0.5s cubic-bezier(0.4,0,0.2,1); }
.pe-tpct { width: 48px; text-align: right; color: var(--muted); }
/* Limits */
.pe-limits { display: grid; grid-template-columns: repeat(3, 1fr); gap: 1rem; margin: 1.5rem 0; }
@media(max-width:640px) { .pe-limits { grid-template-columns: 1fr; } }
.pe-limit-card { background: var(--bg2); border: 1px solid var(--border); border-radius: 12px; padding: 1.5rem 1.6rem; transition: border-color 0.2s; display: flex; flex-direction: column; gap: 0.4rem; }
.pe-limit-card:hover { border-color: var(--purple); }
.pe-limit-icon { font-size: 2.2rem; line-height: 1; }
.pe-limit-title { font-weight: 700; color: var(--text); font-size: 1.35rem; margin: 0; }
.pe-limit-desc { font-size: 1.15rem; color: var(--muted); line-height: 1.65; margin: 0; }
/* Dual meters */
.pe-meters { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; margin: 1.25rem 0; }
.pe-meter { border-radius: 8px; padding: 1.1rem; text-align: center; }
.pe-meter-lbl { font-size: 0.82rem; letter-spacing: 0.1em; text-transform: uppercase; margin-bottom: 0.4rem; }
.pe-meter-val { font-size: 2.2rem; font-weight: 800; }
/* Buttons */
.pe-btn {
margin-top: 0.9rem;
background: var(--bg3);
border: 1px solid var(--cyan);
color: var(--cyan);
padding: 0.4rem 1.1rem;
border-radius: 6px;
cursor: pointer;
font-size: 0.82rem;
font-family: 'Space Grotesk', sans-serif;
transition: background 0.2s;
}
.pe-btn:hover { background: rgba(0,229,255,0.1); }
.pe-btn-pur { border-color: var(--purple); color: var(--purple); }
.pe-btn-pur:hover { background: rgba(168,85,247,0.1); }
/* Interactive badge */
.pe-interactive-header {
display: flex;
align-items: center;
justify-content: space-between;
margin-bottom: 1rem;
}
.pe-interactive-header .pe-box-title { margin-bottom: 0; }
.pe-interactive-badge {
display: inline-flex;
align-items: center;
gap: 0.4rem;
background: rgba(0,229,255,0.08);
border: 1px solid var(--cyan);
color: var(--cyan);
font-size: 0.72rem;
font-weight: 700;
letter-spacing: 0.12em;
text-transform: uppercase;
padding: 0.3rem 0.75rem;
border-radius: 999px;
flex-shrink: 0;
}
.pe-interactive-badge::before {
content: '';
width: 7px;
height: 7px;
border-radius: 50%;
background: var(--cyan);
animation: pe-pulse 1.6s ease-in-out infinite;
flex-shrink: 0;
}
@keyframes pe-pulse {
0%, 100% { opacity: 1; transform: scale(1); }
50% { opacity: 0.4; transform: scale(0.7); }
}
.pe-interact-hint {
display: flex;
align-items: center;
gap: 0.6rem;
margin-top: 0.9rem;
padding: 0.85rem 1.1rem;
background: rgba(0,229,255,0.05);
border: 1px dashed rgba(0,229,255,0.25);
border-radius: 8px;
font-size: 1.05rem;
color: var(--muted);
}
.pe-interact-hint span { font-size: 1.25rem; }
.pe-divider { border: none; border-top: 1px solid var(--border); margin: 2.5rem 0; }
@media(max-width:520px) {
.pe-anns { grid-template-columns: 1fr; }
.pe-sm-demo { flex-direction: column; }
.pe-arrow { transform: rotate(90deg); }
}</style><div class="pe-article"><div class="pe-video"><div class="pe-video-header">▶ Full Video Explainer — covering how LLMs work, from next-token prediction to attention, training, and why hallucinations are inevitable</div><video controls= poster="/images/llms-are-pe/hero.jpg"><source src="/images/llms-are-pe/explainer.mp4" type="video/mp4"/>
Your browser doesn't support HTML5 video.</video></div><nav class="pe-toc"><h3>In this article</h3><ol><li><a href="#pe-s1">What ChatGPT and Claude actually are</a></li><li><a href="#pe-s2">The one job every LLM does</a></li><li><a href="#pe-s3">The probability formula (interactive)</a></li><li><a href="#pe-s4">Softmax: turning scores into probabilities</a></li><li><a href="#pe-s5">How it learns: cross-entropy loss</a></li><li><a href="#pe-s6">The Transformer architecture</a></li><li><a href="#pe-s7">Self-attention: every word watches every word</a></li><li><a href="#pe-s8">How text is actually generated</a></li><li><a href="#pe-s9">Temperature: controlling randomness</a></li><li><a href="#pe-s10">Why it sometimes lies (hallucinations)</a></li><li><a href="#pe-s11">Key limitations</a></li><li><a href="#pe-s12">What's next</a></li></ol></nav><p>You've used ChatGPT. You've heard the word "AI" a thousand times this year. But here's something almost nobody explains clearly: the thing powering these tools is<strong>not intelligent in any human sense</strong>. It doesn't think. It doesn't understand. It doesn't have goals.</p><p>It is, at its core, a<span class="pe-em">very sophisticated next-word predictor</span> — a probability engine trained on the vast majority of text the internet has ever produced. Once you understand this, everything else — its strengths, its failures, its weirdness — clicks into place.</p><div class="pe-callout cy"><strong>Interactive animations ahead:</strong> Press buttons and move sliders as you go — seeing the math move makes it stick.</div><h2 id="pe-s1"><span class="pe-sec-num">01 —</span>What ChatGPT and Claude actually are</h2><p>The term "Artificial Intelligence" conjures images of something that thinks, reasons, and understands — a mind in a machine. That framing is compelling, but misleading when applied to today's large language models (LLMs).</p><p>What you're actually talking to is an<strong>autoregressive probabilistic model</strong>. Every word it generates is the result of asking one question, over and over again:</p><div class="pe-callout gd"><strong>"Given everything written so far, what word is most likely to come next?"</strong></div><p>That's it. Do that billions of times on internet-scale text, and you get something that looks uncannily like reasoning. But it is, fundamentally, pattern matching at extraordinary scale — not understanding, consciousness, or genuine intelligence.</p><table class="pe-table"><thead><tr><th>What you see</th><th>What's actually happening</th><th>The catch</th></tr></thead><tbody><tr><td>It "reasons"</td><td>Pattern-matches reasoning traces from training data</td><td>Breaks on genuinely novel problems</td></tr><tr><td>It "knows facts"</td><td>Recalls high-frequency statistical associations</td><td>Hallucinates on rare edge cases</td></tr><tr><td>It's "creative"</td><td>Samples from learned creative pattern spaces</td><td>Derivative — remixes, doesn't invent</td></tr><tr><td>It has "opinions"</td><td>Outputs tokens shaped by training + alignment</td><td>No actual beliefs internally</td></tr></tbody></table><h2 id="pe-s2"><span class="pe-sec-num">02 —</span>The one job every LLM does</h2><p>Let's make this concrete. Below is a live simulation of next-token prediction. Press<strong>"Predict next token"</strong> and watch the model pick the next word based on probability scores.</p><div class="pe-box"><div class="pe-interactive-header"><div class="pe-box-title">🎯 Next-Token Prediction</div><span class="pe-interactive-badge">Live · Interactive</span></div><div class="pe-sentence" id="pe-sentence">The cat sat on the<span class="pe-cursor"/></div><div class="pe-prob-bars" id="pe-prob-bars"/><div class="pe-interact-hint"><span>👇</span> Press the button to watch the model predict — one token at a time.</div><button class="pe-btn" onclick="pePredict()">Predict next token →</button></div><p>Notice the bars: each candidate word gets a probability score. The model doesn't "decide" in any human sense — it samples from this distribution. The highest-probability word is chosen most often, but not always. That's where both creativity and errors come from.</p><h2 id="pe-s3"><span class="pe-sec-num">03 —</span>The probability formula</h2><p>Here's the mathematical heart of it.<strong>Hover each term</strong> for a plain-English tooltip, then press the button to reveal the full breakdown piece by piece.</p><div class="pe-box"><div class="pe-interactive-header"><div class="pe-box-title">📐 Probability Formula</div><span class="pe-interactive-badge">Live · Interactive</span></div><div class="pe-interact-hint" style="margin-top:0;margin-bottom:0.9rem;"><span>🖱️</span> Hover any term for a plain-English tooltip. Press the button to reveal the formula step by step.</div><div class="pe-math"><span class="pe-term" id="pet0" data-tip="P = Probability of">P</span><span class="pe-term" id="pet1" data-tip="wₜ = the specific token we're predicting">(w<sub>t</sub></span><span class="pe-term" id="pet2" data-tip="| = 'given all of this before it'"> |</span><span class="pe-term" id="pet3" data-tip="w<t = every token that came before in the context"> w<sub>&lt;t</sub></span><span class="pe-term" id="pet4" data-tip="; θ = the model's billions of learned parameters"> ; θ)</span><span class="pe-term" id="pet5" data-tip="= the output we calculate"> =</span><span class="pe-term" id="pet6" data-tip="softmax converts raw scores into a proper probability distribution summing to 1"> softmax(logits<sub>t</sub>)</span></div><div class="pe-anns" id="pe-anns"><div class="pe-ann" id="pea0"><span class="pe-ann-sym">wₜ</span> —<span class="pe-ann-desc">The next token to predict</span></div><div class="pe-ann" id="pea1"><span class="pe-ann-sym">w&lt;t</span> —<span class="pe-ann-desc">All previous tokens (the context)</span></div><div class="pe-ann" id="pea2"><span class="pe-ann-sym">θ</span> —<span class="pe-ann-desc">Billions of learned parameters</span></div><div class="pe-ann" id="pea3"><span class="pe-ann-sym">softmax</span> —<span class="pe-ann-desc">Converts scores → probabilities (sum = 1)</span></div></div><button class="pe-btn" onclick="peRevealFormula()">Reveal formula step by step →</button></div><p>Plain English:<span class="pe-em">"Given everything typed so far, and everything the model learned during training, what is the probability of each possible next word?"</span> The model scores every word in its vocabulary — 50,000+ words — and softmax turns those raw scores into probabilities that add up to exactly 1.0.</p><h2 id="pe-s4"><span class="pe-sec-num">04 —</span>Softmax: raw scores → probabilities</h2><p>The model internally produces a raw score (called a<strong>logit</strong>) for every possible next word. Logits can be any number — positive, negative, large, small. They're not probabilities yet. The<strong>softmax</strong> function converts them into a clean distribution. Press the button to watch the transformation.</p><div class="pe-box"><div class="pe-interactive-header"><div class="pe-box-title">⚡ Softmax Transform</div><span class="pe-interactive-badge">Live · Interactive</span></div><div class="pe-interact-hint" style="margin-top:0;margin-bottom:0.9rem;"><span>👇</span> Press the button to watch raw scores transform into probabilities.</div><div class="pe-sm-demo"><div class="pe-sm-col"><div class="pe-col-lbl">Raw Logits (scores)</div><div id="pe-logits"/></div><div class="pe-arrow" id="pe-sm-arrow" style="opacity:0.3">→</div><div class="pe-sm-col"><div class="pe-col-lbl">After Softmax (probabilities)</div><div id="pe-softmax"/></div></div><button class="pe-btn pe-btn-pur" onclick="peSoftmax()">Run softmax →</button></div><p>Notice: even the most negative logit still gets a small non-zero probability after softmax. The model never completely rules anything out — it just makes some words astronomically unlikely. This is partly why LLMs occasionally produce bizarre outputs: a 0.001% token still gets picked sometimes.</p><h2 id="pe-s5"><span class="pe-sec-num">05 —</span>How it learns: cross-entropy loss</h2><p>During training, the model sees a sentence with the last word hidden and makes a prediction. The training algorithm asks:<span class="pe-em">"How wrong were you?"</span> The measure of wrongness is<strong>cross-entropy loss</strong>.</p><p>The formula:<code style="color:var(--gold);background:var(--bg3);padding:2px 8px;border-radius:4px;font-family:'JetBrains Mono',monospace;">ℒ = −log P(correct word)</code>. If the model assigns 100% probability to the right word, loss = 0. If it assigns 1%, loss is very high.<strong>Drag the slider</strong> to see this in action.</p><div class="pe-box"><div class="pe-interactive-header"><div class="pe-box-title">📉 Cross-Entropy Loss</div><span class="pe-interactive-badge">Live · Interactive</span></div><div class="pe-interact-hint" style="margin-top:0;margin-bottom:0.9rem;"><span>🎚️</span> Drag the slider to change the model's confidence and watch the loss recalculate live.</div><div class="pe-loss-wrap"><div><div class="pe-loss-num" id="pe-loss-num">1.47</div><div class="pe-loss-lbl">Loss ℒ = −log(p)</div><div class="pe-loss-slider"><label>Model's confidence in correct word:<strong id="pe-conf-lbl">23%</strong></label><input type="range" id="pe-conf-slider" min="1" max="99" value="23" oninput="peLoss(this.value)"/></div></div><div class="pe-loss-formula">
Correct word:<span class="pe-lf-hl">"lazy"</span><br>
P("lazy"):<span class="pe-lf-hl" id="pe-lf-p">0.23</span><br><br>
ℒ = −log(<span class="pe-lf-hl" id="pe-lf-p2">0.23</span>)<br>
ℒ =<span class="pe-lf-res" id="pe-lf-res">1.47</span><br><br><span style="color:var(--muted);font-size:0.76rem;" id="pe-lf-verdict">High loss → big update</span></div></div></div><h2 id="pe-s6"><span class="pe-sec-num">06 —</span>The Transformer: the machine inside</h2><p>The specific architecture that makes modern LLMs work is called the<strong>Transformer</strong>, introduced in a landmark 2017 Google paper. Every major LLM today — GPT-4, Claude, Gemini, Llama — is built on this design.</p><p>A Transformer processes your text through many stacked layers. Each layer has two main components:</p><div class="pe-callout"><strong>Multi-Head Self-Attention</strong> — Every word simultaneously looks at every other word, learning which relationships matter. This is the core insight.<br><br><strong>Feed-Forward Network</strong> — A dense neural network that processes each token's information independently, after attention has been applied.</div><p>A large model like GPT-4 stacks around 96 of these layers. With enough layers, parameters, and training data, emergent abilities appear — code generation, translation, basic reasoning — that nobody explicitly programmed. They fall out of the math at scale.</p><h2 id="pe-s7"><span class="pe-sec-num">07 —</span>Self-attention: every word watches every word</h2><p>Before the Transformer, AI models processed text word by word in sequence, making it hard to connect things far apart in a sentence. Self-attention solves this by letting every word simultaneously evaluate its relationship to every other word.<strong>Click a word</strong> to see its attention weights.</p><div class="pe-box"><div class="pe-interactive-header"><div class="pe-box-title">🔍 Self-Attention Weights</div><span class="pe-interactive-badge">Live · Interactive</span></div><div class="pe-attn-words" id="pe-attn-words"/><div id="pe-attn-grid"/><div class="pe-interact-hint"><span>👆</span> Click any word above to see how it attends to every other word. Brighter = stronger attention — notice "it" lights up "animal."</div></div><div class="pe-box"><div class="pe-box-title">📐 The Attention Equation</div><div class="pe-math" style="font-size:1.05rem;">
Attention(Q, K, V) = softmax(<span style="color:var(--cyan);">QKᵀ</span> /<span style="color:var(--gold);">√d<sub>k</sub></span> ) ·<span style="color:var(--purple);">V</span></div><div class="pe-qkv"><div class="pe-qkv-card" style="background:rgba(0,229,255,0.07);border:1px solid rgba(0,229,255,0.2);"><div style="color:var(--cyan);font-weight:700;margin-bottom:0.3rem;font-size:1rem;">Q — Query</div><div style="color:var(--muted);font-size:0.92rem;">"What am I looking for?"</div></div><div class="pe-qkv-card" style="background:rgba(251,191,36,0.07);border:1px solid rgba(251,191,36,0.2);"><div style="color:var(--gold);font-weight:700;margin-bottom:0.3rem;font-size:1rem;">K — Key</div><div style="color:var(--muted);font-size:0.92rem;">"What does each word offer?"</div></div><div class="pe-qkv-card" style="background:rgba(168,85,247,0.07);border:1px solid rgba(168,85,247,0.2);"><div style="color:var(--purple);font-weight:700;margin-bottom:0.3rem;font-size:1rem;">V — Value</div><div style="color:var(--muted);font-size:0.92rem;">"What info do I retrieve?"</div></div></div></div><h2 id="pe-s8"><span class="pe-sec-num">08 —</span>How text is actually generated</h2><p>When you press Send in any AI chat app, here is exactly what happens:</p><ol style="padding-left:1.3rem;margin-bottom:1.1rem;"><li style="margin-bottom:0.55rem;color:var(--muted);"><strong style="color:var(--text);">Tokenization</strong> — Your message splits into tokens (subwords). "unbelievable" → ["un","believ","able"].</li><li style="margin-bottom:0.55rem;color:var(--muted);"><strong style="color:var(--text);">Embedding</strong> — Each token becomes a high-dimensional vector capturing meaning and position.</li><li style="margin-bottom:0.55rem;color:var(--muted);"><strong style="color:var(--text);">Forward pass</strong> — Vectors flow through all Transformer layers. Attention and feed-forward happen, repeatedly.</li><li style="margin-bottom:0.55rem;color:var(--muted);"><strong style="color:var(--text);">Logits → Probabilities</strong> — The final layer scores every vocabulary word. Softmax converts to probabilities.</li><li style="margin-bottom:0.55rem;color:var(--muted);"><strong style="color:var(--text);">Sampling</strong> — One word is chosen based on those probabilities.</li><li style="margin-bottom:0.55rem;color:var(--muted);"><strong style="color:var(--text);">Repeat</strong> — That word is appended and the whole process runs again until the response is done.</li></ol><div class="pe-callout"><strong>KV Caching:</strong> The model caches Key and Value matrices from previous steps so it doesn't recompute attention from scratch every token — making long responses computationally feasible.</div><h2 id="pe-s9"><span class="pe-sec-num">09 —</span>Temperature: controlling randomness</h2><p>When sampling the next word, you can control how random the selection is with a parameter called<strong>temperature</strong>. Drag the slider to see how it reshapes the probability distribution in real time.</p><div class="pe-box"><div class="pe-interactive-header"><div class="pe-box-title">🌡️ Temperature Sampling</div><span class="pe-interactive-badge">Live · Interactive</span></div><div class="pe-interact-hint" style="margin-top:0;margin-bottom:0.9rem;"><span>🎚️</span> Drag the slider left for predictable outputs, right for creative (or chaotic) ones.</div><div class="pe-temp-row-ctrl"><div><div class="pe-temp-big" id="pe-temp-val">1.0</div><div class="pe-temp-lbl" id="pe-temp-lbl">Balanced</div></div><input type="range" id="pe-temp-slider" min="1" max="30" value="10" style="flex:1;accent-color:var(--cyan);" oninput="peTemp(this.value)"/></div><div class="pe-tbars" id="pe-tbars"/><div style="font-size:0.92rem;color:var(--muted);margin-top:0.7rem;">Formula: p'ᵢ = pᵢ<sup>1/T</sup> / Σ(pⱼ<sup>1/T</sup>)</div></div><p>Low temperature (e.g. 0.2) makes the model deterministic — it almost always picks the top word. High temperature (e.g. 2.0) flattens the distribution, giving unusual words a real chance. Most production systems run between 0.7 and 1.0.</p><h2 id="pe-s10"><span class="pe-sec-num">10 —</span>Why it sometimes lies (hallucinations)</h2><p>One of the most misunderstood LLM behaviors is<strong>hallucination</strong> — when the model confidently states something false. This isn't a bug to be patched away. It's a direct consequence of the architecture.</p><p>The model has no internal truth checker. No access to the real world. It only knows:<span class="pe-em">what sequence of words tends to follow this sequence of words?</span> When asked something rare or obscure, the model fills the gap with statistically plausible text — which may be completely wrong.</p><div class="pe-callout gd"><strong>Analogy:</strong> Imagine someone who has read every book in a library but never left the building. Ask what the weather is like outside — they'll give a confident, well-reasoned answer based on weather descriptions they've read. It might be completely wrong.</div><div class="pe-meters"><div class="pe-meter" style="background:rgba(248,113,113,0.08);border:1px solid var(--danger);"><div class="pe-meter-lbl" style="color:var(--danger);">Ground Truth Access</div><div class="pe-meter-val" style="color:var(--danger);">NONE</div></div><div class="pe-meter" style="background:rgba(74,222,128,0.08);border:1px solid #4ade80;"><div class="pe-meter-lbl" style="color:#4ade80;">Statistical Plausibility</div><div class="pe-meter-val" style="color:#4ade80;">HIGH</div></div></div><h2 id="pe-s11"><span class="pe-sec-num">11 —</span>Key limitations to know</h2><p>Understanding these isn't pessimism — it's how you use these tools well.</p><div class="pe-limits"><div class="pe-limit-card"><div class="pe-limit-icon">📏</div><div class="pe-limit-title">Context Window</div><div class="pe-limit-desc">Fixed memory. Older models: ~4K tokens. Newer: up to 1M+. Anything beyond the window is completely invisible to the model.</div></div><div class="pe-limit-card"><div class="pe-limit-icon">🌀</div><div class="pe-limit-title">No Persistent Memory</div><div class="pe-limit-desc">Every conversation starts completely fresh. The model has no memory of past sessions unless you explicitly provide them.</div></div><div class="pe-limit-card"><div class="pe-limit-icon">🎲</div><div class="pe-limit-title">Stochasticity</div><div class="pe-limit-desc">Same prompt, potentially different outputs. The sampling process is inherently random, even at low temperatures.</div></div><div class="pe-limit-card"><div class="pe-limit-icon">🔓</div><div class="pe-limit-title">Jailbreaks</div><div class="pe-limit-desc">Safety training is pattern-based. Clever prompting can sometimes bypass it because the model is still a pattern matcher at heart.</div></div><div class="pe-limit-card"><div class="pe-limit-icon">💭</div><div class="pe-limit-title">Hallucinations</div><div class="pe-limit-desc">Inevitable on low-frequency knowledge. No fact-checker means confident errors are always possible. Verify important claims.</div></div><div class="pe-limit-card"><div class="pe-limit-icon">⚡</div><div class="pe-limit-title">Quadratic Cost</div><div class="pe-limit-desc">Attention cost grows quadratically with context length. Techniques like FlashAttention mitigate this, but it's a fundamental constraint.</div></div></div><h2 id="pe-s12"><span class="pe-sec-num">12 —</span>What's next</h2><p>The probability-engine core remains — but researchers are building powerful layers on top.<strong>RAG (Retrieval-Augmented Generation)</strong> gives the model access to real documents at query time, dramatically reducing hallucination on factual tasks.<strong>Agentic systems</strong> let LLMs use tools, execute code, and iterate on their outputs.<strong>Reasoning models</strong> generate long internal chains of thought before answering, improving performance on math and logic. And<strong>multimodal models</strong> extend the same probabilistic core to images and audio.</p><p>None of these change the fundamental nature of what an LLM is. They all sit on top of the same next-token prediction engine. Understanding that foundation is what makes you a sharper thinker about where this technology is — and isn't — going.</p><div class="pe-callout cy"><strong>The bottom line:</strong> LLMs are extraordinary pattern-recognition engines that have scaled statistical prediction to the point of producing genuinely useful, sometimes astonishing outputs. They are not intelligent in any human sense. Knowing this — really knowing it — is what separates clear thinking about AI from hype.</div><hr class="pe-divider"/><p style="color:var(--muted);font-size:0.95rem;">Video generated with Grok Imagine. Animations built with vanilla JavaScript.</p></div><script>
(function() {
// ── Token Prediction ──
const peSeqs = [
{ prefix: "The cat sat on the", cands: [
{w:"mat",p:62,win:true},{w:"floor",p:18},{w:"rug",p:11},{w:"roof",p:5},{w:"couch",p:4}
]},
{ prefix: "The cat sat on the mat and", cands: [
{w:"looked",p:41,win:true},{w:"waited",p:28},{w:"purred",p:17},{w:"slept",p:9},{w:"yawned",p:5}
]},
{ prefix: "The cat sat on the mat and looked", cands: [
{w:"up",p:54,win:true},{w:"around",p:22},{w:"out",p:14},{w:"away",p:7},{w:"back",p:3}
]},
{ prefix: "The cat sat on the mat and looked up at", cands: [
{w:"the",p:58,win:true},{w:"me",p:22},{w:"nothing",p:11},{w:"her",p:6},{w:"him",p:3}
]},
];
var peIdx = 0;
function peRenderBars(cands) {
var c = document.getElementById('pe-prob-bars');
if (!c) return;
c.innerHTML = '';
cands.forEach(function(cd, i) {
var row = document.createElement('div');
row.className = 'pe-prob-row';
row.innerHTML = '<div class="pe-prob-lbl">'+cd.w+'</div><div class="pe-prob-track"><div class="pe-prob-fill'+(cd.win?' win':'')+'" id="pepf'+i+'"/></div><div class="pe-prob-pct">'+cd.p+'%</div>';
c.appendChild(row);
});
setTimeout(function() {
cands.forEach(function(cd, i) {
var el = document.getElementById('pepf'+i);
if (el) el.style.width = cd.p+'%';
});
}, 60);
}
window.pePredict = function() {
var seq = peSeqs[peIdx % peSeqs.length];
peRenderBars(seq.cands);
setTimeout(function() {
var winner = seq.cands.find(function(c){return c.win;});
var el = document.getElementById('pe-sentence');
if (el) el.innerHTML = seq.prefix+'<span style="color:var(--gold);font-weight:bold;">'+winner.w+'</span><span class="pe-cursor"/>';
}, 800);
peIdx++;
};
peRenderBars(peSeqs[0].cands);
// ── Formula Reveal ──
var peTerms = ['pet0','pet1','pet2','pet3','pet4','pet5','pet6'];
var peAnns = ['pea0','pea1','pea2','pea3'];
window.peRevealFormula = function() {
peTerms.forEach(function(id){var el=document.getElementById(id);if(el)el.classList.remove('on');});
peAnns.forEach(function(id){var el=document.getElementById(id);if(el)el.classList.remove('on');});
peTerms.forEach(function(id, i){ setTimeout(function(){ var el=document.getElementById(id);if(el)el.classList.add('on'); }, i*280); });
peAnns.forEach(function(id, i){ setTimeout(function(){ var el=document.getElementById(id);if(el)el.classList.add('on'); }, 2100+i*230); });
};
// ── Softmax ──
var peSMData = [
{w:'mat',l:4.2},{w:'floor',l:1.8},{w:'rug',l:0.9},{w:'table',l:-0.4},{w:'sky',l:-2.1}
];
(function initSM() {
var li = document.getElementById('pe-logits');
var si = document.getElementById('pe-softmax');
if (!li||!si) return;
li.innerHTML = '';
si.innerHTML = '';
peSMData.forEach(function(d,i) {
li.innerHTML += '<div class="pe-logit-row"><span class="pe-logit-w">'+d.w+'</span><span class="pe-logit-v '+(d.l<0?'neg':'pos')+'">'+(d.l>0?'+':'')+d.l+'</span></div>';
si.innerHTML += '<div class="pe-logit-row"><span class="pe-logit-w">'+d.w+'</span><div class="pe-sm-bar" id="pesm'+i+'"/></div>';
});
})();
window.peSoftmax = function() {
var exps = peSMData.map(function(d){return Math.exp(d.l);});
var sum = exps.reduce(function(a,b){return a+b;},0);
var probs = exps.map(function(e){return e/sum;});
var arrow = document.getElementById('pe-sm-arrow');
if (arrow) arrow.style.opacity = '1';
probs.forEach(function(p,i){
setTimeout(function(){
var bar = document.getElementById('pesm'+i);
if (!bar) return;
bar.style.width = Math.max(p*150,0)+'px';
bar.textContent = (p*100).toFixed(1)+'%';
}, i*140);
});
};
// ── Loss ──
window.peLoss = function(val) {
var p = val/100;
var loss = -Math.log(p);
var cl = document.getElementById('pe-conf-lbl');
var ln = document.getElementById('pe-loss-num');
var lp = document.getElementById('pe-lf-p');
var lp2 = document.getElementById('pe-lf-p2');
var lr = document.getElementById('pe-lf-res');
var lv = document.getElementById('pe-lf-verdict');
if(cl) cl.textContent = val+'%';
if(ln){ ln.textContent = loss.toFixed(2); ln.classList.toggle('good', loss< 0.5);= }= if(lp)= lp.textContent=p.toFixed(2); if(lp2)= lp2.textContent=p.toFixed(2); if(lr)= lr.textContent=loss.toFixed(2); if(lv)= lv.textContent=loss <= 0.5= ?= 'Low= loss= →= small= parameter= update= ✓'= := 'High= loss= →= big= parameter= update= ↑';= };= //= ──= Attention= ──= var= peAttnWords=["The","animal","didn't","cross","the","street","because","it","was","tired"]; var= peAttnW={ "The":= [0.5,0.1,0.05,0.05,0.1,0.05,0.05,0.05,0.03,0.02],= "animal":= [0.05,0.55,0.05,0.05,0.05,0.05,0.05,0.1,0.03,0.02],= "didn't":= [0.04,0.08,0.5,0.08,0.04,0.08,0.06,0.05,0.05,0.02],= "cross":= [0.03,0.05,0.08,0.5,0.03,0.12,0.08,0.05,0.04,0.02],= "the":= [0.08,0.05,0.04,0.05,0.5,0.12,0.05,0.04,0.05,0.02],= "street":= [0.04,0.05,0.06,0.12,0.1,0.45,0.06,0.05,0.05,0.02],= "because":= [0.03,0.06,0.08,0.08,0.03,0.07,0.45,0.1,0.07,0.03],= "it":= [0.03,0.38,0.07,0.06,0.03,0.06,0.1,0.15,0.09,0.03],= "was":= [0.03,0.06,0.05,0.05,0.03,0.05,0.07,0.1,0.5,0.06],= "tired":= [0.03,0.08,0.05,0.05,0.03,0.05,0.07,0.12,0.1,0.42],= };= var= peSel="it" ;= function= peRenderAttn()= {= var= wc=document.getElementById('pe-attn-words'); if= (!wc)= return;= wc.innerHTML='' ;= peAttnWords.forEach(function(w)= {= var= el=document.createElement('div'); el.className='pe-attn-word' +(w===peSel?' sel':'');= el.textContent=w; el.onclick=function(){ peSel=w; peRenderAttn();= };= wc.appendChild(el);= });= var= weights=peAttnW[peSel] ||= peAttnW["it"];= var= wrap=document.getElementById('pe-attn-grid'); if= (!wrap)= return;= wrap.innerHTML='' ;= var= n=peAttnWords.length; var= grid=document.createElement('div'); grid.style.display='grid' ;= grid.style.gridTemplateColumns='repeat(' +n+',= 1fr)';= grid.style.gap='3px' ;= weights.forEach(function(wt,= j)= {= var= cell=document.createElement('div'); cell.style.height='26px' ;= cell.style.borderRadius='3px' ;= cell.style.background='rgba(0,229,255,' +wt+')';= cell.style.transition='background 0.4s' ;= cell.title=peSel+' →= '+peAttnWords[j]+':= '+(wt*100).toFixed(0)+'%';= grid.appendChild(cell);= });= var= labelRow=document.createElement('div'); labelRow.style.display='grid' ;= labelRow.style.gridTemplateColumns='repeat(' +n+',= 1fr)';= labelRow.style.gap='3px' ;= labelRow.style.marginTop='4px' ;= peAttnWords.forEach(function(w)= {= var= lbl=document.createElement('div'); lbl.textContent=w; lbl.style.fontSize='0.6rem' ;= lbl.style.color='var(--muted)' ;= lbl.style.textAlign='center' ;= lbl.style.overflow='hidden' ;= lbl.style.textOverflow='ellipsis' ;= labelRow.appendChild(lbl);= });= wrap.appendChild(grid);= wrap.appendChild(labelRow);= }= peRenderAttn();= //= ──= Temperature= ──= var= peTempBase=[ {w:'mat',p:0.52},{w:'floor',p:0.22},{w:'rug',p:0.13},{w:'table',p:0.08},{w:'sky',p:0.03},{w:'cloud',p:0.02}= ];= (function= initTemp()= {= var= c=document.getElementById('pe-tbars'); if= (!c)= return;= c.innerHTML='' ;= peTempBase.forEach(function(d,i)= {= c.innerHTML= +='<div class="pe-trow"><span class="pe-tlbl">' +d.w+'</span=><div class="pe-ttrack"><div class="pe-tfill" id="petf'+i+'"/></div><span class="pe-tpct" id="petp'+i+'">—</span></div>';
});
peTemp(10);
})();
window.peTemp = function(val) {
var T = val/10;
var dv = document.getElementById('pe-temp-val');
var dl = document.getElementById('pe-temp-lbl');
if (dv) dv.textContent = T.toFixed(1);
if (dl) {
if (T<0.5) dl.textContent='🧊 Deterministic' ;= else= if(T<0.8)= dl.textContent='🔵 Conservative' ;= else= if(T<1.3)= dl.textContent='⚖️ Balanced — sweet spot' ;= else= if(T<2.0)= dl.textContent='🔥 Creative' ;= else= dl.textContent='🌋 Chaotic' ;= }= var= scaled=peTempBase.map(function(d){return Math.pow(d.p,1/T);});= var= sum=scaled.reduce(function(a,b){return a+b;},0);= var= probs=scaled.map(function(s){return s/sum;});= probs.forEach(function(p,i){= var= f=document.getElementById('petf'+i); var= t=document.getElementById('petp'+i); if(f)= f.style.width=(p*100)+'%'; if(t)= t.textContent=(p*100).toFixed(1)+'%'; });= };= })();= </script=>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/llms-are-pe/hero.jpg" medium="image"><media:title type="plain">Deep-Learning</media:title></media:content><category>artificial-intelligence</category><category>llm</category><category>machine-learning</category><category>deep-learning</category><category>architecture</category><category>Knowledge Base</category></item><item><title>Knowledge Distillation: From Massive Models to Efficient Intelligence</title><link>https://curiousbit.netlify.app/knowledge-distillation-from-massive-models-to-efficient-intelligence/</link><guid isPermaLink="true">https://curiousbit.netlify.app/knowledge-distillation-from-massive-models-to-efficient-intelligence/</guid><pubDate>Fri, 08 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;There is a scene you have probably seen in countless films: a master craftsman, decades of experience locked in his hands, patiently guiding a young apprentice. The master does not hand over a textbook. He transfers something richer — intuition, nuance, an understanding of &lt;em&gt;why&lt;/em&gt; certain choices matter. The apprentice, unburdened by the master&amp;rsquo;s size and slowness, eventually moves faster and in some cases surpasses the teacher entirely.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/kd-master-apprentice.jpg" alt="Deep-Learning" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>There is a scene you have probably seen in countless films: a master craftsman, decades of experience locked in his hands, patiently guiding a young apprentice. The master does not hand over a textbook. He transfers something richer — intuition, nuance, an understanding of<em>why</em> certain choices matter. The apprentice, unburdened by the master&rsquo;s size and slowness, eventually moves faster and in some cases surpasses the teacher entirely.</p><p>Knowledge Distillation is that scene, rendered in mathematics.</p><p>Introduced formally by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean at Google in 2015, Knowledge Distillation (KD) is a model compression technique where a large, expensive model — the<strong>teacher</strong> — transfers its learned intelligence to a compact, deployable model — the<strong>student</strong>. The student retains over 90% of the teacher&rsquo;s accuracy while being up to 100× smaller and faster.</p><p>This article takes you from the intuition all the way through to the advanced variants that are reshaping AI deployment in 2026.</p><hr><h2 id="the-problem-intelligence-is-expensive">The Problem: Intelligence Is Expensive</h2><p>Modern AI models are enormous. GPT-4 is estimated to contain over a trillion parameters. BERT-large has 340 million. These models achieve stunning accuracy — but they are cumbersome to deploy. Running a trillion-parameter model for every user query would require data centres the size of small cities.</p><p>The engineering instinct is to train a smaller model directly. But smaller models trained from scratch on raw data consistently underperform large ones. Why?</p><p>Because raw training data is<em>hard</em>. A cat photo labelled simply &ldquo;cat&rdquo; gives a small model very little to work with. A large model, however, does not just see &ldquo;cat&rdquo; — it sees a distribution of confidence across thousands of classes. &ldquo;Cat: 0.92, Lynx: 0.06, Tabby: 0.02.&rdquo; That probability distribution is enormously richer than the hard label.</p><p>Hinton called this richer signal<strong>dark knowledge</strong> — the information encoded in what the model<em>almost</em> predicted.</p><hr><h2 id="the-teacher-student-paradigm">The Teacher-Student Paradigm</h2><p><img src="/images/kd-master-apprentice.jpg" alt="A Renaissance master transfers glowing knowledge orbs to his apprentice"/><p>The core idea is elegant. Instead of training the student on raw labelled data, you train it to<strong>mimic the teacher&rsquo;s output distribution</strong>.</p><p>You run every training example through the large teacher model. For each example, instead of a hard label (0 or 1), you collect the teacher&rsquo;s full<strong>soft target</strong> — the probability it assigns to every possible class. You then train the student to produce those same soft probability distributions.</p><p>The student loss function becomes:</p><pre tabindex="0"><code>Loss = α × (cross-entropy with hard labels)
+ (1-α) × (KL divergence from teacher soft targets)</code></pre><p>The blending weight<code>α</code> controls how much the student learns from the raw data versus the teacher&rsquo;s guidance. In practice, a small<code>α</code> (more weight on teacher targets) is usually optimal.</p><hr><h2 id="soft-targets-and-dark-knowledge">Soft Targets and Dark Knowledge</h2><p><img src="/images/kd-dark-knowledge.jpg" alt="A Flemish alchemist distils the essence of a massive Teacher Model flask into a tiny Student Model vial"/><p>Hard labels are binary. Soft targets are continuous. That difference is enormous.</p><p>Consider an image of a dog that slightly resembles a wolf. A hard label says &ldquo;dog: 1, wolf: 0.&rdquo; A teacher that has seen millions of examples says &ldquo;dog: 0.84, wolf: 0.13, fox: 0.03.&rdquo; That residual probability on<em>wolf</em> carries genuine information about the visual ambiguity in the image. The student trained on soft targets learns not just the answer, but the<em>shape of uncertainty</em> around the answer.</p><p>This is the dark knowledge. It lives in the tails of the distribution — the non-zero probabilities on wrong answers — and it makes the student dramatically more robust than one trained on hard labels alone.</p><hr><h2 id="temperature-the-control-knob">Temperature: The Control Knob</h2><p><img src="/images/kd-temperature.jpg" alt="A Renaissance philosopher adjusts the Temperature T dial on a celestial orrery, sharpening planets on the left and softening them to probability clouds on the right"/><p>Soft targets, by default, tend to be very peaked — the teacher is often highly confident in its top prediction, assigning 0.99 to the correct class and tiny residuals to everything else. At that extreme, the soft target is barely different from a hard label, and the dark knowledge disappears.</p><p>Hinton&rsquo;s solution was<strong>temperature scaling</strong>. Before computing the softmax, you divide the logits by a temperature parameter T:</p><pre tabindex="0"><code>p_i = exp(z_i / T) / Σ exp(z_j / T)</code></pre><p>At<strong>T = 1</strong> (standard), outputs are sharp and peaked.
At<strong>T &gt; 1</strong> (high temperature), outputs become softer and more spread, revealing the relative confidence structure across all classes.</p><p>During distillation, both teacher and student use the same elevated temperature (typically T = 3–5). This &ldquo;warms up&rdquo; the teacher&rsquo;s output into a richer, more informative distribution for the student to learn from. After training, the student is deployed with T = 1.</p><p>The effect is striking. Higher temperatures expose more inter-class structure, giving the student a better map of the concept landscape rather than just a list of correct answers.</p><hr><h2 id="what-gets-transferred-three-flavours-of-distillation">What Gets Transferred? Three Flavours of Distillation</h2><p>Knowledge can flow from teacher to student in different ways. The research community has converged on three main categories:</p><p><strong>Response-based distillation</strong> — the original Hinton approach. The student matches the teacher&rsquo;s final output layer (soft targets). Simple, effective, widely used.</p><p><strong>Feature-based distillation</strong> — the student is trained to match not just the final output but intermediate representations — specific layers or attention maps inside the teacher. This transfers<em>how</em> the teacher thinks, not just what it concludes. The trade-off is complexity: the teacher and student must have compatible architectures or an adapter layer is needed.</p><p><strong>Relation-based distillation</strong> — the student learns to replicate the<em>relationships</em> between different training examples as the teacher sees them. If the teacher places cat images and dog images in nearby regions of its feature space, the student should too. This approach is particularly powerful for metric learning and few-shot tasks.</p><hr><h2 id="advanced-variants">Advanced Variants</h2><h3 id="multi-task-distillation">Multi-Task Distillation</h3><p><img src="/images/kd-polymath-student.jpg" alt="A Leonardo da Vinci polymath student simultaneously masters writing, painting, anatomy, and geometry with golden threads connecting all disciplines to a glowing brain"/><p>Microsoft&rsquo;s MT-DNN research showed that distillation composes naturally with multi-task learning. A teacher trained on nine different natural language tasks simultaneously was distilled into a single student model. The distilled MT-DNN outperformed the original on 7 of 9 GLUE benchmark tasks — pushing the single-model state of the art to 83.7%.</p><p>The insight: when a teacher has learned to generalise across many domains, its soft targets encode cross-task structure that a specialised student cannot discover on its own.</p><h3 id="the-teacher-assistant-bridge">The Teacher Assistant Bridge</h3><p>What happens when the teacher and student are so different in capacity that direct distillation fails? A very large teacher produces soft targets the tiny student simply cannot model well.</p><p>The solution is an intermediate<strong>Teacher Assistant (TA)</strong> — a medium-sized model that first distils from the large teacher, then acts as teacher to the small student. The TA bridges the capacity gap, giving the small student a more tractable target. Research has consistently shown this staged approach outperforms direct large-to-small distillation when the size gap is more than an order of magnitude.</p><h3 id="when-the-student-surpasses-the-teacher">When the Student Surpasses the Teacher</h3><p><img src="/images/kd-student-exceeds.jpg" alt="A young apprentice stands triumphant as his glowing painting outshines the master&rsquo;s faded work, with the aged teacher bowing respectfully"/><p>One of the most counter-intuitive findings in knowledge distillation is that the student can sometimes<em>exceed</em> the teacher.</p><p>The 2022<strong>Symbolic Knowledge Distillation</strong> paper demonstrated this dramatically. The researchers distilled commonsense reasoning from GPT-3 (175B parameters) into a purpose-built commonsense model at 100× smaller size. The resulting student — COMET-DISTIL — outperformed GPT-3 on commonsense benchmarks.</p><p>How? The distillation process acted as a filter. Rather than transferring all of GPT-3&rsquo;s knowledge, the researchers used a<strong>critic model</strong> to selectively distil only high-quality, high-confidence commonsense triples. The student was not burdened by GPT-3&rsquo;s off-topic knowledge or low-confidence noise. It received a curated, concentrated version of the teacher&rsquo;s relevant expertise.</p><p>This is the Renaissance apprentice story made literal: the student, given the master&rsquo;s best knowledge and freed from the master&rsquo;s constraints, eventually does better work.</p><hr><h2 id="real-world-results">Real-World Results</h2><p>The numbers behind knowledge distillation are worth anchoring:</p><p>In Hinton&rsquo;s original speech recognition experiments on a heavily used commercial system, a distilled single model<strong>matched the accuracy of a 10-model ensemble</strong> while requiring one-tenth the compute at inference time.</p><p>In the speech recognition benchmark specifically:</p><table><thead><tr><th>Model</th><th>Frame Accuracy</th><th>Word Error Rate</th></tr></thead><tbody><tr><td>Baseline (single model)</td><td>50.9%</td><td>10.9%</td></tr><tr><td>10× model ensemble (teacher)</td><td>61.1%</td><td>10.7%</td></tr><tr><td>Distilled single student</td><td><strong>60.8%</strong></td><td><strong>10.7%</strong></td></tr></tbody></table><p>The student matches the ensemble at a fraction of the cost. This is the central promise of KD — and it has held up across vision, language, and speech for over a decade.</p><hr><h2 id="why-this-matters-in-2026">Why This Matters in 2026</h2><p>Knowledge distillation is no longer a research technique. It is infrastructure.</p><p>Every major on-device AI model — the language models on your phone, the vision models in your camera, the wake-word detectors in your earbuds — was almost certainly distilled from a much larger cloud model. DistilBERT, MobileNet, and Whisper Tiny are all products of distillation.</p><p>The technique is also central to the LLM compression wave of the past two years. Models like Phi-3, Mistral Small, and Gemma were designed with distillation-aware training pipelines from the start. The goal: deliver GPT-4-class reasoning in a model small enough to run locally, privately, and cheaply.</p><p>And symbolic distillation — transferring knowledge as structured text rather than as neural activations — is opening entirely new territory, allowing language model intelligence to flow into specialised domain models that do not even share the same architecture.</p><hr><h2 id="a-practical-starting-point">A Practical Starting Point</h2><p>If you want to experiment with knowledge distillation today:</p><p><strong>For response-based KD in PyTorch</strong>, the training loop change is minimal — replace your standard cross-entropy loss with the blended loss described above and pass the teacher&rsquo;s logits alongside the hard labels.</p><p><strong>For NLP tasks</strong>, Hugging Face&rsquo;s<code>transformers</code> library includes DistilBERT as a reference distilled model with its training recipe documented.</p><p><strong>For vision</strong>, TorchVision&rsquo;s knowledge distillation tutorial is the fastest on-ramp.</p><p>The key design decisions are: the temperature T (start at 4), the blending weight α (start at 0.5), and whether you need feature-based or response-based transfer (response-based first, feature-based if accuracy is still insufficient).</p><hr><p>The master-apprentice metaphor is more than decorative. Knowledge distillation encodes a genuine pedagogical insight: that the richer the guidance a learner receives, the more efficiently it reaches competence. The hard labels of raw data are the equivalent of telling a student the answer. The soft targets of a teacher model are the equivalent of showing them how to think.</p><p>That distinction — answer versus thinking — is what makes knowledge distillation one of the most elegant ideas in modern machine learning.</p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/kd-master-apprentice.jpg" medium="image"><media:title type="plain">Deep-Learning</media:title></media:content><category>artificial-intelligence</category><category>machine-learning</category><category>model-compression</category><category>deep-learning</category><category>Knowledge Base</category></item></channel></rss>