AI for Enterprises Session 01 — How LLMs Actually Work
1/8
Session 01 Sunday, April 27 · 7:00 PM Tonight

How LLMs Actually Work

Strip away the magic. Understand what a language model really does under the hood — tokens, embeddings, attention, and prediction — and why this changes how you think about AI.

Duration ~2.5 hours (can split)
Format Read → Try → Reflect
Pre-reading None required
Course AI for Enterprises
🎯

What This Course Is

5 min

This is a practitioner's course on AI — not a prompting workshop. You may already know how to talk to AI tools. This course teaches you how to think with AI — to understand what's happening under the hood, and to reason from first principles about how to build with it.

What this is

A ground-up course on how AI actually works — mechanics, architecture, prompting strategy, context, cost, and enterprise design patterns. Every concept connects directly to real work.

What it isn't

Theory for theory's sake. Every session has a direct application — the kind of problems that come up in operations, compliance, vendor management, and client workflows.

The end goal

By session 8, you should be able to design AI into a workflow, not just use it as a chat tool. That means understanding systems, not just prompts.

How to use this

Read each section, try the exercises and experiments at the end, and bring your questions and observations to the next session. The content builds progressively across all 8 weeks.

🚫

Myths vs Reality

8 min

Before getting into mechanics, it helps to clear the slate. Everyone arrives with a head full of AI assumptions shaped by science fiction, news headlines, and marketing copy. If these aren't named and addressed directly, they quietly distort everything else you'll learn. Read each one and check your own assumptions against the reality.

Myth
"AI understands what I'm saying — it reads my message and grasps my intent the way a human would."
Reality
The model has no comprehension. It converts your message into numbers and statistically predicts what tokens should come next. It has never "understood" anything. What looks like understanding is an extraordinarily good pattern match.
Myth
"AI gets smarter every time I use it — my conversations are training it to know me better over time."
Reality
At inference time, no learning happens. The model's weights are frozen. Your conversation has zero effect on the underlying model. Every new chat starts from exactly the same baseline. Memory features in products are engineering add-ons, not learning.
Myth
"AI is browsing the internet to answer my questions, like a really fast search engine."
Reality
By default, LLMs have no internet access. They generate responses purely from patterns learned during training. Web search is a tool that can be explicitly connected — but it has to be built in. Without it, the model is speaking entirely from memory.
Myth
"If AI gives a confident answer, it's probably correct — it would say 'I don't know' if it wasn't sure."
Reality
The model has no internal doubt mechanism. Confidence is a property of the token probability distribution, not of actual correctness. It will produce a fluent, confident-sounding answer whether it's right or completely fabricating. This is called hallucination.
Myth
"AI knows who I am — it remembers me from last time and builds a picture of me over our conversations."
Reality
The model is completely stateless. Each conversation starts from zero. It has no identity for you, no memory of previous sessions, and no persistent relationship with you. Products that simulate memory are injecting past context into your current prompt — it's text, not recollection.
Why this matters

Every single one of these myths leads to a bad product decision or a misplaced trust failure in practice. The faster you internalise the real picture, the faster you can build genuinely useful things with AI — instead of building on top of a misunderstanding.

🧠

Part 1 — The Intuition

15 min

Before reading further — pause and ask yourself: when you send a message to ChatGPT or Claude, what do you think is actually happening inside the model? Hold that intuition in mind, then read what follows.

Text → Tokens. AI doesn't read words the way humans do. It reads chunks called tokens. A single word might be one token, or it might be split across several. The model never sees raw text — only a sequence of numbers.

You can see this yourself by visiting tiktokenizer.vercel.app — paste any text and watch how it gets split into tokens. Try a company name, a technical term, or a sentence from a document you work with, and notice how it tokenizes. Then paste a long paragraph and see the token count. This builds direct intuition for why token limits exist and why phrasing matters at scale.

Tokens → Numbers (Embeddings). Every token gets converted into a point in a very high-dimensional space — imagine a map with thousands of axes. Words that mean similar things cluster close together. The classic example: King − Man + Woman ≈ Queen. That's a real result from this vector space. The model has learned meaning through geometry.

The Prediction Machine. At its core, every LLM is doing one thing: predicting the next most likely token, over and over. Every word in a response is the output of this prediction, chained together. This is the single most important fact to internalise — because it explains both the power and the failure modes of AI.

The punchline

If the model is "just predicting the next token," then it can be confidently wrong. It doesn't know the answer — it knows what a plausible answer looks like. This is why hallucinations happen. This is why you need to verify AI output in critical contexts.

📎

Part 2 — Multimodality: Beyond Text

6 min

Everything we've described so far — tokens, embeddings, prediction — applies to text. But modern enterprise models are not text-only. They can receive and process multiple types of input. This is called multimodality, and it's what makes AI genuinely useful across functions that don't primarily work in text.

For someone in finance reviewing invoices, HR processing resumes, or compliance checking contracts — their work is documents, tables, and images. The moment they realise the model can read those directly, the session becomes relevant to them.

Reflection

Look through the six input types above. Which one maps most directly to something you do in your job today? That's your starting point for thinking about where AI could add real value to your work — not in the abstract, but in that specific task.

🔍

Part 2 — The Attention Mechanism

10 min

This is the "secret sauce" of the Transformer architecture. Keep it conceptual — no matrix math needed. The goal is just the mental model.

Key idea: not all words in a sentence matter equally to each other.

Use this example with the class

"The worker submitted his documents but he forgot to attach his photo." — When the model reads the second "his", how does it know it refers to "the worker" and not "the documents"? That's attention. The model has learned to look back over the entire input and weigh which earlier tokens are most relevant to understanding the current one.

Attention is why models can handle long, complex sentences without losing track of what refers to what. And it's why longer, richer prompts often perform better — you're giving the model more context to "attend to" when forming its response.

This is also the computational reason why longer contexts are more expensive. Every token is attending to every other token. Quadratic cost. More tokens = exponentially more computation.

Part 3 — Training vs Inference

10 min

Most people conflate these two phases. Separating them is one of the most clarifying things you can do early in this course.

Training

The model reads a massive corpus of text, predicts the next token billions of times, and adjusts its internal weights every time it's wrong. This is done once, by the company that built the model (Anthropic, OpenAI, etc.). It takes months and costs millions.

Inference

When you send a message, the model uses those frozen weights to generate a response. No learning is happening. It is not updating based on your conversation. The model is the same whether it's your first message or your ten-thousandth.

Key insight to land

"Every time you start a new chat, you're talking to the exact same frozen model. It has no idea who you are. What looks like memory is just the conversation history — text pasted into the context window. When you close the tab, it's gone."

This sets up Session 2 perfectly — because the next question is: how do companies build AI systems that actually do remember things? That's where RAG and vector databases come in. Plant that question tonight and let it sit.

🎯

Part 4 — RLHF: Why the Model Is Actually Helpful

6 min

Here's a question the session so far hasn't answered: if the base model is just predicting the next token from internet text, why does it behave helpfully instead of generating random noise, hate speech, or conspiracy theories — all of which exist in its training data?

The answer is Reinforcement Learning from Human Feedback (RLHF). It's the step that turns a raw text predictor into a useful assistant — and it's why Claude, ChatGPT, and Gemini feel like products and not just autocomplete engines.

Step 1 — Pre-training

The base model trains on a massive corpus of internet text. At this point it can generate coherent text, but it has no sense of what's "good" or "helpful." It might complete your sentence with misinformation just as readily as a correct answer.

Step 2 — Supervised Fine-tuning

Human trainers write examples of ideal responses to thousands of prompts. The model is fine-tuned on these examples — it starts learning the shape of a "good" answer. But this alone isn't enough.

Step 3 — Reward Model

Humans rank multiple model outputs from best to worst. A separate "reward model" is trained on these rankings — it learns to score how good a response is. This becomes the judge.

Step 4 — Reinforcement Learning

The main model generates responses, the reward model scores them, and the main model is updated to produce higher-scoring outputs over time. It learns helpfulness, harmlessness, and honesty through this feedback loop.

Why this matters for builders

Different companies apply RLHF with different values and priorities. This is why Claude, ChatGPT, and Gemini behave differently even though they're all LLMs. When you choose a model for an enterprise product, you're also choosing the values and guardrails baked into that RLHF process. It's not just a performance decision — it's a policy decision.

🧩

Part 5 — Anatomy of a Prompt

6 min

When you send a message to an AI, you're seeing one small piece of what the model actually receives. In production systems — the kind you'll build at work — a "prompt" is three distinct things, and conflating them is one of the most common engineering mistakes.

System Prompt

Instructions set by the developer, not the user. This is where you define who the model is, how it should behave, what it knows about your product, what it's allowed to say, and what it must never say. Users typically never see this. Think of it as the employee briefing before the customer walks in.

"You are a customer policy specialist for an enterprise SaaS platform. Your role is to help registered partners understand service agreements, billing schedules, and compliance requirements. Never speculate on legal matters — flag uncertainty explicitly. Adapt your language to the formality level of the person writing to you."

Conversation History

Every previous message in the current session — both the user's messages and the model's responses — is included in full, prepended to the new request. The model has no memory mechanism; it simply re-reads the entire conversation each time. This is why long conversations get expensive and why context limits matter.

User Message

What the user typed right now. This is the only part most people think about — but in a well-engineered system, it's actually the smallest piece. The system prompt and injected context often contain far more tokens than the user's actual question.

The engineering insight

In any enterprise AI product, your job as the builder is mostly to craft the system prompt and manage what gets injected into the conversation history. The user message is just the trigger. This framing completely changes how you think about AI product design — you're not just writing prompts, you're designing a context pipeline.

🏗️

Part 6 — How AI Products Are Actually Structured

6 min

There's a question every non-technical person in an organisation will eventually face: "should we use Claude.ai, buy a tool built on AI, or build something ourselves?" These are three completely different decisions — commercially, technically, and in terms of control — but they look identical from the outside. Here's the actual picture.

Foundation
The Foundation Model
The raw LLM trained by Anthropic, OpenAI, Google, or Meta. You cannot access or modify the weights directly. This layer knows everything in its training data and nothing about your business.
Claude 3.5 SonnetGPT-4oGemini ProLlama 3
API Layer
The API
A programmatic interface for developers. You pay per token, set the system prompt, choose the model version and temperature, and inject any context you want. This is where custom-built enterprise products live — products tailored to your workflows, your data, and your policies.
Anthropic APIOpenAI APIGoogle Vertex AI
Product Layer
Third-Party AI Products
Tools built by other companies on top of the API. Engineering is handled for you — you pay for the product, not raw tokens. Limited customisation but fast to deploy. Someone else controls the system prompt and your data handling.
Notion AICursorHubSpot AIn8n AI nodesGrammarly
End User
Consumer Interfaces
What most people use — claude.ai, ChatGPT, Gemini. You type, it responds. System prompt is set by the company. Almost no control over underlying behaviour. Great for personal productivity; not for building business systems.
claude.aiChatGPTGeminiCopilot
The decision framework for any room

Always evaluate a layer by asking three questions: Who controls the system prompt? Where does our data go? Can we customise the behaviour? Consumer interfaces answer none of these. The API answers all of them — at the cost of engineering. Third-party products sit in between — evaluate them on data handling and configurability before adoption.

🏢

Thinking Like an Enterprise

8 min

Using AI as an individual and deploying AI inside an organisation are fundamentally different problems. Most AI content online is written for the individual user — the person experimenting alone, optimising their own workflow. Enterprise AI is a different discipline entirely. The technical challenge is often the easier half.

Here is how the enterprise context changes every question you'll face when building or adopting AI systems:

1
Scale changes the failure mode
A prompt that works 80% of the time is fine for personal use. At 10,000 calls per day, that's 2,000 failures. Enterprise AI demands statistical reliability, not occasional success. Every design decision — model choice, prompt structure, output validation — must be evaluated at the scale it will actually run, not at the scale you tested it.
2
Governance: who owns the output?
When an individual uses AI, they own the output. When an organisation uses AI to generate a contract clause, a hiring decision summary, or a compliance flag — who is accountable? Enterprise AI requires answering: who approves AI output before it's acted on? Who's liable when it's wrong? What's the escalation path? These are not technical questions. They must be resolved before anything ships.
3
Build vs Buy vs Partner
Every AI capability decision is one of three options: build it on the API (full control, full cost, full responsibility), buy a third-party product built on AI (fast, limited customisation, someone else handles infrastructure), or partner with a specialist to build something custom (time-intensive, expensive, highest fit). There is no universally right answer — the decision depends on how differentiated the capability needs to be and how sensitive the data is.
Buy when: the problem is generic (email drafting, meeting summaries) Build when: the problem is differentiated (proprietary data, unique workflow) Partner when: the problem is high-stakes and complex (regulated decisions, custom integrations)
4
AI integrates into systems, not just workflows
Enterprise AI rarely sits alone. It connects upstream to data sources (CRMs, ERPs, databases, document stores) and downstream to action systems (email, ticketing, approval workflows, dashboards). The AI layer is often the smallest part of the engineering surface. Most of the work is data pipelines, system integrations, access controls, and audit logging — not the model itself.
5
Adoption is the hardest part
A technically correct AI system that nobody uses is a failed project. Enterprise AI adoption requires change management: clear communication of what the tool does and doesn't do, training for different skill levels, trust-building through transparency about limitations, and gradual rollout rather than organisation-wide launches. The resistance is rarely irrational — it's usually people protecting against uncertainty they've correctly identified.
6
Measurement comes before ROI
You cannot claim ROI from an AI system you aren't measuring. Before deploying anything, define the baseline metric (how long does this task take today? how often is it done wrong?), the measurement mechanism (how will you capture the AI-assisted version?), and the success threshold (what constitutes meaningful improvement?). Without this, every AI project becomes a perpetual "it seems to be helping" — which is impossible to defend when cost reviews happen.
The framing shift

Individual AI use is about augmenting yourself. Enterprise AI is about augmenting a system — one made of people, processes, data, and existing technology. Every decision you make about an enterprise AI product is really a decision about how that system changes. The model is just one variable.

🌡️

Part 6 — Temperature: The Real Explanation

5 min

You've heard "high temperature = creative, low temperature = consistent." That's true, but it's not the real explanation — and as engineers, you deserve the actual picture.

Every time the model predicts the next token, it doesn't just pick the single most likely one. It generates a probability distribution over every possible next token in its vocabulary. "The" might have a 40% probability, "A" might have 20%, "It" might have 15%, and thousands of other tokens share the remaining 25%.

Temperature controls how you sample from that distribution.

Low temperature (0.0 – 0.3)

The distribution gets "sharpened" — high probability tokens become even more dominant, low probability tokens get suppressed. The model almost always picks the most likely next token. Output is predictable, repetitive, conservative. Use for: data extraction, classification, factual Q&A, summarization.

High temperature (0.7 – 1.2)

The distribution gets "flattened" — lower probability tokens get more of a chance. The model explores less obvious continuations. Output is varied, creative, sometimes surprising. Use for: brainstorming, copywriting, generating diverse options, creative tasks.

The failure mode

Very high temperatures (above 1.0) don't just make the model "more creative" — they start randomly surfacing low-probability tokens, which can produce incoherent, hallucinated, or nonsensical output. Creativity and reliability are genuinely in tension here. In enterprise systems, most production use cases sit between 0.0 and 0.7.

📅

Part 7 — Training Data & the Knowledge Cutoff

5 min

Every LLM has a knowledge cutoff date — the point at which its training data stopped being collected. Anything that happened after that date is invisible to the model. It doesn't know about last week's news, a law that changed last month, or your company's Q2 results.

What was it actually trained on? Mostly: publicly available web text (Common Crawl), books, Wikipedia, code repositories, scientific papers, and curated datasets. This means the model is extraordinarily good at general knowledge, reasoning, writing, and code — but it has zero awareness of your proprietary business data, internal documents, or anything non-public.

What it doesn't know

Events after the training cutoff. Your company's internal data. Prices, regulations, or facts that have changed since training. Anything from private or paywalled sources that wasn't in the training corpus.

How engineers solve this

You inject fresh context into the prompt at runtime — documents, database records, API results, current date. This is the foundation of RAG (Retrieval Augmented Generation), which Session 2 covers in full. The model's frozen knowledge becomes a reasoning engine, not a source of truth.

Practical implication right now

Any AI product you build that needs current, accurate, or proprietary information must inject that information into the context. Relying on the model's training data alone for factual claims in a business context is an engineering mistake — not a prompting mistake. Always design for this from the start.

📐

Context Management — The Real Engineering Problem

8 min

The context window is the single most important constraint in practical AI engineering. Everything you need the model to "know" for a given request must fit inside it — and what doesn't fit, doesn't exist as far as the model is concerned. Understanding how it fills up, what happens when it overflows, and how to manage it deliberately separates working prototypes from production systems.

Context window anatomy — what fills it and in what order ~128K tokens (Claude Sonnet)
System Prompt
Injected Context / RAG
Conversation History
User Msg
Available
System prompt
Injected context (docs, data)
Conversation history
User message
Remaining space
Context overflow

When input exceeds the window, most APIs silently truncate the oldest content — typically the beginning of the conversation or the earliest injected document. The model never warns you. It simply can't see what was cut.

Context contamination

Irrelevant, noisy, or contradictory content in the context window actively harms output quality — even if there's still space. The model attends to everything. Garbage in the context competes with relevant signals.

Position matters

Models pay more attention to content near the start and end of the context ("primacy and recency bias"). Critical instructions should be in the system prompt or reiterated near the user message — not buried in the middle.

Four patterns for managing context in production:

1
Sliding window / rolling history
Keep only the last N turns of conversation in the context. Drop the oldest exchanges as new ones come in. Simple and effective for chat-style products where recent context matters most. Risk: the model loses early context (e.g., user's stated preferences from turn 1).
2
Conversation summarisation
When history gets long, use the model to compress older turns into a summary, then inject the summary instead of the raw history. Preserves semantic content while drastically reducing token count. Requires an extra LLM call but extends effective memory indefinitely.
3
Selective retrieval (RAG)
Instead of injecting all documents into every call, retrieve only the most relevant chunks for the current query using vector similarity search. Keeps injected context lean, targeted, and within budget. This is the standard pattern for knowledge-base products.
4
Context pre-processing
Before injecting a document, clean it: remove headers/footers, strip redundant whitespace, extract only the relevant section. A 20-page PDF that gets cleaned down to 3 relevant paragraphs is infinitely better than dumping the full PDF. Pre-processing is cheap compute compared to wasted tokens.
The design principle

Every token in your context should earn its place. Ask of each piece: does the model need this to answer correctly? If not — cut it. Context hygiene is not optimisation; it's correctness. A model given clean, relevant context consistently outperforms the same model drowning in noise.

⚖️

Part 8 — Model Families & What "Bigger" Actually Means

5 min

You'll make model selection decisions when building AI products. "GPT-4 is better than GPT-3.5" — but what does that mean? What are you actually choosing between?

The most important dimension is parameter count. Parameters are the numerical weights that get updated during training — the millions or billions of numbers that collectively encode the model's knowledge and behavior. A 70B model has 70 billion of these weights. A 7B model has 7 billion.

More parameters → what you gain

Better reasoning on complex, multi-step problems. More nuanced instruction-following. Better performance on tasks that require synthesizing multiple pieces of information. More "coherent worldview" across a long conversation.

More parameters → what you pay

Slower inference (higher latency per response). Higher cost per API call. More compute required to run. For many tasks — classification, extraction, simple Q&A — a smaller model performs identically at a fraction of the cost.

The practical decision framework: Use the smallest model that reliably handles your task. Start with a mid-tier model, measure quality, then step up only if needed. In high-volume enterprise systems, the cost difference between a small and large model can be 10–50x. That's not a detail — it's the business case.

Current landscape — major model families
A

Anthropic — Claude family. Claude Opus (frontier, reasoning-heavy tasks), Claude Sonnet (balanced performance and cost), Claude Haiku (fast, lightweight, high-volume tasks). Strong on instruction-following and safety.

O

OpenAI — GPT family. GPT-4o (multimodal, strong general capability), o1/o3 (reasoning-optimized, slower but exceptional at logic and math). Largest ecosystem of integrations and tooling.

G

Google — Gemini family. Gemini Ultra, Pro, and Flash. Strong on multimodal tasks and Google Workspace integration. Flash tier is extremely cost-effective for high-volume inference.

M

Open-source — Llama, Mistral, Qwen, Falcon. Can be self-hosted. No API cost. Full control over data privacy. Trade-off: you own the infrastructure, the updates, and the guardrails. Critical for compliance-sensitive enterprise contexts.

R

Regional & domain-specific models. A growing category of models trained on specific languages, regions, or industries — Sarvam AI and Krutrim (South Asian languages), Mistral (strong French performance), Jais (Arabic), medical-specific models, legal-specific models. These frequently outperform frontier models on their target domain while being far cheaper to run. Don't assume GPT-4 is the right tool for every language and market.

💰

Context Economy — The Real Cost of AI at Scale

8 min

Tokens are the atomic unit of both capability and cost. Every API call is billed on two dimensions: how many tokens went in (input) and how many came out (output). Understanding this changes every design decision you make — from how you write system prompts to how long you let conversations run.

The deeper issue is that token costs are not linear. They compound. A conversation that starts cheap becomes expensive fast — and most teams don't notice until the invoice arrives.

How token count compounds across a conversation each turn pays for ALL previous turns
Turn 1
~500 tok
Turn 2
~1,100 tok
Turn 5
~3,500 tok
Turn 10
history you've already paid for
~8,000 tok
System prompt
Output
Prior history (re-sent every turn)

This is the context tax. Every time you send a new message, you pay for the entire conversation history again — not just what you typed. By turn 10, the majority of your token spend is on context you've already paid for in previous turns. This compounds across thousands of daily users in a production system.

~750
Words per 1,000 tokens
A rough conversion: a 2-page document ≈ 1,000 tokens. A lean system prompt ≈ 200–600 tokens. A 10-turn chat conversation can easily reach 8,000–15,000 tokens in total input.
3–5×
Output vs input cost ratio
Generating tokens costs significantly more than reading them. Asking for structured short outputs (JSON, bullet points, yes/no) is meaningfully cheaper than asking for long-form prose explanations.
10–50×
Cost gap: frontier vs small model
At volume, model choice dominates your bill. A simple classification task costs the same whether you use a frontier model or a small one — but the price difference is enormous. Never use a frontier model for a task a smaller one handles equally well.

The four levers of context economy:

1
System prompt efficiency
Your system prompt is sent on every single API call, not just the first one. A 1,000-token system prompt sent 10,000 times per day is 10 million input tokens daily — just in overhead. Audit your system prompts regularly. Every redundant sentence is a recurring cost. Target 200–500 tokens for most production system prompts.
2
Conversation history management
Never pass the full conversation history indefinitely. Implement a rolling window (last N turns), a summarisation strategy (compress old turns into a summary paragraph), or a relevance filter (only pass turns that contain information the current query actually needs). The cost difference between naive and managed history is often 60–80% at scale.
3
Prompt caching
If you're passing the same large system prompt or document repeatedly (e.g. a policy document that doesn't change), most providers offer prompt caching — they store the processed version of that prefix and only charge you for it once per cache period rather than on every call. On large system prompts, this alone can cut input costs by 80–90%. Always check whether your API provider supports this feature.
4
Output length control
Unguided, models will generate whatever length feels natural — which is usually longer than you need and always more expensive. Setting explicit output constraints ("respond in under 100 words", "return only a JSON object", "answer in one sentence") reduces output token cost directly. For classification tasks where the answer is one of N labels, the output cost should be near zero. Design for it.
Understanding your usage budget

API plans are tiered by monthly token volume and rate limits (requests per minute). When you're in development, low-volume usage is cheap. When you go to production with real users, costs scale with every conversation, every document, every retry. Model your production cost before you launch: estimate average tokens per interaction × daily active users × 30 days. This number is often a surprise.

Cost per task vs cost per token

Stop thinking in per-token costs and start thinking in per-task costs. What does it cost to classify one document? To generate one support reply? To summarise one meeting? Once you have a cost-per-task figure, you can compare it against the human cost of the same task, set a payback period, and make a defensible business case for the AI investment.

The non-obvious insight

Most AI projects are cheap during evaluation and expensive in production — not because the price per token changes, but because evaluation uses a handful of carefully chosen inputs and production uses everything. Design your context strategy in week one, not after your first invoice. The teams that treat token economics as an afterthought almost always rebuild their context pipeline before month three.

🧪

Try It Yourself — Three Exercises

15 min

The fastest way to internalise these concepts is to experience them directly. Each exercise below takes 3–5 minutes and demonstrates something you cannot fully grasp from reading alone. Open a browser tab and work through them.

Exercise 1 — See tokenisation in action tiktokenizer.vercel.app
1

Open tiktokenizer.vercel.app and paste a paragraph you've written — a vendor communication, an email, anything. Watch how it splits into tokens. Notice which words stay whole and which get split.

2

Now paste a long document and watch the token count climb. Consider: at 128,000 tokens (Claude's context window), roughly how many pages of text could you fit? How many of your typical documents would exceed that?

3

Try rewriting the same idea in fewer words. Notice how token count drops. At scale — thousands of API calls per day — this difference translates directly into cost. Cleaner prompts are cheaper prompts.

Exercise 2 — Experience the effect of temperature claude.ai or any LLM
1

Send the exact same prompt to Claude twice in separate conversations: "Give me three creative names for a B2B project management tool." Notice how the responses differ — same model, same prompt, different outputs. That variation is temperature at work.

2

Now try a factual extraction task twice: "What is the capital of Karnataka?" Notice the responses are nearly identical. Low-stakes factual queries converge because one answer dominates the probability distribution.

3

Reflect: for a classification task (flagging a vendor document as compliant or non-compliant), which behaviour do you want — the creative variation or the factual consistency? That's your temperature decision.

Exercise 3 — Find the confidence trap Any LLM
1

Ask the model a specific factual question from your domain — something about a regulation, a technical standard, a market trend, or an industry requirement — where you already know the answer is complex or has changed recently.

2

Watch it answer with apparent confidence. Then probe it: "Are you certain about that? What's your source? Could this have changed recently?" Notice whether it backtracks, hedges, or doubles down.

3

This is the core insight to carry forward: the model generates plausible text, not verified truth. Confidence in the output is not evidence of correctness. Output validation is an engineering problem — not a prompting one.

🔗

Closing the Loop — Why Prompting Works

5 min

You may already know how to prompt — but likely by feel and intuition. Now that you understand the mechanics, you can reason from first principles about why the techniques that work, work. That shift — from intuition to reasoning — is what makes prompting a disciplined skill rather than guesswork.

Why chain-of-thought works

When you say "think step by step," you're not coaching the model to be more careful. You're forcing it to generate intermediate tokens — and those intermediate tokens become part of the context the model attends to when producing the final answer. The reasoning steps literally appear in the probability distribution of what comes next. Writing the steps out loud helps the model get to the right final token.

Why few-shot examples work

When you include 2-3 examples of input/output in a prompt, you're conditioning the token probability distribution. The model has seen similar patterns in training. Seeing them in context shifts the distribution toward outputs that match the format and style you've demonstrated. Examples are essentially sample data for real-time distribution shaping.

Why specificity beats vagueness

Vague prompts produce high-entropy outputs — many possible next tokens are roughly equally likely. Specific, detailed prompts narrow the distribution dramatically. When you say "summarize this in 3 bullet points for a non-technical executive audience," you've constrained the space of plausible next tokens at every step. That's not style advice — it's probability engineering.

Why role-playing the model works

"You are a senior compliance officer..." primes the model with a cluster of associated tokens from training. The attention mechanism picks up on "compliance officer" and shifts the distribution toward formal, precise, risk-aware language — because that's what correlates with those tokens in training data. Persona prompting is activating a learned statistical cluster.

The core insight

Every prompting technique is a mechanism for shaping token probability distributions. You're not convincing the model — you're conditioning what it statistically considers most likely next. Once you internalise this, you can reason from first principles about why a prompt isn't working, instead of guessing at rewrites.

⚖️

AI Ethics & Bias — Where They Come From

7 min

AI bias is not a design choice or a political statement — it's a mathematical consequence of training. A model learns from data. If the data reflects historical inequities, the model encodes them. If the data over-represents certain demographics, languages, or viewpoints, the model performs better for those groups and worse for everyone else. This is not fixable by prompting. It's a property of the training process.

Training data bias

Most LLMs were trained primarily on English-language text scraped from the internet. This means they perform significantly better on English than on other languages, better on Western cultural contexts than others, and encode the biases — including gender, racial, and socioeconomic ones — present in that training corpus.

Language & regional performance gaps

A model that performs excellently in English may produce noticeably weaker output in regional languages, mixed-script text, or vernacular dialects. This isn't a minor quality difference — it can mean factually incorrect outputs, loss of nuance, or culturally inappropriate responses that damage user trust.

Representation bias in outputs

When asked to generate content about professionals, leaders, or experts, models default toward representations that match patterns in their training data. This can result in systematically skewed outputs for resumes, role descriptions, or hiring-related tasks — and create liability for organisations that deploy these outputs in people decisions.

Amplification, not just reflection

Models don't just reflect bias — they can amplify it. Because outputs are generated at scale and may be treated as authoritative, a biased pattern in a model's output can influence many downstream decisions before anyone notices. Scale amplifies both capability and error.

What this means for enterprise deployment: Before deploying any AI system that makes or influences decisions about people — hiring, credit, healthcare, access to services — you need to audit the model's performance across your specific demographic and language groups, not just its average benchmark performance. Average performance can look excellent while hiding severe underperformance for specific subgroups.

The audit principle

Test your AI system on the populations it will actually serve — not on benchmark datasets that may not represent them. If your product operates across multiple languages or regions, measure quality separately for each. A system that is 95% accurate on average but 60% accurate for a specific group is not a good system for that group. It's a system that excludes them.

🎲

Probabilistic vs Deterministic Thinking

6 min

This is the single biggest mindset shift for engineers and product managers coming from traditional software. It changes how you test, how you debug, how you measure quality, and how you design systems. Until you make this shift, AI systems will feel unreliable and unmanageable. After it, they become tractable.

Traditional software is deterministic

Given the same input, you always get the same output. A bug either happens or it doesn't. You write unit tests that pass or fail. A release is either correct or broken. Debugging means finding the specific line of code that caused the failure.

AI systems are probabilistic

Given the same input, you get different outputs each time. Quality is a distribution, not a binary. You measure error rates across populations, not individual correctness. A "bug" might mean "this fails 12% of the time on this input type." There is no line of code to fix — you improve the distribution.

1
You test statistically, not with unit tests
A single test case tells you almost nothing about an AI system. You need an eval set of 50–500+ examples across the full range of inputs the system will encounter. Quality is measured as a rate — "this system produces acceptable output on 94% of inputs from our eval set" — not as a boolean pass/fail.
2
You iterate on data and prompts, not just code
In traditional software, improving quality means fixing code. In AI, improving quality means improving prompts, improving training data, improving context quality, or choosing a better model. The "code" is often the least important variable. Engineers who insist on treating AI like deterministic software will always struggle with it.
3
Failures are distributions, not incidents
When a traditional system breaks, there's usually a single root cause to find. When an AI system fails, it fails on a pattern of inputs — a specific category, a language, an edge case. Debugging means identifying the distribution of failures, not a single incident. "It gets confused when the input is very long" is a useful debugging observation. "It was wrong once" is not.
4
Acceptable error rate is a product decision, not a technical one
There is no "100% correct" for an AI system — only "what error rate is acceptable for this use case?" For a low-stakes brainstorming tool, 85% may be fine. For a medical triage assistant, 99.9% may still be insufficient. This threshold must be set by product and business stakeholders before a system ships, not discovered after deployment.
The practical implication

Before building any AI system, answer: what is the acceptable error rate for this specific use case, and how will you measure whether you've met it? If you can't answer both parts, you're not ready to deploy. The measurement infrastructure is as important as the AI itself.

👤

Human-in-the-Loop — When AI Must Not Act Alone

6 min

There are tasks where AI can produce output independently and tasks where a human must review that output before anything happens. The design decision about where this boundary sits is one of the most important — and most frequently skipped — decisions in enterprise AI deployment.

Getting it wrong in one direction means under-using AI (humans reviewing everything, defeating the efficiency gain). Getting it wrong in the other direction means AI acting autonomously in situations where the cost of error — legal, financial, reputational, or human — is too high.

AI can act independently when

The cost of error is low and easily reversible. The output is informational, not decisional. The task has high volume and low stakes per instance. There is a feedback loop to catch systematic errors before they compound. Examples: draft generation, summarisation, classification for routing.

Human review is required when

The decision affects a person's rights, access, employment, or financial position. The output will be communicated externally as your organisation's position. The regulatory environment imposes human accountability. The error cost is irreversible. Examples: hiring decisions, contract approvals, compliance flags, financial disbursements, healthcare recommendations.

The three review patterns in practice:

A
AI drafts, human approves
The AI generates a complete draft — a document, a recommendation, a response — and a human reviews and either approves, edits, or rejects it before it's used. This captures most of the efficiency gain while keeping a human in the accountability chain. Best for medium-stakes, moderate-volume tasks.
→ Contract clause generation, performance review summaries, external communications
B
AI flags, human decides
The AI identifies anomalies, risks, or items that need attention and surfaces them to a human, but takes no action itself. The human then decides what to do. This is the right pattern for risk and compliance workflows where AI is better than humans at scanning volume but humans must own the decision.
→ Fraud detection, document compliance review, quality assurance workflows
C
AI acts, human audits
The AI takes action autonomously (sends a message, updates a record, routes a request) and a human reviews a sample of these actions periodically. Only appropriate when the action is low-stakes, fully reversible, and there is strong monitoring. Must include automatic escalation when the AI encounters low-confidence situations.
→ FAQ responses, auto-categorisation, notification routing, draft creation
The design question every AI project must answer

For each action your AI system will take: what is the consequence of this being wrong, and is that consequence reversible? If the answer to either part makes you uncomfortable, the design needs a human checkpoint. This is not a limitation of AI — it is responsible system design. The goal is not maximum automation; it is maximum value at acceptable risk.

📝

Prompting Strategy — The PCTFE Framework

10 min

Most people prompt the way they'd text a colleague — casually, partially, assuming shared context. That works for quick personal tasks. It fails consistently when you're building something that needs to work reliably across hundreds of different inputs from different users. You need a structure.

The PCTFE framework is a five-element scaffold for writing prompts that are explicit, testable, and maintainable. Think of it the same way you'd think about writing a function: inputs, behaviour, outputs — specified completely.

P
Persona — Who the model should be
Define a role that activates the right cluster of knowledge, tone, and behaviour. Be specific — not just "expert" but what kind, for what audience, with what authority level. The persona anchors everything that follows.
You are a senior customer policy specialist at a B2B SaaS company. You help enterprise clients understand service agreements, billing, and compliance requirements. You are direct and precise, adapt your communication style to the client's seniority level, and never speculate on legal matters without explicitly flagging uncertainty.
C
Context — What the model needs to know
Supply the background information the model cannot infer: who the user is, what situation they're in, relevant data, prior decisions, constraints. Without context, the model fills gaps with plausible assumptions — which are frequently wrong.
The client (Account ID: ACC-4821) has been on the platform for 8 months. Their last invoice was $12,400 on March 15. They are currently flagged for a missing compliance declaration required for their industry tier. The following is their support query: {user_message}
T
Task — What you want done, precisely
State the action explicitly. Use verbs. Specify scope. Break compound tasks into ordered steps if needed. "Help the vendor" is not a task. "Identify the compliance issue, explain it in plain language, and list the exact documents required to resolve it" is a task.
1. Identify the specific compliance issue from the vendor's record above. 2. Explain the issue in plain language the vendor can understand. 3. List exactly what documents are required and where to submit them. 4. Estimate the resolution timeline based on standard processing times.
F
Format — How the output should look
Specify structure, length, tone, and language. Without format guidance, the model will make its own choices — which may not match your UI, your user's expectations, or your downstream parsing logic. Always be explicit.
Respond in the vendor's preferred language. Use this structure: - Issue: [one sentence] - What this means: [2-3 sentences, plain language] - Required documents: [numbered list] - Next step: [one clear action] Keep the total response under 150 words. Do not use legal jargon.
E
Examples — Show, don't just tell
Include 1–3 worked examples of ideal input/output pairs whenever the task involves judgement, classification, or a specific output style. Examples are the highest-leverage element in a prompt — they shift the probability distribution more reliably than any amount of verbal instruction.
Example — Account missing compliance declaration: Issue: Required compliance documentation not on file. What this means: We cannot process transactions above the standard tier limit without verified compliance documentation. Your account activity is currently restricted. Required documents: 1. Signed compliance declaration (Form C-7), 2. Supporting certification from your industry body Next step: Upload documents via the account portal → Settings → Compliance → Upload Documents. --- [Add 1–2 more examples for other common account issue types]
Not every prompt needs all five

For a quick personal task, two elements might be enough. For a production system prompt that will run thousands of times a day, all five are mandatory. The framework scales — use as much of it as the stakes require. But if a prompt is misbehaving, the fix is almost always in a missing or underspecified element.

Additional techniques that compound on the framework:

Negative constraints

Explicitly tell the model what NOT to do. "Do not speculate", "Never mention competitor platforms", "Do not apologise more than once". Models follow negative constraints reliably — use them for compliance-critical outputs.

Output delimiters

Ask the model to wrap specific output in XML-like tags: <summary>, <action>, <confidence>. This makes programmatic parsing trivial and prevents the model from mixing reasoning with output.

Grounding with "only use the provided information"

For factual, document-based tasks, instruct the model to only use information from the provided context. Never invent. If the answer isn't in the document, say so. This is the single most effective anti-hallucination instruction.

Confidence flagging

Instruct the model to rate its confidence (High/Medium/Low) and explain uncertainty. "If you are not certain, say 'I'm not confident about this' and explain why." Turns a binary correct/hallucinated output into a graduated, auditable one.

⚠️

Common Prompting Mistakes — Before & After

8 min

These are the eight patterns that consistently produce bad output — not because the model is broken, but because the prompt is underspecified. Each one has a simple fix once you know what to look for.

Mistake 1 — The vague ask
Fixed
Weak prompt
"Write something about our service billing policy."
No audience, no length, no format, no purpose. The model will write anything — usually generic and too long.
Strong prompt
"Write a 100-word summary of our service billing policy for new clients who have never received an invoice. Use simple language. Focus on: billing frequency, payment terms, and what to do if there's a discrepancy."
Mistake 2 — Stacking multiple tasks
Fixed
Weak prompt
"Summarise this document, identify any compliance issues, suggest improvements, translate it into another language, and then write a follow-up email to the client."
Five unrelated tasks in one. The model will rush each one, conflate outputs, or drop tasks entirely.
Strong prompt
Break into sequential prompts: 1. "Summarise this document in 3 bullet points." 2. "Based on the summary, identify compliance issues." 3. "Draft a follow-up email based on issue [X]." Each call gets full attention.
Mistake 3 — No format instruction
Fixed
Weak prompt
"Extract the key dates from this contract."
Produces prose, bullets, a table, or a JSON object — randomly. Impossible to parse programmatically or display consistently.
Strong prompt
"Extract all dates from this contract. Return ONLY a JSON array: [{"event": "...", "date": "DD-MM-YYYY"}] If no date is found, return an empty array. Do not include any other text."
Mistake 4 — Trusting the model with maths
Fixed
Weak approach
"Calculate the total billing amount for 47 clients given this rate table."
LLMs are unreliable at multi-step arithmetic. The answer may look correct and be wrong by thousands. Never trust raw LLM output for financial calculations.
Strong approach
Use the model to write the formula or Python/spreadsheet code that does the calculation. Execute that code separately. The model writes logic reliably; it executes arithmetic unreliably.
Mistake 5 — Not grounding factual tasks
Fixed
Weak prompt
"What are the compliance requirements for contract workers under local employment law?"
The model will answer from training data — which may be outdated, jurisdiction-vague, or factually incorrect. No source, no auditability, no way to catch errors.
Strong prompt
"Using ONLY the compliance document below, answer: what requirements apply to contract workers? If the document doesn't cover this, say 'Not covered in the provided document.' [paste the actual document]"
Mistake 6 — Assuming the model remembers
Fixed
Weak approach
Referencing "the vendor we discussed earlier" or "the policy from last week" in a new conversation without re-providing the data.
Every new conversation is a blank slate. The model has no recollection of previous sessions. References to prior conversations are invisible to it.
Strong approach
Always re-inject necessary context at the start of each conversation: "Vendor: [name]. Issue from last session: [summary]. Today's query: [question]." Treat every call as stateless by design.
🔧

Treat Prompts Like Code

5 min

A prompt that runs in production thousands of times a day is not a casual instruction — it's a critical piece of software. It should be treated with the same discipline as code: versioned, tested, documented, and reviewed before it ships.

Version your prompts

Store prompts in version control (Git) just like code. Every change should be a commit with a message explaining what changed and why. When a production prompt breaks, you need to know what was different yesterday. "v1", "v2", "final_final" in a Google Doc is not versioning.

Test with a diverse eval set

Before shipping a prompt change, run it against 20–50 representative inputs — including edge cases, adversarial inputs, and examples where the old prompt was known to fail. If you don't have an eval set, you're shipping blind. Build one alongside your first prompt.

Change one thing at a time

Prompts behave like complex systems — changing multiple elements simultaneously makes it impossible to attribute improvements or regressions. When a prompt isn't working, form a hypothesis about one element (e.g., "the persona is too vague"), change only that, and re-evaluate. This is A/B testing for prompts.

Separate prompts from application code

Never hardcode a prompt as a string inside your application logic. Store prompts in a separate config file, database, or prompt management system. This lets non-engineers iterate on prompt text without touching code — and lets you roll back a bad prompt without a deployment.

The iteration mindset

No prompt is correct on the first attempt. The best practitioners expect 5–15 iterations before a prompt is production-ready. Each iteration should be informed by a specific failure mode observed on a specific input. "It sometimes gives wrong answers" is not a debugging statement. "On inputs where the client account has missing documentation, it fabricates a document name" is.

🔐

What Never Goes Into a Prompt

5 min

Every prompt you send to an external API crosses a network boundary and is processed on someone else's infrastructure. Most enterprise teams don't think about this until after an incident. Know these rules before you build anything that touches real user data.

🪪
National identity documents and government-issued IDs
Government-issued identity numbers — whatever form they take in your jurisdiction — are almost always classified as sensitive personal data under applicable data protection law. Sending them to a third-party AI API without explicit consent, data processing agreements, and documented purpose is a compliance violation in most regulatory frameworks (GDPR, various national data protection acts). Mask or hash identifiers before they reach the prompt.
💳
Financial account details and transaction data
Bank account numbers, payment identifiers, and financial transaction records sent to an LLM API may be logged for model improvement unless you have a zero data retention agreement in place. Most consumer-tier API contracts do not include this. Always check the provider's data handling policy before routing financial data through a prompt — and understand which regulatory framework governs your organisation's data.
📱
Customer contact information
Phone numbers, email addresses, and physical addresses are personally identifiable under virtually every data protection framework globally. If your AI product processes queries that include this data, your pipeline should anonymise or pseudonymise where possible — and your system prompt should explicitly restrict the model from reproducing or storing contact details in its output.
🔑
API keys, passwords, and secrets
Shockingly common: developers paste API keys or credentials into a prompt while debugging. These get logged, potentially retained, and are occasionally visible in model outputs if the prompt is poorly designed. Never put secrets in prompts. Use environment variables and secure vaults.
⚖️
Legally privileged communications
Legal advice, HR investigation notes, and compliance findings may be protected by privilege. Sending them through a third-party AI API may waive that protection. Always consult your legal team before routing privileged documents through any external service.
The default rule

Before putting any piece of data in a prompt, ask: "Would I be comfortable if this data appeared in the model provider's logs indefinitely?" If the answer is no — mask it, hash it, or don't include it. Design your prompts to work with pseudonymised references ("Client ID: C-4821") rather than raw personal data wherever possible.

🚧

What AI Genuinely Cannot Do

6 min

Beyond hallucination (which the session has already covered), there are structural limitations that every person in this room should know — because overestimating AI capability in any of these directions leads to failed projects, broken trust, and real business risk.

🌐
It cannot access the internet by default
Unless web search is explicitly connected as a tool, the model has no live internet access. It cannot check today's prices, look up a recent news story, verify a phone number, or fetch a webpage. Everything it says comes from training data alone.
🎬
It cannot take actions in the world on its own
An LLM can only generate text. It cannot send an email, click a button, fill a form, make a call, or update a database — unless it has been explicitly connected to tools that do those things (which is what AI agents are about, covered in Session 4). By default, it's a responder, not an actor.
🧠
It cannot learn from your conversation
Nothing you tell it today changes the model's weights. If you correct it, it accepts the correction within the session — but next session, it's back to the same baseline. It has no mechanism to absorb and retain new knowledge from user interactions. Training is a separate, expensive, offline process.
🔢
It is unreliable at precise arithmetic and counting
LLMs are token predictors, not calculators. For simple arithmetic they perform well, but exact computation (large numbers, multi-step calculations, counting specific items) is genuinely unreliable. Always route numerical work to a calculator or code interpreter, not a raw LLM response.
🔐
It has no guaranteed confidentiality without engineering controls
Whatever you put in a prompt — system prompt, business data, customer information — may be used for model improvement or accessible to the provider unless you have a data processing agreement in place. For regulated industries (finance, health, legal), this is a compliance issue, not just a preference.
⏱️
It cannot reliably handle tasks requiring real-time or sequential state
Tasks like "monitor this inbox and reply to anything that comes in" or "track progress across 50 ongoing cases" require persistent state, timing, and coordination — none of which a base LLM provides. These require agent frameworks with memory and tooling, not just an LLM in a chat window.
The trust principle

Being honest about what AI cannot do builds more credibility — with your team and with your clients — than overselling it. Better decisions, better systems, and faster failure detection all follow from having an accurate map of the terrain from the start. The real capability is impressive enough.

💬

Reflect & Discuss

10 min

Work through these questions yourself, or bring them to the group session. They're designed to bridge what you've just read with how it applies to your actual work.

  • Q Now that you know prompts have three layers — system prompt, history, and user message — where in your work could a well-crafted system prompt replace repetitive instructions you currently give manually every session?
  • Q Given that the model's knowledge is frozen at a cutoff date and has no access to your company's data, what would have to be injected into the context for an AI tool to be genuinely useful in your specific role?
  • Q If different models reflect different RLHF-baked values, what would a compliance-critical or client-facing AI product require from the model's built-in behaviour? Does any current model meet that bar?
  • Q What's one assumption about AI you held before reading this session that now looks different? Does it change how you'd approach a task you've already been using AI for?
🧪

This Week's Experiment

Before Session 2
Map one repetitive task to an AI opportunity
Pick any repetitive task you do — a message you send often, a document you review, a decision you make routinely. Write a one-paragraph description of what you're doing and what you wish AI could do for you. Note it down and bring it to the next session.

This exercise primes everything in Session 2 — your real example will be the raw material for understanding RAG, memory, and context injection. Generic examples are forgettable. Your actual workflow is not.

Coming next Sunday
Session 02 — Context, Memory & Why AI Forgets
May 4 · 7:00 PM · RAG, vector databases, stateless vs stateful systems