How LLMs Actually Work
Strip away the magic. Understand what a language model really does under the hood — tokens, embeddings, attention, and prediction — and why this changes how you think about AI.
What This Course Is
5 minThis is a practitioner's course on AI — not a prompting workshop. You may already know how to talk to AI tools. This course teaches you how to think with AI — to understand what's happening under the hood, and to reason from first principles about how to build with it.
A ground-up course on how AI actually works — mechanics, architecture, prompting strategy, context, cost, and enterprise design patterns. Every concept connects directly to real work.
Theory for theory's sake. Every session has a direct application — the kind of problems that come up in operations, compliance, vendor management, and client workflows.
By session 8, you should be able to design AI into a workflow, not just use it as a chat tool. That means understanding systems, not just prompts.
Read each section, try the exercises and experiments at the end, and bring your questions and observations to the next session. The content builds progressively across all 8 weeks.
Myths vs Reality
8 minBefore getting into mechanics, it helps to clear the slate. Everyone arrives with a head full of AI assumptions shaped by science fiction, news headlines, and marketing copy. If these aren't named and addressed directly, they quietly distort everything else you'll learn. Read each one and check your own assumptions against the reality.
Every single one of these myths leads to a bad product decision or a misplaced trust failure in practice. The faster you internalise the real picture, the faster you can build genuinely useful things with AI — instead of building on top of a misunderstanding.
Part 1 — The Intuition
15 minBefore reading further — pause and ask yourself: when you send a message to ChatGPT or Claude, what do you think is actually happening inside the model? Hold that intuition in mind, then read what follows.
Text → Tokens. AI doesn't read words the way humans do. It reads chunks called tokens. A single word might be one token, or it might be split across several. The model never sees raw text — only a sequence of numbers.
You can see this yourself by visiting tiktokenizer.vercel.app — paste any text and watch how it gets split into tokens. Try a company name, a technical term, or a sentence from a document you work with, and notice how it tokenizes. Then paste a long paragraph and see the token count. This builds direct intuition for why token limits exist and why phrasing matters at scale.
Tokens → Numbers (Embeddings). Every token gets converted into a point in a very high-dimensional space — imagine a map with thousands of axes. Words that mean similar things cluster close together. The classic example: King − Man + Woman ≈ Queen. That's a real result from this vector space. The model has learned meaning through geometry.
The Prediction Machine. At its core, every LLM is doing one thing: predicting the next most likely token, over and over. Every word in a response is the output of this prediction, chained together. This is the single most important fact to internalise — because it explains both the power and the failure modes of AI.
If the model is "just predicting the next token," then it can be confidently wrong. It doesn't know the answer — it knows what a plausible answer looks like. This is why hallucinations happen. This is why you need to verify AI output in critical contexts.
Part 2 — Multimodality: Beyond Text
6 minEverything we've described so far — tokens, embeddings, prediction — applies to text. But modern enterprise models are not text-only. They can receive and process multiple types of input. This is called multimodality, and it's what makes AI genuinely useful across functions that don't primarily work in text.
For someone in finance reviewing invoices, HR processing resumes, or compliance checking contracts — their work is documents, tables, and images. The moment they realise the model can read those directly, the session becomes relevant to them.
Look through the six input types above. Which one maps most directly to something you do in your job today? That's your starting point for thinking about where AI could add real value to your work — not in the abstract, but in that specific task.
Part 2 — The Attention Mechanism
10 minThis is the "secret sauce" of the Transformer architecture. Keep it conceptual — no matrix math needed. The goal is just the mental model.
Key idea: not all words in a sentence matter equally to each other.
"The worker submitted his documents but he forgot to attach his photo." — When the model reads the second "his", how does it know it refers to "the worker" and not "the documents"? That's attention. The model has learned to look back over the entire input and weigh which earlier tokens are most relevant to understanding the current one.
Attention is why models can handle long, complex sentences without losing track of what refers to what. And it's why longer, richer prompts often perform better — you're giving the model more context to "attend to" when forming its response.
This is also the computational reason why longer contexts are more expensive. Every token is attending to every other token. Quadratic cost. More tokens = exponentially more computation.
Part 3 — Training vs Inference
10 minMost people conflate these two phases. Separating them is one of the most clarifying things you can do early in this course.
The model reads a massive corpus of text, predicts the next token billions of times, and adjusts its internal weights every time it's wrong. This is done once, by the company that built the model (Anthropic, OpenAI, etc.). It takes months and costs millions.
When you send a message, the model uses those frozen weights to generate a response. No learning is happening. It is not updating based on your conversation. The model is the same whether it's your first message or your ten-thousandth.
"Every time you start a new chat, you're talking to the exact same frozen model. It has no idea who you are. What looks like memory is just the conversation history — text pasted into the context window. When you close the tab, it's gone."
This sets up Session 2 perfectly — because the next question is: how do companies build AI systems that actually do remember things? That's where RAG and vector databases come in. Plant that question tonight and let it sit.
Part 4 — RLHF: Why the Model Is Actually Helpful
6 minHere's a question the session so far hasn't answered: if the base model is just predicting the next token from internet text, why does it behave helpfully instead of generating random noise, hate speech, or conspiracy theories — all of which exist in its training data?
The answer is Reinforcement Learning from Human Feedback (RLHF). It's the step that turns a raw text predictor into a useful assistant — and it's why Claude, ChatGPT, and Gemini feel like products and not just autocomplete engines.
The base model trains on a massive corpus of internet text. At this point it can generate coherent text, but it has no sense of what's "good" or "helpful." It might complete your sentence with misinformation just as readily as a correct answer.
Human trainers write examples of ideal responses to thousands of prompts. The model is fine-tuned on these examples — it starts learning the shape of a "good" answer. But this alone isn't enough.
Humans rank multiple model outputs from best to worst. A separate "reward model" is trained on these rankings — it learns to score how good a response is. This becomes the judge.
The main model generates responses, the reward model scores them, and the main model is updated to produce higher-scoring outputs over time. It learns helpfulness, harmlessness, and honesty through this feedback loop.
Different companies apply RLHF with different values and priorities. This is why Claude, ChatGPT, and Gemini behave differently even though they're all LLMs. When you choose a model for an enterprise product, you're also choosing the values and guardrails baked into that RLHF process. It's not just a performance decision — it's a policy decision.
Part 5 — Anatomy of a Prompt
6 minWhen you send a message to an AI, you're seeing one small piece of what the model actually receives. In production systems — the kind you'll build at work — a "prompt" is three distinct things, and conflating them is one of the most common engineering mistakes.
Instructions set by the developer, not the user. This is where you define who the model is, how it should behave, what it knows about your product, what it's allowed to say, and what it must never say. Users typically never see this. Think of it as the employee briefing before the customer walks in.
"You are a customer policy specialist for an enterprise SaaS platform. Your role is to help registered partners understand service agreements, billing schedules, and compliance requirements. Never speculate on legal matters — flag uncertainty explicitly. Adapt your language to the formality level of the person writing to you."
Every previous message in the current session — both the user's messages and the model's responses — is included in full, prepended to the new request. The model has no memory mechanism; it simply re-reads the entire conversation each time. This is why long conversations get expensive and why context limits matter.
What the user typed right now. This is the only part most people think about — but in a well-engineered system, it's actually the smallest piece. The system prompt and injected context often contain far more tokens than the user's actual question.
In any enterprise AI product, your job as the builder is mostly to craft the system prompt and manage what gets injected into the conversation history. The user message is just the trigger. This framing completely changes how you think about AI product design — you're not just writing prompts, you're designing a context pipeline.
Part 6 — How AI Products Are Actually Structured
6 minThere's a question every non-technical person in an organisation will eventually face: "should we use Claude.ai, buy a tool built on AI, or build something ourselves?" These are three completely different decisions — commercially, technically, and in terms of control — but they look identical from the outside. Here's the actual picture.
Always evaluate a layer by asking three questions: Who controls the system prompt? Where does our data go? Can we customise the behaviour? Consumer interfaces answer none of these. The API answers all of them — at the cost of engineering. Third-party products sit in between — evaluate them on data handling and configurability before adoption.
Thinking Like an Enterprise
8 minUsing AI as an individual and deploying AI inside an organisation are fundamentally different problems. Most AI content online is written for the individual user — the person experimenting alone, optimising their own workflow. Enterprise AI is a different discipline entirely. The technical challenge is often the easier half.
Here is how the enterprise context changes every question you'll face when building or adopting AI systems:
Individual AI use is about augmenting yourself. Enterprise AI is about augmenting a system — one made of people, processes, data, and existing technology. Every decision you make about an enterprise AI product is really a decision about how that system changes. The model is just one variable.
Part 6 — Temperature: The Real Explanation
5 minYou've heard "high temperature = creative, low temperature = consistent." That's true, but it's not the real explanation — and as engineers, you deserve the actual picture.
Every time the model predicts the next token, it doesn't just pick the single most likely one. It generates a probability distribution over every possible next token in its vocabulary. "The" might have a 40% probability, "A" might have 20%, "It" might have 15%, and thousands of other tokens share the remaining 25%.
Temperature controls how you sample from that distribution.
The distribution gets "sharpened" — high probability tokens become even more dominant, low probability tokens get suppressed. The model almost always picks the most likely next token. Output is predictable, repetitive, conservative. Use for: data extraction, classification, factual Q&A, summarization.
The distribution gets "flattened" — lower probability tokens get more of a chance. The model explores less obvious continuations. Output is varied, creative, sometimes surprising. Use for: brainstorming, copywriting, generating diverse options, creative tasks.
Very high temperatures (above 1.0) don't just make the model "more creative" — they start randomly surfacing low-probability tokens, which can produce incoherent, hallucinated, or nonsensical output. Creativity and reliability are genuinely in tension here. In enterprise systems, most production use cases sit between 0.0 and 0.7.
Part 7 — Training Data & the Knowledge Cutoff
5 minEvery LLM has a knowledge cutoff date — the point at which its training data stopped being collected. Anything that happened after that date is invisible to the model. It doesn't know about last week's news, a law that changed last month, or your company's Q2 results.
What was it actually trained on? Mostly: publicly available web text (Common Crawl), books, Wikipedia, code repositories, scientific papers, and curated datasets. This means the model is extraordinarily good at general knowledge, reasoning, writing, and code — but it has zero awareness of your proprietary business data, internal documents, or anything non-public.
Events after the training cutoff. Your company's internal data. Prices, regulations, or facts that have changed since training. Anything from private or paywalled sources that wasn't in the training corpus.
You inject fresh context into the prompt at runtime — documents, database records, API results, current date. This is the foundation of RAG (Retrieval Augmented Generation), which Session 2 covers in full. The model's frozen knowledge becomes a reasoning engine, not a source of truth.
Any AI product you build that needs current, accurate, or proprietary information must inject that information into the context. Relying on the model's training data alone for factual claims in a business context is an engineering mistake — not a prompting mistake. Always design for this from the start.
Context Management — The Real Engineering Problem
8 minThe context window is the single most important constraint in practical AI engineering. Everything you need the model to "know" for a given request must fit inside it — and what doesn't fit, doesn't exist as far as the model is concerned. Understanding how it fills up, what happens when it overflows, and how to manage it deliberately separates working prototypes from production systems.
When input exceeds the window, most APIs silently truncate the oldest content — typically the beginning of the conversation or the earliest injected document. The model never warns you. It simply can't see what was cut.
Irrelevant, noisy, or contradictory content in the context window actively harms output quality — even if there's still space. The model attends to everything. Garbage in the context competes with relevant signals.
Models pay more attention to content near the start and end of the context ("primacy and recency bias"). Critical instructions should be in the system prompt or reiterated near the user message — not buried in the middle.
Four patterns for managing context in production:
Every token in your context should earn its place. Ask of each piece: does the model need this to answer correctly? If not — cut it. Context hygiene is not optimisation; it's correctness. A model given clean, relevant context consistently outperforms the same model drowning in noise.
Part 8 — Model Families & What "Bigger" Actually Means
5 minYou'll make model selection decisions when building AI products. "GPT-4 is better than GPT-3.5" — but what does that mean? What are you actually choosing between?
The most important dimension is parameter count. Parameters are the numerical weights that get updated during training — the millions or billions of numbers that collectively encode the model's knowledge and behavior. A 70B model has 70 billion of these weights. A 7B model has 7 billion.
Better reasoning on complex, multi-step problems. More nuanced instruction-following. Better performance on tasks that require synthesizing multiple pieces of information. More "coherent worldview" across a long conversation.
Slower inference (higher latency per response). Higher cost per API call. More compute required to run. For many tasks — classification, extraction, simple Q&A — a smaller model performs identically at a fraction of the cost.
The practical decision framework: Use the smallest model that reliably handles your task. Start with a mid-tier model, measure quality, then step up only if needed. In high-volume enterprise systems, the cost difference between a small and large model can be 10–50x. That's not a detail — it's the business case.
Anthropic — Claude family. Claude Opus (frontier, reasoning-heavy tasks), Claude Sonnet (balanced performance and cost), Claude Haiku (fast, lightweight, high-volume tasks). Strong on instruction-following and safety.
OpenAI — GPT family. GPT-4o (multimodal, strong general capability), o1/o3 (reasoning-optimized, slower but exceptional at logic and math). Largest ecosystem of integrations and tooling.
Google — Gemini family. Gemini Ultra, Pro, and Flash. Strong on multimodal tasks and Google Workspace integration. Flash tier is extremely cost-effective for high-volume inference.
Open-source — Llama, Mistral, Qwen, Falcon. Can be self-hosted. No API cost. Full control over data privacy. Trade-off: you own the infrastructure, the updates, and the guardrails. Critical for compliance-sensitive enterprise contexts.
Regional & domain-specific models. A growing category of models trained on specific languages, regions, or industries — Sarvam AI and Krutrim (South Asian languages), Mistral (strong French performance), Jais (Arabic), medical-specific models, legal-specific models. These frequently outperform frontier models on their target domain while being far cheaper to run. Don't assume GPT-4 is the right tool for every language and market.
Context Economy — The Real Cost of AI at Scale
8 minTokens are the atomic unit of both capability and cost. Every API call is billed on two dimensions: how many tokens went in (input) and how many came out (output). Understanding this changes every design decision you make — from how you write system prompts to how long you let conversations run.
The deeper issue is that token costs are not linear. They compound. A conversation that starts cheap becomes expensive fast — and most teams don't notice until the invoice arrives.
This is the context tax. Every time you send a new message, you pay for the entire conversation history again — not just what you typed. By turn 10, the majority of your token spend is on context you've already paid for in previous turns. This compounds across thousands of daily users in a production system.
The four levers of context economy:
API plans are tiered by monthly token volume and rate limits (requests per minute). When you're in development, low-volume usage is cheap. When you go to production with real users, costs scale with every conversation, every document, every retry. Model your production cost before you launch: estimate average tokens per interaction × daily active users × 30 days. This number is often a surprise.
Stop thinking in per-token costs and start thinking in per-task costs. What does it cost to classify one document? To generate one support reply? To summarise one meeting? Once you have a cost-per-task figure, you can compare it against the human cost of the same task, set a payback period, and make a defensible business case for the AI investment.
Most AI projects are cheap during evaluation and expensive in production — not because the price per token changes, but because evaluation uses a handful of carefully chosen inputs and production uses everything. Design your context strategy in week one, not after your first invoice. The teams that treat token economics as an afterthought almost always rebuild their context pipeline before month three.
Try It Yourself — Three Exercises
15 minThe fastest way to internalise these concepts is to experience them directly. Each exercise below takes 3–5 minutes and demonstrates something you cannot fully grasp from reading alone. Open a browser tab and work through them.
Open tiktokenizer.vercel.app and paste a paragraph you've written — a vendor communication, an email, anything. Watch how it splits into tokens. Notice which words stay whole and which get split.
Now paste a long document and watch the token count climb. Consider: at 128,000 tokens (Claude's context window), roughly how many pages of text could you fit? How many of your typical documents would exceed that?
Try rewriting the same idea in fewer words. Notice how token count drops. At scale — thousands of API calls per day — this difference translates directly into cost. Cleaner prompts are cheaper prompts.
Send the exact same prompt to Claude twice in separate conversations: "Give me three creative names for a B2B project management tool." Notice how the responses differ — same model, same prompt, different outputs. That variation is temperature at work.
Now try a factual extraction task twice: "What is the capital of Karnataka?" Notice the responses are nearly identical. Low-stakes factual queries converge because one answer dominates the probability distribution.
Reflect: for a classification task (flagging a vendor document as compliant or non-compliant), which behaviour do you want — the creative variation or the factual consistency? That's your temperature decision.
Ask the model a specific factual question from your domain — something about a regulation, a technical standard, a market trend, or an industry requirement — where you already know the answer is complex or has changed recently.
Watch it answer with apparent confidence. Then probe it: "Are you certain about that? What's your source? Could this have changed recently?" Notice whether it backtracks, hedges, or doubles down.
This is the core insight to carry forward: the model generates plausible text, not verified truth. Confidence in the output is not evidence of correctness. Output validation is an engineering problem — not a prompting one.
Closing the Loop — Why Prompting Works
5 minYou may already know how to prompt — but likely by feel and intuition. Now that you understand the mechanics, you can reason from first principles about why the techniques that work, work. That shift — from intuition to reasoning — is what makes prompting a disciplined skill rather than guesswork.
When you say "think step by step," you're not coaching the model to be more careful. You're forcing it to generate intermediate tokens — and those intermediate tokens become part of the context the model attends to when producing the final answer. The reasoning steps literally appear in the probability distribution of what comes next. Writing the steps out loud helps the model get to the right final token.
When you include 2-3 examples of input/output in a prompt, you're conditioning the token probability distribution. The model has seen similar patterns in training. Seeing them in context shifts the distribution toward outputs that match the format and style you've demonstrated. Examples are essentially sample data for real-time distribution shaping.
Vague prompts produce high-entropy outputs — many possible next tokens are roughly equally likely. Specific, detailed prompts narrow the distribution dramatically. When you say "summarize this in 3 bullet points for a non-technical executive audience," you've constrained the space of plausible next tokens at every step. That's not style advice — it's probability engineering.
"You are a senior compliance officer..." primes the model with a cluster of associated tokens from training. The attention mechanism picks up on "compliance officer" and shifts the distribution toward formal, precise, risk-aware language — because that's what correlates with those tokens in training data. Persona prompting is activating a learned statistical cluster.
Every prompting technique is a mechanism for shaping token probability distributions. You're not convincing the model — you're conditioning what it statistically considers most likely next. Once you internalise this, you can reason from first principles about why a prompt isn't working, instead of guessing at rewrites.
AI Ethics & Bias — Where They Come From
7 minAI bias is not a design choice or a political statement — it's a mathematical consequence of training. A model learns from data. If the data reflects historical inequities, the model encodes them. If the data over-represents certain demographics, languages, or viewpoints, the model performs better for those groups and worse for everyone else. This is not fixable by prompting. It's a property of the training process.
Most LLMs were trained primarily on English-language text scraped from the internet. This means they perform significantly better on English than on other languages, better on Western cultural contexts than others, and encode the biases — including gender, racial, and socioeconomic ones — present in that training corpus.
A model that performs excellently in English may produce noticeably weaker output in regional languages, mixed-script text, or vernacular dialects. This isn't a minor quality difference — it can mean factually incorrect outputs, loss of nuance, or culturally inappropriate responses that damage user trust.
When asked to generate content about professionals, leaders, or experts, models default toward representations that match patterns in their training data. This can result in systematically skewed outputs for resumes, role descriptions, or hiring-related tasks — and create liability for organisations that deploy these outputs in people decisions.
Models don't just reflect bias — they can amplify it. Because outputs are generated at scale and may be treated as authoritative, a biased pattern in a model's output can influence many downstream decisions before anyone notices. Scale amplifies both capability and error.
What this means for enterprise deployment: Before deploying any AI system that makes or influences decisions about people — hiring, credit, healthcare, access to services — you need to audit the model's performance across your specific demographic and language groups, not just its average benchmark performance. Average performance can look excellent while hiding severe underperformance for specific subgroups.
Test your AI system on the populations it will actually serve — not on benchmark datasets that may not represent them. If your product operates across multiple languages or regions, measure quality separately for each. A system that is 95% accurate on average but 60% accurate for a specific group is not a good system for that group. It's a system that excludes them.
Probabilistic vs Deterministic Thinking
6 minThis is the single biggest mindset shift for engineers and product managers coming from traditional software. It changes how you test, how you debug, how you measure quality, and how you design systems. Until you make this shift, AI systems will feel unreliable and unmanageable. After it, they become tractable.
Given the same input, you always get the same output. A bug either happens or it doesn't. You write unit tests that pass or fail. A release is either correct or broken. Debugging means finding the specific line of code that caused the failure.
Given the same input, you get different outputs each time. Quality is a distribution, not a binary. You measure error rates across populations, not individual correctness. A "bug" might mean "this fails 12% of the time on this input type." There is no line of code to fix — you improve the distribution.
Before building any AI system, answer: what is the acceptable error rate for this specific use case, and how will you measure whether you've met it? If you can't answer both parts, you're not ready to deploy. The measurement infrastructure is as important as the AI itself.
Human-in-the-Loop — When AI Must Not Act Alone
6 minThere are tasks where AI can produce output independently and tasks where a human must review that output before anything happens. The design decision about where this boundary sits is one of the most important — and most frequently skipped — decisions in enterprise AI deployment.
Getting it wrong in one direction means under-using AI (humans reviewing everything, defeating the efficiency gain). Getting it wrong in the other direction means AI acting autonomously in situations where the cost of error — legal, financial, reputational, or human — is too high.
The cost of error is low and easily reversible. The output is informational, not decisional. The task has high volume and low stakes per instance. There is a feedback loop to catch systematic errors before they compound. Examples: draft generation, summarisation, classification for routing.
The decision affects a person's rights, access, employment, or financial position. The output will be communicated externally as your organisation's position. The regulatory environment imposes human accountability. The error cost is irreversible. Examples: hiring decisions, contract approvals, compliance flags, financial disbursements, healthcare recommendations.
The three review patterns in practice:
For each action your AI system will take: what is the consequence of this being wrong, and is that consequence reversible? If the answer to either part makes you uncomfortable, the design needs a human checkpoint. This is not a limitation of AI — it is responsible system design. The goal is not maximum automation; it is maximum value at acceptable risk.
Prompting Strategy — The PCTFE Framework
10 minMost people prompt the way they'd text a colleague — casually, partially, assuming shared context. That works for quick personal tasks. It fails consistently when you're building something that needs to work reliably across hundreds of different inputs from different users. You need a structure.
The PCTFE framework is a five-element scaffold for writing prompts that are explicit, testable, and maintainable. Think of it the same way you'd think about writing a function: inputs, behaviour, outputs — specified completely.
For a quick personal task, two elements might be enough. For a production system prompt that will run thousands of times a day, all five are mandatory. The framework scales — use as much of it as the stakes require. But if a prompt is misbehaving, the fix is almost always in a missing or underspecified element.
Additional techniques that compound on the framework:
Explicitly tell the model what NOT to do. "Do not speculate", "Never mention competitor platforms", "Do not apologise more than once". Models follow negative constraints reliably — use them for compliance-critical outputs.
Ask the model to wrap specific output in XML-like tags: <summary>, <action>, <confidence>. This makes programmatic parsing trivial and prevents the model from mixing reasoning with output.
For factual, document-based tasks, instruct the model to only use information from the provided context. Never invent. If the answer isn't in the document, say so. This is the single most effective anti-hallucination instruction.
Instruct the model to rate its confidence (High/Medium/Low) and explain uncertainty. "If you are not certain, say 'I'm not confident about this' and explain why." Turns a binary correct/hallucinated output into a graduated, auditable one.
Common Prompting Mistakes — Before & After
8 minThese are the eight patterns that consistently produce bad output — not because the model is broken, but because the prompt is underspecified. Each one has a simple fix once you know what to look for.
"Write something about our service billing policy."
"Write a 100-word summary of our service billing policy for new clients who have never received an invoice. Use simple language. Focus on: billing frequency, payment terms, and what to do if there's a discrepancy."
"Summarise this document, identify any compliance issues, suggest improvements, translate it into another language, and then write a follow-up email to the client."
Break into sequential prompts:
1. "Summarise this document in 3 bullet points."
2. "Based on the summary, identify compliance issues."
3. "Draft a follow-up email based on issue [X]."
Each call gets full attention.
"Extract the key dates from this contract."
"Extract all dates from this contract. Return ONLY a JSON array:
[{"event": "...", "date": "DD-MM-YYYY"}]
If no date is found, return an empty array. Do not include any other text."
"Calculate the total billing amount for 47 clients given this rate table."
Use the model to write the formula or Python/spreadsheet code that does the calculation. Execute that code separately. The model writes logic reliably; it executes arithmetic unreliably.
"What are the compliance requirements for contract workers under local employment law?"
"Using ONLY the compliance document below, answer: what requirements apply to contract workers? If the document doesn't cover this, say 'Not covered in the provided document.'
[paste the actual document]"
Referencing "the vendor we discussed earlier" or "the policy from last week" in a new conversation without re-providing the data.
Always re-inject necessary context at the start of each conversation: "Vendor: [name]. Issue from last session: [summary]. Today's query: [question]."
Treat every call as stateless by design.
Treat Prompts Like Code
5 minA prompt that runs in production thousands of times a day is not a casual instruction — it's a critical piece of software. It should be treated with the same discipline as code: versioned, tested, documented, and reviewed before it ships.
Store prompts in version control (Git) just like code. Every change should be a commit with a message explaining what changed and why. When a production prompt breaks, you need to know what was different yesterday. "v1", "v2", "final_final" in a Google Doc is not versioning.
Before shipping a prompt change, run it against 20–50 representative inputs — including edge cases, adversarial inputs, and examples where the old prompt was known to fail. If you don't have an eval set, you're shipping blind. Build one alongside your first prompt.
Prompts behave like complex systems — changing multiple elements simultaneously makes it impossible to attribute improvements or regressions. When a prompt isn't working, form a hypothesis about one element (e.g., "the persona is too vague"), change only that, and re-evaluate. This is A/B testing for prompts.
Never hardcode a prompt as a string inside your application logic. Store prompts in a separate config file, database, or prompt management system. This lets non-engineers iterate on prompt text without touching code — and lets you roll back a bad prompt without a deployment.
No prompt is correct on the first attempt. The best practitioners expect 5–15 iterations before a prompt is production-ready. Each iteration should be informed by a specific failure mode observed on a specific input. "It sometimes gives wrong answers" is not a debugging statement. "On inputs where the client account has missing documentation, it fabricates a document name" is.
What Never Goes Into a Prompt
5 minEvery prompt you send to an external API crosses a network boundary and is processed on someone else's infrastructure. Most enterprise teams don't think about this until after an incident. Know these rules before you build anything that touches real user data.
Before putting any piece of data in a prompt, ask: "Would I be comfortable if this data appeared in the model provider's logs indefinitely?" If the answer is no — mask it, hash it, or don't include it. Design your prompts to work with pseudonymised references ("Client ID: C-4821") rather than raw personal data wherever possible.
What AI Genuinely Cannot Do
6 minBeyond hallucination (which the session has already covered), there are structural limitations that every person in this room should know — because overestimating AI capability in any of these directions leads to failed projects, broken trust, and real business risk.
Being honest about what AI cannot do builds more credibility — with your team and with your clients — than overselling it. Better decisions, better systems, and faster failure detection all follow from having an accurate map of the terrain from the start. The real capability is impressive enough.
Reflect & Discuss
10 minWork through these questions yourself, or bring them to the group session. They're designed to bridge what you've just read with how it applies to your actual work.
- Now that you know prompts have three layers — system prompt, history, and user message — where in your work could a well-crafted system prompt replace repetitive instructions you currently give manually every session?
- Given that the model's knowledge is frozen at a cutoff date and has no access to your company's data, what would have to be injected into the context for an AI tool to be genuinely useful in your specific role?
- If different models reflect different RLHF-baked values, what would a compliance-critical or client-facing AI product require from the model's built-in behaviour? Does any current model meet that bar?
- What's one assumption about AI you held before reading this session that now looks different? Does it change how you'd approach a task you've already been using AI for?
This Week's Experiment
This exercise primes everything in Session 2 — your real example will be the raw material for understanding RAG, memory, and context injection. Generic examples are forgettable. Your actual workflow is not.