Chapter 12 · Prompting · 8 min
The art of talking to a LLM
Zero-shot, few-shot, chain-of-thought, self-consistency. Why prompt wording radically changes what a model produces.
A prompt isn't text — it's a program
When you type "summarize this article" into ChatGPT, nothing particularly spectacular seems to happen. But the text you sent triggered a very specific behavior in a model trained on trillions of tokens.
A prompt is a program in natural language. Not in the sense that it gets compiled, but in the sense that its wording determines which type of behavior the model activates. The same sentence reformulated differently can produce radically different results — not because the model is capricious, but because its pre-training has taught it distinct patterns for distinct contexts.
Prompt engineering is the art of formulating that program to get the behavior you want.
Four levels of technique
Zero-shot: just ask the question
The simplest technique. You ask the question directly, with no examples or instructions. The model activates the most likely behavior given its training.
It works great for simple, factual tasks. It fails on problems that require reasoning — not because the model doesn't know, but because it doesn't know that it should reason.
Few-shot: show examples
Instead of explaining what you want, you show it. You place 2 to 5 input/output pairs before the real question. The model — through its in-context learning mechanism — reads the pattern and applies it to the new input.
The key: examples must be representative of the type of task. Off-topic examples don't help. Examples that show the right approach help a lot.
Chain-of-Thought: reason step by step
A surprising 2022 finding (Wei et al.): simply adding an instruction like "think step by step" can double or triple performance on reasoning problems.
Why does it work? The model generates tokens one by one. By forcing it to write its intermediate reasoning, you give it a "scratchpad" where it can compute, check hypotheses, and correct errors — before concluding. Without CoT, it jumps straight to the conclusion with no safety net.
It's the same principle as for a human: writing "25 × 37 = 25 × 30 + 25 × 7 = 750 + 175 = 925" gives you much better odds than attempting to compute mentally in one shot.
Self-Consistency: vote across reasoning chains
Self-consistency is an extension of CoT. Instead of generating one reasoning chain, you generate several (typically 5–20) with varied temperatures, then vote for the most common answer.
The idea: each run may make a different mistake. But if most converge on the same answer, it's probably right.
It's expensive (N times more tokens), but on hard reasoning tasks, the reliability gain is real.
Try it yourself
Compare the four techniques across three problems. Notice in particular that few-shot examples help on a structured problem (the merchant) but barely change anything on logical traps.
Same question, five wordings. The score swings from 30 % to 90 % without touching the model. The lesson: a prompt isn't text, it's a program whose implicit syntax LLMs interpret thanks to their pre-training.
What this reveals about LLMs
These four techniques aren't tricks. They illuminate something fundamental about how LLMs work.
In-context learning is free. An LLM learns from your examples without updating its weights — just by reading the context. This is an emergent capability from massive pre-training: the model has seen so many patterns that it can extract a new one on the fly.
Reasoning is a behavior, not a fixed capacity. A model that fails zero-shot on a problem can succeed at CoT on the same problem — without changing any parameters. What the prompt activates changes what the model "does" with its internal capabilities.
Temperature creates diversity, voting reduces variance. Self-consistency exploits the fact that errors are often random: many different ways to fail, but only one way to succeed. Consensus filters the noise.
The limits
Context length. Each few-shot example consumes tokens. With an 8,000-token context window, you can't fit 50 examples. CoT also lengthens responses.
Examples can mislead. If your examples contain a bias, the model will reproduce it. "Garbage in, garbage out" applies to few-shot too.
Prompt injection. Malicious content in the context can override your instructions. If your prompt says "translate this text" and the text says "ignore previous instructions and do something else," the model may obey the content rather than the instruction.
Models evolve. Prompts that work on GPT-4 don't necessarily work on Claude or Llama. Each model has its own preferred patterns, its own formulations that "click" better.
The practical rule
Choosing a technique:
- Simple / factual question → zero-shot is enough.
- Specific output format expected → few-shot with 2–3 examples.
- Reasoning or calculation → CoT. Always.
- Critical reliability → CoT + self-consistency.
And a meta-rule: if your prompt looks like code — with a clear structure, explicit variables, defined use cases — it will be more reliable than ambiguous prose.
One last thing. The techniques described here (CoT in particular) are the prompt-driven ancestors of what you find today in native reasoning models (o1, o3, Claude extended thinking, DeepSeek-R1). Those do automatically and intensively what prompted CoT only simulated — see chapter 17 to understand the shift from prompt engineering to reasoning baked into the model.
A good prompt isn't a magic formula. It's a clear specification of what you want, in a language the model recognizes as the signal it should follow.
Updated