TL;DR: You’d expect an AI that “predicts the most likely next word” to give the same answer to the same question. It doesn’t, because it never just takes the most likely word. At each step it has a ranked list of candidate words with probabilities, and it draws one by probability — a weighted lottery where the favorite usually wins but not always. That dial is called temperature. Even setting it to 0 doesn’t reliably give identical output (a 2025 test got 80 different results from 1,000 runs at temperature 0). And “it varies” is a separate thing from “it’s making things up” — don’t confuse the two.
The other day I wanted a tagline for a little side project and couldn’t be bothered to write one, so I asked an AI: “give me a one-line slogan for my tech blog, just one.” Didn’t love it, hit regenerate, got another, regenerated once more. Three completely different lines, no overlap.
You’ve probably hit this too. You regenerate hoping to get back the good answer from a second ago, and it’s gone for good. Which raises a fair question: if the model really just “picks the most likely next word,” then the same question with the same starting point should make it pick the exact same words and hand you the exact same answer. So why is it different every single time?
It isn’t picking the highest-probability word
The line everyone repeats is that a language model just predicts the next word. That’s true. But it’s easy to fill in the rest as “so it picks the most likely one each time,” and that’s the part that’s wrong.
Slow down the moment where it produces one word. What it actually has is a whole ranked list of candidate words, each with a probability. Say after some sentence the options are “happy” 40%, “tired” 25%, “busy” 15%, trailing off into a long tail of less and less likely words. If it rigidly took the 40% option every time, it really would be a machine that prints the same answer forever. But that’s not how it picks. It’s closer to drawing a ticket from that distribution: the high-probability words hold more tickets and come up often, but the longer-shot words hold tickets too, and now and then one of them wins.
One word gets drawn that way, then the next word is drawn from a fresh list of probabilities, and the whole reply is sampled out one piece at a time. Change an early draw even slightly and the lists downstream all shift with it, so the paths fan out. That’s why a re-asked question tends to start similar and drift apart further in.
What’s the dial behind this?
It’s called temperature, and it controls how adventurous that drawing gets.
Turn it down and the model plays it safe, leaning toward the highest-probability word every time: conservative, repetitive. Turn it up and it’s more willing to reach for the unlikelier words, so the output gets more varied and more imaginative. You’d want it high for a poem, a slogan, or ideas you haven’t thought of; you’d want it low when you’re asking it to format a table the same way every time and not get creative.
The catch is that the everyday chat box you type into (ChatGPT, Claude, Gemini) doesn’t usually put that dial in front of you. They pick a middle default behind the scenes, enough to keep some variety without wandering off topic. So what you feel is exactly that: a little different each time, never wildly off.
One small thing I noticed testing this: I asked a few current models for that same slogan, and one of them gave the identical opening words on two of three tries, only branching later. Which makes sense: at the very first word the top candidate is usually far ahead, so the lottery keeps landing on it, and it’s only further in, where two or three candidates sit close together, that there’s room to diverge. It drifts, but it doesn’t drift randomly.
Doesn’t temperature 0 make it deterministic?
You’d think so, and this is the part I found genuinely surprising: turning the randomness all the way down still doesn’t reliably give you the same answer twice.
A 2025 writeup from Thinking Machines Lab, Defeating Nondeterminism in LLM Inference, tested exactly this. They ran 1,000 completions at temperature 0 (supposedly the fully deterministic setting) and still got 80 distinct outputs, first diverging around the hundredth token. The cause turned out to have nothing to do with the sampling dial. It’s that the underlying arithmetic runs in a slightly different order depending on how your request happens to get batched with others on the GPU, and those tiny numerical differences are enough to tip a close call between two candidate words. Making the math batch-invariant fixed it, but that’s an engineering effort, not a checkbox.
For everyday use you don’t need the internals. The takeaway is just: “run it again and get the same thing” isn’t something you can lean on, even at the most deterministic setting. (The newer “reasoning” models that think before answering vary even more; this post is only about that most basic layer.)
Is “it varies” the same as “it’s making things up”?
No, and it’s worth keeping these two apart, because they get blamed for each other.
The variation is just the sampling doing its job. It tells you nothing about whether any particular answer is correct. That’s a different failure from when the model states something wrong in the exact same confident tone it uses for true things. That one is about it not separating fluent from true. One is “it took a different path this time.” The other is “it can’t tell whether it’s bluffing.” A varied answer isn’t a sign it’s lying to you, and a steady, confident answer isn’t a sign it didn’t roll the dice.
So next time you regenerate and get something different, don’t read it as the model being unreliable. It just drew another ticket. The thing actually worth your attention is the same as always: whatever it handed you this round, varied or not, you still have to judge whether it’s right — which is the one habit I lean on hardest across all three assistants day to day.


