Temperature on KbWen Blog

把 temperature 設成 0，AI 就會每次都一樣嗎？

KbWen — Fri, 19 Jun 2026 10:00:00 +0800

TL;DR：網路上很常看到一句建議——要 AI 每次給一樣的答案，把 temperature 設成 0 就好。聽起來天衣無縫：temperature 0 就是叫它每次都挑機率最高的字，沒有隨機，那不就該每次一樣？但實際上不是。有人拿同一個 prompt、temperature 0 連跑 1000 次，還是跑出 80 種不一樣的輸出。原因不是「浮點誤差」這麼簡單，而是你的請求在 GPU 上跟多少別的請求湊成一批一起算，會偷偷改掉算術的順序。最確定，不等於可重現。

「想讓 AI 每次都給一樣的答案？把 temperature 設成 0 就好。」這句建議在網路上幾乎是標準答案，而且乍看完全合理——temperature 0 就是把那層隨機性整個關掉，關掉了不就每次都一樣？

這也是上一篇為什麼同一個問題問 AI，每次答案都不一樣？很自然會接到的下一步——那篇講它預設在照機率抽籤、所以會飄，那把那層隨機關掉不就得了。

這話幾乎是對的——壞就壞在「幾乎」。真的拿去跑跑看，它會破在一個意外的地方。

把溫度關到底，它還是會變

2025 年 Thinking Machines Lab 有一篇 Defeating Nondeterminism in LLM Inference，就老老實實去測了這件事。他們拿同一個 prompt、temperature 設成 0——理論上最「確定」的設定——對同一個模型連跑 1000 次。結果跑出 80 種不一樣的輸出。而且不是一開始就亂，前面都一樣，一路到第 103 個 token 才開始分岔。

也就是說，你把那個「隨機」的旋鈕轉到底了，它照樣給你 80 個版本（多半是同個意思的不同寫法，不是答案在對錯之間跳）。所以問題顯然不在 temperature。隨機性早就關了，那這個「變」是哪來的？

不是浮點誤差這麼簡單

這時候最常聽到的解釋是「喔那是浮點誤差啦」。這個答案對一半、偷懶一半，而且偷懶的那半剛好是重點。

浮點數確實有個怪脾氣：它的加法不符合結合律。白話講就是 (a+b)+c 跟 a+(b+c) 算出來可能差那麼一點點——電腦處理小數本來就會在尾巴留一點誤差，加的順序一換，誤差就落在不一樣的地方。這是地基沒錯。但光有這個地基，還不足以讓你每次結果不一樣。因為假如每次「加的順序」都固定，那誤差也會每次都一樣，結果照樣可重現。

真正的扳機，是那個順序其實沒固定。

它跟多少人擠在一起算，你管不到

關鍵在它是「怎麼被算出來的」。你以為你送一個請求進去，它就單獨幫你算——其實沒有。為了效率，伺服器會把同時湧進來的一堆請求湊成一批（batch）一起算。而這一批有多大，是隨時在變的：當下多少人在用、你的請求剛好跟多少別人的湊在一起，每次都不一樣。而批次的大小一變，GPU 為了算這一批，會用不一樣的方式把工作切開、把一長串數字分段加總——也就是負責矩陣運算的那個 kernel，把數字加起來的順序變了。順序一變，前面講的浮點誤差就落在不同位置。

多數時候這點誤差沒差。但偶爾——兩個候選字機率咬得很近的時候——這一丁點誤差剛好就把原本的第一名擠下去，換成第二名上場。一個 token 一變，後面整串就順著岔開了。

所以它不是真的在「隨機」，而是有一件你完全插不上手的事在左右你的答案：這一瞬間，機房裡有多少人在跟你擠同一張卡。temperature 0 關掉的是「主動抽籤」那層，可是這層藏在運算底層的飄移，它根本沒碰到。

那修得好嗎？修得好，但要花力氣

能修。Thinking Machines 那篇後半就是去做這件事：把那幾個關鍵 kernel 改寫成「不管批次怎麼湊，算術順序都固定」（他們叫 batch-invariant）。改完之後，同樣 1000 次，真的就 bitwise 完全一致了。代價是慢一點，但對真的需要可重現的場景（像他們在意的強化學習訓練）划得來。

重點是這個：可重現不是預設就附贈的東西，是要有人特地去把它釘死。把 temperature 設 0、打個勾，並不會自動到手。（也補一句：這種飄主要是線上這種多人共用的服務才明顯；你在自己機器上、把批次固定好跑開源模型，其實常常是能重現的——所以這不是「LLM 天生就不可能一樣」，是「線上服務的預設沒幫你保證」。）

那實際上該記得什麼

講這麼多底層，對日常用 AI 其實就收成一句很實用的話：別把「我跑一次、它對了」當成「它每次都會這樣」。連最確定的設定都不保證一字不差，更別說你平常根本沒在關隨機。

如果你拿 AI 做的事需要「同樣輸入、同樣輸出」——核對一個答案、跑一段自動化流程——那層保證得你自己另外想辦法（把要求講死，或乾脆用程式兜住），不能假設它天生就穩。這跟我在 AI 說「完成了」，怎麼確認它真的做完？裡那條囉嗦的習慣是同一個道理：它這次對，只能算這次對。

Why Does AI Give a Different Answer Every Time You Ask?

KbWen — Fri, 19 Jun 2026 09:20:00 +0800

TL;DR: You’d expect an AI that “predicts the most likely next word” to give the same answer to the same question. It doesn’t, because it never just takes the most likely word. At each step it has a ranked list of candidate words with probabilities, and it draws one by probability — a weighted lottery where the favorite usually wins but not always. That dial is called temperature. Even setting it to 0 doesn’t reliably give identical output (a 2025 test got 80 different results from 1,000 runs at temperature 0). And “it varies” is a separate thing from “it’s making things up” — don’t confuse the two.

The other day I wanted a tagline for a little side project and couldn’t be bothered to write one, so I asked an AI: “give me a one-line slogan for my tech blog, just one.” Didn’t love it, hit regenerate, got another, regenerated once more. Three completely different lines, no overlap.

You’ve probably hit this too. You regenerate hoping to get back the good answer from a second ago, and it’s gone for good. Which raises a fair question: if the model really just “picks the most likely next word,” then the same question with the same starting point should make it pick the exact same words and hand you the exact same answer. So why is it different every single time?

It isn’t picking the highest-probability word

The line everyone repeats is that a language model just predicts the next word. That’s true. But it’s easy to fill in the rest as “so it picks the most likely one each time,” and that’s the part that’s wrong.

Slow down the moment where it produces one word. What it actually has is a whole ranked list of candidate words, each with a probability. Say after some sentence the options are “happy” 40%, “tired” 25%, “busy” 15%, trailing off into a long tail of less and less likely words. If it rigidly took the 40% option every time, it really would be a machine that prints the same answer forever. But that’s not how it picks. It’s closer to drawing a ticket from that distribution: the high-probability words hold more tickets and come up often, but the longer-shot words hold tickets too, and now and then one of them wins.

One word gets drawn that way, then the next word is drawn from a fresh list of probabilities, and the whole reply is sampled out one piece at a time. Change an early draw even slightly and the lists downstream all shift with it, so the paths fan out. That’s why a re-asked question tends to start similar and drift apart further in.

What’s the dial behind this?

It’s called temperature, and it controls how adventurous that drawing gets.

Turn it down and the model plays it safe, leaning toward the highest-probability word every time: conservative, repetitive. Turn it up and it’s more willing to reach for the unlikelier words, so the output gets more varied and more imaginative. You’d want it high for a poem, a slogan, or ideas you haven’t thought of; you’d want it low when you’re asking it to format a table the same way every time and not get creative.

The catch is that the everyday chat box you type into (ChatGPT, Claude, Gemini) doesn’t usually put that dial in front of you. They pick a middle default behind the scenes, enough to keep some variety without wandering off topic. So what you feel is exactly that: a little different each time, never wildly off.

One small thing I noticed testing this: I asked a few current models for that same slogan, and one of them gave the identical opening words on two of three tries, only branching later. Which makes sense: at the very first word the top candidate is usually far ahead, so the lottery keeps landing on it, and it’s only further in, where two or three candidates sit close together, that there’s room to diverge. It drifts, but it doesn’t drift randomly.

Doesn’t temperature 0 make it deterministic?

You’d think so, and this is the part I found genuinely surprising: turning the randomness all the way down still doesn’t reliably give you the same answer twice.

A 2025 writeup from Thinking Machines Lab, Defeating Nondeterminism in LLM Inference, tested exactly this. They ran 1,000 completions at temperature 0 (supposedly the fully deterministic setting) and still got 80 distinct outputs, first diverging around the hundredth token. The cause turned out to have nothing to do with the sampling dial. It’s that the underlying arithmetic runs in a slightly different order depending on how your request happens to get batched with others on the GPU, and those tiny numerical differences are enough to tip a close call between two candidate words. Making the math batch-invariant fixed it, but that’s an engineering effort, not a checkbox.

For everyday use you don’t need the internals. The takeaway is just: “run it again and get the same thing” isn’t something you can lean on, even at the most deterministic setting. (The newer “reasoning” models that think before answering vary even more; this post is only about that most basic layer.)

Is “it varies” the same as “it’s making things up”?

No, and it’s worth keeping these two apart, because they get blamed for each other.

The variation is just the sampling doing its job. It tells you nothing about whether any particular answer is correct. That’s a different failure from when the model states something wrong in the exact same confident tone it uses for true things. That one is about it not separating fluent from true. One is “it took a different path this time.” The other is “it can’t tell whether it’s bluffing.” A varied answer isn’t a sign it’s lying to you, and a steady, confident answer isn’t a sign it didn’t roll the dice.

So next time you regenerate and get something different, don’t read it as the model being unreliable. It just drew another ticket. The thing actually worth your attention is the same as always: whatever it handed you this round, varied or not, you still have to judge whether it’s right — which is the one habit I lean on hardest across all three assistants day to day.

中文版：為什麼同一個問題問 AI，每次答案都不一樣？

為什麼同一個問題問 AI，每次答案都不一樣？

KbWen — Fri, 19 Jun 2026 09:00:00 +0800

TL;DR：你大概以為 AI 每次都挑「機率最高的下一個字」，所以同樣的問題該給一樣的答案才對。其實它沒有。它每生一個字，是從一排各有機率的候選字裡「照機率抽一個」——像一場加權抽籤，機率高的容易中、但不是每次都中。這個刻意留的隨機性叫 sampling，背後有個旋鈕叫 temperature。所以重問會飄是正常的、是設計成這樣的。但「飄」跟「一本正經唬爛」是兩回事，別搞混。

前幾天我想幫一個小側專案弄句標語，懶得自己想，就丟給 AI：「幫我的技術部落格想一句 slogan，中文，給我一句就好。」覺得不夠好，按了重生，再來一句，又重生一次。三次拿到三句完全不一樣的——「以技術之力探索未來」、「以技術為筆，寫下每一次的成長軌跡」、「用技術探索世界，用文字記錄思考」。

你大概也遇過這種事。重問一次想找回剛剛那個比較好的答案，結果再也回不去了。問題是，如果它真的像大家說的「挑機率最高的字」，那同一個問題、同一個開頭，它每次不是都該挑出一模一樣的字、給你一模一樣的答案嗎？怎麼會每次都不一樣？

它根本沒在挑「最高分」那個

關鍵就在這句被講得太順的話：LLM 其實從頭到尾就在做一件事，預測下一個字。這句沒錯，但很多人會自動腦補成「它每次都挑機率最高的那個字」。它沒有。

把它生成一個字的那一瞬間放慢來看：它手上其實是一整排候選字，每個都帶著一個機率。比方說「我今天很」這個開頭，後面接「開心」40%、「累」25%、「忙」15%，再來還拖著一長串機率越來越低的字。如果它每次都死板地挑那個 40% 的，那它的確會變成一台每次都吐一樣答案的機器。但它不是這樣選的——它比較像拿這排機率去抽籤，機率高的字分到的籤多、容易中，機率低的字也有幾張籤，偶爾就會中一次。（它眼裡的「字」其實是一種叫 token 的小塊，這個我在 Token 是什麼裡聊過，這篇不看也不影響。）

一個字這樣抽，下一個字又從新的一排機率裡抽，整段話就是這樣一路抽出來的。前面抽到的字稍微不一樣，後面接的整排機率就跟著變，越走越岔。所以你才會看到，同一個問題重問，開頭往往有點像、後面整個飄掉。

那個「要多敢抽」，有個旋鈕

它要乖乖挑最高分，還是放膽去抽冷門的字——這件事是可以調的，這個旋鈕叫 temperature（你在各家 API 文件裡都查得到這個參數）。

調低，它越乖，越偏向每次都挑機率最高的那個，答案保守、重複性高；調高，它越放得開，越敢去抽那些機率沒那麼高的字，答案就更跳、更有想像力。寫詩、想 slogan、要它給你沒想到的點子，你會希望它高一點；叫它照固定格式整理一份資料，你會希望它低一點、別亂發揮。

只是你平常打字聊天的那個輸入框——ChatGPT、Claude、Gemini 那種——通常不會把這個旋鈕擺在你面前。它在背後設了一個預設值，落在中間：留一點變化，但不會亂跑。你感受到的大概就是：它每次都有點不一樣，但也不會離題到哪去。（你可能會想，那把它設到 0 不就穩了？沒那麼簡單——我另外寫了一篇把 temperature 設成 0，AI 就會每次都一樣嗎？聊這件意外的事。）

順帶講個我自己試出來的小細節。我把同一句 slogan 的要求拿去問現在的幾家模型，有一家三次裡有兩次，開頭那幾個字幾乎一樣，只是後面才岔開。這其實滿合理的——開頭那個位置，機率最高的字遙遙領先，抽籤抽來抽去都中它；要到中後段，幾個候選字機率咬得比較近，才有空間抽出不同的路。它飄，但不是亂飄。

「飄」跟「唬爛」是兩回事

還有一個東西，很容易跟這個搞混。

它每次答得不一樣，是上面講的這個抽籤機制，是設計成這樣的，跟它「對不對」沒什麼關係。這跟它有時候一本正經地把錯的東西講得很篤定，是另一回事——那篇講的是它分不清「順」跟「對」。一個是它「每次走的路不一樣」，一個是它「不知道自己在不在亂講」。會飄不代表它在唬你，講得篤定也不代表它沒在飄。

所以下次重問拿到不一樣的答案，先別急著覺得它不靠譜。它只是又抽了一次籤而已。真正要留意的，是另一件事：它這次給你的答案，不管飄不飄，你都還是得自己判斷它對不對。

English version: Why Does AI Give a Different Answer Every Time You Ask?