Embeddings on KbWen Blog

Embedding 是什麼？AI 怎麼知道兩句話意思一樣

KbWen — Thu, 02 Jul 2026 11:30:00 +0800

TL;DR： Embedding（嵌入）就是把一段文字變成一長串數字，也就是空間裡的一個點。意思相近的文字會落在相近的位置，AI 判斷「兩句話是不是同個意思」，是在量這兩個點的方向差多少（夾角，術語叫 cosine similarity 餘弦相似度）。這就是為什麼搜尋能找到跟你一個字都沒重疊、意思卻對得上的結果。這招很好用，但像「king − man + woman = queen」那個經典例子，有點灌水。

電腦其實不懂「意思」。這件事有點掃興，但得從這裡開始講。

你打「貓」、「小貓」、「喵星人」，它並不知道這三個講的是同一種毛茸茸的生物。它會的只有一件事：把每段文字換算成一串數字，再去比對這些數字。神奇的是，就這麼粗暴的一招，撐起了現在大半的「語意搜尋」和「找相似」。這篇想拆開來看它怎麼運作，順便講一下，它在哪裡會騙你。

把每個詞變成一串座標

核心動作是這樣：拿一段文字，換成一串數字。不是一個數字，是一長串。OpenAI 現在的小模型，一段文字給你 1,536 個數字，大模型給 3,072 個。

一串數字，說穿了就是座標。兩個數字，是平面上一個點 (x, y)；三個數字，是空間裡一個點；1,536 個數字，是一個你腦袋畫不出來、但數學算得出來的點。每段你丟進去的文字，都變成這個高維空間裡的一根圖釘。

整個把戲的重點就一句話：模型會把「意思相近的東西」擺在附近。所以「貓」「小貓」「喵星人」三根圖釘會插在一起，即使它們字面上沒幾個字重疊，因為訓練時，它們常出現在類似的上下文裡。（這個「看上下文」的想法很老了，早年的 word2vec 就是靠它。）

這跟斷詞是兩回事。Token 是把文字「切開」成小塊；embedding 是切開之後，給每一塊（或整句、整篇）一個座標。之前寫 LLM 怎麼一個字一個字往下猜也提過，模型骨子裡都在算數字，這裡只是連「意思」也一起變成數字而已。

量「意思相近」，就是在量夾角

兩根圖釘擺在那，怎麼判斷它們意思像不像？看它們從中心點出發、指的方向像不像。

想像從原點各拉一支箭到那兩根圖釘。方向幾乎一樣，意思就幾乎一樣；差不多垂直，八竿子打不著；指相反，那是對立。用來描述「方向差多少」的那個數字，就是兩支箭夾角的餘弦值，也就是 cosine similarity。同方向是 1，垂直是 0，範圍一路到 −1。

為什麼看夾角、不看兩點的直線距離？因為方向比較不受長度影響。一則短短的筆記和一篇長長的文章，只要在講同一件事，就該算「像」，看方向會比看距離穩。而且 OpenAI 這些 embedding 模型吐出來的向量，長度都已經被縮成 1，所以算夾角餘弦，就等於把兩串數字對應位置相乘再加總（內積）而已。

給個 2D 的簡化版，抓一下手感：

import numpy as np

def cosine(a, b):
    a, b = np.array(a), np.array(b)
    return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))

cat    = [0.9, 0.1]   # 假裝這是「貓」
kitten = [0.85, 0.2]  # 「小貓」
banana = [0.1, 0.95]  # 「香蕉」

cosine(cat, kitten)  # ~0.99 -> 幾乎同方向
cosine(cat, banana)  # ~0.21 -> 夾角很大

真實的向量是 1,536 維、不是 2 維，也沒有人幫每一維標上意思。但做的事，就是這個，只是更寬。

（這串數字還有個好玩的地方：常常可以砍尾巴，1,536 個只留前面幾百個，意思大致還在。OpenAI 新的模型還為這件事訓練過，取名叫 Matryoshka，就是那個俄羅斯娃娃。）

「king − man + woman = queen」這個經典，灌了點水

你大概看過那個讓 embedding 顯得很神的例子：把「king」的向量減掉「man」、加上「woman」，就會落到「queen」。文字可以做加減法！意思變成算術！

是真的，但化了妝。示範不會告訴你的是：當你算出 king − man + woman、去找最近的字時，標準做法會把你剛輸入的那三個字（king、man、woman）先排除掉。如果不排除，離 king − man + woman 最近的，通常是……king 本人。

回頭去驗那些 word2vec 的經典範例，會發現不少討喜的例子，都得靠這個「偷偷不算輸入詞」的動作才成立。所以比較準的說法是：向量加減法把你帶到大概對的鄰居家門口，然後一個「把最明顯答案藏起來」的過濾器，接手領走了那個漂亮的結尾。空間裡確實有這種規律（甚至有個乾淨的數學理由說明這個位移為什麼會存在），只是沒有投影片上「意思 = 代數」那麼乾淨。

會特別講這個，不是要掃興，而是它剛好是「該怎麼看這整套東西」的提示。embedding 抓到的是統計上的相近（什麼跟什麼常一起出現），而我們老是忍不住說它「懂」。說它懂，是我們一廂情願，它只是記得什麼跟什麼常一起出現而已。

這招用在哪、又在哪會騙你

多數時候，embedding 是在你看不到的地方幹活：

語意搜尋：開頭那個例子，用意思找，不用關鍵字。
RAG：聊天機器人「查你的文件」時，通常是把你的問題變成向量，在地圖上找最近的幾塊，再塞進 context window 才回答。
去重複、分群、推薦：找更多像這個的。

同一套線路，也帶著同一種風險。因為這張地圖是從「人類文字裡什麼跟什麼常一起出現」長出來的，它會把人類的習慣一起學走，包括有偏見的那些。兩句話靠得很近，可能因為它們真的同義，也可能只是因為它們共用了同一個刻板印象。幾何分不出這兩種，而你也修不掉這件事，它是方法本身帶進來的：這套方法從頭到尾只會算什麼跟什麼常一起出現。

所以「AI 到底懂不懂這兩句是同個意思」，拆到最後滿平淡的：它把兩句都變成箭頭，量了夾角，沒有更玄的東西。你平常用的語意搜尋、RAG、找相似，底層幾乎都是這一招放大來跑。知道它是幾何、不是理解，至少你猜得到它大概會在哪裡翻車：語意搜尋偶爾撈回八竿子打不著的東西，多半就是卡在這。

How Embeddings Work: How AI Knows Two Sentences Mean the Same Thing

KbWen — Thu, 02 Jul 2026 11:30:00 +0800

TL;DR: An embedding turns a piece of text into a long list of numbers, a point in space. Text that means similar things lands in nearby spots, and the AI decides “do these two mean the same thing” by checking whether the two points sit in the same direction: the angle between them, called cosine similarity. That’s how search finds a page sharing zero words with your query. It’s a genuinely useful trick, and as the famous “king − man + woman = queen” example shows, a bit more of a magic show than the demos let on.

Search your notes for “how to make my laptop quieter” and a decent search engine hands you a page titled “reducing fan noise on a notebook.” Not one word in common. No laptop, no quieter. Yet it’s exactly the page you wanted.

Keyword matching can’t do that. It needs the words to overlap. So what’s doing the matching?

The answer is embeddings. It’s simpler than it sounds, and the most famous demo of it is half a con. We’ll get to that.

Turn every sentence into an arrow

The move underneath is this: take a piece of text and turn it into a list of numbers. Not one number — a long list. OpenAI’s current small model gives you 1,536 numbers per input; the large one, 3,072.

A list of numbers is just coordinates. Two numbers put a point on a page (x, y). Three put it in a room. 1,536 put it in a space you can’t picture, but the math doesn’t care that you can’t. Every sentence you embed becomes one pin stuck somewhere in that space.

Here’s the whole trick in one line: the model places the pins so that things that mean similar things land near each other. “Reduce fan noise on a notebook” gets a pin right next to “make my laptop quieter” even though they share no words, because during training the model saw them turn up in the same kinds of contexts. (That “same contexts” idea is old; it’s the engine behind the early word2vec models too.)

This is a different job from tokenizing. Tokens are how the text gets chopped into chunks to read. Embeddings are what you get after: each chunk (or whole sentence, or whole document) handed a location on the meaning-map.

Measuring meaning is measuring an angle

So you’ve got two pins. How do you ask “do these mean roughly the same thing”? You check whether they point the same way from the center.

Picture an arrow from the origin to each pin. Point nearly the same direction, the texts mean nearly the same thing. At right angles, unrelated. Opposite ways, opposed. The number that captures this is the cosine of the angle between them: cosine similarity. Same direction is 1, perpendicular is 0, and it bottoms out at −1.

Why the angle instead of the plain distance between the pins? Because direction survives length. A three-line note and a long article about the same thing should still count as similar, and comparing direction holds up where comparing distance wobbles. Conveniently, OpenAI’s embedding models hand back vectors already scaled to length 1, so the cosine is just the dot product: multiply the two lists pairwise, add them up, done.

A 2D stand-in, to get the feel:

import numpy as np

def cosine(a, b):
    a, b = np.array(a), np.array(b)
    return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))

laptop = [0.9, 0.1]   # pretend this is "make my laptop quieter"
fan    = [0.8, 0.2]   # "reduce fan noise on a notebook"
banana = [0.1, 0.95]  # "banana bread recipe"

cosine(laptop, fan)     # ~0.99  -> basically the same direction
cosine(laptop, banana)  # ~0.21  -> off at a wide angle

Real vectors have 1,536 dimensions, not 2, and nobody labels what each one means. But the operation is exactly this, only wider.

One aside worth keeping: you can often chop the tail off these vectors (keep the first few hundred of the 1,536 numbers) and still get most of the meaning. OpenAI’s newer models are trained for that on purpose and call it Matryoshka, after the nesting dolls.

The famous trick is mostly a magic show

You’ve probably seen the line that makes embeddings sound magical: take the vector for “king”, subtract “man”, add “woman”, land on “queen.” Word math. Meaning as arithmetic.

It’s real, but it’s dressed up. Here’s the part the demos skip. When you compute king − man + woman and ask for the nearest word, the standard code throws out the three words you put in. Leave them in, and the nearest vector to king − man + woman is usually king itself.

Go back through the classic word2vec examples and a lot of the crowd-pleasers only land with that quiet exclusion in place. So the honest version: the vector arithmetic nudges you into roughly the right neighborhood, and then a filter that hides the obvious answer takes credit for the punchline. There’s a real regularity in the space (there’s even a tidy mathematical reason the offset works at all), just not the clean “meaning = algebra” the slide implies.

I raise this not to be a killjoy but because it’s the tell for how to read all of this. Embeddings capture statistical similarity: what turns up near what. That’s a good deal less than the word “understands” implies, and it’s easy to forget when a system leans on the map and calls the result understanding.

Where you actually meet this

Most of the time embeddings work out of sight:

Semantic search: the laptop-and-fan case. You search by meaning, not by keyword.
RAG: when a chatbot “looks something up” in your documents, it usually embeds your question, finds the nearest chunks on the map, and pastes them into the context window before answering.
Dedup, clustering, recommendations: “find me more like this.”

And the same wiring carries the same warning. Because the map is built from how words co-occur in human writing, it inherits human patterns, including the ugly ones. Two sentences can sit close because they truly mean the same thing, or just because they lean on the same stereotype. The geometry can’t tell those two apart, and you can’t tune that out; it comes in with the method, which only ever knew what-sits-near-what.

So “does the AI understand that these two mean the same?” comes down to something almost embarrassingly literal: it turned both into arrows and checked the angle. That single move, run at a scale you can’t picture, is most of what “semantic” anything does today: search, retrieval, “more like this.” Geometry doing an impression of comprehension, good enough most of the time that it’s easy to forget which one it is. The place it trips is exactly where those two come apart.