The Language of Artificial Intelligence
What makes two words similar?
It’s obvious that a dog and a cat are more alike than a dog and a pencil, but why?
Intuitively, our minds search for similarities between the properties of both to determine whether they are alike or not.
A dog and a cat are living beings, they are animals, more specifically, domestic animals. They are mammals, they have four legs, etc.
A dog and a cat are similar.
A dog and a pencil, on the other hand, share fewer similar properties. Therefore, they are less alike.
For us humans, identifying these similarities is easy, and in fact, it’s how we teach our children.
— Dad, what’s a subway?
— A subway is like a train that travels underground.
We reference a known concept and clarify it with a specific property. This is known as Cognitive categorization.
However, artificial intelligence models, especially those used in language understanding, rely on a different mechanism.
In the world of AI, two words are considered similar if they often appear in similar contexts.
We don’t teach the model that cat and dog are similar. We don’t teach that a subway and a train are related.
Models learn this on their own during their training phase.
And unlike us, their training doesn’t take years, it takes hours
Learn from Mistakes
Although training an artificial intelligence to understand the relationship between words might sound complex, the idea behind it is surprisingly simple: completing sentences.
For example:
The ___ barks when it’s hungry.
The model must predict that the missing word is dog.
If it gets it right, great.
If not, its parameters are adjusted to do better next time.
This exercise is repeated millions of times, using enormous amounts of text. And it does so in a fraction of the time it would take us.
With enough training, the model becomes really effective at guessing the missing word.
Once the model has finished training, it can complete sentences like:
My pet is a ___ and it’s lovely.
If the model’s word prediction ability is good, it will know that the sentence could be completed with the word dog, but there’s also a high probability that the missing word is cat.
In other words, if the model has learned correctly, the words dog and cat will be among its most likely predictions, which means that thanks to the training, the model has learned that there is a relationship between dog and cat.
And it has learned that relationship by seeing those two words frequently in similar contexts.
Interestingly, that’s not what caught my attention the most. What surprised me most was an indirect consequence of this learning process, called embedding.
What is an embedding?
An embedding is how a model represents a word.
It does this using a list of numbers, a vector.
But these aren’t random numbers, they are values that capture, in a compressed form, aspects of the word’s meaning.
It’s as if we were translating human language into a language that math can understand.
Let’s take a very simplified example: imagine the embedding of the word train is [3, 10, 5]
And the embedding of the word subway is [3, 10, 6]
Both vectors are nearly identical, differing only in the last number. Why?
Because train and subway are very similar, they both transport people, follow a fixed route, have cars, etc.
The difference is that one runs above ground and the other underground.
So we could imagine that the last number represents the property “where it runs”:
— The value 5 indicates above ground
— The value 6 indicates underground
This is a very simplified example with only 3 numbers (these 3 components of the vector are called dimensions). Real embeddings usually have hundreds or even thousands of dimensions, but it helps illustrate an important point:
Each position in the embedding captures a property of the word.
One dimension might reflect whether it’s a living being. Another might indicate whether it’s large or small. Another, whether it’s abstract or concrete.
And so on.
The model doesn’t learn these properties because we tell it explicitly.
It discovers them on its own, by observing how words are used in millions of sentences and extracting usage patterns.
In reality, we don’t know exactly what each dimension represents.
But we do know this: the result works surprisingly well.
Whatever those dimensions are capturing, they do it so well that models can predict words with almost surgical precision.
They don’t understand the world like we do, but in their numeric language, they represent it with astonishing accuracy.
During training, these numbers are adjusted.
When the model fails to predict a word, a mathematical formula calculates the degree of error.
Based on that error, the numbers are modified so that next time, the prediction improves.
Each mistake sharpens its understanding of language.
Each correction brings the model closer to a more realistic representation of meaning.
And during training, it makes billions of mistakes.
Beyond representing word meanings so faithfully, embeddings have tremendous potential.
Because they’re represented as numbers, we can perform mathematical operations with words, literally.
A very famous example is that of king and queen.
If we take the embedding for king and subtract the embedding for man, we’re left with something like the concept of royalty without gender.
And if we then add the embedding for woman, the result is surprisingly close to the embedding for queen:
embedding(“king”) — embedding(“man”) + embedding(“woman”) ≈ embedding(“queen”)Embeddings aren’t limited to words.
There are embeddings for entire sentences, for images, for audio, and even video.
Imagine how useful embeddings are for search.
You write a phrase about what you’re looking for, and its embedding will resemble the embedding of whatever it is you’re trying to find.
It doesn’t matter if you describe it in English or Spanish, in text or voice, the resulting embedding will be very similar.
Embeddings are very much like a universal language.
