From Probabilistic Models to Word Embeddings 🧐
In the world of Natural Language Processing (NLP), word representation models are key for understanding how models interpret language.
So what is a representation model (RM)? Before answering this, we have to be aware of an important disclaimer: AI models run on math. The consequence of this prerequisite highlights an inconvenient truth: words aren’t very conducive for math computation. So…
How do we get out of this jam?
We do what mathematicians have done for centuries: convert non-integers into numbers. In the field of AI, this is known as vectorization. This conversion is the first of many steps in achieving a representation of natural language that’s computable. In essence, word representation in machine learning is the bridge between the meaning of words and the ability to compute its meaning.
A Working Definition of Representation Models
Piggybacking on the previous paragraph: RMs give a computer program the ability to compute and achieve some “understanding” of natural language. I put “understanding” in quotes because whether or not a computer “understands” is a debate for philosophers, or as Noam Chomsky postulates:
“Do submarines swim?”
Nonetheless, vectorizing words into real numbers are applied differently depending on the NLP task at hand. In the following section, we will review the first of two RMs we’ll call the probabilistic model, which leverages frequency and frequentist probability as the conduit of inference. The second model we’ll call the word embedding model, which leverages vector space as a proxy to semantics. (The second model is discussed in the Seconda Parte)
The most basic kind of RM achieves representation by measuring the frequency of words in text. In this model word order is ignored and text corpora (paragraphs/sentences) are split into their individual tokens or words called tokenization: 👇
John likes to watch movies. Mary likes movies too.
Upon tokenization, words are given a numerical value for representation based on some form of frequency. The numerical values can vary:
1. Frequency | i.e. The frequency of the word “king” appearing in Shakespeare plays. Useful for descriptive statistics as used in rudimentary social media analytics.
2. Relative Frequency | i.e. The frequency of the word “king” appearing in Shakespeare plays divided by the total number words in his plays (also known as a batting average). Useful for creating likelihood probabilities as found in the Naive Bayes algorithm.
- The most famous method of leveraging relative frequency is Maximum Likelihood Estimation (aka MLE). This method computes the maximum probability in a distribution out of a given choices.
3. Conditional Probability | i.e. Computes the probability of an event based on the condition that another event has occurred. This type of probability is useful when an engineer is interested in a sequence of words. For example, translating sentences.
Probabilistic models have a tendency to generalize better when inputs are ambiguous and sparse. When inputs are denser, the inter-dependency of words becomes crucial for inferring word meaning and a different medium other than probability is more rewarding. This different medium, we’ll call vector space, is a consequence of having to deal with increased complexity.
👇 is the intuition.
In a band, a song’s interpretation is maximized when recognizing the musical complexity being exhibited by the instruments simultaneously. If you only listened to single instrument, it reduces information and emergence collapses. Imagine listening to Van Halen’s “Running with Devil” by only listening to the bass. 😥
In our next article, we’ll discuss the word embeddings model that pays less attention to frequency and more attention in how a word relates to other words. Stay tuned. ✌️✌️