Word Vectors and Beyond…🧐
In the prima parte of our introduction to word representations, we discussed the most basic type of representation: frequency probability. While this approach continues to be used in NLP projects requiring low semantic inference, more often than not, mission-critical projects may be dependent on word meaning.
Natural language, as a complex system, offers many detours and makes it very difficult to hand-craft the rules of language. As a result, the ability for a computer model to infer meaning is an incredibly difficult natural language task. Here are some of the difficulties:
- Word Sense | Disambiguation: Words have multiple meanings. “Broke” can mean: a) separate into pieces b)having little money c) something that doesn’t work d) piercing through e) change direction f) interruption of continuity or sequence.
- Synonyms: No explanation necessary, couch -> sofa; remote control -> clicker.
- Contextual Word Usage: “Water” and “H20” are synonymous, however one would use them selectively according to the context of conversation. For example, one wouldn’t typically say “I’m going to the pool to dive in the H2O.”
This type of modelling leverages the distributional hypothesis which states words that occur in similar contexts tend to have similar meanings. This is an improvement from the frequency probability model because we are now looking for inter-dependencies of words where the model is able to find relations.
For example, the words “king” “queen” and “princess” are contextually relevant as entities relating to royal titles. While other relationships such as “hot” and “cold” infer a relation to the opposite ends of temperature measurement. Ok, so there many types of relations, below is a pithy list to keep in mind:
1. Relation of Similarity: a) Ferrari, Lamborghini, Maserati.
2. Relation of Specificity: a) Hyponym: A jet is a hyponym of aircraft (since it’s a specific type of aircraft). b) Hypernym: An aircraft is a hypernym of jet (since it’s more general than a jet).
3. Relation of Affect: a) Positive Connotation: Happiness and Love. b) Negative Connotation: Sadness and Hate.
What Types of Word Vectors Are Used in NLP?
1. Co-Occurrence Matrix
We are starting to see the frequency (co-occurence) approach mature into a model that focuses more on discriminating words relative to its documents! TF-IDF (term frequency–inverse document frequency) is a popular method whereby one normalizes the frequency of words by the number of documents. In essence, it highlights outlier words that occur less often in documents as signals to more information (shout-out to Claude Shannon). TF-IDF is used extensively in search engines and information retrieval tasks.
2. One-Hot Encoding
Assigning each unique word in a corpus with a unique number of 0s and 1s (“king” = 0–0–1, “queen” = 0–1–0 etc.). One disadvantage is the lack of similarity among synonyms. For example, a “laptop” vs. “notebook” are synonymous but receive different encodings. Additionally, since one-hot encoding vectors grow relative to vocabulary size, it makes it less ideal for handling large volumes of text where the vectors will hold many integers. (and as a consequence one-hot becomes computationally expensive)
3. Word Embeddings
Words are embedded as vectors in a graph. In this space, Cosine is used for the measurement of word similarity calculated via dot-product. The disadvantage of word embeddings is its failure to capture disambiguation. The word “broke” can have different meanings even though it only receives 1 vector.
Word embedding models you have probably heard of:
Word2vec uses neural networks to obtain word embeddings.
GloVe uses Co-Occurrence probabilities to obtain word embeddings.
You may have heard of BERT, GPT-2, or ELMo. If not, the next time you write a text message on your smartphone and it suggests to auto-complete your next word in the sentence, this is an application of a language model. However, language models are more than just apps on your smartphone, they are a benchmark for word representation.
Currently, these state-of-the-art language models are used for the representation of language by adhering to a simple principle:
If a system is able to correctly guess the next word in a sentence, then it is able to understand the representational structure of natural language.
Take this sentence for example, “ I went to the store to buy a gallon of (blank).” There are many things in the text that hints on what’s going on even though we don’t know what the last word is. We know it’s a “store” (which limits what is there to buy, i.e. not rocket fuel), it refers to a “gallon” (implying it’s a liquid, i.e. not cake). The distribution of surrounding words helps language models accurately guess the masked word, in fact, masking is how these large language models are trained! Words are hidden in text corpora and the model is forced to predict what they are.
In conclusion, the nature of language modelling shows how models are able to partially grasp word meaning from syntax and word co-occurrences. Emergence of word meaning to a computer model is in part related to the inter-dependence of surrounding words in the text!
This article completes our two-part introduction into word representation where we discussed how word frequencies and word vectors impacts the way natural language is represented in NLP. ✌️✌️
“The meaning of a word is its use in the language“