A complete guide to N-Gram language model in Natural Language Processing
Blog Authors: Sachin Dedwal, Mohit Naik, Srushti Paigude, Saiesh Prabhu, Yesh Pareek(VIT, Pune NLP_G 9)
What is a Language Model?
A language model learns to predict the probability of a sequence of words. The use of various statistical and probabilistic techniques to predict the probability of a given sequence of words appearing in a phrase is known as language modeling (LM). To establish a foundation for their word predictions, language models evaluate large amounts of text data. They’re employed in natural language processing (NLP) applications, especially those that output text. Some of these applications can be found here. Some of these applications include machine translation and question answering.
Here our goal is to calculate the probability of sentence using words
P(w) = P(w1, w2, w3, w4, …., wn)
So, probability of an upcoming word:
P(w5| w1, w2, w3, w4)
A model that computes either of these:
P(W) or P(Wn| w1, w2, w3, w4, …., wn-1) is called a language model.
For calculating this probability, we will use the chain rule
P(A, B) = P(A)P(B|A
A more generalized version would be
P(x1, x2, x, x1, x1,) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
Or
P(w1, w2, w3, w4, ….. wn) = ∏i P(Wi | w1, w2, w3, w4 , ….. Wi−1)
The problem with this approach is when a word occurs only once with the data as it will have higher probability or not even in the taken data. We may never have enough data to estimate the values or calculate probability.
So, we make a simple assumption i.e., Markov Assumption
P(w1, w2, w3, w4, ….. wn) ≈ ∏i P(wi | wi-k …wi−1
N-gram model
What is the N-grams model?
An n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, letters, words, or base pairs according to the application. The n-grams are typically collected from a text or speech corpus.
N- grams are one way to help machines understand a word in its context to get a better understanding of the meaning of the word. For example, “We need to book our tickets soon” versus “We need to read this book soon”. The former “book” is used as a verb and therefore is an action. The latter “book” is used as a noun. The n in n-grams is just the number of words you want to look at.
A model that simply relies on how often a word occurs without looking at previous words is called unigram. If a model considers only the previous word to predict the current word, then it’s called bigram. If two previous words are considered, then it’s a trigram model.
For example, for the sentence “The cow jumps over the grass”. If N=2 (known as bigrams), then the n-grams would be:
· the cow
· cow jumps
· jumps over
· over the
· the grass
So, you have 5 n-grams in this case. We moved from the->cow to cow->jumps to jumps->over, etc, essentially moving one word forward to generate the next bigram.
If N=3, the n-grams would be:
· the cow jumps
· cow jumps over
· jumps over the
· over the grass
So, you have 4 n-grams in this case. When N=1, this is referred to as unigrams and this is essentially the individual words in a sentence. When N=2, this is called bigrams, and when N=3 this is called trigrams. When N>3 this is usually referred to as four grams or five grams and so on.
So how many n-grams can we have in a sentence?
If X=number of words in a given sentence K, the number of n-grams for sentence k be,
N-grams k=X-(N-1)
Applications of N-gram models
N-gram is used in speech recognition, Language identification, textual representation, information filtering, etc. Likewise, N-gram models are employed in artificial intelligence to provide more natural sentences within the target language.
When correcting for spelling errors, sometimes dictionary lookups won’t help. as an example, within the phrase “in about fifteen mineuts” the word ‘minuets’ could be a valid dictionary word but it’s incorrect during this context. N-gram models can correct such errors.
In general, many NLP applications like N-gram models including part-of-speech tagging, tongue generation, word similarity, sentiment extraction, and predictive text input.
Evalution of N-gram model
The best way to evaluate a model is to check how well it is predicted in end-to-end application testing. This approach is known as extrinsic evaluation, but it is time-consuming and expensive. The alternative approach is to define a suitable metric and evaluate it regardless of the application called intrinsic evaluation. This does not guarantee the performance of the application. However, this is a quick first step in verifying algorithmic performance. Perplexity: The perplexity (sometimes called PP for short) of a language model in a test sentence is the inverse probability of the test sentence, normalized by the number of words. From the inverse relationship to probability, minimizing perplexity implies maximizing the probability of the test set. Perplexity can also be linked to the concept of entropy in information theory. In any n-gram model, it is important to include markers at the beginning and end of sentences. the overall probability that all of the languages will add up to one. However, all calculations must include the end markers but not the start markers in the word token count.
Smoothing
What do we do with terms that are in our vocabulary (and aren’t unfamiliar) yet appear in an unknown context in a test set? We’ll have to take a bit of probability mass from the more common occurrences and give it to the events we’ve never seen to prohibit a language model from assigning 0 probability to these unseen events. Smoothing or discounting is the term for this type of adjustment.
Laplace Smoothing
Before we normalize the bigram counts into probabilities, the simplest approach to smooth them is to add one to all of them. All counts that were previously zero will now have a count of one, counts of one will have a count of two, and so on. Laplace smoothing is the name of the algorithm.
Although Laplace smoothing does not perform well enough to be employed in recent n-gram models, it does provide an excellent introduction to many of the concepts found in other smoothing algorithms.
The count ci normalized by the total number of word tokens N is the unsmoothed maximum likelihood estimate of the unigram probability of the word wi:
P(wi) = ci/N
We must additionally change the denominator to account for the extra V observations because there are V terms in the lexicon and each one has been increased:
Plaplace(wi) = ci+1/N+V
By establishing an adjusted count c, it is easy to define how a smoothing technique impacts the numerator. Because we’re simply modifying the numerator and adding 1, we’ll need to multiply by a normalizable factor to determine this count.
C*i =ci+1 * N/N+V
After that, the corrected count can be normalized by N-words, yielding a probabilistic value.
Add-k smoothing
One option to add one smoothing is to shift a smaller portion of the probability mass from observed to unknown events. Instead of adding 1, we add a fractional count k (.5?.05?.01?) to each count. As a result, add-k smoothing is the name of the algorithm.
Add-k smoothing necessitates the existence of a mechanism for determining k, which can be accomplished, for example, by optimizing on a devset. Despite the fact that add-k is beneficial for some tasks (such as text classification), it turns out that it isn’t very good at language modeling.
Limitations and Tools for N-gram Model
When a model is trained on Shakespeare’s writings, it will not make accurate predictions when applied to another genre. As a result, we must ensure that the training corpus resembles the test corpus. Like many statistical models, the N-gram model is heavily reliant on the training corpus. As a result, the probabilities frequently encode specific information about a training corpus. Furthermore, the performance of the N-gram model fluctuates as the value of N changes. A notable problem with the MLE approach is sparse data. In other words, every N-gram that appeared a significant number of times may have a credible probability estimate. However, because any corpus is restricted, it is inevitable that some perfectly acceptable English word sequences will be absent. As a result, any training corpus’ N-gram matrix is certain to contain a significant number of occurrences of putative “zero probability N-grams.” In one study, a bigram model outperformed a unigram model in sentiment analysis, but the number of features increased. As a result, good feature selection strategies are required when scaling N-gram models to larger datasets or switching to a higher N. Longer-distance context is poorly captured by N-gram models. Performance gains are restricted after 6-grams, according to research.
Software Tools for N-gram
N-gram, tm, tau, and RWeka are a few useful R packages. N-gram analysis is supported by the tidytext package. The NTLK method nltk. utils. ngrams is available in Python (). nltk.lm is a more extensive package. The ngram package can compute n-gram string similarity outside of NLTK. SRILM is a handy toolkit for developing language models that are written in C++ and freely sourced. This includes the ngram-f utility. This includes the utility ngram-format, which can read or write N-gram models in the popular ARPA backoff format, which was designed by Doug Paul at MIT Lincoln Labs. A demo of an N-gram predictive model implemented in R Shiny can be tried out online. It’s based on Google Books research. It displays how the N-gram coheres with a set of words. Given a sequence of words, it shows how the N-gram counts have changed over the years.