N-Gram Language Modelling with NLTK - GeeksforGeeks (2024)

Language modeling is the way of determining the probability of any sequence of words. Language modeling is used in various applications such as Speech Recognition, Spam filtering, etc. Language modeling is the key aim behind implementing many state-of-the-art Natural Language Processing models.

Methods of Language Modelling

Two methods of Language Modeling:

  1. Statistical Language Modelling: Statistical Language Modeling, or Language Modeling, is the development of probabilistic models that can predict the next word in the sequence given the words that precede. Examples such as N-gram language modeling.
  2. Neural Language Modeling: Neural network methods are achieving better results than classical methods both on standalone language models and when models are incorporated into larger models on challenging tasks like speech recognition and machine translation. A way of performing a neural language model is through word embeddings.

N-gram

N-gram can be defined as the contiguous sequence of n items from a given sample of text or speech. The items can be letters, words, or base pairs according to the application. The N-grams typically are collected from a text or speech corpus (A long text dataset).

For instance, N-grams can be unigrams like (“This”, “article”, “is”, “on”, “NLP”) or bigrams (“This article”, “article is”, “is on”, “on NLP”).

N-gram Language Model

An N-gram language model predicts the probability of a given N-gram within any sequence of words in a language. A well-crafted N-gram model can effectively predict the next word in a sentence, which is essentially determining the value of p(w∣h), where h is the history or context and w is the word to predict.

Let’s explore how to predict the next word in a sentence. We need to calculate p(w|h), where w is the candidate for the next word. Consider the sentence ‘This article is on…’.If we want to calculate the probability of the next word being “NLP”, the probability can be expressed as:

[Tex]p(\text{“NLP”} | \text{“This”}, \text{“article”}, \text{“is”}, \text{“on”})[/Tex]

To generalize, the conditional probability of the fifth word given the first four can be written as:

[Tex]p(w_5 | w_1, w_2, w_3, w_4) \quad \text{or} \quad p(W) = p(w_n | w_1, w_2, \ldots, w_{n-1})[/Tex]

This is calculated using the chain rule of probability:

[Tex]P(A|B) = \frac{P(A \cap B)}{P(B)} \quad \text{and} \quad P(A \cap B) = P(A|B)P(B)[/Tex]

Now generalize this to sequence probability:

[Tex]P(X_1, X_2, \ldots, X_n) = P(X_1) P(X_2 | X_1) P(X_3 | X_1, X_2) \ldots P(X_n | X_1, X_2, \ldots, X_{n-1})[/Tex]

This yields:

[Tex]P(w_1, w_2, w_3, \ldots, w_n) = \prod_{i} P(w_i | w_1, w_2, \ldots, w_{i-1})[/Tex]

By applying Markov assumptions, which propose that the future state depends only on the current state and not on the sequence of events that preceded it, we simplify the formula:

[Tex]P(w_i | w_1, w_2, \ldots, w_{i-1}) \approx P(w_i | w_{i-k}, \ldots, w_{i-1})[/Tex]

For a unigram model (k=0), this simplifies further to:

[Tex]P(w_1, w_2, \ldots, w_n) \approx \prod_i P(w_i)[/Tex]

And for a bigram model (k=1):

[Tex]P(w_i | w_1, w_2, \ldots, w_{i-1}) \approx P(w_i | w_{i-1})[/Tex]

Implementing N-Gram Language Modelling in NLTK

Python

# Import necessary librariesimport nltkfrom nltk import bigrams, trigramsfrom nltk.corpus import reutersfrom collections import defaultdict# Download necessary NLTK resourcesnltk.download('reuters')nltk.download('punkt')# Tokenize the textwords = nltk.word_tokenize(' '.join(reuters.words()))# Create trigramstri_grams = list(trigrams(words))# Build a trigram modelmodel = defaultdict(lambda: defaultdict(lambda: 0))# Count frequency of co-occurrencefor w1, w2, w3 in tri_grams: model[(w1, w2)][w3] += 1# Transform the counts into probabilitiesfor w1_w2 in model: total_count = float(sum(model[w1_w2].values())) for w3 in model[w1_w2]: model[w1_w2][w3] /= total_count# Function to predict the next worddef predict_next_word(w1, w2): """ Predicts the next word based on the previous two words using the trained trigram model. Args: w1 (str): The first word. w2 (str): The second word. Returns: str: The predicted next word. """ next_word = model[w1, w2] if next_word: predicted_word = max(next_word, key=next_word.get) # Choose the most likely next word return predicted_word else: return "No prediction available"# Example usageprint("Next Word:", predict_next_word('the', 'stock'))

Output:

Next Word: of

Metrics for Language Modelling

  • Entropy: Entropy, as a measure of the amount of information conveyed by Claude Shannon. Below is the formula for representing entropy

[Tex]H(p) = \sum_{x} p(x)\cdot (-log(p(x)))\\[/Tex]

H(p) is always greater than equal to 0.

  • Cross-Entropy: It measures the ability of the trained model to represent test data([Tex]W_{1}^{i-1}[/Tex]).

[Tex]H(p) =\sum_{i=1}^{x} \frac{1}{n} (-log_2(p(w_i | w_{1}^{i-1})))[/Tex]

The cross-entropy is always greater than or equal to Entropy i.e the model uncertainty can be no less than the true uncertainty.

  • Perplexity: Perplexity is a measure of how good a probability distribution predicts a sample. It can be understood as a measure of uncertainty. The perplexity can be calculated by cross-entropy to the exponent of 2.

[Tex]2^{Cross-Entropy}[/Tex]

Following is the formula for the calculation of Probability of the test set assigned by the language model, normalized by the number of words:

[Tex]PP(W) = \sqrt[n]{\prod_{i=1}^{N}\frac{1}{P(w_i | w_{i-1})}}[/Tex]

For Example:

  • Let’s take an example of the sentence: ‘Natural Language Processing’. For predicting the first word, let’s say the word has the following probabilities:
wordP(word | <start>)
The0.4
Processing0.3
Natural0.12
Language0.18
  • Now, we know the probability of getting the first word as natural. But, what’s the probability of getting the next word after getting the word ‘Language‘ after the word ‘Natural‘.
wordP(word | ‘Natural’ )
The0.05
Processing0.3
Natural0.15
Language0.5
  • After getting the probability of generating words ‘Natural Language’, what’s the probability of getting ‘Processing‘.
wordP(word | ‘Language’ )
The0.1
Processing0.7
Natural0.1
Language0.1
  • Now, the perplexity can be calculated as:

[Tex]PP(W) = \sqrt[n]{\prod_{i=1}^{N}\frac{1}{P(w_i | w_{i-1})}} = \sqrt[3]{\frac{1}{0.12 * 0.5 * 0.7}} \approx 2.876[/Tex]

  • From that we can also calculate entropy:

[Tex]Entropy = log_2(2.876) = 1.524[/Tex]

Shortcomings:

  • To get a better context of the text, we need higher values of n, but this will also increase computational overhead.
  • The increasing value of n in n-gram can also lead to sparsity.

References



P

pawangfg

N-Gram Language Modelling with NLTK - GeeksforGeeks (1)

Improve

Next Article

What are Language Models in NLP?

N-Gram Language Modelling with NLTK - GeeksforGeeks (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Mrs. Angelic Larkin

Last Updated:

Views: 6119

Rating: 4.7 / 5 (47 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Mrs. Angelic Larkin

Birthday: 1992-06-28

Address: Apt. 413 8275 Mueller Overpass, South Magnolia, IA 99527-6023

Phone: +6824704719725

Job: District Real-Estate Facilitator

Hobby: Letterboxing, Vacation, Poi, Homebrewing, Mountain biking, Slacklining, Cabaret

Introduction: My name is Mrs. Angelic Larkin, I am a cute, charming, funny, determined, inexpensive, joyous, cheerful person who loves writing and wants to share my knowledge and understanding with you.