What is an n-gram and how are they used in AI/ML?


An n-gram is a contiguous sequence of n items (typically words or characters) from a text sample, used extensively in natural language processing and machine learning.

For example, in the phrase "The cat sat," unigrams (n=1) are ["The", "cat", "sat"], bigrams (n=2) are ["The cat", "cat sat"], and the trigram (n=3) is ["The cat sat"]. Character-level n-grams are also common - the word "hello" as character trigrams would be ["hel", "ell", "llo"].

N-grams enable predictive language modeling through probability calculations based on the frequency of sequences in training data. For instance, after seeing "The cat," an n-gram model calculates probabilities for the next word based on how often different words follow this bigram in the training corpus. In English text, "The cat sat" might be more probable than "The cat helicopter" because the first sequence appears more frequently in typical usage.

The predictive power of n-grams makes them valuable for various applications, including text generation, autocomplete systems, spelling correction, and speech recognition. While more advanced models like transformers have superseded traditional n-gram approaches for many tasks, n-grams remain important in situations requiring computational efficiency or simple statistical analysis of text patterns. N-grams can be used at both the word and character level, making them versatile tools for tasks ranging from language identification to document classification.