While such grammar models are easy to understand they cannot cope with large vocabulary applications. It is simply too difficult to write grammars which can provide sufficient coverage. However, one can attempt to capture linguistic knowledge, domain knowledge and any other pertinent information simply by using a statistical approach over a large corpus compiled from material in the domain you wish to understand. {Deshmukh, 1999] takes us through these aspects and, while pointing out that most systems use a tri-gram back-off language model (for reasons of practicality which will be explored later in this essay), explains that various other approaches have been explored including higher order N-grams, long-range dependencies, cache, link and trigger models, class grammars and decision tree clustered language models. Pointers are given to further work on all of these in the penultimate paragraph on page 5 of that paper. .
4 Tri-grams.
It is easy to see that, if you look at the probability of one word following another, you can start to capture the forces that determine why this occurs without having to understand those forces fully. This is a process that can be automated. Of course looking at only adjacent pairs of words does not capture such longer relationships as what sort of arguments are preferred by particular verbs' and ideally one would use 8-grams or 10-grams, predicting the probability of a word following an already known 7 or 9 words. However, while collections of text (called a corpus) are growing, one would require an enormous corpus to find enough examples of these longer runs of words to draw any valid statistical conclusions. This problem is called the sparse data problem and is explored extensively by [Peng, 2001]. .
The sparse data problem has lead to a practical approach using tri-grams but even in a very large corpus like the Brown Corpus* some combinations of words are never even seen.