ngram_build [input file0] [input file1] ... -o [output file] [-p ifile] [-order int] [-smooth int] [-input_format string] [-otype string] [-sparse ] [-dense ] [-backoff int] [-floor double] [-freqsmooth int] [-trace ] [-save_compressed ] [-oov_mode string] [-oov_marker string] [-prev_tag string] [-prev_prev_tag string] [-last_tag string] [-default_tags ]
ngram_build offers basic ngram language model estimation.
Input data format
Two input formats are supported. In sentence_per_line format, the program will deal with start and end of sentence (if required) by using special vocabulary items specified by -prev_tag, -prev_prev_tag and -last_tag. For example, the input sentence:
the cat sat on the mat |
... prev_prev_tag prev_prev_tag prev_tag the cat sat on the mat last_tag |
prev_prev_tag prev_tag the prev_tag the cat the cat sat cat sat on sat on the on the mat the mat last_tag |
Representation
The internal representation of the model becomes important for higher values of N where, if V is the vocabulary size, \(V^N\) becomes very large. In such cases, we cannot explicitly hold pobabilities for all possible ngrams, and a sparse representation must be used (i.e. only non-zero probabilities are stored).
Getting more robust probability estimates
Testing an ngram model
-w ifile filename containing word list (required)
-p ifile filename containing predictee word list (default is to use wordlist given by -w)
-order int order, 1=unigram, 2=bigram etc. (default 2)
-smooth int Good-Turing smooth the grammar up to the given frequency
-input_format string format of input data (default sentence_per_line) may be sentence_per_file, ngram_per_line.
-otype string format of output file, one of cstr_ascii cstr_bin or htk_ascii
-sparse build ngram in sparse representation
-dense build ngram in dense representation (default)
-backoff int build backoff ngram (requires -smooth)
-floor double frequency floor value used with some ngrams
-freqsmooth int build frequency backed off smoothed ngram, this requires -smooth option
-trace give verbose outout about build process
-save_compressed save ngram in gzipped format
-oov_mode string what to do about out-of-vocabulary words, one of skip_ngram, skip_sentence (default), skip_file, or use_oov_marker
-oov_marker string special word for oov words (default !OOV) (use in conjunction with '-oov_mode use_oov_marker' Pseudo-words :
-prev_tag string tag before sentence start
-prev_prev_tag string all words before 'prev_tag'
-last_tag string after sentence end
-default_tags use default tags of !ENTER,!EXIT and !EXIT respectively