Edinburgh Speech Tools Library
	Prev	Chapter 3. Executable Programs	Next

ngram_build Train n-gram language model

Table of Contents
Synopsis
OPTIONS

Synopsis

ngram_build [input file0] [input file1] ... -o [output file] [-p ifile] [-order int] [-smooth int] [-input_format string] [-otype string] [-sparse ] [-dense ] [-backoff int] [-floor double] [-freqsmooth int] [-trace ] [-save_compressed ] [-oov_mode string] [-oov_marker string] [-prev_tag string] [-prev_prev_tag string] [-last_tag string] [-default_tags ]

ngram_build offers basic ngram language model estimation.


: Input data format

Two input formats are supported. In sentence_per_line format, the program will deal with start and end of sentence (if required) by using special vocabulary items specified by -prev_tag, -prev_prev_tag and -last_tag. For example, the input sentence:

the cat sat on the mat

would be treated as

... prev_prev_tag prev_prev_tag prev_tag the cat sat on the mat last_tag

where prev_prev_tag is the argument to -prev_prev_tag, and so on. A default set of tag names is also available. This input format is only useful for sliding-window type applications (e.g. language modelling for speech recognition). The second input format is ngram_per_line which is useful for either non-sliding-window applications, or where the user requires an alternative treatment of start/end of sentence to that provided above. Now the input file simply contains a complete ngram per line. For the same example as above (to build a trigram model) this would be:

prev_prev_tag prev_tag the prev_tag the cat the cat sat cat sat on sat on the on the mat the mat last_tag


: Representation

The internal representation of the model becomes important for higher values of N where, if V is the vocabulary size, \(V^N\) becomes very large. In such cases, we cannot explicitly hold pobabilities for all possible ngrams, and a sparse representation must be used (i.e. only non-zero probabilities are stored).


: Getting more robust probability estimates


: Testing an ngram model

ngram_testprogram.

OPTIONS

-w
ifile filename containing word list (required)
-p
ifile filename containing predictee word list (default is to use wordlist given by -w)
-order
int order, 1=unigram, 2=bigram etc. (default 2)
-smooth
int Good-Turing smooth the grammar up to the given frequency
-input_format
string format of input data (default sentence_per_line) may be sentence_per_file, ngram_per_line.
-otype
string format of output file, one of cstr_ascii cstr_bin or htk_ascii
-sparse
build ngram in sparse representation
-dense
build ngram in dense representation (default)
-backoff
int build backoff ngram (requires -smooth)
-floor
double frequency floor value used with some ngrams
-freqsmooth
int build frequency backed off smoothed ngram, this requires -smooth option
-trace
give verbose outout about build process
-save_compressed
save ngram in gzipped format
-oov_mode
string what to do about out-of-vocabulary words, one of skip_ngram, skip_sentence (default), skip_file, or use_oov_marker
-oov_marker
string special word for oov words (default !OOV) (use in conjunction with '-oov_mode use_oov_marker' Pseudo-words :
-prev_tag
string tag before sentence start
-prev_prev_tag
string all words before 'prev_tag'
-last_tag
string after sentence end
-default_tags
use default tags of !ENTER,!EXIT and !EXIT respectively

-w	`ifile` filename containing word list (required)
-p	`ifile` filename containing predictee word list (default is to use wordlist given by -w)
-order	`int` order, 1=unigram, 2=bigram etc. (default 2)
-smooth	`int` Good-Turing smooth the grammar up to the given frequency
-input_format	`string` format of input data (default sentence_per_line) may be sentence_per_file, ngram_per_line.
-otype	`string` format of output file, one of cstr_ascii cstr_bin or htk_ascii
-sparse	build ngram in sparse representation
-dense	build ngram in dense representation (default)
-backoff	`int` build backoff ngram (requires -smooth)
-floor	`double` frequency floor value used with some ngrams
-freqsmooth	`int` build frequency backed off smoothed ngram, this requires -smooth option
-trace	give verbose outout about build process
-save_compressed	save ngram in gzipped format
-oov_mode	`string` what to do about out-of-vocabulary words, one of skip_ngram, skip_sentence (default), skip_file, or use_oov_marker
-oov_marker	`string` special word for oov words (default !OOV) (use in conjunction with '-oov_mode use_oov_marker' Pseudo-words :
-prev_tag	`string` tag before sentence start
-prev_prev_tag	`string` all words before 'prev_tag'
-last_tag	`string` after sentence end
-default_tags	use default tags of !ENTER,!EXIT and !EXIT respectively

Prev	Home	Next
dp Perform dynamic programming on label sequences	Up	ngram_test Test n-gram language model