Building Synthetic Voices
<<< Previous	Chapter 3. A Practical Speech Synthesis System	Next >>>

3.2. Utterance structure

The basic building block for Festival is the utterance. The structure consists of a set of relations over a set of items. Each item represents a object such as a word, segment, syllable, etc. while relations relate these items together. An item may appear in multiple relations, such as a segment will be in a Segment relation and also in the SylStructure relation. Relations define an ordered structure over the items within them, in general these may be arbitrary graphs but in practice so far we have only used lists and trees Items may contain a number of features.

There are no built-in relations in Festival and the names and use of them is controlled by the particular modules used to do synthesis. Language, voice and module specific relations can easy be created and manipulated. However within our basic voices we have followed a number of conventions that should be followed if you wish to use some of the existing modules.

The relation names used will depend on the particular structure chosen for your voice. So far most of our released voices have the same basic structure though some of our research voices contain quite a different set of relations. For our basic English voices the relations used are as follows

Text: Contains a single item which contains a feature with the input character string that is being synthesized
Token: A list of trees where each root of each tree is the white space separated tokenized object from the input character string. Punctuation and whitespace has been stripped and placed on features on these token items. The daughters of each of these roots are the list of words that the token is associated with. In many cases this is a one to one relationship, but in general it is one to zero or more. For example tokens comprising of digits will typically be associated with a number of words.
Word: The words in the utterance. By word we typically mean something that can be given a pronunciation from a lexicon (or letter-to-sound rules). However in most of our voices we distinguish pronunciation by the words and a part of speech feature. Words with also be leaves of the Token relation, leaves of the Phrase relation and roots of the SylStructure relation.
Phrase: A simple list of trees representing the prosodic phrasing on the utterance. In our voices we only have one level of prosodic phrase below the utterance (though you can easily add a deeper hierarchy if your models require it). The tree roots are labeled with the phrase type and the leaves of these trees are in the Word relation.
Syllable: A simple list of syllable items. These syllable items are intermediate nodes in the SylStructure relation allowing access to the words these syllables are in and the segments that are in these syllables. In this format no further onset/coda distinction is made explicit but can be derived from this information.
Segment: A simple list of segment (phone) items. These form the leaves of the SylStructure relation through which we can find where each segment is placed within its syllable and word. By convention silence phones do not appear in any syllable (or word) but will exist in the segment relation.
SylStructure: A list of tree structures over the items in the Word, Syllable and Segment items.
IntEvent: A simple list of intonation events (accents and boundaries). These are related to syllables through the Intonation relation.
Intonation: A list of trees whose roots are items in the Syllable relation, and daughters are in the IntEvent relation. It is assumed that a syllable may have a number of intonation events associated with it (at least accents and boundaries), but an intonation event may only by associated with one syllable.
Wave: A relation consisting of a single item that has a feature with the synthesized waveform.
Target: A list of trees whose roots are segments and daughters are F0 target points. This is only used by some intonation modules.
Unit, SourceSegments, Frames, SourceCoef TargetCoef: A number of relations used the the UniSyn module.

<<< Previous	Home	Next >>>
A Practical Speech Synthesis System	Up	Modules