Chapter 3. A Practical Speech Synthesis System

The Festival Speech Synthesis Systems was developed at the Centre for Speech Technology Reseach at the University of Edinburgh in the late 90's. It offers a free, portable, language independent, run-time speech synthesis engine for verious platforms under various APIs. This book is not about the Festival system itself, Festival is just the engine that we will use in the process of building voices, both as a run-time engine for the voices we build and as a tool in the building process itself. This chapter gives a background on the philosophy of the system, its basic use, and some lower level details on its internals that will make the understanding of the whole synthesis task easier.

The Festival Speech Synthesis System was designed to target three particular classes of speech synthesis user.

  1. Speech synthesis researchers: where they may use Festival as a vehicle for developmeent and testing of new research in synthesis technology.

  2. Speech application developers: where synthesis is not the primary interest, but Festival will be a substantial sub-component which may require significant integration and hence the system must be open and easily configurable.

  3. End user: where the system simple takes text and generates speech, requiring no or very little configuration from the user.

In the design of Festival it was important that all three classes of user were served as there needs to be a clear route from research work to practial usable systems as this not only encourages research to be focussed but also, as has been shown by the large uptake of the system, ensures there is a large user community interested in seeing improvements to the system.

The Festival Speech Synthesis System was built based on the experience of previous synthesis engines. Design of a key architecture is important as what may seem general to begin with can quickly become a limiting factor, as new and more ambitious techniques are attempted within it. The basic architecture of Festival benefited mainly from previous synthesis engines developed at Edinburgh University, specifically Osprey [taylor91]. ATR's CHATR system, [black94] was also a major influence on Festival, CHATR's original core architecture was also developed by the same authors as Festival. In designing Festival, the intention was to avoid the previous limitations in the utterance representation and module specification, specifically in avoiding constraints on the types of modules and dependencies between them. However even with this intent, Festival went through a number of core changes before it settled.

The Festival system consists of a set of C++ objects and core methods suitable for doing synthesis tasks. These objects include synthesis specific objects like, waveforms, tracks and utterances as well as more general objects like feature sets, n-grams, and decision trees.

In order to give parameters and specify flow of control Festival offers a scripting language based on the Scheme programming language [Scheme96]. Having a scripting language is one of the key factors that makes Festival a useful system. Most of the techniques in this book for building new voices within Festival can be done without any changes to the core C++ objects. This makes development of new voices not only more accessible to a larger population or users, as C++ knowledge nor a C++ compiler is necessary, it also makes the distribution of voices built by these techniques easy as users do not require any recompilation to use newly created voices.

Scheme offers a very simple syntax but powerful language for specifying parameters and simple functions. Scheme was chosen as its implementation is small and would not increase the size of the Festival system unnecessarily. Also, using an embedded Scheme component does not increase the requirements for installation as would the use of say Java, Perl or Python as the scripting language. Scheme frightens some people as Lisp based languages have an unfair reputation for being slow. Festival's use of Scheme is (in general) limited to simple functions and very little time is spent in the Scheme interpreter itself. Automatic garbage collection also has a reputation for slowing systems down. In Festival, garbage collection happens after each utterance is synthesized and again takes up only a small amount of time but allows the programmer nor to have to worry about explicitly freeing memory.

For the most part the third type of user, defined above, will never need to change any part of the systems (though they usually find something they want to change, like adding new entries to the lexicon). The second level of user typically does most of their customizing in Scheme, though this is usually just modifying existing pieces of Scheme in the way that people may add simple lines of Lisp to their .emacs file. It is primarily only the synthesis research community that has to deal with the C++ end of the system, though C/C++ interfaces to the systems as a library are also provided (see Chapter 23 for more discussions on APIs).

This chapter covers the basic use of the system and is followed by more details of the internal structures, particularly the utterance sturcture, accessing methods and modules. These later sections are probabaly more detail than one needs for building standard voices described in the book, but the is information is necessary when more ambituous voice building tasks are attempted.

3.1. Basic Use

The examples here are given based on a standard installation on a Unix system as described in Chapter 28, however the examples are likely to work under any platform Festival supports.

The most simple way to use Festival to speak a file from the command line, is by the command

festival --tts example.txt

This will speak the text in example.txt using the default voice.

Festival can also read text from stdin using a command like

echo "Hello world" | festival --tts

Festival actually offers two modes, a text mode and a command mode. In text mode everything given to Festival is treated as text to be spoken. In comamnd mode everything is treated as Scheme commands and interpreted.

When festival is started with no arguments if goes into interactive command mode. There you may type Scheme command and have Festival interpret them. For example

$ festival
....
festival>

One simple command is SayText takes a single string argument and says its contents.

festival> (SayText "Hello world.")
#<Utterance 0x402a9ce8>
festival>

You may select other voices for synthesis by calling the appropriate function. For example

festival> (voice_cmu_sls_diphone)
cmu_us_sls_diphone
festival> (SayText "Hello world.")
#<Utterance 0x402f0648>
festival> 

Will use a female US English voice (if installed).

The command line interface offers comand line history though the up and down arrows (ctrl-P and ctrl-N) and editing through standard emacs-like commands. Importantly the interface does function and filename completion too, using the TAB key.

Any Scheme command may be typed at the command line for example

festival> (Parameter.set 'Duration_Stretch 1.5)
1.5
festival> 

Will make all durations longer for the current voice (making the voice speak slower.

festival> (SayText "a very slow example.")
#<Utterance 0x402f564376>
festival> 

Calling any specific voice will reset this value (or you may do it by hand).

festival> (voice_cmu_us_kal_diphone)
cmu_us_kal_diphone
festival> (SayText "a normal example.")
#<Utterance 0x402e3348>
festival> 

The SayText is just a simple function that takes the given string, constructs an utterance object from is, synthesizes it and sends the resulting waveform to the audio device. This isn't really suitable for synthesizing anythign but very short utterances. The TTS process involves the more complex task of splitting text streams into utterance synthesizing them and sendthem to the audio device to they may play as the same time working on the next utterance to that the audio output is continuous. Festival does this through the tts function (which is what is actually called when Festival is given the --tts argument on the command line. In Scheme the tts funciton takes two arguments, a filename and a mode. Modes are described in more detail in Section 6.5, and can be used to allow special processing of text, such as respecting markup or particular styles of text like email etc. In simple case the mode will be nil which denotes the basic raw or fundamental mode.

festival> (tts "WarandPeace.txt" nil)
t
festival> 

Commands can also be stored in files, which is normal when a number of function definitions and parameter settings are required. These scheme files can be loaded by the function SayText as in

festival> (load "commands.scm")
t
festival> 

Arguments to Festival at startup time will normally be treated as command files and loaded.

$ festival commands.scm
...
festival>

However if the argument starts with a left parenthesis ( the argument is interpreted directly as a Scheme command.

$ festival '(SayText "a short example.")'
...
festival>

If the -b (batch) option is specified Festival does not go into interactive mode and exits after processing all of the given arguments.

$ festival -b mynewvoicedefs.scm '(SayText "a short example.")'

Thus we can use Festival interactively or simple as a batch scripting language. The batch format will be used often in the voice building process though the intereactive mode is useful for testing new voices.