Building Synthetic Voices | ||
---|---|---|
<<< Previous | Next >>> |
When building a new diphone based voice for a supported language, such as English, the upper parts of the systems can mostly be taken from existing voices, thus making the building task simpler. Of course, things can still go wrong, and its worth checking everything at each stage. This section gives the basic walkthrough for building a new US English voice. Support for building UK (southern, RP dialect) is also provided this way. For building non-US/UK synthesizers see Chapter 18 for a similar walkthrough but also covering the full text, lexicona nd prosody issues which we can subsume in this example.
Recording a whole diphone set usually takes a number of hours, if everything goes to plan. Construction of the voice after recording may take another couple of hours, though much of this is CPU bound. Then hand-correction may take at least another few hours (depending on the quality). Thus if all goes well it is possible to construct a new voice in a day's work though usually something goes wrong and it takes longer. The more time you spend making sure the data is correctly aligned and labeled, the better the results will be. While something can be made quickly, it can take much longer to do it very well.
For those of you who have ignored the rest of this document and are just hoping to get by by reading this, good luck. It may be possible to do that, but considering the time you'll need to invest to build a voice, being familar with the comments, at least in the rest of this chapter, may be well worth the time invested.
The tasks you will need to do are:
construct basic template files
generate prompts
record nonsense words
autolabel nonsense words
generate diphone index
generate pitchmarks and LPC coefficients
Test, and hand fix diphones
Build diphone group files and distribution
As with all parts of festvox, you must set the following environment variables to where you have installed versions of the Edinburgh Speech Tools and the festvox distribution
The next stage is to select a directory to build the voice. You will need in the order of 500M of diskspace to do this, it could be done in less, but its better to have enough to start with. Make a new directory and cd into itexport ESTDIR=/home/awb/projects/1.4.1/speech_tools
export FESTVOXDIR=/home/awb/projects/festvox
By convention, the directory is named for the institution, the language (here, us English) and the speaker (awb, who actually speaks with a Scottish accent). Although it can be fixed later, the directory name is used when festival searches for available voices, so it is good to follow this convention.mkdir ~/data/cmu_us_awb_diphone
cd ~/data/cmu_us_awb_diphone
Build the basic directory structure
the arguments to setup_diphone are, the institution building the voice, the language, and the name of the speaker. If you don't have a institution we recommend you use net. There is an ISO standard for language names, though unfortunately it doesn't allow distinction between US and UK English, so in general we recommend you use the two letter form, though for US English use us and UK English use uk. The speaker name may or may nor be there actual name.$FESTVOXDIR/src/diphones/setup_diphone cmu us awb
The setup script builds the basic directory structure and copies in various skeleton files. For languages us and uk it copies in files with much of the details filled in for those languages, for other languages the skeleton files are much more skeletal.
For constructing a us voice you must have the following installed in your version of festival
And for a UK voice you needfestvox_kallpc16k
festlex_POSLEX
festlex_CMU
At run-time the two appropriate festlex packages (POSLEX + dialect specific lexicon) will be required but not the existing kal/rab voices.festvox_rablpc16k
festlex_POSLEX
festlex_OALD
To generate the nonsense word list
We use a synthesized voice to build waveforms of the prompts, both for actual prompting and for alignment. If you want to change the prompt voice (e.g. to a female) edit festvox/us_schema.scm. Near the end of the file is the function Diphone_Prompt_Setup. By default (for US English) the voice (voice_kal_diphone) is called. Change that, and the F0 value in the following line, if appropriate, to the voice use wish to follow.festival -b festvox/diphlist.scm festvox/us_schema.scm \
'(diphone-gen-schema "us" "etc/usdiph.list")'
Then to synthesize the prompts
festival -b festvox/diphlist.scm festvox/us_schema.scm \
'(diphone-gen-waves "prompt-wav" "prompt-lab" "etc/usdiph.list")'
Now record the prompts. Care should be taken to set up the recording environment so it is best. Note all power levels so that if more than one session is required you can continue and still get the same recording quality. Given the length of the US English list, its unlikely a person can say allow of these in one sitting without taking breaks at least, so ensuring the environment can be duplicated is important, even if it's only after a good stretch and a drink of water.
Note a third argument can be given to state which nonse word to begin prompting from. This if you have already recorded the first 100 you can continue withbin/prompt_them etc/usdiph.list
See Section 29.1 for notes on pronunciation (or Section 29.2 for the UK version).bin/prompt_them etc/usdiph.list 101
The recorded prompts can the be labeled by
Its is always worthwhile correcting the autolabeling. Usebin/make_labs prompt-wav/*.wav
and select FILE OPEN from the top menu bar and the place the other dialog box and clink inside it and hit return. A list of all label files will be given. Double-click on each of these to see the labels, spectragram and waveform. (** reference to "How to correct labels" required **).emulabel etc/emu_lab
Once the diphone labels have been corrected, the diphone index may be built by
bin/make_diph_index etc/usdiph.list dic/awbdiph.est
If no EGG signal has been collected you can extract the pitchmarks by (though read Section 4.7 to ensure you are getting the best exteraction).
If you do have an EGG signal then use the following insteadbin/make_pm_wave wav/*.wav
A program to move the predicted pitchmarks to the nearest peak in the waveform is also provided. This is almost always a good idea, even for EGG extracted pitch marksbin/make_pm lar/*.lar
Getting good pitchmarks is important to the quality of the synthesis, see Section 4.7 for more discussion.bin/make_pm_fix pm/*.pm
Because there is often a power mismatch through a set of diphone we provided a simple method for finding what general power difference exist between files. This finds the mean power for each vowel in each file and calculates a factor with respect to the overall mean vowel power. A table of power modifiers for each file can be calculated by
The factors calculated by this are saved in etc/powfacts.bin/find_powerfactors lab/*.lab
Then build the pitch-synchronous LPC coefficients, which use the power factors if they've been calculated.
bin/make_lpc wav/*.wav
Now the database is ready for its initial tests.
When there has been no hand correction of the labels this stage may fail with diphones not having proper start, mid and end values. This happens when the automatic labeled has position two labels at the same point. For each diphone that has a problem find out which file it comes from (grep for it in dic/awbdiph.est and use emulabel to change the labeling to as its correct. For example suppose "ah-m" is wrong you'll find is comes from us_0314. Thus typefestival festvox/cmu_us_awb_diphone.scm '(voice_cmu_us_awb_diphone)'
After correcting labels you must re-run the make_diph_index command. You should also re-run the find_powerfacts stage and make_lpc stages as these too depend on the labels, but this takes longer to run and perhaps that need only be done when you've corrected many labels.emulabel etc/emu_lab us_0314
To test the voice's basic functionality with
As the autolabeling is unlikely to work completely you should listen to a number of examples to find out what diphones have gone wrong.festival> (SayPhones '(pau hh ax l ow pau))
festival> (intro)
Finally, once you have corrected the errors (did we mention you need to check and correct the errors?), you can build a final voice suitable for distribution. First you need to create a group file which contains only the subparts of spoken words which contain the diphones.
The us_ in this function name confusingly stands for UniSyn (the unit concatenation subsystem in Festival) and nothing to do with US English.festival festvox/cmu_us_awb_diphone.scm '(voice_cmu_us_awb_diphone)'
...
festival (us_make_group_file "group/awblpc.group" nil)
...
To test this edit festvox/cmu_us_awb_diphone.scm and change the choice of databases used from separate to grouped. This is done by commenting out the line (around line 81)
and uncommented the line (around line 84)(set! cmu_us_awb_db_name (us_diphone_init cmu_us_awb_lpc_sep))
The next stage is to integrate this new voice so that festival can find it automatically. To do this, you should add a symbolic link from the voice directory of Festival's English voices to the directory containing the new voice. First cd to festival's voice directory (this will vary depending on where you installed festival)(set! cmu_us_awb_db_name (us_diphone_init cmu_us_awb_lpc_group))
add a symbolic link back to where your voice was builtcd /home/awb/projects/1.4.1/festival/lib/voices/english/
Now this new voice will be available for anyone runing that version festival (started from any directory)ln -s /home/awb/data/cmu_us_awb_diphone
The final stage is to generate a distribution file so the voice may be installed on other's festival installations. Before you do this you must add a file COPYING to the directory you built the diphone database in. This should state the terms and conditions in which people may use, distribute and modify the voice.festival
...
festival> (voice_cmu_us_awb_diphone)
...
festival> (intro)
...
Generate the distribution tarfile in the directory above the festival installation (the one where festival/ and speech_tools/ directory is).
cd /home/awb/projects/1.4.1/
tar zcvf festvox_cmu_us_awb_lpc.tar.gz \
festival/lib/voices/english/cmu_us_awb_diphone/festvox/*.scm \
festival/lib/voices/english/cmu_us_awb_diphone/COPYING \
festival/lib/voices/english/cmu_us_awb_diphone/group/awblpc.group
The complete files from building an example US voice based on the KAL recordings is available at http://festvox.org/examples/cmu_us_kal_diphone/.
<<< Previous | Home | Next >>> |
A Japanese Diphone Voice | Up | ldom full example |