Building Synthetic Voices | ||
---|---|---|
<<< Previous | Chapter 12. Unit selection databases | Next >>> |
As touched on above the choice of an inventory of units can be viewed as a line from a small inventory phones, to diphones, triphones to arbitrary units. Though the direction you come from influences the selection of the units from the database. CHATR [campbell96] lies firmly at the "arbitrary units" end of the spectrum. Although it can exclude bad units from its inventory it is very much "everything minus some" view of the world. Microsoft's Whistler [huang97] on the other hand, starts off with a general database base but selects typical units from it. Thus its inventory is substantially smaller than the full general database the units are extracted from. At the other end of the spectrum we have the fixed pre-specified inventory like diphone synthesis as has bee described in the previous chapter.
In this section we'll give some examples of moving along the line from the fixed pre-specified inventory to the words the more general inventories but these techniques still have a strong component of prespecification.
Firstly lets us assume you have a general database that is labeled with utterances as described above. We can extract a standard diphone database from this general database, however unless the database was specifically designed, a general database is unlikely to have diphone coverage. Even when phonetically rich databases are used such as Timit there is likely to be very few vowel-vowel diphones as they are comparatively rare. But as these diphone are rare we may be able to do with out them and hence it is at least an interesting exercise to extract an as complete as possible diphone index from a general database.
The simplest method is to linearly search for all phone-phone pairs in the phone set through all utterances simply taking the first example. Some same code is given in src/diphone/make_diphs_index.scm. This basic idea is to load in all the utterances in a database, and index each segment by is phone name and succeeding phone name. Then various selection techniques can be use to select from the multiple candidates of each diphone (or you can split the indexing further). After selection a diphone index file can be saved.
The utterances to load are identified by a list of fileids. For example if the list of fileids (without parenthesis) is in the file etc/fileids, the following will builds a diphone index.
festival .../make_diphs_utts.scm
...
festival> (set! fileids (load "etc/fileids" t))
...
festival> (make_diphone_index fileids "dic/f2bdiph.est")
Note that as this diphone index will contain a number of holes you will need to either augment it with "similar" diphones or process your diphone selections through UniSyn_module_hooks as described in the previous chapter.
As you complicate the selection, and the number of diphones you used from the database you will need to complicate the names used to identify the diphones themselves. The convention of using underscores for syllable internal consonant clusters and dollars for syllable initial consonants can be followed, but you will need to go further if you wish to start introducing new feature such as phrase finality and stress. Eventually going to a generalized naming scheme (type and number) as used by the cluster selection technique described above, will prove worth while. Also using CART trees, through hand written and fully deterministic (one candidate at the leafs), will be a reasonable algorithm to select between hand stipulated alternatives with reasonable backoff strategies.
Another potential direction is to use the acoustic costs used in the clustering methods described in the previous section. These can be used to identify what the most typical unit in a cluster are (the mean distances from all other units are given in the leafs). Pruning these trees until the cluster only contain a single example should help to improve synthesis, in that variation in the feature in the "diphone" index will then be determined by the features specified in the cluster train algorithm. Of course though as you limit the number of distinct units types the more prosodic modification will be required by your signal processing algorithm, which requires that you have good pitch marks.
If you already have an existing database but don't wish to go to full unit selection, such techniques are probably quite feasible and worth further investigation.
<<< Previous | Home | Next >>> |
Building a Unit Selection Cluster Voice | Up | Labeling Speech |