Concatenation of natural speech segments is the standard approach used in today's text-to-speech (TTS) synthesis systems. In order to get synthetic speech that sounds close to natural, mainly the two following requirements must be met: First, an appropriately prepared speech database is needed from which the segments to be concatenated can be extracted and second, during synthesis the segments to be concatenated to a particular utterance have to be selected in an optimal way.
The preparation of such a speech database is a very difficult and time-consuming task. A major problem arises from the fact that even professional speakers cannot stick to exactly the same speaking style over long recording sessions. There is always some degree of spontaneous changement and also fatigue. Therefore, methods for accurate detection of such changement are necessary.
The aim of this project is first of all to do research into such methods. Furthermore, a set of tools will be developed that allow for online supervision of the speaking style during recordings. Additional tools are needed to precisely analyze and describe the recorded speech signals. This description will include the position of all types of linguistic items such as phones, syllables, words and phrases, prosodic properties of these items, the accurateness of the pronunciation of each phone, etc. This information will be used to select the optimal segments for concatenation.
The developed tool set will make the preparation of a TTS speech database less time-consuming and will maximize the quality of the synthesized speech.
In the context of this project, several tools have been developed: a new type of fundamental frequency detection, an accurate pitch marker and an improved phonetic segmentation (see [EHP09], [EP10], [HP10] and [EP11], resp.).
Supported by: This project is mainly supported by KTI.