Monday, June 10, 2013

Concatenative Synthesis & Formant Synthesis

There are three types of Concatenative Synthesis all of which consist of pre-recorded speech stored in a memory then retrieved from the memory and simply concatenated to produce the sentence to be spoken. Hence eliminating the need to store large number of words for synthesis of un-restricted text, basis sounds (phonemes) can be used, as basis sounds can be combined together to form words and create sentences.

  •  Unit selection synthesis
This division is specially modified to align diphones, morphemes, words, phrases and sentences stored on large databases and composed so as to gain a sense of naturalness in TTS systems.

  •  Diphone synthesis
Diphones are obtained by cutting a speech waveform into phone-sized units, with a cut in the middle of each phone so as to preserve the transition between adjacent phones in each diphone, the pitch of diphones aren’t as distorted therefore the pitch varies. (E.g.: To synthesize the word straight, the six-diphone sequence)/#s-st-tr-re-et-t#/ would be used (# denotes silence).

  • Domain specific synthesis 
Domain specific synthesis implements very simple voiceover patterns and sequences and is often found in household electronics. E.g. in the domain of animated characters, it has been observed that features occurring in human expression need to be exaggerated in synthetic Expression in order to be believable.

Formant Synthesis

It is a custom filter model based on the acoustic theory of speech production where the vocal tract transfer passes through the filter and in time morphed to create a waveform of artificial speech also called Rules based synthesis .The source proceeds as a sampling function for voiced speech, in much more simpler models transfer function of the linear filter modeling the vocal tract has only poles. This format has been used before in Sega and Atari. Video games, the lead source for this function are produced by the vocal cord and noise made by pressure variations across the constriction formed in the vocal tract. The resulting speech sounds “inanimate” or “robot-like”. No human speech recordings are involved at run time. Several larger undertakings have used formant synthesizers because the high degree of control they can provide not only with conveying questions and statements but a range of other multi-purpose functions. Formant synthesis is currently in use within the VAESS project.

 • Source Filter Model

The source filter is the most common of all synthesis techniques. This theory states that the vocal tract can be used as linear filter. The vocal cord has to vibrate in order for this process to activate. The result sound which is produced must exit through the lips. All sounds can be later filtered, the different aspects of this theory are complicating and so left for professionals.

In a model made by a journalist it is explained that the source filter model is divided into 3 separate parts the source, the filter and lip radiation.