Friday, June 7, 2013

Speech Synthesis – An Introduction


(1962-1977): The first generation of speech synthesis featured a formant like synthesis of phonemes based on the phonetic breakdown of phrase to formant frequency contours which at that time was at the peak.

(1962-1977): The synthesis had low precision and naturalness due to the limited resources of that generation and was soon replaced by successors.

(1977-1992): In the second generation of speech synthesis the standards of intelligibility were improved with the immediate use of LCP parameters, however many would say that the lifelikeness of this process still remained low. This system relied on converting the appropriate units from text input to speech form.

(1992-present): This system can be customized to suit the given process, and uses “Unit Selection Synthesis” which according to a web based article were introduced to the public Sagisaka at ATR Labs in Kyoto. The latest version of this is available for American and British English, Danish, Finnish, French, German, Icelandic, Italian, Norwegian, Spanish, Swedish, and Dutch. Digital Equipment Corporation [3] (DEC) talk system is originally descended from MITalk and Klattalk and offers nine different voice personalities, four male, four female and one child (depending on the equipment). The present DECtalk structure is based on digital formant synthesis.

Speech synthesis has increased its commercial acclaim in modern applications mainly due to advancements and further requirements of existing research organizations in the past decade, most of which aim to reduce the  cost and time of standard procedures. The product consists of a text-to-speech system that converts (e.g. : Plain text based input to synthetic speech, Additional ramification of phonological and acoustic details must be shared for greater accuracy).Due to a large increase of widely used speech databases simpler applications have been developed to adapt with the acquired standards that are met with, while these waveform techniques are in great demand improvements have been made to original TTS systems currently in use in many research based companies. The growth of speech applications for both recognition and synthesis has increased since computers developed. The implemented software or hardware product can also render symbolic linguistic representations like phonetic transcriptions to synthetic speech. Speech synthesizers have also been used to allow people with disabilities to interact with people. Danish scientist Christian Kratzenstein has been credited for building a prototype model mechanized to produce five distinct sounds (commonly associated as the 5 vowels in the international phonetic alphabet). This was followed by the bellows-operated "acoustic-mechanical speech machine" by Wolfgang von Kempelen of Pressburg, Hungary, described in a 1791 paper. This machine added models of the tongue and lips, enabling it to produce consonants as well as vowels. Further developments of the system include attempts to program emotion into synthesized speech operational systems; several small studies have been conducted as an attempt at progress on emotional speech synthesis. Both engineers and linguists who work in the research region of TTS are trying to enlist a great deal of data required for TTS, an European author claims that the general view of this field is dominated by Americans even though the pioneers of this systems are almost exclusively European.