A Comparison Between Allophone, Syllable, and Diphone Based TTS Systems for Kurdish Language

By Wafa Barkhoda1, Bahram ZahirAzami1, Anvar Bahrampour2, Om-Kolsoom Shahryari1

Abstract– nowadays, concatenative method is used in most modern TTS systems to produce artificial speech. The most important challenge in this method is choosing an appropriate unit for creating a database. This unit must warranty smoothness and high quality speech, and also, creating database for it must take reasonable resources and should be inexpensive. Syllable, phoneme, allophone, and, diphone are usually used as the units in such systems. In this paper, we implemented three synthesis systems for Kurdish language, respectively based on syllable, allophone, and diphone. We compare the quality of the three systems, using subjective tests.

By Wafa Barkhoda1, Bahram ZahirAzami1, Anvar Bahrampour2, Om-Kolsoom Shahryari1

Abstract– nowadays, concatenative method is used in most modern TTS systems to produce artificial speech. The most important challenge in this method is choosing an appropriate unit for creating a database. This unit must warranty smoothness and high quality speech, and also, creating database for it must take reasonable resources and should be inexpensive. Syllable, phoneme, allophone, and, diphone are usually used as the units in such systems. In this paper, we implemented three synthesis systems for Kurdish language, respectively based on syllable, allophone, and diphone. We compare the quality of the three systems, using subjective tests.

Keywords- Speech Synthesis; Concatenative Method; Kurdish TTS System; Allophone; Syllable, and Diphone.

1. INTRODUCTION

High quality speech synthesis from the electronic form of text has been a focus of research activities during the last two decades, and it has led to an increasing horizon of applications. To mention a few, commercial telephone response systems, natural language computer interfaces, reading machines for blind people and other aids for the handicapped, language learning systems, multimedia applications, talking books and toys are among the many examples [1].

Most of the existing commercial speech synthesis systems can be classified as either formant synthesizers [2,3] or concatenation synthesizers [4,5]. Formant synthesizers, which are usually controlled by rules, have the advantage of having small footprints at the expense of the quality and naturalness of the synthesized speech [6]. On the other hand, concatenative speech synthesis, using large speech databases, has become popular due to its ability to produce high quality natural speech output [7]. The large footprints of these systems do not present a practical problem for applications where the synthesis engine runs on a server with enough computational power and sufficient storage [7].

Concatenative speech synthesis systems have grown in popularity in recent years. As memory costs have dropped, it has become possible to increase the size of the acoustic inventory that can be used in such a system. The first successful concatenative systems were diphone based [8], with only one diphone unit representing each combination of consecutive phones. An important issue for these systems was how to select, offline, the single best unit of each diphone for inclusion in the acoustic inventory [9,10]. More recently there has been interest in automation of the process of creating databases and in allowing multiple instances of particular phones or groups of phones in the database, with the selection decided at run time. A new, but related problem has emerged: that of dynamically choosing the most adequate unit for any particular synthesized utterance [11]. The development and application of text to speech synthesis technology for various languages are growing rapidly [12,13]. Designing a synthesizer for a language is largely dependant on the structure of that language. In addition, there can be variations (dialects) particular to geographic regions. Designing a synthesizer requires significant investigation into the language structure or linguistics of a given region.

In most languages, widespread researches are done on Text-to-Speech systems and also, in some of these languages commercial versions of system are offered. CHATR [14, 15] and AT&T NEXT GEN [16] are two examples offered in English language. Also, in other languages such as French [17,18], Arabic [4,19,20], Norwegian [21], Korean [22], Greek [23], Persian [24-27], etc, much effort has been done in this field.

The area of Kurdish Text-to-Speech (TTS) is still in its infancy, and compared to other languages, there has been little research carried on in this language. To the best of our knowledge, nobody has performed any serious academic research on various branches of Kurdish language processing yet (recognition, synthesis, etc.) [28,29].

Kurdish is one of the Iranian languages, which are a sub category of the Indian-European family [30,31]. The Kurdish phonemics consists of 24 consonants, 4 semi vowels and 6 vowels. Also / ع/ ,/ح /, and / غ/ have entered Kurdish from Arabic. Also, this language has two scripts: the first one is a modified Arabic alphabet and the second one is a modified Latin alphabet [32,33]. For example “trifa” which means “moon light” in Kurdish, is written as / تريفه / in the Arabic script and as “tirîfe” in the Latin. Whereas both scripts are in use, both of them suffer some problems (e.g., in Arabic script the phoneme /i/ is not written; also both /w/ and /u/ are written with the same Arabic written sign / 32,33 ] /و ], and Latin script does not have the Arabic phoneme / ئ/, and it does not have any standard written sign for foreign phonemes [33]).

In concatenative systems, one of the most important challenges is to select an appropriate unit for concatenation. Each unit has its own advantages and disadvantages, and might be appropriate for a specific system. In this paper we develop three various concatenative TTS systems for Kurdish language based on syllable, allophone, and diphones, and compare these systems in intelligibility, naturalness, and overall quality.

The rest of the paper is organized as follows: Section 2 introduces the allophone based TTS system. Section 3 and 4 presents syllable and diphone based systems respectively, and finally, comparison between these systems and quality test results are presented in Section 5. Conclusions are drawn in Section 6.

 

++++++++++++++++++++++++++++++

A Comparison Between Allophone, Syllable, and Diphone Based TTS Systems for Kurdish Language

Wafa Barkhoda1, Bahram ZahirAzami1, Anvar Bahrampour2, Om-Kolsoom Shahryari1

1Department of Computer, University of Kurdistan Sanandaj, Iran

2Department of Computer, Islamic Azad University Sanandaj, Iran

{w.barkhoda, zahir, shahryari.kolsoom}@ieee.org , bahrampour58@gmail.com

University of Kurdistan, Sanandaj, 200x

 

Please view the full report as attached.