image
image
image
image
image
image
 
Home -->  IITM Software -->  Syllable Level Codes
Search  
The Syllable level coding scheme

Basic principles underlying the encoding scheme incorporated into the IITM Software are discussed in this document.

Introduction

  The multilingual software developed at IIT Madras employs a syllable level coding scheme for representing text in Indian languages. This choice is based on linguistic considerations as well as the fact that the writing systems are syllabic in nature, i.e., text written in any of the Indian Scripts follows the rules for writing syllables and not the basic letters constituting the consonants and vowels.

  The IITM scheme is uniform across all the languages/scripts of India. A superset of basic Aksharas has been identified and codes assigned for them. In simple terms, an Akshara is a syllable and may consist of just a vowel, a consonant, a consonant vowel combination or in general a combination of two or more consonants and a vowel. Approximately 12000 Aksharas, found to be in use across the languages, have been identified for this set. While it may be argued that this ay be restrictive, the coding scheme does allow arbitrary syllables to be represented using the generic forms of consonants, consistent with linguistic requirements. 

  Each syllable is represented through a 16 bit fixed length code. This scheme also provides for representations along the lines of variable length codes for syllables as in ISCII or Unicode. However, the fixed length code allows very easy text processing compared to the variable length schemes.

  The Syllable Level Coding Scheme is a natural choice for languages which employ a syllabic writing system. IIT Madras has proposed this coding scheme as a single unifying method for computing with all the Indian languages including Urdu. The scheme is open to question however and has been termed proprietary by groups developing applications on the basis of Unicode or ISCII. 

  The following sections present the list of consonants and vowels which can form syllables. This is a superset of the basic Aksharas from  all the languages. Devanagari script is used for illustration but where required, other scripts are also used.

  Each consonant and vowel is assigned an individual number corresponding to its lexical ordering. This numbering scheme is meant to identify consonants and vowels within the set of the basic Aksharas. The development team had arrived at this numbering after examining many issues relating to linguistic conformity.

  In the Table below, each Akshara has also been assigned a name (an English name) that will be used to refer to the Akshara in a Computer Program ( similar to a name assigned to a literal).

Top of Page



Vowels

Sixteen vowels are included. The last one in the list is the "null" vowel. This definition allows us to treat a generic consonant as a syllable with a null vowel for purposes of text processing. The null vowel plays the role of the "halanth".

Vowel 15, as mentioned earlier, should be viewed as a "null" vowel. A consonant combining with the "null" vowel is equivalent to its generic form. In the IITM scheme, a consonant is always viewed as one with the first vowel "ah" as part of it.  This is strictly not correct from a linguistic point of view since the generic consonant is defined as one with out a vowel. It is not  easy to pronounce a generic consonant by itself. Hence the convention that the consonant is treated as a syllable with the first vowel "ah". The generic consonant also gets viewed as a syllable when the null vowel is part of it. This is a convenient representation. When writing a syllable, the rules always permit writing it as a series of generic consonants except for the  last consonant in the syllable. This will become clear later.

Top of Page



Consonants

In Indian languages, consonants are grouped into sets based on the physiological basis for the production of the sound the consonant stands for. There are basically seven groups.

 
  consonants 39-42 apply in the case of Southern languages though "lla" is also seen in  Sanskrit, Gujarati etc. "nas" is specific to Tamil.

  "ksha" (38) is actually a conjunct but due to its high frequency of use, it has been assigned a value. In the IITM system, this is treated as a consonant but handled by special rules when lexicographic ordering is effected.

Top of Page



Special consonants

  Apart from these 42, four additional consonants have also been defined. These are not consonants per se but viewed as pseudo consonants which can form syllables which may correspond not to a sound but a shape. In all the writing systems, special symbols are present but these do not have linguistic value. They often represent notation used in accountancy, poetry, Vedic texts etc..

These four additional consonants are named

"visarg" (43),  "music" (44), "vedic" (45) and "null" (46)

  The actual use of these consonants will become apparent later.

  In addition to the vowels and consonants, the scheme provides codes for 16 special marks which are basically punctuation symbols. Two of the sixteen are used as "Anuswars" and one is reserved for the "Avagraha" symbol. These 16 are reckoned as special syllables formed by an imaginary consonant with the 16 vowels. This imaginary consonant has been assigned the value 63.

  Ten numerals have also been assigned special codes. These distinguish the local numerals from their ASCII equivalents. The ten numerals are viewed as special syllables using the imaginary consonant mentioned above.

  All the symbols seen in the 96 character displayable ASCII have also been assigned codes. Each Roman letter, punctuation or special character is viewed as a syllable involving an imaginary "Roman consonant" This imaginary Roman Consonant has been assigned a value 62.

  Thus the basic set of aksharas supported in the IITM scheme consist of 16 vowels, 42 basic consonants, 4 special consonants, two imaginary consonants, one for Roman letters and the other for special symbols.

Top of Page



Representing Syllables
  The form of a syllable can be any one of the following.

V  - a pure vowel
C  - a pure consonant ( generic consonant with "ah")
CV - a consonant vowel combination
CCV - two consonant conjuncts
CCCV - three consonant conjuncts

  The IITM scheme caters to all possible V, C and CV combinations and select combinations for the CCV and CCCV forms. About 800 of these have been defined after studying the syllables in use across all the Indian languages. 

  In the scheme, for each base consonant C, at most 31 conjuncts can be specified and so the number of syllables one can form with any one of the 42 consonants above is limited to 31. This does restrict the number of syllables one can represent through a single code (2 bytes). In practice however, this does not appear to be a problem.

  The IITM scheme does not provide a single code for four consonant conjuncts and above though many such conjuncts are in use. These have to be handled specially in ways that also provide linguistic conformity. 

Top of Page



The Syllable representation Scheme

   Each syllable is represented as a triple ( c, cj, v) where c is the base consonant, cj is the conjunct part consisting of one or more consonants and v, the vowel. The triple is accommodated in a 15 bit field divided into 6, 5 and 4 bit fields as shown.

1
MSB

(Consonant)

(Conjunct)

(vowel)
    The Most significant bit is not a part of the syllable. It is used to indicate if the next fifteen bits actually represent a syllable or an attribute/escape value. For valid displayable syllables, this bit is zero. When set to one, the next 15 bits carry additional information about the language/script to be used in the succeeding syllables.

  The interpretation of the 6 bit consonant field as well as the four bit vowel field is fairly obvious. The intermediate 5 bit value needs some explanation.

Codes for  Samyuktakshars

  For each base consonant specified, one may list the set of syllables seen in normal use across the languages. Up to 31 of these are assigned values. For many base consonants, this set may be quite small, with as few as seven syllables. The specific set of two and three consonant syllables starting with a base consonant is lexicographically ordered and a number between 1 and 31 is assigned to each combination. This process is best illustrated through an example.

  Seen below is the list of conjuncts starting with "ga". It should be kept in mind that the syllables listed here are the ones for which codes have been assigned. It is certainly possible that the list is not exhaustive and that other syllables starting with "ga" have been omitted. The understanding here is that the ordering conforms to the lexical ordering of the samyuktakshars.


  We observe that the triple directly allows us to see the base consonant as well as the vowel. Inferring the consonant or the consonants in the conjunct part requires a look up through a table.

  When performing "Regular Expression Matching", we can gain a lot of flexibility by masking the conjunct part or the vowel part or both and identify strings that sound similar. In other words, the IITM scheme is very well suited for regular expression matching at the syllable level.

  The full set of conjuncts supported by the software is specified in a text file referred to as generic.cnj and the complete set of basic consonants, vowels and the special symbols are specified in independent files. These files are text files which are used in generating the syllable level codes.

generic.vow : The set of 16 vowels with their codes and the key stroke associated with each vowel. The key here refers to the ASCII value of the Roman letter that should be used for typing in the vowel.

generic.con : The set of 46 consonants (42 which are linguistically significant, three which are meant for special cases and the null consonant make up this 46). The structure of this file is similar to that of generic.vow.

generic.spl : The set of 16 special characters which may be typed in by way of punctuation marks and special aksharas.

generic.cnj: The listing of conjuncts which have been assigned codes. The list is presented in the order of the base consonants. For any base consonant only 31 conjuncts are allowed.

    All the above four files are pure text files (ASCII) and hence can be modified to suit specific requirements without the need to recompile the application dealing with syllables. These four files are required to be converted to the appropriate data structures which will be read into the applications from external files. 

  The software does not hard code the keyboard mapping for a vowel, consonant or a special symbol. It is therefore possible to reassign the keys to suit specific requirements.

  The recommended keyboard mapping is shown below. This is based on phonetic Roman equivalents, to the extent possible with 26 letters and about 16 special symbols and punctuation.

Top of Page
Use of the null consonant

  The null consonant is useful for generating syllables which conform to specific display shapes without disturbing the linguistic content. A syllable starting with a null consonant will have the following triplet. 

  (46, cj, v)  

with  cj and v taking their respective range of values. (46,cj,15) will correctly display the consonant in cj through its half form in Devanagari derived scripts and the smaller shape that appears below a consonant in the Southern scripts. The halanth should be specified for a pure half form since the linguistic equivalent for the half form is a generic consonant. In the Southern scripts, the equivalent of the half form is the consonant which appears above and the consonant appearing below will be the one that takes the vowel.

  Just as we have generic consonants, we can also have generic syllables, i.e., a combination of consonants only. Such a generic syllable can form part of a full syllable and the full syllable obtained by adding a vowel. Generic syllables may be typed in a sequence to form arbitrarily long syllables. With some care, such sequences can conform to linguistic requirements except when the writing system changes the order in which the consonants are displayed. This is seen mostly with "r" and the rules vary widely across the scripts. The writing system used for Tamil employs only generic consonant shapes for conjunct aksharas.

Null consonant with a vowel

  The null consonant can take a vowel by itself and this may be used to represent the Matras. One observes that while Matras by themselves do not have any linguistic value, the  standalone symbol for the Matra is required in practice, if only to teach the rules of the writing system.

  The roles played by the null consonant and the null vowel are now clear.

  The null vowel in a syllable represents a generic consonant or a generic conjunct depending on the contents of the two fields ( the 6 bit consonant part and the 5 bit conjunct part).

  The null consonant in a syllable is a representation of a generic consonant, specified by the conjunct part. This is a provision made in the IITM coding scheme for representing alternate display shapes for a generic consonant in a syllable. Essentially, this may be viewed as a trick to generate suitable displayed forms for arbitrary syllables, while maintaining linguistic content. Essentially this amounts to composing syllable shapes.

  It may be noted that the set of syllables defined in the IITM software (where each syllable is coded as exactly two bytes) is already comprehensive. Enough code space is available for adding some more but this will result in incorrect display of some aksharas in the text prepared with an earlier set of codes. Only the Samyuktakshars will be affected however.

Top of Page



Special Consonants Visarg, Music and Vedic

   The three special consonants Visarg(code 43), Music(code 44) and Vedic (code 45) permit the generation of special symbols such as Vedic accent marks, Musical notation etc. Also, the Visarg consonant is required in practice as a stand alone feature to handle syllables that already have a syllable.

  The three vowels not included in the main set of vowels (long vocallic r, vocallic l and its long form) can be typed in as special syllables using Visrag and the vowels ru, uh and ouh. It is true that the linguistic structure breaks when a syllable is composed like this but since the use of these vowels is quite rare, applications can remember to handle the situation. The special consonant Music was included to permit music notation to be handled by the software. Currently, this consonant is used to generate 16 special syllables for use with Braille.

  Details of these special consonants may be seen in the documents included with the applications.



Some general observations on the syllable level codes
 
  • Only 15 bits are used for each syllable. The sixteenth bit specifies how the next 15 bits should be interpreted. When the 16th bit is set to 1, the next 15 bits specify the script to be associated with the syllables which follow till another switch occurs.
  • By and large, the coding scheme maintains the correct lexical ordering of the syllables. In fact standard sorting algorithms may be used without problems.

  •   It should be remembered that lexicographic ordering is not precisely defined for any Indian language. There are several opinions on this. In practice it may be necessary to map each syllable to an appropriately ordered value before sorting. The algorithm for this is really quite simple since one is dealing with fixed length codes.

  • Though the 15 bits provide for as many as 32,768  syllables, only about 12000 are meaningful in practice.
  • The size of text is significantly reduced in terms of number of bytes stored compared to other schemes involving ASCII or Unicode. In the following examples, two or more representations are given, where different transliteration rules are applied.
  • Contents

    Introduction

    Vowels

    Consonants

    special consonants

    Representing Syllables

    Samyuktakshars

    The null consonant

    General observations


    The Syllable level coding scheme has also been extended to handle Arabic, Urdu, Hebrew and Avestan. These scripts are written right to left but follow the syllabic writing method. Discussion on this is included in a separate page.
     
     

    Discussion on the state machine for keyboard input.
     

    Rendering a syllable: the use of tables for generating Glyph strings.

















     

    Acharya Logo
    Swan and her cygnets. A happy scene.

    Image source
    www.genre.fsnet.co.uk/gallery/birds/cygnets.jpg
    Reproduced with permission from the author John Robinson

    Today is Sep. 24, 2017
    Local Time: 18 06 19


    | Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
    | Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
    Last updated on 11/07/12    Best viewed at 800x600 or better