image
image
image
image
image
image
image
 
Home --> Software Design Issues --> Existing coding schemes for Indian language text
Search  
 
 
Existing standards for codes in respect of  Indian Scripts
  Internal representation of text in Indian Languages may be viewed as the problem of assigning codes to the aksharas of the languages. The complexities of the syllabic writing systems in use have presented difficulties in standardizing internal representations. TeX was an inspiration in the late 1980s but using TeX was more suited for Typesetting and not Text processing per se. In the absence of appropriate fonts, interactive applications could not be attempted and when fonts became available, applications simply used the Glyph positions as the codes and the number of Glyphs was restricted on account of the eight bit fonts.

The following representations still apply as many applications have been written to use one or the other. It must be remembered that these representations primarily address the issue of internal representation for rendering text. 

Use of Roman letters with diacritic marks
ISCII codes
Unicode for Indian Scripts. 
ISFOC standard from CDAC

    Of the above, the first has been discussed in the section on Transliteration principles. The ISFOC standard applies more to standardization of Fonts for different scripts and cannot really be thought as as an encoding standard. We confine our discussion in this section to ISCII and the Unicode. A brief note on ISFOC will be found in a separate page.



About ISCII

About Unicode for Indian Languages

Detailed Discussion of Unicode for Indian Languages


Report from CDAC on character encoding standards for Indian Scripts

Multilingualism and the Internet
(A good exposition of issues dealing with multilingual information on the web)

 

Indian Script Code for Information Interchange (ISCII)
    ISCII was proposed in the eighties and a suitable standard was evolved by 1991. Here are the salient aspects of the ISCII representation.
  • It is a single representation for all the Indian Scripts. 
  • codes have been assigned in the upper ASCII region (160 - 255) for the aksharas of the language.
  • The scheme also assigns codes for the Matras (vowel extensions). 
  • Special characters have been included to specify how a consonant in a syllable should be rendered. Rendering of Devanagari has been kept in mind.
  • A special Attribute character has been included to identify the script to be used in rendering specific sections of the text.
shown below is the basic assignment in the form of a Table.  There is also a version of this table known as PC-ISCII, where there are no characters defined in the range 176-223.  In PC-ISCII, The first three columns of the ISCII-91 table have been shifted to the starting location of 128.  PC-ISCII has been used in many applications based on the GIST Card, a hardware adapter which supported Indian language applications on an IBM PC. In the table, some code values have not been assigned. Six columns of 16 assignments each start at the Hexadecimal value of A0 which is equivalent to decimal 160.
ISCII Code Assignments

The following observations are made.

    1. The ISCII code is reasonably well suited for representing the syllables of Indian languages, though one must remember that a multiple byte representation is inevitable, which could vary from one byte to as many as 10 bytes for a syllable. 

    2. The ISCII code has effected a compromise in grouping the consonants of the languages into a common set that does not preserve the true sorting order of the aksharas across the languages. Specifically, some aksharas of Tamil, Malayalam and Telugu are out of place in the assignment of codes. 

    3. The ISCII code provides for some tricks to be used in representing some aksharas, specifically the case of Devanagari aksharas representing Persian letters. ISCII uses a concept known as the Nukta Character to indicate the required akshara.

    4. When forming conjuncts, ISCII specifications require that the halanth character be used once or twice depending on whether the halanth form of the consonant or half form of the consonant is present. This results in more than one internal representations for the same syllable. Also, ISCII provides for the concept of the soft halanth as well as an invisible consonant to handle representations of special letters. Parsing a text string made up of ISCII codes is a fairly complex problem requiring a state machine which is also language dependent. This is a consequence of  the observation that languages like Tamil do not support conjuncts made up of three or more differing consonants. In fact it is stated that Tamil has no conjunct aksharas. What is probably implied here is that a syllable in Tamil is always split into its basic consonants and the Matra. Several decades ago Tamil writing in palm leaves did show geminated consonants in special form.

     Though representation at the level of a syllable is possible in ISCII, processing a syllable  can become quite complex, i.e., linguistic processing may pose specific difficulties due to the variable length codes for syllables. 

    5. The code assignments, though language independent, do not admit of clean and error free transliteration across languages especially into Tamil from Devanagari. 

    6. It is difficult to perform a check on an ISCII string to see if arbitrary syllables are present. Though theoretically many syllables are possible, in practice the set is limited to about 600 - 800 basic syllables which can also combine with all the vowels. The standard provides for arbitrary syllables to handle cases where new words may be introduced in the language or syllables from other languages are to be handled.

    It must be stated here that ISCII represents the very first attempt at syllable level coding of Indian Language aksharas. Unfortunately, outside of CDAC which promoted ISCII through their  GIST technology, very few seem to use ISCII.

     ISCII codes have nothing to do with fonts and a given text in ISCII may be displayed using many different fonts for the same script. This will require specific rendering software which can map the ISCII codes to the glyphs in a matching font for the script. Multibyte syllables will have to be mapped into multiple glyphs in a font dependent and  language dependent manner. It is primarily this complexity that has rendered ISCII less popular. Details of ISCII are covered in the Bureau of Indian Standard Documents No. IS:13194-1991. 

Shown below are some examples of strings in Devanagri and other scripts along with their ISCII representations.

ISCII Glyphs

Top of Page
 
Unicode for Indian Languages

  Unicode was the first attempt at producing a standard for multilingual documents. Unicode owes its origin to the concept of the ASCII code extended to accommodate International Languages and scripts. 

    Short character codes ( 7 bits or 8 bits) are adequate to represent the letters of the alphabets of many languages of the world. The fundamental idea behind Unicode is that a superset of characters from all the different languages/scripts of the world be formed so that a single coding scheme could effectively handle almost all the alphabets of all the languages. What this implies is that the different scripts used in the writing systems followed by different languages be accommodated in the coding scheme. In Unicode more than 65000 different characters can be referenced. This large set includes not only the letters of the alphabet from many different languages of the world but also punctuation, special shapes such as mathematical symbols, Currency symbols etc. The term Code Space is often used to refer to the full set of codes and in Unicode, the Code space is divided into consecutive regions spanning typically 128 code values. Essentially this assignment retains the ordering of the characters within the assigned group and is therefore very similar to the ASCII assignments which were in vogue earlier. 

  Unicode assignments may be viewed geometrically as a stack of planes, each plane having one and possibly multiple chunks of 128 consecutive code values. Logically related characters or symbols have been grouped together in Unicode to span one or more regions of 128 code values. We may view these regions as different planes in the Code Space as illustrated in the figure below. Data processing software using Unicode will be able to identify the Language of the text for each character by identifying the plane the character is located in and use appropriate font to display the same or invoke some meaningful linguistic processing. 
 

Unicode Assignments

    Technically, Unicode can handle many more languages than the supported scripts if these languages use the same script in their writing systems. By consolidating a complete set of symbols used in the writing systems across a family of languages, one can get a script that caters to all of them. The Latin script with its supplementary characters and extended symbol has about 550 different characters and this is quite adequate to handle almost anything that has appeared in print in respect of the Latin script. Hence in the geometrical view above, some planes may be larger (wider) than others and more than one script could have characters from logically similar groups specified in a plane. 

  The fact that several languages/scripts of the world require many more than 128 codes has necessitated assignments of more than one basic plane (i.e., multiples of 128 code values) for them. Languages such as Greek, Arabic or Chinese  have larger planes assigned to them. In particular, Unicode has allowed nearly 20000 characters of Chinese, Japanese and Korean scripts to be included in a contiguous region of the Code Space. Currently fewer than a hundred different groups of symbols or specific scripts are included in Unicode. 

   Even though it is a sixteen bit code and can therefore handle more than 65000 code values, Unicode should not be viewed as a scheme which allows several thousand characters for each and every language. That it has provision for fewer than 128 characters for many scripts is a general observation since many languages do not require more than 128 characters to display text. 

  In respect of Indian languages which use syllabic writing systems, one might think that Unicode would have provided several thousands of codes for the syllables similar to the nearly 11000 Hangul syllables already included. On the contrary, Unicode has pretty much accepted the concept behind ISCII and has provided only for the most basic units of the writing systems which include the vowels, consonants and the vowel modifiers. 

  Unlike ISCII, which has a uniform coding scheme for all the languages, Unicode has provided individual planes for the nine major scripts of India. Within these planes of 128 code values each, assignments are language specific though the ISCII base has been more or less retained. Consequently, Unicode suffers from the same limitations that ISCII runs into. There are some  questionable assignments in Unicode in respect of  Matras. A Matra is not a character by itself. It is a representation of a combination of a vowel and consonant, in other words the representation of a medial vowel. A vowel and NOT its Matra is the basic linguistic unit. Consequently linguistic processing will be difficult with Unicode with Indian languages, just as in ISCII. 

   Here is the Unicode assignment for Sanskrit (Devanagari). The language code for Sanskrit (Devanagari) is 09 (hex) and so the codes span the range 0901 to 097f (Hexadecimal values). In this chart, the characters of Devanagari with a dot beneath, are grouped in the range 0958 to 095f. These are the characters used in Hindi which are derived from  Persian and seen in Urdu as well. Likewise in locations 0929, 0931 and 0934 the letters are dotted. The codes are similar to ISCII in ordering but Unicode includes characters not specified in ISCII. Also, the assignments for each language more or less adhere to the same relative locations for the basic vowels and consonants as in ISCII but include many language dependent codes. The code positions in Unicode will not exactly match the corresponding ISCII assignments.

Unicode Assignments

  Shown below are the Unicode representations for some strings in different scripts. These are the same strings shown earlier under ISCII. 

Unicode Glyphs

  From the discussion above, it will be seen that ISCII and Unicode provide multibyte representations for syllables. This is not unlike the case for English and other European languages where syllables are shown only with the basic letters of the Alphabet. However, in all the writing systems used in India, each syllable is individually identifiable through a unique shape and one has to provide for thousands of shapes while rendering text. 

  While these thousands of shapes may be composed from a much smaller set of basic shapes for the vowels, consonants and vowel modifiers, one must admit that several hundreds of syllables have unique shapes which cannot be derived by putting together the basic shapes. It is estimated that in practice, more than 600 different glyphs would be required to adequately represent all the different syllables in most of the scripts. The main problem of dealing with Unicode for Indian languages/scripts has to do with the mapping between a multibyte code for a syllable and its displayed shape. This is a very complex issue requiring further understanding of rendering rules. As such a full discussion of this would require that the viewer understand the intricacies of the writing systems of India. We cover this in a separate page.

Top of Page


Unicode Related Information

Many questions relating to Unicode for Indic Scripts have been answered at the Unicode web site.

__________________

The site maintained by Alan Wood provides extensive coverage of Unicode including Unicode resources for Indian languages
www.alanwood.net/unicode/

__________________

UTF-8, the method used for moving Unicode Data across systems and displaying Unicode encoded documents on Web Browsers. Link to an excellent discussion provided by Markus Kuhn

__________________

While most people say exciting things about Unicode, there are a few who share our concern about its weak points. Here are some observations made by an expert.
__________________

Specific technical problems with ISCII and Unicode.

    It must be observed, in the light of the above discussion that displaying a Unicode string in Indian language requires a complex piece of processing software to identify the syllables and get the corresponding glyphs from an appropriate font for the script. The multibyte nature of Unicode (for a syllable) makes a table driven approach to this quite difficult. Even though it is possible to write such modules which can go from Unicode to the display of text using some font, one faces a formidable problem in respect of data entry, where formation of syllables from multiple key sequences Is truly overwhelming. With limited number of keys available in standard keyboards, it is often not possible to accommodate all the symbols one would require to produce meaningful printouts in each script consistent with quality typesetting systems.

  Unicode based applications employ the concept of "Locales" to permit data entry of multilingual text. Each Locale is associated with its own keyboard mapping and application software can switch Locales to permit data entry of multilingual text. It will be seen that for Indian scripts, the Locales themselves have limitations since they do not permit a full complement of  letters and special characters to be typed in, much less the standard punctuation that has become part of Indian scripts today.

    While it is possible to write special keyboard driver programs which implement a state machine to handle key sequences to produce syllables, the approach is not universal enough to be included into the Operating Systems, certainly not when  a single driver should cater to all the Indian scripts. There is no meaning in having a Hindi version of OS with its own Data entry convention which differs substantially from a Tamil or Telugu version. 

  Here is a summary of the issues that confront us when dealing with Unicode for Indian scripts.

  • Rendering text in a manner that is uniform across applications is quite difficult. Windowing applications with cut,copy/paste features suffer due to problems in correctly identifying the width of each syllable on the screen. Also, applications have to worry about specific rendering issues when modifier codes are present. How applications run into difficulties in rendering even simple strings is illustrated with examples in a separate page.
  • Interpreting the syllabic content involves context dependent processing, that too with a variable number of codes for each syllable.
  • A complete set of symbols used in standard printed text has not been included in Unicode for almost all the Indian scripts.
  • Displaying text in scripts other that what Unicode supports is not possible. For instance, many of the scripts used in the past such as the Grantha Script, Modi, Sharada etc., cannot be used to display Sanskrit text. This will be a fairly serious limitation in practice when thousands of manuscripts written over the centuries have to be preserved and interpreted.
  • Transliteration across Indian scripts will not be easy to implement since appropriate symbols currently recommended for transliteration are not part of the Unicode set. In the Indian context, transliteration very much a requirement.
  • The unicode assignments bear little resemblance to the linguistic base on which the aksharas of Indian scripts are  founded. While this is not a  critical issue,  it is desirable to have codes whose values are based on some linguistic properties assigned to the vowels and consonants, as has been the practice in India. 
  In a separate web page, we discuss the problems associated with Unicode for linguistic processing of text in Indian languages.

    Details of Unicode for Indian scripts have been published in the standard available from the Unicode consortium. The Unicode web site does have useful information but one will have to resort to the printed text to get the real details. These are also available in PDF format from the above web site.

Top of Page


Is Unicode for Indian Languages meaningless then ?

    The answer is certainly No. The main purpose of the Unicode is to transport information across computer systems. As of today, Unicode is reasonably adequate to do this job since it does provide for representing text at the syllable level though not in the fixed size units (Bytes). 

    Applications dealing with Indian Languages will have to include a special layer which transforms Unicode text into a more meaningful layer for linguistic or text processing purposes. The point to keep in mind is that the seven bit ASCII based representation for most World language serves both purposes well i.e., not only are text strings transferable across systems, but linguistic processing is consistent with the seven bit representation . It so happens that the phonetic nature of our Indian Languages has necessitated a different representation for linguistic analysis.

With majority of the Languages of the World, which use a relatively small set of symbols to represent the letters of their alphabet, 8 bit (or even 7 bit) character codes are adequate to represent the letters.



Please refer to the FAQ provided at the Unicode web site which provides answers to some of the questions raised here. The real issue to understand is whether Unicode is adequate from the point of view of efficient text processing of Syllables so that one may attempt meaningful processing of text in Indian languages, consistent with the syllabic writing system.
Acharya Logo
Distant views of the Himalayan Peaks are unforgettable and awe inspiring!

Today is Nov. 24, 2017
Local Time: 07 48 23

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on 10/26/12     Best viewed at 800x600 or better