image
image
image
image
image
image
image
 
Home --> Software Design Issues --> Unicode and ISCII
Search  
 
Limitations of Unicode and ISCII

  This is a discussion relating to the suitability of  ISCII or Unicode for linguistic text processing in Indian languages. We have assumed that the reader is familiar with the assignment of codes in ISCII as well as Unicode for different Indian languages/scripts. The views expressed here should not be construed as opposing the very idea of Unicode for Indian scripts. It just happens that Unicode brings in a lot difficulties in linguistic processing.  Unicode could possibly work in a multilingual application for Indian languages involving just data entry and display. Yet, a coding scheme that exhibits a clear bias towards text rendering does pose problems even for such simple applications.

A detailed presentation of Unicode and Basic Indian Language Computing has been made available in a separate section. The issues involved have been explained with several examples. 

  The following paragraphs highlight the major problems one encounters in dealing with these two schemes.



Should the Encoding emphasize language issues or the writing system (script)?

  This is a fundamental question which must be understood by anyone discussing encoding schemes for Indian Languages. In the past, text processing on a computer was always understood in terms of the letters of the alphabet. The text concerned is displayed in the script associated with the language. The script includes all the shapes or symbols seen in the writing system so that the information to be conveyed by the text is complete. In other words, the displayed information conveys the linguistic content properly.

  When it comes to Indian languages, the script used for conveying the information may have no direct relationship with the language used. It often is the case that any script which can convey syllabic content without ambiguity or error could be used for a language. Sanskrit for instance could be written in half a dozen different scripts. A computer application dealing with a specific script will certainly have to honour the writing conventions in vogue but text processing cannot be based on the way the script displays a specific syllable. What is critical is the linguistic content in the syllable because it represents the sounds the syllable is built with and not the shapes used in composing the display. The latin script handles the problem by displaying the syllable only by composing it from the shapes of the consonants and vowel and this rule is meticulously followed. With writing systems which are syllabic in nature, each syllable has its own unique shape and identifying the syllable from its shape is much more complex but it has the advantage that there will be no ambiguity in the sounds when the written text is read out.

  Text encoding schemes that help identify the syllable quickly and efficiently will work better for linguistic analysis or text processing in general. This will particularly apply to writing systems which are syllabic in nature. When Unicode was proposed, it was anticipated that the encoding scheme will concentrate on linguistic requirements and not the rendering aspects, i.e., the writing system involved. Unfortunately the bias towards rendering continues to plague Unicode at least in so far as syllabic writing systems are concerned.

  Since Unicode emphasizes the script and not the language, one has to content oneself with the scripts provided for in Unicode. There is no question of using many of the scripts that we have seen in India. It will not be possible to handle electronically scripts such as the Grantha, used in South india for writing Sanskrit or the Modi script, used for Marathi at one time. While one may disagree with this view and argue that one may never use those scripts now or in the future, introducing new scripts or adding new symbols to an existing script to cater to additional sounds will continue to remain a problem.

  It must be stated that  linguistic scholars use the International Phonetic Alphabet (IPA), a script that represents specific sounds covering almost all the languages of the world. Unicode  provides support for IPA but computer applications providing appropriate interfaces for IPA  can hardly be cited.
 

Data preparation versus Linguistic text processing.

  There are basically two fundamental aspects to Electronic processing of text. The first is to generate the text itself so that the information can be stored and displayed, preferably on different computer systems. Displaying text could also include high quality printing or typesetting. The second is a more important point which relates to interpreting the information carried by the text string. For example, the text string could well be a line from a poem where one is trying to find out if the string is a palindrome. In respect of Indian languages where prefixes and suffixes are added to root words to obtain declensions, it may be necessary to look at a string to arrive at the root word for grammatically analyzing the sentence and break it into words representing different parts of speech. This is another example of linguistic text processing.

  When the assigned codes relate only to the writing system used for the language, the emphasis is primarily on displaying the text string. When text has to be displayed, the codes representing the text will have to be mapped to the shapes appropriate to the character in the text. In the case of Indian languages, the position of a glyph in a font designed for the script had been used in the past as the code for the text. For many of the Indian languages, several different fonts have been designed each with a different Glyph arrangement (see section on "A tutorial on fonts for Indian languages"). As an example, the string shown below in Devanagari, will have many different internal representations depending on the font used. These codes do not relate to linguistic information that the character referenced by the code stands for. However, text rendering requires only a simple one to one mapping between the codes and the glyphs, a feature built into almost every computer system to deal with ASCII text. 

  The string has four syllables in it and requires 14 bytes of storage if the Xdvng font is used and 9 bytes if the Sanskrit98 font is used. The same string may also be represented in transliterated form using 10 bytes. Clearly, glyph codes cannot be utilized for extracting linguistic information from the internal representation. To some extent, the transliteration based representation has some advantages since one can possibly identify syllables based on vowel boundaries. However, transliteration based approach does not help us order the aksharas in the desired lexical order since the ordering for indian languages is totally different and the standard sorting algorithms will yield incorrect results. Also, it will not be easy to write a special computer program to do this since identifying a syllable requires scanning through a variable number of bytes for each syllable.

  ISCII and Unicode representations for the above string do preserve linguistic information but the codes cannot be directly rendered and the display will have to be composed by putting together one or more shapes consistent with the syllable being rendered. The real issue however is the mapping to be effected when going from ISCII or Unicode to the glyphs in the font. The mapping will involve complex rules depending on the conventions used in the writing system which can surprisingly vary even for a given script. The mapping rules are generally required to be built into the application. This poses real difficulties for those who write the software. Two different applications following two different rendering conventions will produce text that will be incompatible between the applications. Please visit the relevant section on detailed discussion of Unicode for Indian languages for additional information.

Return to top

Difficulties in using variable length (multibyte) representations

  The variable number of bytes used in representing a syllable also pose peculiar problems in text processing. Let us consider the problem of identifying a palindrome in Sanskrit. Given below are some palindromes familiar to our viewers from the Learn Sanskrit series of on-line lessons.

  A glance at the representations is enough to convince one of the the futility of attempting standard algorithms for a solution. The palindrome is immediately recognized when seen as a series of syllables but not when seen in terms of the codes.

  There are other problems too which relate to ambiguities in interpreting a string where a series of eight bit codes carrying rendering information are present the text. What we have tried to emphasize here is that meaningful text processing in any of the Indian languages can be achieved only if the internal representation allows direct identification of syllables (Aksharas).

  As of March 2005, none of the existing font based coding schemes for Indian languages satisfies the linguistic processing requirement. ISCII and Unicode at least have some structure that might help identify the aksharas but even these run into problems as we will see below.

Return to top

Problems faced due to assignment of codes to character shapes.

  Unicode and ISCII have provided a few codes which do not carry any linguistic information but direct the rendering process to force some rules while rendering text. Thus both the schemes mix linguistic content with rendering information. Extracting  linguistic content from a string with such a mix requires extensive context dependent processing, something one cannot easily handle at the application level. Contrary to the basic principle behind Unicode where text representation should be clearly separated from the rendering process, Unicode does show a departure in respect of South Asian scripts.

  What is the significance of this observation?

  Very simply, applications will not be able to identify text strings for their linguistic content when string matching is involved ( regular expression matching, if you permit). Just try and figure out a suitable regular expression to match all the strings shown in the illustration below! 

Unicode rendering

  If you are wondering as to how this text was created in the first place, the header in the window should provide the clue. You can download the associated file ( aditya.txt ) and try the string matching yourself in any application that supports Unicode text processing with Indian languages. 

Valid Unicode Strings do not necessarily constitute valid Linguistic content.

  Both ISCII and Unicode include codes for the medial vowel representations. A medial vowel representation (Matra) does not carry any linguistic information by itself. That is, one has to make sure that a consonant or conjunct precedes the Matra. It is quite easy to setup a Unicode string in Devanagari or other Indian scripts to display a Matra by itself on the wrong side of a consonant giving one an impression that a particular syllable is being shown. Internally, it would be a different story altogether. Here is an example.

Incorrectly rendered text

 The corresponding file mahodaya.txt may be downloaded and the rendering variations checked on your system.

  The provisions in Unicode and ISCII to compose a syllable can result in extremely difficult situations for the application handling the text. The real problem one faces in practice is that the application is required to handle part of the rendering by querying the system to check if the font used (typically an open type font) supports the display form sought. Such a requirement poses difficulties for software developers. It would be so much better if an application can just prepare the string and ask the system to render it, as in standard ASCII.

  In ISCII, "INV" the code representing an invisible consonant and the "Nukta", a code set apart for composing dotted consonants and some other ligatures are examples of rendering information built into a code. Unicode runs into additional problems as well because it provides codes for ligatures that would not qualify for linguistic content by themselves. These coupled with characters such as the zero width joiner, zero width non-joiner etc., can cause serious headaches to the text processing applications if the displayed text was composed using these codes. This is what the example cited above illustrates, where identical displays do not have identical internal representations.

Coming back to our problem, the ISCII INV code is special as it represents a way of displaying a consonant. The INV cannot be viewed as part of a syllable since it refers to a shape in this case. As mentioned above, one must look at the context in which the INV code is used before dealing with it. Applications which interpret ISCII text often have problems rendering the strings so as to allow proper transliteration across the scripts. ILEAP, the multilingual offering from CDAC is one of the few known applications handling ISCII. This application does run into problems in transliteration of strings which include the INV and the Nukta codes. Viewers familiar with ILEAP may want to try this for themselves by downloading the associated .aci file which can be inserted into an ILEAP document  iscii_ex.aci .

In Unicode, values are assigned only for the basic vowels, consonants and the vowel extensions or medial vowels. (Please refer to the chart for Unicode assignments for Devanagari.) Though fundamentally Unicode aims at separating the text representation from the rendering of the display, discrepancies such as illustrated above create difficulties in practice. In any electronic text processing, it is important to avoid context dependent  identification of text. Where one letter of the alphabet maps into one font glyph, the context problem does not arise. For Indian scripts, where a conjunct character is often built form several glyphs, identifying a context will be nothing short of a nightmare!

Return to top

Dealing with South Indian scripts (also, collation issues)

Let us now go over to a few other difficulties with the present assignment of Unicode for some of the South Indian languages, specifically Tamil. The string shown below may be examined. Unicode allows the string to be generated using six characters.

This example alone is enough to establish the need for text representation in terms of linguistic units or aksharas. One is not surprised that ancient wisdom in India emphasized the need to utter the sounds properly after looking at its representation as an akshara. Thus, building up a composite shape for an akshara from the basic units (i.e., shapes for vowels, consonants and vowel extensions) was a process that was learnt over a period of time but once understood, a person had no difficulty in hearing the sound in a shape.

Our next observation about Unicode has to do with the sorting  order of aksharas in different languages. The causality in this case is Tamil, though one runs into related problems even for Devanagari. The basic consonants in Tamil are eighteen and the accepted lexical ordering is given below.

The Unicode assignment differs from the established convention. This probably was not the intention of those who assigned Unicode values to the Tamil letters but resulted as a consequence of fitting the aksharas of other languages into the basic framework setup for Sanskrit.

The view expressed above has been contested by others, specifically the proponents of Unicode, who maintain that encoding schemes cannot be expected to provide compatibility with the lexical ordering sought. According to them, it is the responsibility of the Application Software to handle meaningfully, linguistic issues connected with the application. While one cannot deny the correctness of this view, the question of whether such applications can be written at all remains to be answered. Interested readers may visit the page where we have discussed this in detail.

Return to top

Does Unicode representation allow direct transliteration across Indian languages?

The answer, sadly, is a NO. Transliteration would be correct only when syllables can be properly identified. Unicode values for the aksharas of different languages do not always match in terms of their index within the assigned set of codes. Transliteration will have to be attempted based on the context and should take into account the presence of modifier codes, a near impossible task if the transliterated display should look right and convey the same linguistic content. Even granting that across the Indian scripts one could attempt to use large conversion tables, script specific Unicode assignments will cause difficulties. Worse still, it will not be easy to provide transliteration into Roman with diacritics, something scholars all over the world have used for representing text in different Indian languages. This despite the fact that Unicode supports a full complement of IPA symbols!.


Having said so much about the inadequacies of ISCII and Unicode, we should also examine the real feasibility of a syllable level  representation. The akshara as a representation of a sound is a unique concept, though the same sound may be given different written shapes based on the script. So one has to identify the set of sounds in a language and assign codes to them in such a way that each sound may be distinguished uniquely and ordered properly according to the lexical order. For linguistic purposes, it may be necessary to break a sound into its basic component sounds (vowels and consonants). For efficient string processing, the assigned codes must all be of the same size. Variable length representation for a syllable does not help in any way to write good algorithms which can also be efficient when implemented.

Just how many syllables are required to be coded is an interesting question, for as one might guess, there are countless possibilities of combinations of consonants and vowels. Yet, over the period that our languages have seen good use, approximately eight hundred to a thousand syllables are seen. One  has to merely look at a dictionary and count the different aksharas to arrive at a meaningful number. For Sanskrit and many other Indian languages, this number as indicated above is approximately eight hundred to a thousand basic syllables, i.e., aksharas consisting of only consonants. With the possibility of each conjunct combining with any one of the vowels, the total number will be many thousands. (The noted exception to this is Tamil where a conjunct is always  written by splitting it into its basic consonants).

By carefully examining texts in different languages, the development team at IIT Madras has identified about 800 conjuncts which are individually used (along with a vowel of course). The coding scheme recommended for use is a sixteen bit value for each of the 13000 or so individually identifiable syllables. The code has been designed in such a way as to quickly reveal the basic consonant and vowel forming the syllable and also identify the other consonants, should there be any. The way syllables are reckoned in Indian scripts is explained in a separate page.

For those who think in binary, a sixteen bit code allows up to 65536 possibilities and that many will never really be required. The IIT Madras coding scheme has structured the sixteen bits in such a way that only the specified syllables will be recognized by the processing utilities. As of now, most applications which have allowed Indian languages to be handled, have used only a font based or a Unicode based representation for the aksharas. Such applications will not be able to interpret the text prepared using the IIT Madras software. This is not a serious problem since there are not many applications that have really enabled Indian language usage. The IIT Madras software includes many different applications which can be used right away. Hence, the solution offered by IIT Madras should be viewed as not merely a feasible approach to dealing with the problem of coding Indian language characters but one which meets both requirements viz., language enabling as well as localization.

A syllable level coding scheme will have several other advantages. Applications may choose from a variety of fonts for display and printing and also freely transliterate across scripts, thus allowing multilingual preparation of documents with the same text shown in many languages. The common format will also come in handy for preparing material for the web, where interaction may also be provided on a web page, as may be seen from the on-line demos at this site. The syllable level representation is amazingly compact when one thinks of storage occupied. The whole text of the Bhagavadgita, with seven hundred slokas in "anushtup" chandas, will require only twenty four kilobytes of storage! It is a different matter altogether that these twenty four may require several hundreds of Kilobytes of commentaries for a person to understand the purpose and meaning of the seven hundred slokas. That Sanskrit has the ability to compact much information in its syllables is really true in practice.

Return to top
 

Main issues

Script or Language?

Data preparation versus linguistic processing

Problems with multibyte representations

Codes for shapes as opposed to aksharas

Valid Unicode strings with no linguistic content

Sorting order

Transliteration across scripts

____________________

Additional references


A good exposition of the inconsistencies seen in Unicode in respect of Indian scripts is given in the web pages prepared by  Jeroen Hellingman
 
 

The Unicode Web Site has a set of Frequently Asked Questions with Answers in respect of Indian languages. You will find that the discussion on the left is not really at variance with what you will see at the Unicode Site.
 
 

An interesting page highlighting the problems of current Unicode assignments and omissions
( Bengali Related)
 
 

Why Unicode will not work on the Internet?
 (A viewpoint)
 
 

Another article relating to what one may see as Unicode's weak points.
 
 

An interesting exposition relating to Unicode and software internationalization. Discusses problems with Unicode
 
 

A page full of references to articles on Unicode

Acharya Logo
Himalayan peaks as they reflect the golden rays of the early morning sun.

Today is Mar. 26, 2017
Local Time: 13 06 34

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on 10/26/12     Best viewed at 800x600 or better