Home --> Software Design Issues --> Unicode --> debate
Search  
 
Sorting order with Unicode
The debate on Unicode sorting order or collation

  One of the issues which has received much attention in respect of Indian languages and Unicode is the problem of sorting order (called collation by some experts). Traditionally, the assignment of codes to the characters of a language took into consideration the order in which the letters of the alphabet would be arranged for purposes of creating lists which could be viewed easily and scanned quickly by a person. Almost all the classical sorting algorithms (including indexing of data bases) arrange the letters in the increasing or decreasing order of the assigned codes.

  It is clearly known that Unicode has not taken into account the required lexical ordering of the aksharas in any of the Indian scripts. This is understandable, for Unicode was essentially derived from ISCII where the ordering was based on similar sounding aksharas rather than the actual ordering conventions and this applied mainly to the Southern Languages. ISCII gave a uniform set of codes for all the languages however and perhaps on account of this no one really raised the issue. Unicode made a departure by assigning language specific (actually script specific) codes to our aksharas but in essence retained the basic structure of ISCII.

  Specific instances of aksharas that were ordered differently are shown below.

  The two "ra"s of Tamil are placed together though they are separated by  four  consonants in the conventional order.  The two "na"s in Tamil are placed together where as they are separated by nine consonants. The very soft "na" in Tamil actually comes at the end. The consonants in our languages are also grouped together linguistically and it will be necessary to keep this in mind when attempting any sort of Linguistic Text processing.

  Lexical ordering of text is desirable whenever we prepare information for manual view as in a dictionary or a list of names of students in a class.

  A recent paper written by an expert at Microsoft titled " Issues in Indic language collation" argues that in general, assignment of character codes for several world languages has not taken into consideration the lexical ordering and that the Unicode assignment cannot be faulted.  The expert's assertion is that culturally and linguistically appropriate collation is influenced by a language and not the script. The author goes on to state that  it will be shown in the paper that Unicode, as an encoding, is more than sufficient to support Indic scripts and languages, since it is only one step among many to develop culturally and linguistically appropriate software for India.

One must read the statement carefully, for Microsoft has accepted that coding alone is not the issue but the application as well. It has also emphasized that an application (which is based on the code) must be culturally and linguistically appropriate. No one can deny the correctness of these observations. In placing the script above the language, i.e., emphasizing the need to handle the script in the computer rather than the linguistic content, a very peculiar situation has emerged, in respect of computing with Indian languages.

The real issue is whether such applications can indeed be written with Unicode as the base. That is, in the context of linguistic processing can an application supporting Unicode truly incorporate the features called for in providing a culturally and linguistically appropriate solution to the problem at hand?

This question can be easily answered.

  A text processing application that places the script ahead of the language will necessarily have to examine the context in which a Unicode character is seen within a text string. A perfectly valid Unicode string is not necessarily valid in terms of its linguistic content and so every application must build into itself a great deal of linguistic information to map a given Unicode string into the linguistic entity that the user will understand. Such applications are not only very difficult to write but will be heavily influenced by the script itself making it virtually impossible to handle a truly multilingual interface.

  In the first place, it is a difficult proposition indeed to write any text processing application which has to work with multiple characters to arrive at a linguistic quantum, namely the syllable, which is central to all the Indian languages. If Unicode had concentrated on the linguistic content alone and had not prescribed rendering rules, the situation would be a little better. This is not the case however and linguistic processing with Unicode will require very complex algorithms to actually infer the context in which each character appears by examining the characters appearing before as well as those appearing after it.

  Consider the situation in respect of the Matras. The matra itself is not a proper linguistic unit but a representation of a medial vowel, i.e., a vowel occurring in a syllable in the middle or end of a word. Matras have been assigned codes so that a computer program can quickly identify a syllable boundary in a text string. If we ask ourselves the question, "How many times does a given vowel occur in some text, the program will have to match not only the occurrence of that vowel but its matra as well. This is two comparisons. Worse still, a vowel can occur in its basic form right in the middle of a word as shown below.

 This means that to check for the presence of the vowel, one will have to perform two comparisons for each character but even that can be accepted. However the two comparisons will still not yield the correct results since the matra can be accepted only if it is preceded by a valid consonant. Now we begin to appreciate the complexity involved. Imagine checking the occurrences of the vowel shown in the illustration below. One has second thoughts on whether Microsoft applications do indeed assert that linguistic content can be preserved in a culturally appropriate manner!


Observation

  A valid or legal Unicode string is not necessarily linguistically legal (nonsense words are always linguistically legal). Getting linguistic content out of any Unicode string is a very difficult task on account of the multibyte nature of the syllable when expressed as a Unicode string. Also the presence of codes which have no linguistic content but only provide rendering information further complicates the processing.

  As of this writing (March 2003), linguistic collation has not been properly incorporated into any of the Microsoft applications which are known to provide Unicode support for Indian Languages. In the screen shot below, one can see the results of sorting a column of words in a table.  Both Devanagari and Tamil examples are illustrated. It is clearly seen that only the Unicode ordering is preserved and not the conventional linguistically accepted ordering. The document was typed into Wordpad under Windows 2000, pasted onto word and the words placed inside a table using the convert text to table feature of Word.


  For those who would like to try this out for themselves, we have provided a downloadable version of the file containing the words in Devanagari and Tamil which will open with Wordpad, notepad or Word  under Win2000/XP.

sorttest.doc ( Open with Wordpad or Word under win2000)

  It is equally amusing to observe the differences in the displayed text in each of the three applications, The team at SDL was originally under the impression that Microsoft had problems in rendering zero width glyphs in truetype fonts but Microsoft's own Opentype font is no exception. The culprit is not the font but the application.  One can verify this by opening a sample html file (sorttest.html) we have provided, with netscape4.7 or later which file contains the same text in UTF8.

  Shown below is a screen shot of Netscape rendering the text referenced above.

  In the illustration below, the screen shot corresponds to the text copied from Netscape and pasted into Microsoft Word. Notice the problems arising out of incorrect interpretation of the Unicode string. Not only do we see problems with the placement of the words but the last Unicode character in each line seems to be rendered independently.


  The screen shot below shows the same text  copied from Internet Explorer and pasted into Wordpad. Notice how the last Unicode character has been missed during the rendering process!


  If the onus is on the application to render a Unicode text string to conform to a linguistically appropriate form, one can immediately see the futility of attempting to write applications that deal with multilingual text, even assuming that we take support from Microsoft provided modules such as Uniscribe. The current implementations of Unicode support seem to concentrate mainly on data entry and not really any text processing.

The wisdom of our Linguistic experts

  Linguistics has been an important subject of exposition and discussion in respect of Indian languages (Sanskrit and Tamil in particular) from early times. The great scholars and grammarians had clearly stated that the sound is more important than the shape. and hence one must master the art of discerning sounds correctly from any utterance. The script was secondary and we all know that the same sound can be represented in different scripts. Thus the ability to discern the sounds from written shapes was not considered important and in fact discouraged since distortions could occur on account of the variations in representation.

  In the stone inscriptions of Ashoka one finds occasional instances of conjuncts where the order of the consonants in writing one below the other is reversed. A reader familiar only with the script will no doubt read it incorrectly. Scholars known to the author of this paper have however opined that this is a classic example of a distortion when the person who does the carving fails to hear the sounds carefully. The context however tells us what the akshara should really be.

  Correct linguistic handling of text in Indian languages requires that a written shape is uniquely traced to a proper linguistic quantum which is usually a syllable but can well be a special symbol. Unicode will not be able to this efficiently. That Unicode as an encoding is more than sufficient for supporting Indian scripts is not something one can accept. We must remember that the language comes first and then only a script for it. If you concentrate on the script and provide for dealing with it in a computer, you will be severely limited by what the computer program can actually display. On the other hand, Unicode is sufficient for carrying information that can be displayed leaving the viewer to extract the linguistic content from the display. Thus going from Unicode to display makes sense since the viewer will interpret the text linguistically but going back from the display so as to preserve the linguistic content  requires extremely complex processing and it is not clear whether multilingual applications can really benefit from the use of Unicode.

Multilingual Computing- A view from SDL

Introduction
Viewpoint
Writing systems
Linguistic requirements
Dealing with Text
Computing requirements (for India)


Unicode for Indian Languages

The conceptual basis for Unicode

Unicode for Indian scripts
Data entry
Issues in rendering Unicode
Using a shaping engine
Discussion on sorting
Open type fonts


Unicode support in Microsoft applications

Uniscribe
Limitations of Uniscribe

A review of some MS applications supporting Unicode



Recommendations for Developers of Indian language Applications

Using True type fonts to render Unicode Text

Can we simplify handling Unicode text?

Guidelines for development under Linux


Summary of SDL's observations

Acharya Logo
Distant views of the Himalayan Peaks are unforgettable and awe inspiring!

Today is May. 27, 2017
Local Time: 04 27 27

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on     Best viewed at 800x600 or better