Sorting order with Unicode
The debate on
Unicode sorting order or collation
One of the issues
which has received much attention in respect of Indian languages and Unicode
is the problem of sorting order (called collation by some experts). Traditionally,
the assignment of codes to the characters of a language took into consideration
the order in which the letters of the alphabet would be arranged for purposes
of creating lists which could be viewed easily and scanned quickly by a
person. Almost all the classical sorting algorithms (including indexing
of data bases) arrange the letters in the increasing or decreasing order
of the assigned codes.
It is clearly known
that Unicode has not taken into account the required lexical ordering of
the aksharas in any of the Indian scripts. This is understandable, for
Unicode was essentially derived from ISCII where the ordering was based
on similar sounding aksharas rather than the actual ordering conventions
and this applied mainly to the Southern Languages. ISCII gave a uniform
set of codes for all the languages however and perhaps on account of this
no one really raised the issue. Unicode made a departure by assigning language
specific (actually script specific) codes to our aksharas but in essence
retained the basic structure of ISCII.
of aksharas that were ordered differently are shown below.
The two "ra"s
of Tamil are placed together though they are separated by four
consonants in the conventional order. The two "na"s in Tamil are
placed together where as they are separated by nine consonants. The very
soft "na" in Tamil actually comes at the end. The consonants in our languages
are also grouped together linguistically and it will be necessary to keep
this in mind when attempting any sort of Linguistic Text processing.
Lexical ordering of
text is desirable whenever we prepare information for manual view as in
a dictionary or a list of names of students in a class.
A recent paper written
by an expert at Microsoft titled " Issues in Indic language collation"
argues that in general, assignment of character codes for several world
languages has not taken into consideration the lexical ordering and that
the Unicode assignment cannot be faulted. The expert's assertion
is that culturally and linguistically appropriate
collation is influenced by a language and not the script. The author
goes on to state that it will be shown in the paper that Unicode,
as an encoding, is more than sufficient to support Indic scripts and languages,
since it is only one step among many to develop culturally and linguistically
appropriate software for India.
One must read the statement
carefully, for Microsoft has accepted that coding alone is not the issue
but the application as well. It has also emphasized that an application
(which is based on the code) must be culturally and linguistically appropriate.
No one can deny the correctness of these observations. In
placing the script above the language, i.e., emphasizing the need to handle
the script in the computer rather than the linguistic content, a very peculiar
situation has emerged, in respect of computing with Indian languages.
The real issue is whether
such applications can indeed be written with Unicode as the base. That
is, in the context of linguistic processing can an application supporting
Unicode truly incorporate the features called for in providing a culturally
and linguistically appropriate solution to the problem at hand?
This question can be easily
A text processing application that places the script ahead of the language
will necessarily have to examine the context in which a Unicode character
is seen within a text string. A perfectly valid Unicode string is not necessarily
valid in terms of its linguistic content and so every application must
build into itself a great deal of linguistic information to map a given
Unicode string into the linguistic entity that the user will understand.
applications are not only very difficult to write but will be heavily influenced
by the script itself making it virtually impossible to handle a truly multilingual
In the first place,
it is a difficult proposition indeed to write any text processing application
which has to work with multiple characters to arrive at a linguistic quantum,
namely the syllable, which is central to all the Indian languages. If Unicode
had concentrated on the linguistic content alone and had not prescribed
rendering rules, the situation would be a little better. This is not the
case however and linguistic processing with Unicode will require very complex
algorithms to actually infer the context in which each character appears
by examining the characters appearing before as well as those appearing
Consider the situation
in respect of the Matras. The matra itself is not a proper linguistic unit
but a representation of a medial vowel, i.e., a vowel occurring in a syllable
in the middle or end of a word. Matras have been assigned codes so that
a computer program can quickly identify a syllable boundary in a text string.
If we ask ourselves the question, "How many times does a given vowel occur
in some text, the program will have to match not only the occurrence of
that vowel but its matra as well. This is two comparisons. Worse still,
a vowel can occur in its basic form right in the middle of a word as shown
that to check for the presence of the vowel, one will have to perform two
comparisons for each character but even that can be accepted. However the
two comparisons will still not yield the correct results since the matra
can be accepted only if it is preceded by a valid consonant. Now we begin
to appreciate the complexity involved. Imagine checking the occurrences
of the vowel shown in the illustration below. One has second thoughts on
whether Microsoft applications do indeed assert that linguistic content
can be preserved in a culturally appropriate manner!
A valid or legal Unicode string is not necessarily linguistically legal
(nonsense words are always linguistically legal). Getting linguistic
content out of any Unicode string is a very difficult task on account of
the multibyte nature of the syllable when expressed as a Unicode string.
Also the presence of codes which have no linguistic content but only provide
rendering information further complicates the processing.
As of this writing
(March 2003), linguistic collation has not been properly incorporated into
any of the Microsoft applications which are known to provide Unicode support
for Indian Languages. In the screen shot below, one can see the results
of sorting a column of words in a table. Both Devanagari and Tamil
examples are illustrated. It is clearly seen that only the Unicode ordering
is preserved and not the conventional linguistically accepted ordering.
The document was typed into Wordpad under Windows 2000, pasted onto word
and the words placed inside a table using the convert text to table feature
who would like to try this out for themselves, we have provided a downloadable
version of the file containing the words in Devanagari and Tamil which
will open with Wordpad, notepad or Word under Win2000/XP.
( Open with Wordpad or Word under win2000)
It is equally amusing
to observe the differences in the displayed text in each of the three applications,
The team at SDL was originally under the impression that Microsoft had
problems in rendering zero width glyphs in truetype fonts but Microsoft's
own Opentype font is no exception. The culprit is not the font but the
application. One can verify this by opening a sample html file (sorttest.html)
we have provided, with netscape4.7 or later which file contains the same
text in UTF8.
Shown below is a screen
shot of Netscape rendering the text referenced above.
In the illustration
below, the screen shot corresponds to the text copied from Netscape and
pasted into Microsoft Word. Notice the problems arising out of incorrect
interpretation of the Unicode string. Not only do we see problems with
the placement of the words but the last Unicode character in each line
seems to be rendered independently.
shot below shows the same text copied from Internet Explorer and
pasted into Wordpad. Notice how the last Unicode character has been missed
during the rendering process!
If the onus
is on the application to render a Unicode text string to conform to a linguistically
appropriate form, one can immediately see the futility of attempting to
write applications that deal with multilingual text, even assuming that
we take support from Microsoft provided modules such as Uniscribe. The
current implementations of Unicode support seem to concentrate mainly on
data entry and not really any text processing.
The wisdom of our Linguistic
Linguistics has been
an important subject of exposition and discussion in respect of Indian
languages (Sanskrit and Tamil in particular) from early times. The great
scholars and grammarians had clearly stated that the sound is more important
than the shape. and hence one must master the art of discerning sounds
correctly from any utterance. The script was secondary and we all know
that the same sound can be represented in different scripts. Thus the ability
to discern the sounds from written shapes was not considered important
and in fact discouraged since distortions could occur on account of the
variations in representation.
In the stone inscriptions
of Ashoka one finds occasional instances of conjuncts where the order of
the consonants in writing one below the other is reversed. A reader familiar
only with the script will no doubt read it incorrectly. Scholars known
to the author of this paper have however opined that this is a classic
example of a distortion when the person who does the carving fails to hear
the sounds carefully. The context however tells us what the akshara should
handling of text in Indian languages requires that a written shape is uniquely
traced to a proper linguistic quantum which is usually a syllable but can
well be a special symbol. Unicode will not be able to this efficiently.
That Unicode as an encoding is more than sufficient for supporting Indian
scripts is not something one can accept. We must remember that the language
comes first and then only a script for it. If you concentrate on the script
and provide for dealing with it in a computer, you will be severely limited
by what the computer program can actually display. On the other hand, Unicode
is sufficient for carrying information that can be displayed leaving the
viewer to extract the linguistic content from the display. Thus going from
Unicode to display makes sense since the viewer will interpret the text
linguistically but going back from the display so as to preserve the linguistic
content requires extremely complex processing and it is not clear
whether multilingual applications can really benefit from the use of Unicode.