of Unicode and ISCII
This is a discussion
relating to the suitability of ISCII or Unicode for linguistic text
processing in Indian languages. We have assumed that the reader is familiar
with the assignment of codes in ISCII as well as Unicode for different
Indian languages/scripts. The views expressed here
should not be construed as opposing the very idea of Unicode for Indian
scripts. It just happens that Unicode brings in a lot difficulties
in linguistic processing. Unicode could possibly work in a multilingual
application for Indian languages involving just data entry and display.
Yet, a coding scheme that exhibits a clear bias towards text rendering
does pose problems even for such simple applications.
detailed presentation of Unicode and Basic Indian Language Computing has
been made available in a separate section. The issues involved have been
explained with several examples.
The following paragraphs
highlight the major problems one encounters in dealing with these two schemes.
the Encoding emphasize language issues or the writing system (script)?
This is a fundamental
question which must be understood by anyone discussing encoding schemes
for Indian Languages. In the past, text processing on a computer was always
understood in terms of the letters of the alphabet. The text concerned
is displayed in the script associated with the language. The script includes
all the shapes or symbols seen in the writing system so that the information
to be conveyed by the text is complete. In other words, the displayed information
conveys the linguistic content properly.
When it comes to Indian
languages, the script used for conveying the information may have no direct
relationship with the language used. It often is the case that any script
which can convey syllabic content without ambiguity or error could be used
for a language. Sanskrit for instance could be written in half a dozen
different scripts. A computer application dealing with a specific script
will certainly have to honour the writing conventions in vogue but text
processing cannot be based on the way the script displays a specific syllable.
What is critical is the linguistic content in the syllable because it represents
the sounds the syllable is built with and not the shapes used in composing
the display. The latin script handles the problem by displaying the syllable
only by composing it from the shapes of the consonants and vowel and this
rule is meticulously followed. With writing systems which are syllabic
in nature, each syllable has its own unique shape and identifying the syllable
from its shape is much more complex but it has the advantage that there
will be no ambiguity in the sounds when the written text is read out.
Text encoding schemes
that help identify the syllable quickly and efficiently will work better
for linguistic analysis or text processing in general. This will particularly
apply to writing systems which are syllabic in nature. When Unicode was
proposed, it was anticipated that the encoding scheme will concentrate
on linguistic requirements and not the rendering aspects, i.e., the writing
system involved. Unfortunately the bias towards rendering continues to
plague Unicode at least in so far as syllabic writing systems are concerned.
Since Unicode emphasizes
the script and not the language, one has to content oneself with the scripts
provided for in Unicode. There is no question of using many of the scripts
that we have seen in India. It will not be possible to handle electronically
scripts such as the Grantha, used in South india for writing Sanskrit or
the Modi script, used for Marathi at one time. While one may disagree with
this view and argue that one may never use those scripts now or in the
future, introducing new scripts or adding new symbols to an existing script
to cater to additional sounds will continue to remain a problem.
It must be stated
that linguistic scholars use the International Phonetic Alphabet
(IPA), a script that represents specific sounds covering almost all the
languages of the world. Unicode provides support for IPA but computer
applications providing appropriate interfaces for IPA can hardly
preparation versus Linguistic text processing.
There are basically
two fundamental aspects to Electronic processing of text. The first is
to generate the text itself so that the information can be stored and displayed,
preferably on different computer systems. Displaying text could also include
high quality printing or typesetting. The second is a more important point
which relates to interpreting the information carried by the text string.
For example, the text string could well be a line from a poem where one
is trying to find out if the string is a palindrome. In respect of Indian
languages where prefixes and suffixes are added to root words to obtain
declensions, it may be necessary to look at a string to arrive at the root
word for grammatically analyzing the sentence and break it into words representing
different parts of speech. This is another example of linguistic text processing.
When the assigned codes relate only to the writing system used for the
language, the emphasis is primarily on displaying the text string.
When text has to be displayed, the codes representing the text will have
to be mapped to the shapes appropriate to the character in the text. In
the case of Indian languages, the position of a glyph in a font designed
for the script had been used in the past as the code for the text. For
many of the Indian languages, several different fonts have been designed
each with a different Glyph arrangement (see section on "A
tutorial on fonts for Indian languages"). As an example, the string
shown below in Devanagari, will have many different internal representations
depending on the font used. These codes do not relate to linguistic information
that the character referenced by the code stands for. However, text rendering
requires only a simple one to one mapping between the codes and the glyphs,
a feature built into almost every computer system to deal with ASCII text.
The string has four syllables
in it and requires 14 bytes of storage if the Xdvng font is used and 9
bytes if the Sanskrit98 font is used. The same string may also be represented
in transliterated form using 10 bytes. Clearly, glyph codes cannot be utilized
for extracting linguistic information from the internal representation.
To some extent, the transliteration based representation has some advantages
since one can possibly identify syllables based on vowel boundaries. However,
transliteration based approach does not help us order the aksharas in the
desired lexical order since the ordering for indian languages is totally
different and the standard sorting algorithms will yield incorrect results.
Also, it will not be easy to write a special computer program to do this
since identifying a syllable requires scanning through a variable number
of bytes for each syllable.
ISCII and Unicode
representations for the above string do preserve linguistic information
but the codes cannot be directly rendered and the display will have to
be composed by putting together one or more shapes consistent with the
syllable being rendered. The real issue however is the mapping to be effected
when going from ISCII or Unicode to the glyphs in the font. The mapping
will involve complex rules depending on the conventions used in the writing
system which can surprisingly vary even for a given script. The mapping
rules are generally required to be built into the application. This poses
real difficulties for those who write the software. Two different applications
following two different rendering conventions will produce text that will
be incompatible between the applications. Please visit the relevant section on detailed
of Unicode for Indian languages for additional information.
in using variable length (multibyte) representations
The variable number
of bytes used in representing a syllable also pose peculiar problems in
text processing. Let us consider the problem of identifying a palindrome
in Sanskrit. Given below are some palindromes familiar to our viewers from
the Learn Sanskrit series of on-line lessons.
A glance at the representations
is enough to convince one of the the futility of attempting standard algorithms
for a solution. The palindrome is immediately recognized when seen as a
series of syllables but not when seen in terms of the codes.
There are other problems
too which relate to ambiguities in interpreting a string where a series
of eight bit codes carrying rendering information are present the text.
What we have tried to emphasize here is that meaningful text processing
in any of the Indian languages can be achieved only if the internal representation
allows direct identification of syllables (Aksharas).
As of March 2005,
none of the existing font based coding schemes for Indian languages satisfies
the linguistic processing requirement. ISCII and Unicode at least have
some structure that might help identify the aksharas but even these run
into problems as we will see below.
faced due to assignment of codes to character shapes.
Unicode and ISCII
have provided a few codes which do not carry any linguistic information
but direct the rendering process to force some rules while rendering text.
Thus both the schemes mix linguistic content with rendering information.
Extracting linguistic content from a string with such a mix requires
extensive context dependent processing, something one cannot easily handle
at the application level. Contrary to the basic principle behind Unicode
where text representation should be clearly separated from the rendering
process, Unicode does show a departure in respect of South Asian scripts.
What is the significance
of this observation?
Very simply, applications
will not be able to identify text strings for their linguistic content
when string matching is involved ( regular expression matching, if you
permit). Just try and figure out a suitable regular expression to match
all the strings shown in the illustration below!
If you are wondering
as to how this text was created in the first place, the header in the window
should provide the clue. You can download the associated file ( aditya.txt
) and try the string matching yourself in any application that supports
Unicode text processing with Indian languages.
Unicode Strings do not necessarily constitute valid Linguistic content.
Both ISCII and Unicode
include codes for the medial vowel representations. A medial vowel representation
(Matra) does not carry any linguistic information by itself. That is, one
has to make sure that a consonant or conjunct precedes the Matra. It is
quite easy to setup a Unicode string in Devanagari or other Indian scripts
to display a Matra by itself on the wrong side of a consonant giving one
an impression that a particular syllable is being shown. Internally, it
would be a different story altogether. Here is an example.
The corresponding file
mahodaya.txt may be downloaded and the rendering
variations checked on your system.
The provisions in
Unicode and ISCII to compose a syllable can result in extremely difficult
situations for the application handling the text. The real problem one
faces in practice is that the application is required to handle part of
the rendering by querying the system to check if the font used (typically
an open type font) supports the display form sought. Such a requirement
poses difficulties for software developers. It would be so much better
if an application can just prepare the string and ask the system to render
it, as in standard ASCII.
In ISCII, "INV" the
code representing an invisible consonant and the "Nukta", a code set apart
for composing dotted consonants and some other ligatures are examples of
rendering information built into a code. Unicode runs into additional problems
as well because it provides codes for ligatures that would not qualify
for linguistic content by themselves. These coupled with characters such
as the zero width joiner, zero width non-joiner etc., can cause serious
headaches to the text processing applications if the displayed text was
composed using these codes. This is what the example cited above illustrates,
where identical displays do not have identical internal representations.
Coming back to our problem,
the ISCII INV code is special as it represents a way of displaying a consonant.
The INV cannot be viewed as part of a syllable since it refers to a shape
in this case. As mentioned above, one must look at the context in which
the INV code is used before dealing with it. Applications which interpret
ISCII text often have problems rendering the strings so as to allow proper
transliteration across the scripts. ILEAP, the multilingual offering from
CDAC is one of the few known applications handling ISCII. This application
does run into problems in transliteration of strings which include the
INV and the Nukta codes. Viewers familiar with ILEAP may want to try this
for themselves by downloading the associated .aci file which can be inserted
into an ILEAP document iscii_ex.aci .
In Unicode, values are assigned
only for the basic vowels, consonants and the vowel extensions or medial
vowels. (Please refer to the chart for Unicode assignments for Devanagari.)
Though fundamentally Unicode aims at separating the text representation
from the rendering of the display, discrepancies such as illustrated above
create difficulties in practice. In any electronic text processing, it
is important to avoid context dependent identification of text. Where
one letter of the alphabet maps into one font glyph, the context problem
does not arise. For Indian scripts, where a conjunct character is often
built form several glyphs, identifying a context will be nothing short
of a nightmare!
with South Indian scripts (also, collation issues)
Let us now go over to a few
other difficulties with the present assignment of Unicode for some of the
South Indian languages, specifically Tamil. The string shown below may
be examined. Unicode allows the string to be generated using six characters.
This example alone is enough
to establish the need for text representation in terms of linguistic units
or aksharas. One is not surprised that ancient wisdom in India emphasized
the need to utter the sounds properly after looking at its representation
as an akshara. Thus, building up a composite shape for an akshara from
the basic units (i.e., shapes for vowels, consonants and vowel extensions)
was a process that was learnt over a period of time but once understood,
a person had no difficulty in hearing the sound in a shape.
Our next observation about
Unicode has to do with the sorting order of aksharas in different
languages. The causality in this case is Tamil, though one runs into related
problems even for Devanagari. The basic consonants in Tamil are eighteen
and the accepted lexical ordering is given below.
The Unicode assignment differs
from the established convention. This probably was not the intention of
those who assigned Unicode values to the Tamil letters but resulted as
a consequence of fitting the aksharas of other languages into the basic
framework setup for Sanskrit.
view expressed above has been contested by others, specifically the proponents
of Unicode, who maintain that encoding schemes cannot be expected to provide
compatibility with the lexical ordering sought. According to them,
it is the responsibility of the Application Software to handle meaningfully,
linguistic issues connected with the application. While one cannot deny
the correctness of this view, the question of whether such applications
can be written at all remains to be answered. Interested readers may visit
the page where we have discussed this in detail.
Unicode representation allow direct transliteration across Indian languages?
The answer, sadly, is a NO.
Transliteration would be correct only when syllables can be properly identified.
Unicode values for the aksharas of different languages do not always match
in terms of their index within the assigned set of codes. Transliteration
will have to be attempted based on the context and should take into account
the presence of modifier codes, a near impossible task if the transliterated
display should look right and convey the same linguistic content. Even
granting that across the Indian scripts one could attempt to use large
conversion tables, script specific Unicode assignments will cause difficulties.
Worse still, it will not be easy to provide transliteration into Roman
with diacritics, something scholars all over the world have used for representing
text in different Indian languages. This despite the fact that Unicode
supports a full complement of IPA symbols!.
Having said so much about
the inadequacies of ISCII and Unicode, we should also examine the real
feasibility of a syllable level representation. The akshara as a
representation of a sound is a unique concept, though the same sound may
be given different written shapes based on the script. So one has to identify
the set of sounds in a language and assign codes to them in such a way
that each sound may be distinguished uniquely and ordered properly according
to the lexical order. For linguistic purposes, it may be necessary to break
a sound into its basic component sounds (vowels and consonants). For efficient
string processing, the assigned codes must all be of the same size. Variable
length representation for a syllable does not help in any way to write
good algorithms which can also be efficient when implemented.
Just how many syllables are
required to be coded is an interesting question, for as one might guess,
there are countless possibilities of combinations of consonants and vowels.
Yet, over the period that our languages have seen good use, approximately
eight hundred to a thousand syllables are seen. One has to merely
look at a dictionary and count the different aksharas to arrive at a meaningful
number. For Sanskrit and many other Indian languages, this number as indicated
above is approximately eight hundred to a thousand basic syllables, i.e.,
aksharas consisting of only consonants. With the possibility of each conjunct
combining with any one of the vowels, the total number will be many thousands.
(The noted exception to this is Tamil where a conjunct is always
written by splitting it into its basic consonants).
By carefully examining texts
in different languages, the development team at IIT Madras has identified
about 800 conjuncts which are individually used (along with a vowel of
course). The coding scheme recommended for use is a sixteen bit value for
each of the 13000 or so individually identifiable syllables. The code has
been designed in such a way as to quickly reveal the basic consonant and
vowel forming the syllable and also identify the other consonants, should
there be any. The way syllables are reckoned in Indian scripts is explained
in a separate page.
For those who think in binary,
a sixteen bit code allows up to 65536 possibilities and that many will
never really be required. The IIT Madras coding scheme has structured the
sixteen bits in such a way that only the specified syllables will be recognized
by the processing utilities. As of now, most applications which have allowed
Indian languages to be handled, have used only a font based or a Unicode
based representation for the aksharas. Such applications will not be able
to interpret the text prepared using the IIT Madras
software. This is not a serious problem since there are not many applications
that have really enabled Indian language usage. The IIT Madras software
includes many different applications which can be used right away. Hence,
the solution offered by IIT Madras should be viewed as not merely a feasible
approach to dealing with the problem of coding Indian language characters
but one which meets both requirements viz., language enabling as well as
A syllable level coding scheme
will have several other advantages. Applications may choose from a variety
of fonts for display and printing and also freely transliterate across
scripts, thus allowing multilingual preparation of documents with the same
text shown in many languages. The common format will also come in handy
for preparing material for the web, where interaction may also be provided
on a web page, as may be seen from the on-line demos at this site. The
syllable level representation is amazingly compact when one thinks of storage
occupied. The whole text of the Bhagavadgita, with seven hundred slokas
in "anushtup" chandas, will require only twenty four kilobytes of storage!
It is a different matter altogether that these twenty four may require
several hundreds of Kilobytes of commentaries for a person to understand
the purpose and meaning of the seven hundred slokas. That Sanskrit has
the ability to compact much information in its syllables is really true