Unicode: A Viewpoint
The Multilingual Systems
project at IIT Madras was started around the time ISCII had evolved into
a standard. It was clear to the development team that though ISCII was
conceived as the basis for syllabic representation of text in Indian languages,
one had to reckon the need to process a variable number of bytes to identify
a proper syllable. The variable length code makes text processing very
complex especially in the presence of codes which do not have linguistic
significance but are required for correctly rendering the syllable.
In recent years, software
developers have indeed given serious thought to supporting Unicode for
Indian languages. Unicode for Indian languages has basically evolved from
ISCII and has retained the essence of eight bit coding scheme though script
specific codes have been assigned for the different scripts. World over,
there has been a continuing debate about the real suitability of Unicode
for applications in Indian languages but the open commitment given by Microsoft
has led many developers to toe the line towards Unicode.
From the very beginning,
the Multilingual Systems project at IIT Madras had seen the futility of
attempting to do text and linguistic processing with variable length codes
for syllables and had therefore evolved a uniform two byte scheme to simplify
The question of adhering
to a meaningful standard where developers see distinct advantages is an
important issue but a standard becomes meaningful only if most of what
we have successfully attempted earlier can be accommodated. In this respect,
Unicode for Indian languages does pose fairly serious challenges and to
this date (March 2005) no satisfactory implementation of useful applications
can be cited as examples.
purpose of this article is not to present an argument against using Unicode
but to bring out the real difficulties in coping with its implementation
for Indian languages.
Many of the complexities
involved in rendering Unicode text through Uniscribe (Microsoft's shaping
engine) or equivalent interfaces will be taken up one by one and the difficulties
faced in linguistic processing will be explained. Where required, test
files have been included for viewers to download and verify the points
The information provided
here will probably convince the reader that it is quite difficult to work
with Unicode for Indian languages. Hence one should seriously consider
alternatives for text processing. On the issue of using Unicode for transporting
information across system, there is enough consensus however.