In the context of
internationalization and providing uniformity in the handling of text based
information across the languages of the world, Unicode has gained considerable
importance. The fundamental concept behind Unicode is that text (Unicode
based text) representation retains the linguistic content that must be
conveyed while at the same time provide for this content to be displayed
in human readable form. By catering to both these requirements, Unicode
has emerged as the best choice for representing text in a computer application,
specifically one that deals with multilingual content. Developers across
the world are committing themselves to providing Unicode support in all
processing is one of the essential requirements when it comes to computerization
in India. Here, the development of applications requires that interactive
user interfaces in different regional languages must be part of each application.
A specific regional language may be supported through one or more scripts
despite the fact that a given script may be used for more than one language,
very important issue, from a conceptual angle at least, is whether support
for a script is equivalent to supporting a language? During
the initial phases of development of applications in Indian languages,
one was concerned more with the rendering aspects of text, a formidable
problem in itself on account of the syllabic writing system followed for
all the Indian languages. No one really felt compelled to take into consideration
text processing issues. Majority of the early applications required text
entry and display with computation effected on numbers rather than text
per se. It is not surprising therefore that whatever standardization was
attempted, did emphasize mostly the aspects of the writing system without
really catering to the linguistic requirements.
In essence, the standardization
mentioned above (ISCII and Unicode) requires context dependent text processing
of each character as opposed to simple handling of a character by itself.
In western scripts, the writing system employs a relatively small set of
shapes and symbols as this is sufficient to satisfy the requirement that
linguistic content as well as rendering information be exactly specified
through the same set of codes. Consequently, text processing could be comfortably
achieved using a small set of codes.
In respect of our
languages, the complexities of the writing systems demand that a large
number of written shapes (typically in thousands) be used though the linguistic
content may still be specified using a small set of codes for the vowels
and consonants (typically less than a hundred). Hence it is not possible
to use the same set of codes to satisfy both the requirements. In their
wisdom, the designers of ISCII and subsequently Unicode, essentially struck
a compromise where the smaller set of codes was recommended. Yet,
they yielded to the temptation of incorporating codes to include rendering
information as well. These codes conveying rendering information took care
of Devanagari derived writing systems but do not adequately address the
writing systems of the South.
The problem that we
face today, in respect of efficient representation of text in our languages,
is precisely one of not being able to do either effective linguistic processing
or meet the real requirements of the writing systems.
Multilingual Systems Development Project at IIT Madras had taken the view
that efficient text processing is absolutely essential and is perhaps more
important than precise rendering of text so long as ambiguities are avoided.
The consequence of this decision was that the coding structure should preserve
linguistic content as well as provide complete rendering information within
the flexibilities offered by the writing system. Such a coding scheme would
require syllables to be coded since the linguistic content is expressed
through syllables and the writing system displays syllables. The multilingual
software applications developed at IIT Madras have successfully demonstrated
that linguistic text processing at the syllable level is not only possible
but can also be accomplished by using conventional algorithms which work
with fixed size codes. In contrast with this, application development
with Unicode support has raised a number of issues which must be thoroughly
discussed and understood before one accepts Unicode as a viable standard
for computing with Indian languages.
In the light of the
above, the Systems Development Laboratory, IIT Madras is pleased to share
with the viewers, the Lab's experiences in dealing with linguistic and
rendering issues of text in all the important scripts of India.