System Development effort at SDL emphasizes the need to work at the syllable
level when it comes to processing text in Indian languages. Unicode is
one approach to representing syllables. There are other approaches as well
which have been used for many many years such as ISCII. If we look at the
Transliteration based representation of text in our scripts, we see that
the letters of the Roman script are used to represent the syllables, again
in a multibyte manner.
As can be inferred from the
discussions in earlier sections of this monograph, there is little difference
between Unicode and a transliteration scheme. Programs such as ITRANS,
RIT and many other transliteration schemes have successfully dealt with
representing text in Indian languages. While it is true that many of these
programs do not provide interactive interfaces, adding the support required
is relatively straight forward. If this were really so, why is it that
someone has not implemented it before for Unicode? We have some good answers
provides guidelines for implementing the shaping engine, though not in
explicit terms. People believed (and still believe) that these guidelines
are sacred and developers should strictly adhere to them. There
is no reason why we have to follow the guidelines if whatever we do in
practice satisfies the essential requirement of rendering the syllables,
the special symbols and punctuation.
assumes that there can be no restrictions on syllable formation and so
any syllable should be permitted no matter what consonants are present.
Arbitrary syllables make little sense and experience has shown that in
practice one encounters only a limited set, albeit numbering a few hundreds.
The writing system does indeed provide for arbitrary ones by merely decomposing
them into generic consonants except for the last. Hence if we can handle
these, we would really be able to do take care of most text processing.
The clue to doing this is to stop arbitrary syllable formation at the input
stage itself, i.e., during data entry.
such as Lex and Yacc could be used to great advantage to parse the input
string (i.e., keystrokes) to generate tokens that map directly to the specified
set of syllables. The ITRANS package already has a complete definition
file for many scripts and can identify most of the syllables correctly.
Once the syllable is identified, it can be rendered by merely looking up
a table. As we can see, the table will have at most a thousand entries
(typically about six hundred), each corresponding to a base syllable. Syllables
which involve different vowels with the same base, can be rendered by using
matras and only exceptions need be remembered.
main advantage to be gained in this case is that our syllables will not
contain the modifier characters in the input string, thus paving the way
for much better linguistic text processing. Should one decide, each
syllable may indeed be mapped into a unique integer based on the scheme
suggested by IIT Madras. This would take us to two different internal
representations, one in terms of Unicode and the other in terms of fixed
length syllable codes. A syllable can be rendered in one of many different
scripts ( Unicode die-hards won't ever buy this) simply through table look
up method. Virtually any font can be used which has the minimal set of
glyphs required to render text in the specified script. Here we are deviating
from a convention that the font used should conform to the encoding used
in the text. This is justified on the grounds that for Indian scripts,
it is well nigh impossible to force an encoding standard for text where
the one code one glyph mapping applies.
What we are doing here is
essentially implementing the rules of the writing systems by first identifying
the syllables at the input stage itself and completing the rendering process
by simple table look up. If we change the font, we simply change the table.
For the same font, we can use different tables at different times to get
different representations for the same syllable. In effect, we will have
our own Uniscribe which can dynamically be configured to work with a script
and any appropriate font. There will be very few restrictions on the font
itself except that zero width glyphs will have to be correctly rendered.
Fortunately, X11 under Linux does a good job of this. Introducing
a new script for a language simply involves the use of an appropriate font
and a table mapping the syllables to the glyphs.
Most of our multilingual
requirements such as transliteration across scripts, uniform data entry
for all the languages and most importantly, uniform approach to linguistic
processing in all the scripts etc., can be comfortably met if we take this
Cut and paste across applications
will require that we maintain a backing string and map the blocked text
to portions in this string. GTK allows us to do this effectively. The multilingual
editor for Linux from IIT Madras allows you to change the script on
screen and allows effortless cut/copy and paste without disturbing the
stored representation of the syllables.
An input module may be provided
to the developers which essentially is a character input facility along
the lines of getchr( ). This module will be called by an application to
input text. The module will return syllables in multibyte form or if necessary
in a fixed width form. The syllables will be easy to work with from a linguistic
angle, since no modifier codes will be present. A reasonable amount of
equivalence in terms of code values across scripts may be possible now
and transliteration may be more easily accomplished. Applications need
not switch keyboards since the syllables will be common to all the languages.
Only the font and the associated table will differ.
Open type fonts can be avoided.
Unicode fonts in the range E000-E9FF may still be used for rendering text.
We will need at most about 240 glyphs for each script to get a very satisfactory
(and reasonably complete) set of ligatures displayed. So we can actually
have one single Unicode font catering to all the nine scripts in this range.
In fact one can go back to the syllables from the glyph codes by parsing
the glyph string with a parser which may be easily written using lex and
Yacc, much the same way we had recommended earlier.