Home --> Software Design Issues --> Unicode --> linux_guide
Application Development under Linux
(Unicode Support)
The Multilingual System Development effort at SDL emphasizes the need to work at the syllable level when it comes to processing text in Indian languages. Unicode is one approach to representing syllables. There are other approaches as well which have been used for many many years such as ISCII. If we look at the Transliteration based representation of text in our scripts, we see that the letters of the Roman script are used to represent the syllables, again in a multibyte manner.

As can be inferred from the discussions in earlier sections of this monograph, there is little difference between Unicode and a transliteration scheme. Programs such as ITRANS, RIT and many other transliteration schemes have successfully dealt with representing text in Indian languages. While it is true that many of these programs do not provide interactive interfaces, adding the support required is relatively straight forward. If this were really so, why is it that someone has not implemented it before for Unicode? We have some good answers here.

Unicode provides guidelines for implementing the shaping engine, though not in explicit terms. People believed (and still believe) that these guidelines are sacred and developers should strictly adhere to them. There is no reason why we have to follow the guidelines if whatever we do in practice satisfies the essential requirement of rendering the syllables, the special symbols and punctuation.

Unicode assumes that there can be no restrictions on syllable formation and so any syllable should be permitted no matter what consonants are present. Arbitrary syllables make little sense and experience has shown that in practice one encounters only a limited set, albeit numbering a few hundreds. The writing system does indeed provide for arbitrary ones by merely decomposing them into generic consonants except for the last. Hence if we can handle these, we would really be able to do take care of most text processing. The clue to doing this is to stop arbitrary syllable formation at the input stage itself, i.e., during data entry. 

Tools such as Lex and Yacc could be used to great advantage to parse the input string (i.e., keystrokes) to generate tokens that map directly to the specified set of syllables. The ITRANS package already has a complete definition file for many scripts and can identify most of the syllables correctly. Once the syllable is identified, it can be rendered by merely looking up a table. As we can see, the table will have at most a thousand entries (typically about six hundred), each corresponding to a base syllable. Syllables which involve different vowels with the same base, can be rendered by using matras and only exceptions need be remembered.

The main advantage to be gained in this case is that our syllables will not contain the modifier characters in the input string, thus paving the way for much better linguistic text processing. Should one decide, each syllable may indeed be mapped into a unique integer based on the scheme suggested by IIT Madras. This would take us to two different  internal representations, one in terms of Unicode and the other in terms of fixed length syllable codes. A syllable can be rendered in one of many different scripts ( Unicode die-hards won't ever buy this) simply through table look up method. Virtually any font can be used which has the minimal set of glyphs required to render text in the specified script. Here we are deviating from a convention that the font used should conform to the encoding used in the text. This is justified on the grounds that for Indian scripts, it is well nigh impossible to force an encoding standard for text where the one code one glyph mapping applies.

What we are doing here is essentially implementing the rules of the writing systems by first identifying the syllables at the input stage itself and completing the rendering process by simple table look up. If we change the font, we simply change the table. For the same font, we can use different tables at different times to get different representations for the same syllable. In effect, we will have our own Uniscribe which can dynamically be configured to work with a script and any appropriate font. There will be very few restrictions on the font itself except that zero width glyphs will have to be correctly rendered. Fortunately, X11 under Linux does a good job of this. Introducing a new script for a language simply involves the use of an appropriate font and a table mapping the syllables to the glyphs.

Most of our multilingual requirements such as transliteration across scripts, uniform data entry for all the languages and most importantly, uniform approach to linguistic processing in all the scripts etc., can be comfortably met if we take this approach. 

Cut and paste across applications will require that we maintain a backing string and map the blocked text to portions in this string. GTK allows us to do this effectively. The multilingual editor for Linux from IIT Madras allows you to change the script on screen and allows effortless cut/copy and paste without disturbing the stored representation of the syllables. 

An input module may be provided to the developers which essentially is a character input facility along the lines of getchr( ). This module will be called by an application to input text. The module will return syllables in multibyte form or if necessary in a fixed width form. The syllables will be easy to work with from a linguistic angle, since no modifier codes will be present. A reasonable amount of equivalence in terms of code values across scripts may be possible now and transliteration may be more easily accomplished. Applications need not switch keyboards since the syllables will be common to all the languages. Only the font and the associated table will differ.

Open type fonts can be avoided. Unicode fonts in the range E000-E9FF may still be used for rendering text. We will need at most about 240 glyphs for each script to get a very satisfactory (and reasonably complete) set of ligatures displayed. So we can actually have one single Unicode font catering to all the nine scripts in this range. In fact one can go back to the syllables from the glyph codes by parsing the glyph string with a parser which may be easily written using lex and Yacc, much the same way we had recommended earlier.

Multilingual Computing- A view from SDL

Writing systems
Linguistic requirements
Dealing with Text
Computing requirements (for India)

Unicode for Indian Languages

The conceptual basis for Unicode

Unicode for Indian scripts
Data entry
Issues in rendering Unicode
Using a shaping engine
Discussion on sorting
Open type fonts

Unicode support in Microsoft applications

Limitations of Uniscribe

A review of some MS applications supporting Unicode

Recommendations for Developers of Indian language Applications

Using True type fonts to render Unicode Text

Can we simplify handling Unicode text?

Guidelines for development under Linux

Summary of SDL's observations

Acharya Logo
Distant views of the Himalayan Peaks are unforgettable and awe inspiring!

Today is Apr. 05, 2020
Local Time: 01 32 42

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on     Best viewed at 800x600 or better