Home --> Software Design Issues --> Unicode --> uniscribe
About Uniscribe
Uniscribe: Rendering Unicode text in Windows Applications

  The main function of Uniscribe is to take an arbitrarily long Unicode string and map it into a sequence of syllables for display. It is assumed that the input string correctly represents the Unicode characters entered from an application through the keyboard or hs been generated electronically. The Unicode characters come from a set of assigned Unicode values for the script in use.

  Those having access to Windows XP/2000 can actually generate the keystrokes and see how Wordpad or Microsoft Word (or even Notepad) handle the input. In the illustration, zwj and zwnj refer to specific Unicode values which convey rendering information. You have to type them in not as zwj in English but through the decimal equivalents of their Unicode values. The zero width joiner (zwj) is typed in by holding the ALT key down while entering the decimal value 08205. For the zero width non joiner, the value is 08204. This seems to work in Word and Wordpad.

  Built into the Uniscribe shaping engine are the rules for going from the Unicode string to the shape, consistent with the rendering recommendations from the Unicode consortium. Thus Uniscribe is nothing but a set of hard coded rules to render syllables. These rules are rigid (as implemented  by Microsoft) and hence a user does not have the flexibility to get alternative representations except to code them differently using possibly the zero width joiners and non joiners. In the examples shown above, the same syllable is shown in different displayed forms but generated from different Unicode strings.

  The implementation of Uniscribe is such that part of the shaping information is derived from the font used for the script and this font must be an open type font. Open type fonts for Indian languages require the designer to be thoroughly familiar with the writing system and this can be a rather exacting requirement. On account of the basic structure where the open type font allows a single glyph to be selected from a sequence of character codes, the font tends to become unwieldy. The currently available Mangal Open Type font for WinXP/2000 has nearly 650 glyphs, many of which are derived from a much smaller set of basic glyphs. It would not be incorrect to state that the motivation for Open Type fonts came more from languages with a syllabic writing system with many ligatures and combined shapes than other typesetting considerations. In fact text in Indian languages can be comfortably typeset with existing Truetype fonts for the different scripts. The issue of concern is Data Entry.

  The names  of Unicode characters (along with code values) are rigidly specified and there is absolutely no way new characters can be introduced without  going through the consortium. When you do succeed in that, every application that is based on Unicode will have to be rewritten to accommodate the change.

 Unicode, though a meaningful concept to represent text from different languages of the world ( more appropriately scripts) emphasizes the script first and then only the language. This is quite the opposite of our approach to languages. It is the language (defined by the sounds) that comes first and then only the script. We all know that any of the Indian languages can use any writing system so long as the sounds can be preserved. There will be no confusion in the process as we all know well that Sanskrit can be written in Devanagari, Sharada (from Kashmir) or Grantha from the south. All these retain the phonetic information in the script through properly formed rules or mapping the syllable into a shape. Marathi used to be written in a script known as Modi though one uses Devanagari these days.

 Unicode has a bias towards the rules of the writing system which cannot be denied. There are valid code values that will not refer to a linguistic element but to a shape. The zero width and non zero width joiners are examples of this provision. Hence deriving the linguistic content from a string of Unicode values is not as easy as simple string matching, when such characters are present. Even a simple application such as a text editor requires linguistic processing when a find or search and replace operation is to be supported.

  For those willing to experiment with the idiosyncrasies of Microsoft's implementation of Unicode support for Indian languages, the following is worth an attempt.

  In the screen shot below try and figure out the expression to be typed in to get a match for the strings shown.

  A copy of the file is available for download. Open the file  with Wordpad and see if you can type in expressions to match all the strings. Even though some strings look identical, their Unicode representations are not. When the file is opened under Wordpad, the window which pops us when you select the find option does not seem to permit the entry of the zero width joiner or non joiner characters.

  In respect of data entry today, most Indian languages require the use of punctuation marks and the few but important mathematical signs such as the plus, minus etc.. Since these are not explicitly included in the Unicode assignments for Indian languages, data entry would require frequent switching of the keyboard. Many keyboards for Indian language data entry (including the Microsoft Keyboard which is based on the Inscript layout) pack so many shapes into the keys that even standard symbols cannot be accommodated. (See if you can type in the parentheses in the Microsoft Tamil Keyboard!)

  Though Uniscribe is meant to provide the required representation of a syllable for display and printing, the onus is on the application to correctly handle the spacing of the text. What this means is that an application is intricately tied to Uniscribe and the associated Open type font and the developer must know the actual capabilities of Uniscribe's shaping behaviour.  This is rather unfair, for developers should concentrate on the processing of the information and not be burdened with formatting details. Elsewhere in this analysis, we have provided examples of three different Microsoft applications that compute the widths of the same text string totally differently. It turns out that when you copy and paste a Unicode text string into Word, cursor movement no longer applies at the syllable level as required but more at the individual unicode character level. Cursor positioning to edit the copied text cannot be ascertained by moving the cursor to the required syllable. Amusing results will be seen if you try and do this. Much of this can be inferred from the illustration above.

The case against arbitrarily long syllables.

 The basic assignment of Unicode allows arbitrarily long syllables to be constructed even though they will make no sense. Uniscribe attempts to process long text strings to identify syllables and this can lead to absurdities. From what is known in India, there are only about a thousand meaningful syllables, most of which have only two consonants and rarely three or four consonants. There is virtually no need to allow new shapes for a new syllable  even if it be built with three or four consonants because the writing system permits the syllable to be written in split form.

  While one may feel pleased that there is no limit to the syllables that can be formed by Uniscribe, one can readily see that a perfectly valid Unicode string can cause enough confusion to the shaping engine. We have already seen an example of this. Uniscribe could well stop with three or four consonant syllables to make the text preparation process simpler. Editing at the syllable level is not without its problems in Microsoft applications.

Keeping track of two representations.

  The need to correctly identify syllables along with the need to to maintain correct spacing of text on the screen requires very complex processing. The problem arises as a consequence of the display being managed in terms of codes referring to glyphs while the text itself be handled using assigned character codes (Unicode) for the script. The irony is that the Open type font is also a Unicode font with valid glyph codes but not having a one to one relationship with the stored text in terms of characters and glyphs. Errors are bound to occur in any computation that has to struggle hard to keep track of two different representations at the same time.

Copy/paste features in an application heavily rely on the ability of the application to trace back to the internally stored text from the displayed text. For most western scripts this is straightforward but for any writing system that follows a syllabic representation, this requirement is not easy to fulfill.

Multilingual Computing- A view from SDL

Writing systems
Linguistic requirements
Dealing with Text
Computing requirements (for India)

Unicode for Indian Languages

The conceptual basis for Unicode

Unicode for Indian scripts
Data entry
Issues in rendering Unicode
Using a shaping engine
Discussion on sorting
Open type fonts

Unicode support in Microsoft applications

Limitations of Uniscribe

A review of some MS applications supporting Unicode

Recommendations for Developers of Indian language Applications

Using True type fonts to render Unicode Text

Can we simplify handling Unicode text?

Guidelines for development under Linux

Summary of SDL's observations

Acharya Logo
Distant views of the Himalayan Peaks are unforgettable and awe inspiring!

Today is Apr. 05, 2020
Local Time: 01 57 10

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on     Best viewed at 800x600 or better