Home --> Software Design Issues --> Unicode --> shaping
Shaping Engine for Rendering Text
The complexity of rendering Indian scripts

 Uniscribe (or its equivalent) is the programming interface which allows Unicode text strings to be interpreted for display in a Microsoft Windows environment. As we know, the Unicode text strings for Indian scripts will consist of only the basic vowels, consonants, matras and a few additional symbols. The purpose of Uniscribe is to generate the information for display in terms of Glyph codes, consistent with the conventions of the writing system.

  A computer program which is given a Unicode text string in any of the Indian languages will have to identify how the string should be broken up for generating the display. This is the basic process of identifying the syllables which make up the text string. Suppose the string in question is

  In other words, we need to force an intermediate code to tell the shaping engine to do something different. This is in fact what Unicode recommends through the use of zero width joiners and non joiners.

  One might argue that this is a pathological example which is unlikely to be encountered in practice. The truth is that when we teach a writing system to children we tell them that there are equivalent ways of writing the same syllable. That is, the same linguistic content may be shown differently using different scripts or even in the same script through permitted variations.

  You will find that it is pretty much impossible to get the Microsoft shaping engine to render the same text string differently though such a provision will be very helpful in practice to handle the variations in the writing systems practiced in different regions. Assuming that one decides to change the rendering to a different standard, we will have to modify the shaping engine to change the rendering rule. This will not only require rewriting the module but require recompilation and distribution of the new module. Such flexibility is not easily provided in Microsoft applications where one recommends an upgrade rather than a patch or file substitution.

It cannot be assumed that the mapping from the Unicode text to the rendered shape is unique and will be frozen for ever to write a one time shaping engine. We will find that when we have to reproduce thousands of manuscripts preserved in India (written as well as printed) we will necessarily have to accommodate variations.

  The problem can be handled somewhat if we allow the rendering rules in the shaping engine to be read in dynamically rather than remain hard coded. This provision will not be an easy one since the shaping engine will have to map a multibyte string into a final shape that may depend on a supplied parameter. If Unicode were devoid of context specifying codes such as the ZWJ and ZWNJ, this would be much easier. Unfortunately, the presence of these codes, can really complicate string processing.

  Philosophically, Unicode would remain a meaningful scheme for our scripts if only it confined itself to specifying the linguistic content and nothing more. As observed by other experts, Unicode's bias towards rendering is an issue one has to reckon with in implementing the shaping engine. What this implies is that certain Unicode values have no linguistic content but are used only to guide the rendering process so that the displayed shape is forced to conform to a specific pattern. Such codes are seldom required in European scripts since each Unicode character maps directly to one and only one shape.

  If we are required to perform linguistic processing on a Unicode text string, the presence of special characters will certainly pose problems. Let us consider an example.

  We now see that the conventional fixed width codes certainly aid in string processing if each code carries only linguistic information. Unfortunately we are not able to provide for this if we take the Unicode route.

The pertinent question is, can one have fixed width codes for the syllables? That is, can we have each syllable coded into a fixed number of bytes? The answer is certainly yes, though one must admit that there are at least 5000 syllables ( bare minimum) which are in regular use and across the different languages, one might even see the need for more than ten thousand. The Multilingual software developed at the Systems Development Lab., IIT Madras, is indeed an example of a system that is based on fixed width syllable level codes. The software uses a sixteen bit code for each syllable where the linguistic content is very clearly identified in terms of the consonants and the vowel present in it.

The conceptual basis for the shaping engine.

  The Uniscribe script engine is faithful to the specification of Unicode in rendering syllables. Unfortunately, the rendering rules are hard coded into the modules of the engine though these rules conform to some default conventions in the writing system. Consequently variations in the displayed syllable shapes cannot be honoured. Nor can we introduce a new script for the language without rewriting the shaping engine. Unicode character names are bound to the name of the script and it is quite unlikely that one will be able to introduce new scripts for Indian languages based on Unicode. Many Indian languages used different scripts at different times without any loss of linguistic content e.g., Grantha for Sanskrit, Modi for Marathi.

The essential steps involved in rendering Unicode text through the shaping engine go as follows.

1. Identify syllable boundaries or special characters.

2. Apply the rendering rules for each syllable by examining the consonants and identifying the specific rendered form applicable to the each consonant. For example, if  "ra" is present in the syllable, see if it is the first consonant or the middle one or even the last one. The form chosen for display will now be based on the nature of the consonant occurring before "ra". If that consonant has a vertical line in its shape, then "ra" would be formed with a short diagonal stroke joining the vertical line in the lower half of the consonant. If the previous consonant were one without a vertical stroke, then the form of "ra" chosen may be that resembling the caret sign placed below the consonant.

3. The shaping engine may also apply some rules that call for reordering of the consonants and associating suitable shapes with the reordered consonants. This happens when "ra" comes in as the first consonant of a syllable and the displayed shape involves the "reph" form.

  The Uniscribe engine has enough complexity to identify the rules for a large number of syllables of arbitrary length running into many consonants. It will now be clear to the reader that not only are the rendering rules hard coded but they assume the availability of the associated shapes in the font used for display. This can cause problems in applications which may prefer to use high quality fonts for typesetting which fonts may not have the expected features in respect of the shaping engine but otherwise be adequate for high quality printouts. Uniscribe requires that an Open Type font be used along with it and not any True type font which may be entirely adequate for the purpose. As of this writing (Mar. 2003) the Devanagari font supported under WinXP/2000 cannot cater to many requirements called for in normal writing in spite of being rated as an effective Open Type font for the script.

It is quite unlikely that one single but adequate font for Devanagari text rendering will be developed since special software tools are required for creating meaningful Open Type fonts. Designing fonts for Indian Scripts requires the designer to understand the writing system thoroughly so that all the ligatures of importance are included in the font. In the Open Type font, a syllable can be mapped into the required shape by graphically positioning the component shapes (glyphs) which are related to the consonants and the vowel in the syllable. The Uniscribe engine would differentiate the shapes to be used for consonants based on the syllable. That is, the choice of the shapes building up the final form for a syllable will be context dependent based on the actual consonants. The same consonant may get rendered using different shapes in different syllables.

  No doubt the whole process is  complex and quite involved since the font designer and the Uniscribe developer have to work together to arrive at a good solution. One finds top font designers who may not know the intricacies of the writing system. Likewise, a linguistic expert may not really concern himself/herself with  the nuances of the font file. This is perhaps the reason why we have basically one Open Type font available for Devanagari.

  Open Type fonts for Indian languages generally require a large number of glyphs running into several hundreds.

  The essential idea of the Open Type font is to map a syllable into a shape. Since there are thousands of syllables, it is not meaningful to design a font which has an individual glyph for each syllable. The general idea is that a default shape formation rule be applied to a syllable but handle exceptions where appropriate.

  The default rule will probably work for about 70% of the syllables where the required matra is added to the consonant's shape. The graphic positioning of the matra may be important form the typesetting point of view since the matra cannot be put in a fixed place around the glyph. See the illustration below.

  Designers of True type fonts knew this requirement and had simply included two or more glyphs for the same matra to handle variations in its placement with different consonants. Typically the matra is overlaid with the glyph of the consonant with an appropriate displacement wit respect to the coordinates of the graphical shape of the consonant.

  In the Open Type font, since typography was also an important consideration, the font specification provides for precise positioning of a glyph with respect to another when a new glyph is required to be shaped from two or three component glyphs. Thus it will be possible for us to have just one glyph designed for the matra but use it with any consonant by positioning it at an appropriate location with respect to each consonant.

  In the Open Type font, the designers have made a provision for handling this through the concept of a composite glyph which is a new glyph obtained from two or more basic glyphs in the font. This specific feature is exploited by Uniscribe to quickly identify the composite glyphs which can be rendered for a specific Unicode string for a syllable. However, a large number of composite glyphs will be required in this case. One will remember that composite glyphs were permitted even in True type fonts but precisely locating one glyph with respect to the other was not handled, only simple overlays. In fact, Microsoft experts recommend that a good way to design Open Type fonts for Indian scripts is to use as many composite glyphs as possible since the Uniscribe engine could easily map the Unicode strings to the component  glyphs. The Open Type font can lead you to just one glyph from multiple character codes and it is now clear why this type of a font is being promoted for use with Indian languages where multiple character codes map to a shape.  The Mangal font  font for Devanagari supplied with Win2000  has nearly all its glyphs specified as composite glyphs.

  Designing an Open Type font is however not a simple proposition. Special tools are required. Worse, the Open Type font will have to carry a digital signature if it has to be allowed for use in Microsoft applications. Getting the font digitally signed is some task indeed!

Summarizing the discussions

1. The Open Type font provides for multiple character codes to be mapped into a single shape. This is an important feature which distinguishes the Open Type from True type where one is invariably tied to one code one glyph mapping.

2. For Indian scripts, an Open Type font is inevitable if an application goes through Uniscribe (or its equivalent) in rendering Unicode text. It must be emphasized here that language dependent calls will have to be made to Uniscribe to handle the required rendering. This simply means that an application cannot be written in a language independent manner. This in our view is a fairly serious limitation of the Unicode based approach to computing with Indian languages. The common linguistic base across the languages can actually help the development of multilingual applications which can work transparently with any language.

3. Open Type fonts will invariably include a very large number of glyphs, most of which may be composite in nature. Yet, the same can be provided through a True type font which includes only the component glyphs and hence can be much smaller in size.

4. The Uniscribe shaping engine cannot permit multiple representations for the same Unicode string by specifying a parameter for each representation. This is the responsibility of the application.

  A computer program cannot easily generate the display for a syllable electronically, unless it knows that it can provide the display consistent with the user's requirements. Put simply, an application will necessarily have to know which syllable will have to be constructed with ZWJ or ZWNJ codes, if the shape desired is different from what Uniscribe defaults to.

   The subtle message carried by the above statement is that localization of an application will not be easy since every application handling a script must know how to code the syllables using Unicode characters, to have conformity with conventions of the writing systems that are not coded into Uniscribe. Linguistic text processing will be quite difficult under the circumstances.

Multilingual Computing- A view from SDL

Writing systems
Linguistic requirements
Dealing with Text
Computing requirements (for India)

Unicode for Indian Languages

The conceptual basis for Unicode

Unicode for Indian scripts
Data entry
Issues in rendering Unicode
Using a shaping engine
Discussion on sorting
Open type fonts

Unicode support in Microsoft applications

Limitations of Uniscribe

A review of some MS applications supporting Unicode

Recommendations for Developers of Indian language Applications

Using True type fonts to render Unicode Text

Can we simplify handling Unicode text?

Guidelines for development under Linux

Summary of SDL's observations

Acharya Logo
Distant views of the Himalayan Peaks are unforgettable and awe inspiring!

Today is Apr. 05, 2020
Local Time: 00 52 04

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on     Best viewed at 800x600 or better