Shaping Engine for
of rendering Indian scripts
Uniscribe (or its equivalent)
is the programming interface which allows Unicode text strings to be interpreted
for display in a Microsoft Windows environment. As we know, the Unicode
text strings for Indian scripts will consist of only the basic vowels,
consonants, matras and a few additional symbols. The purpose of Uniscribe
is to generate the information for display in terms of Glyph codes, consistent
with the conventions of the writing system.
A computer program
which is given a Unicode text string in any of the Indian languages will
have to identify how the string should be broken up for generating the
display. This is the basic process of identifying the syllables which make
up the text string. Suppose the string in question is
words, we need to force an intermediate code to tell the shaping engine
to do something different. This is in fact what Unicode recommends through
the use of zero width joiners and non joiners.
One might argue that
this is a pathological example which is unlikely to be encountered in practice.
The truth is that when we teach a writing system to children we tell them
that there are equivalent ways of writing the same syllable. That is, the
same linguistic content may be shown differently using different scripts
or even in the same script through permitted variations.
You will find that
it is pretty much impossible to get the Microsoft shaping engine to render
the same text string differently though such a provision will be very helpful
in practice to handle the variations in the writing systems practiced in
different regions. Assuming that one decides to change the rendering to
a different standard, we will have to modify the shaping engine to change
the rendering rule. This will not only require rewriting the module but
require recompilation and distribution of the new module. Such flexibility
is not easily provided in Microsoft applications where one recommends an
upgrade rather than a patch or file substitution.
It cannot be assumed that
the mapping from the Unicode text to the rendered shape is unique and will
be frozen for ever to write a one time shaping engine. We will find that
when we have to reproduce thousands of manuscripts preserved in India (written
as well as printed) we will necessarily have to accommodate variations.
The problem can be
handled somewhat if we allow the rendering rules in the shaping engine
to be read in dynamically rather than remain hard coded. This provision
will not be an easy one since the shaping engine will have to map a multibyte
string into a final shape that may depend on a supplied parameter. If Unicode
were devoid of context specifying codes such as the ZWJ and ZWNJ, this
would be much easier. Unfortunately, the presence of these codes, can really
complicate string processing.
would remain a meaningful scheme for our scripts if only it confined itself
to specifying the linguistic content and nothing more. As observed by other
experts, Unicode's bias towards rendering is an issue one has to reckon
with in implementing the shaping engine. What this implies is that certain
Unicode values have no linguistic content but are used only to guide the
rendering process so that the displayed shape is forced to conform to a
specific pattern. Such codes are seldom required in European scripts since
each Unicode character maps directly to one and only one shape.
If we are required
to perform linguistic processing on a Unicode text string, the presence
of special characters will certainly pose problems. Let us consider an
We now see
that the conventional fixed width codes certainly aid in string processing
if each code carries only linguistic information. Unfortunately we are
not able to provide for this if we take the Unicode route.
The pertinent question is,
can one have fixed width codes for the syllables? That is, can we have
each syllable coded into a fixed number of bytes? The answer is certainly
yes, though one must admit that there are at least 5000 syllables ( bare
minimum) which are in regular use and across the different languages, one
might even see the need for more than ten thousand. The Multilingual software
developed at the Systems Development Lab., IIT Madras, is indeed an example
of a system that is based on fixed width syllable level codes. The software
uses a sixteen bit code for each syllable where the linguistic content
is very clearly identified in terms of the consonants and the vowel present
The conceptual basis for
the shaping engine.
The Uniscribe script
engine is faithful to the specification of Unicode in rendering syllables.
Unfortunately, the rendering rules are hard coded into the modules of the
engine though these rules conform to some default conventions in the writing
system. Consequently variations in the displayed syllable shapes cannot
be honoured. Nor can we introduce a new script for the language without
rewriting the shaping engine. Unicode character names are bound to the
name of the script and it is quite unlikely that one will be able to introduce
new scripts for Indian languages based on Unicode. Many Indian languages
used different scripts at different times without any loss of linguistic
content e.g., Grantha for Sanskrit, Modi for Marathi.
The essential steps involved
in rendering Unicode text through the shaping engine go as follows.
1. Identify syllable boundaries
or special characters.
2. Apply the rendering rules
for each syllable by examining the consonants and identifying the specific
rendered form applicable to the each consonant. For example, if "ra"
is present in the syllable, see if it is the first consonant or the middle
one or even the last one. The form chosen for display will now be based
on the nature of the consonant occurring before "ra". If that consonant
has a vertical line in its shape, then "ra" would be formed with a short
diagonal stroke joining the vertical line in the lower half of the consonant.
If the previous consonant were one without a vertical stroke, then the
form of "ra" chosen may be that resembling the caret sign placed below
3. The shaping engine may
also apply some rules that call for reordering of the consonants and associating
suitable shapes with the reordered consonants. This happens when "ra" comes
in as the first consonant of a syllable and the displayed shape involves
the "reph" form.
The Uniscribe engine
has enough complexity to identify the rules for a large number of syllables
of arbitrary length running into many consonants. It will now be clear
to the reader that not only are the rendering rules hard coded but they
assume the availability of the associated shapes in the font used for display.
This can cause problems in applications which may prefer to use high quality
fonts for typesetting which fonts may not have the expected features in
respect of the shaping engine but otherwise be adequate for high quality
printouts. Uniscribe requires that an Open Type font be used along with
it and not any True type font which may be entirely adequate for the purpose.
As of this writing (Mar. 2003) the Devanagari font supported under WinXP/2000
cannot cater to many requirements called for in normal writing in spite
of being rated as an effective Open Type font for the script.
It is quite unlikely that
one single but adequate font for Devanagari text rendering will be developed
since special software tools are required for creating meaningful Open
Type fonts. Designing fonts for Indian Scripts requires the designer to
understand the writing system thoroughly so that all the ligatures of importance
are included in the font. In the Open Type font, a syllable can be mapped
into the required shape by graphically positioning the component shapes
(glyphs) which are related to the consonants and the vowel in the syllable.
The Uniscribe engine would differentiate the shapes to be used for consonants
based on the syllable. That is, the choice of the shapes building up the
final form for a syllable will be context dependent based on the actual
consonants. The same consonant may get rendered using different shapes
in different syllables.
the whole process is complex and quite involved since the font designer
and the Uniscribe developer have to work together to arrive at a good solution.
One finds top font designers who may not know the intricacies of the writing
system. Likewise, a linguistic expert may not really concern himself/herself
with the nuances of the font file. This is perhaps the reason why
we have basically one Open Type font available for Devanagari.
Open Type fonts for
Indian languages generally require a large number of glyphs running into
The essential idea
of the Open Type font is to map a syllable into a shape. Since there are
thousands of syllables, it is not meaningful to design a font which has
an individual glyph for each syllable. The general idea is that a default
shape formation rule be applied to a syllable but handle exceptions where
The default rule will
probably work for about 70% of the syllables where the required matra is
added to the consonant's shape. The graphic positioning of the matra may
be important form the typesetting point of view since the matra cannot
be put in a fixed place around the glyph. See the illustration below.
of True type fonts knew this requirement and had simply included two or
more glyphs for the same matra to handle variations in its placement with
different consonants. Typically the matra is overlaid with the glyph of
the consonant with an appropriate displacement wit respect to the coordinates
of the graphical shape of the consonant.
In the Open Type font,
since typography was also an important consideration, the font specification
provides for precise positioning of a glyph with respect to another when
a new glyph is required to be shaped from two or three component glyphs.
Thus it will be possible for us to have just one glyph designed for the
matra but use it with any consonant by positioning it at an appropriate
location with respect to each consonant.
In the Open Type font,
the designers have made a provision for handling this through the concept
of a composite glyph which is a new glyph obtained from two or more basic
glyphs in the font. This specific feature is exploited by Uniscribe to
quickly identify the composite glyphs which can be rendered for a specific
Unicode string for a syllable. However, a large number of composite glyphs
will be required in this case. One will remember that composite glyphs
were permitted even in True type fonts but precisely locating one glyph
with respect to the other was not handled, only simple overlays. In fact,
Microsoft experts recommend that a good way to design Open Type fonts for
Indian scripts is to use as many composite glyphs as possible since the
Uniscribe engine could easily map the Unicode strings to the component
glyphs. The Open Type font can lead you to just one glyph from multiple
character codes and it is now clear why this type of a font is being promoted
for use with Indian languages where multiple character codes map to a shape.
The Mangal font font for Devanagari supplied with Win2000 has
nearly all its glyphs specified as composite glyphs.
Designing an Open
Type font is however not a simple proposition. Special tools are required.
Worse, the Open Type font will have to carry a digital signature if it
has to be allowed for use in Microsoft applications. Getting the font digitally
signed is some task indeed!
Summarizing the discussions
1. The Open Type font provides
for multiple character codes to be mapped into a single shape. This is
an important feature which distinguishes the Open Type from True type where
one is invariably tied to one code one glyph mapping.
2. For Indian scripts, an
Open Type font is inevitable if an application goes through Uniscribe (or
its equivalent) in rendering Unicode text. It must be emphasized here that
language dependent calls will have to be made to Uniscribe to handle the
required rendering. This simply means that an application cannot be written
in a language independent manner. This in our view is a fairly serious
limitation of the Unicode based approach to computing with Indian languages.
The common linguistic base across the languages can actually help the development
of multilingual applications which can work transparently with any language.
3. Open Type fonts will invariably
include a very large number of glyphs, most of which may be composite in
nature. Yet, the same can be provided through a True type font which includes
only the component glyphs and hence can be much smaller in size.
4. The Uniscribe shaping
engine cannot permit multiple representations for the same Unicode string
by specifying a parameter for each representation. This is the responsibility
of the application.
A computer program
cannot easily generate the display for a syllable electronically, unless
it knows that it can provide the display consistent with the user's requirements.
Put simply, an application will necessarily have to know which syllable
will have to be constructed with ZWJ or ZWNJ codes, if the shape desired
is different from what Uniscribe defaults to.
The subtle message
carried by the above statement is that localization
of an application will not be easy since every application handling a script
must know how to code the syllables using Unicode characters, to have conformity
with conventions of the writing systems that are not coded into Uniscribe.
Linguistic text processing will be quite difficult under the circumstances.