glyphs have helped build the required complex shapes through a simple process
of concatenating the glyphs. In other words, zero width glyphs help create
shapes which are formed by overlapping many shapes (usually two or three).
Typically the matras in Devanagari are overlapped with the consonant shapes.
The main advantage with this approach is that text can be rendered on most
systems which can just take a string of eight bit codes. Most font rendering
schemes have realized the need to correctly handle zero width glyphs (Win9X,
Linux, Mac, Postscript). In these cases, the rendering engine is quite
simple and just concatenates the shapes together.
The question which has always been asked is "Can every display requirement
be handled through the use of zero width glyphs (in respect of most scripts
in India)"? While the answer to this question is certainly "yes",
a large number of such glyphs will be required in practice to handle all
the shapes which can be generated only by overlapping more basic shapes.
It is quite difficult to accommodate a large number of these glyphs in
an eight bit font. It may be noted here that TeX has indeed shown
that an eight bit font may be all that we need for our scripts but the
approach cannot be used in interactive applications.
Developers who desire
to use Unicode for Indian languages face the problem of building up the
required shape for each syllable using only a Unicode font. For majority
of the languages of the world, a unicode font need have one glyph only
for every unicode character defined for the language. In respect of Indian
languages, the situation is very different, since the Unicode font will
have to accommodate literally thousands of glyphs. Certainly one could
think of a Unicode font with several thousand glyphs where each glyph is
directly a representation of a syllable. Unfortunately, when Unicode assignments
were made, the experts felt that a scheme similar to ISCII would be sufficient.
So, each Indian language got an assignment of a limited set of 128 code
values from which it was assumed all syllables will be derived (represented)
using variable number of Unicode characters. It was felt that since the
one to one mapping between a Unicode character and a glyph does not apply,
a rendering engine would have to be used which maps the Unicode characters
to the glyphs of SOME font, without specifying the range of Unicode values
for the font glyphs.
The way out out of
this situation was to suggest a new font concept called the Open type font
which would incorporate features to map one or more Unicode characters
to one more glyphs in an appropriate Unicode range. This Open type font
would permit a large number of Glyphs, several hundreds perhaps, enough
to generate all the required ligatures through positioning glyphs with
respect to one another. With this the required ligatures would be obtained
by selecting the glyphs appropriate to a syllable but shape the display
by positioning the glyphs in precisely defined locations. The need for
zero width glyphs does not arise, for the font rendering program would
get positioning information from the glyph to be displayed which will now
identify the component glyphs to be pieced together. The Open type font
allows a string of unicode characters to be mapped into a single glyph
thus permitting the generation of the shape of the syllable from a variable
length string. By precisely locating the glyphs in relation to one another
graphically, the need for multiple zero width glyphs for the same ligature
(as in True type fonts) is eliminated. It is said that such precise positioning
allows superior quality typography as well. It is a different matter however,
if the basic glyphs themselves are not aesthetically pleasing as is the
case with the Microsoft Mangal font!
An Open type Unicode
font not only allows more than 256 glyphs but also builds into it the positioning
information when multiple glyphs are overlaid. Essentially this is the
same concept as that of a composite glyph in a conventional True type font.
The composite glyph also has the advantage that we can specify it with
just one code. However, when mapping characters in the text, a True type
font will permit only one glyph to be mapped to one character. This is
the distinct advantage of the Open type font where a string of Unicode
values can map to a single glyph. When a font rendering program is called
to display a composite glyph, it would dynamically build the glyph from
the component glyphs by positioning them properly. If one uses zero width
glyphs in a font, the same final result can be obtained but only by specifying
a code for each glyph. If we examine the syllable shown earlier, an open
type font could indeed include a glyph that is a combination of the first
two ("sht" and "ra") and be mapped into the syllable "sh, t, ra" .
In reality, many glyphs in the Microsoft Mangal font are composite glyphs
(almost 500 of them) and the recommendation from Microsoft experts emphasizes
the use of composite glyphs for as many glyphs as possible which directly
relate to a syllable.
The Uniscribe module,
which constitutes the shaping engine for Unicode in Microsoft applications,
will identify that "sh", "t" and "ra" would come out as a single shape
by applying the rule that when the consonant "ra" comes as the last consonant
in a syllable, it would be written using a ligature which can occur either
attachment to the vertical stroke of the preceding consonant (as in "p,
ra") or as an individual ligature below it, if the preceding consonant
does not have a vertical stroke. It turns out that Microsoft displays
the syllable in the illustration above not as a single ligature for "sh"
and "t" but through a half form for "sh" and a ligature for "ra" under
the consonant "t".
It is now reasonably
clear to us that a lot of rules are hard coded into Uniscribe. Some
of the rules will depend on the availability of specific shapes (glyphs)
in the font under use. Since the form of the syllable is hard coded into
Uniscribe, the user or the developer cannot provide alternate forms for
a syllable even if this form can be pieced together from other available
glyphs in the font. Often a form where a conjunct can be shown without
a halanth in any of the consonants is preferred by people. This is certainly
not possible with Uniscribe as of today (March 2003). Tomorrow, if we do
agree to build a new glyph into the Mangal font, Uniscribe will have to
be rewritten! Of course Microsoft does not insist on the developer using
Uniscribe. The onus is then on the designer to shape the syllables in the
application itself, something that can lead to a lot of additional work.
Uniscribe also works
on the principle of internally defined rules which specify which form of
a consonant applies in a given context. Thus "ra" occurring as the first
consonant of a syllable is treated differently from a "ra" that occurs
in the middle or at the end. Towards this, uniscribe also reorders the
input string to handle cases where the first consonant is graphically positioned
at the end, as in the case where the "reph" form applies. In Marathi, it
is not always the case that the reph form is used each time "ra" occurs
as the first consonant. So these rules, which are language dependent have
to be handled by Uniscribe only when the language associate with the script
is also specified as a parameter. It is not possible to dynamically introduce
a language that uses Devanagari but has rules different from Sanskrit or
codes are required to be Unicode values.
Writing applications which can transfer information between themselves
through copy/paste greatly benefit from scripts which map one Unicode character
into one font glyph. In this case the code of the displayed character is
identical with that of the character in storage. One can readily identify
the internally stored text merely by looking at the displayed string.
We have seen that this cannot be the case with respect to Indian languages,
for several Unicode characters in sequence constitute a syllable and hence
a shape. The computer system (basically the OS) must use only a Unicode
font to render the text since everything is Unicode based. The large set
of Unicode values required in a font for an Indian language (Tamil may
do with a small set) cannot be accommodated in any other Unicode range
unless that range has no specific Unicode assignments. Taking note of this,
developers have struck a compromise by designing Unicode (Open type too)
fonts having glyph codes in a region designated as "Private Use area" by
the Unicode consortium, where one has the freedom to locate their own characters
of their own scripts. This in essence allows the characters of any new
language to be assigned Unicode values in a totally free manner without
prejudice to or interfering with the codes otherwise legally assigned to
several other languages in the Unicode standard.
Thus, Unicode text in Indian languages will be represented through the
standard Unicode assignments for the different Indian languages but all
corresponding fonts will locate their glyphs in the Private Use area. One
can readily see that this offers no loss of flexibility in processing a
syllable, for what is needed is the identification of a glyph that has
a valid Unicode assigned to it. In a document
displayed using such a font, going from the displayed code to the internal
code is still a reality so long as we retain the stored text internally
in some buffer and back track from the displayed codes simply by repeatedly
generating temporary display codes and matching them against the actually
displayed ones. So copy paste operations will
be possible. In a one code one glyph case, the need for this internally
stored text does not arise because the internally stored text from which
the display was generated will be identical to the displayed codes themselves.
When we use the Private Use area, we may have no way of finding out what
language text is being displayed unless we access the Unicode values of
the internally stored text. Multilingual applications will have quite some
work to do in relating the display to a language if the text displayed
uses fonts in the Private use area but the actual code values are different.
Thus all applications dealing with Unicode in Indian languages MUST always
retain a buffer in which the Unicode string that has given rise to the
current display is kept. Worse still, as editing operations are performed
on displayed text, pointers linking graphical positioning of the glyphs
with the internally stored text string must be maintained. This is a very
complex issue and we know that Microsoft applications themselves have not
handled this with care, as will be seen below.
It is now apparent that the application has a lot of responsibility in
actually positioning the syllables on the screen when Unicode strings have
to be displayed. Errors in computing the widths
of displayed glyphs can lead to a lot of confusion during the backtracking
process. Errors of this type can cause unpleasant gaps in the displayed
text and we know that this situation does
exist even with Microsoft software!
Seen below is a screen shot of three Microsoft applications handling the
same text. These are Wordpad, Word and Excel all running under win/XP.
The text was generated by typing into Wordpad and copied and pasted into
the other two. The identical looking strings in the Wordpad display are
not really identical in their internally stored form but differ due to
the incorporation of zero width joiners. It is however clear that all the
strings refer to the same syllable. The test as to whether the applications
actually perform syllable level processing is also apparent from the illustration.
Examine how Word displays the strings. The wavy red line put in by Word
(a spelling error being pointed out) tells us what Word thinks is actually
the width of the displayed string! The situation with Excel is no less
amusing where it does not seem to use Uniscribe at all but goes by one
Unicode one glyph maxim, ignoring the zero width joiners altogether. More
interesting to observe is what happens when you try a string match for
the word. Wordpad would match only one string while Word matches five and
misses out the one where gaps are seen in the word.
You can verify all this for yourself if you have WindowsXP running on your
computer. Just download the Unicode text file
corresponding to the displayed text which we have made available for
you. You can open the file in Wordpad or Word directly but must do a copy
and paste into Excel.
At this point one might point to the inconsistencies in text processing
with Unicode. Text processing at the Syllable level cannot be solved by
providing modules which identify syllable boundaries alone and display
The need to check the linguistic
validity of a text string that is perfectly valid as a Unicode string is
really the crux of the problem. The multibyte
nature of the syllable coupled with the need to identify codes which do
not carry linguistic information but only codes that help in rendering
the syllable, will require a lot of comparisons with each Unicode character
and severely affect performance, besides complicating the algorithms themselves.
All this goes to show that it is very difficult to write applications based
on Unicode rendering. Applications which go only one way i.e., from Unicode
text to display are perhaps the only ones which may work but this would
restrict the applications to mere data entry and display. Even here an
application must know how the shaping engine (Uniscribe or equivalent)
renders the text to present the display appropriate to the user's needs.
For instance, the onus is on the application to format the text graphically
by ascertaining the character widths. Worse still, an application may actually
be required to know when rendering information has to be inserted into
a string through zero width or non zero width joiners and such. A
major constraint which most applications will face is in permitting multilingual
data entry. It will be very difficult to build applications that
allow data entry in different scripts within the same interface unless
it handles the keyboard itself. The moment you rely on the support given
by the OS, you will invariably be forced to use alternate keyboards. As
indicated elsewhere in this essay, it is not possible to type in punctuation
marks in Tamil using the Microsoft Tamil Keyboard and one will have to
switch keyboards. While one can certainly argue that this is consistent
with the basic concept of Unicode where punctuation marks are assigned
codes in a different region, the need to switch keyboards can be frustrating.
It is never a good policy to require applications to handle text formatting
by themselves. At least a meaningful API should be available which can
take a Unicode string and render it on the display in a predictable fashion.
This is very difficult to manage unless we have a one code one glyph situation.
Perhaps a one code many glyphs situation is also not difficult to deal
with, since the one code can really be that of a syllable. Unfortunately,
Unicode has not taken this route.
In Microsoft's implementation
of Unicode support for Indian languages, it appears that the calculation
of widths of displayed glyphs has some error. This is particularly so with
zero width glyphs. It is clear that the responsibility for the correct
display rests with the application and not the shaping engine. Shown below
are screen shots of the same text in different applications, Word, Wordpad
and and Netscape. One wonders how this has come about! Zero width glyphs
from standard fonts (in this case a true type font from IIT Madras) are
rendered correctly under Word but gaps are seen in Wordpad. Wordpad correctly
interprets widths of characters in the Latha font which is Microsoft's
own font but Word seems to suffer, especially with zero width space characters.
If you are intrigued about the clear text typed in Windows 2000 (Devanagari
and Tamil text), just look at the simple multilingual
text editor developed at IIT Madras.
of True type fonts
Dealing with applications
supporting user interfaces in Indian languages is entirely feasible with
Unicode and True type fonts. It will be necessary to place many glyphs
side by side to display a syllable but this can be managed with appropriate
zero width glyphs. The application must now parse the input string to identify
syllables. A significant amount of simplification
can be effected if we agree to restrict syllable formation to a limited
set of say about six hundred syllables (which by the way, will cover most
of our requirements in respect of our languages). The mapping from
a syllable to its glyphs may be accomplished through simple table lookup
as opposed to complex rules built into Uniscribe. The multilingual software
from IIT Madras has established that this approach is not only viable but
very simple to implement. Syllable formation is effected at the input stage
itself during data entry and each syllable stored internally as a fixed
size code (two bytes).
It is relatively easy
to write parsing applications which can handle dynamically entered strings.
The Acharya web site hosts a demo page where
the viewer can verify that a sequence of consonants and vowels can be input
to generate the syllables dynamically and displayed as well in any script.
Syllables may also be standardized by collectively taking all the basic
sounds from each language and working with a superset of vowels and consonants.
The text rendering
process can be simplified considerably if we agree to deal with a finite
set of syllables as opposed to allowing arbitrarily long ones. Over the
years one has seen that almost all the text ever prepared in India includes
just about 500-800 syllables depending on the language, which have to be
shown with special ligatures. It is therefore sufficient if this set is
catered to. Restricting the set of syllables gives us the flexibility to
use tables to map the syllables to glyphs. Table lookup can also be effected
dynamically giving us the additional flexibility to use alternate forms
of display for syllables.
If we carefully design
our True type fonts, we can create a multilingual font supporting all the
important scripts (nine of them) and place the glyphs in the region E000-E9ff
region, where each script will have close to 250 glyphs. We can include
many common glyphs in this font including punctuation marks, special symbols
and such which we could not manage in a regular True type font for want
of glyphs. Comparable Open type font would require at least 650 glyphs
per script and we can see that it will be difficult to manage such a huge
font, let alone design one.
True type fonts also
have other advantages. the rendering process is not tied to the availability
of a specific font so long as the glyphs are present at the expected location.
We can prepare text and get it rendered in any font of our choice where
the glyphs occupy the specified locations. With Open type fonts, unless
Unicode input conforms to the assigned code values and not the glyph codes,
the characters will not be rendered right.
If we create text
in a microsoft application that allows us type in Unicode values in the
private use area (E000-F7FF), we will not be able to view the text with
the Mangal font even though it has glyphs in this range. There will be
greater flexibility if an application can correctly identify the glyph
codes and use any True type font that can render the glyphs right. This
is how we currently display text in many Win9x applications where we generate
ASCII text but view the same with a Devanagari or Tamil font. While
it is true that a shaping engine is always required to render Unicode in
Indian languages, the shaping engine should permit flexibility for us to
use any compatible font. It does not appear that this is possible as of
now since there is only one Open type font available for Devanagari and
Uniscribe is tied to this.
One can summarize the
observations as follows.
What Microsoft (perhaps
other developers as well) has done is to demonstrate that text in Indian
languages can be typed into any application. While it may appear that this
is all one would require to run the application with Indian language support,
the truth is that none of the applications can correctly interpret the
entered text to effect further processing. In other words localization,
the ability to support a truly interactive user interface where user commands
are correctly and consistently interpreted across all applications, is
something that has not been viewed seriously. When this does happen, we
would not be surprised if the application is just monolingual and script
The use of Unicode
(in respect of Indian languages) to truly bring in localization does not
seem to be offering much promise. While one cannot deny that that someone
can actually accomplish this in spite of the problems of multibyte codes,
it is becoming clear to many that developers will find it easier
to provide script and problem specific solutions by handling the script
related issues themselves, for there is no doubt that they can handle the
linguistic aspects with confidence.