Unicode text in Windows Applications
The main function
of Uniscribe is to take an arbitrarily long Unicode string and map it into
a sequence of syllables for display. It is assumed that the input string
correctly represents the Unicode characters entered from an application
through the keyboard or hs been generated electronically. The Unicode characters
come from a set of assigned Unicode values for the script in use.
Those having access
to Windows XP/2000 can actually generate the keystrokes and see how Wordpad
or Microsoft Word (or even Notepad) handle the input. In the illustration,
zwj and zwnj refer to specific Unicode values which convey rendering information.
You have to type them in not as zwj in English but through the decimal
equivalents of their Unicode values. The zero width joiner (zwj) is typed
in by holding the ALT key down while entering the decimal value 08205.
For the zero width non joiner, the value is 08204. This seems to work in
Word and Wordpad.
the Uniscribe shaping engine are the rules for going from the Unicode string
to the shape, consistent with the rendering recommendations from the Unicode
consortium. Thus Uniscribe is nothing but a set of hard coded rules to
render syllables. These rules are rigid (as implemented by Microsoft)
and hence a user does not have the flexibility to get alternative representations
except to code them differently using possibly the zero width joiners and
non joiners. In the examples shown above, the same syllable is shown in
different displayed forms but generated from different Unicode strings.
of Uniscribe is such that part of the shaping information is derived from
the font used for the script and this font must be an open type font. Open
type fonts for Indian languages require the designer to be thoroughly familiar
with the writing system and this can be a rather exacting requirement.
On account of the basic structure where the open type font allows a single
glyph to be selected from a sequence of character codes, the font tends
to become unwieldy. The currently available Mangal Open Type font for WinXP/2000
has nearly 650 glyphs, many of which are derived from a much smaller set
of basic glyphs. It would not be incorrect to state that the motivation
for Open Type fonts came more from languages with a syllabic writing system
with many ligatures and combined shapes than other typesetting considerations.
fact text in Indian languages can be comfortably typeset with existing
Truetype fonts for the different scripts. The issue of concern is
The names of
Unicode characters (along with code values) are rigidly specified and there
is absolutely no way new characters can be introduced without going
through the consortium. When you do succeed in that, every application
that is based on Unicode will have to be rewritten to accommodate the change.
Unicode, though a meaningful
concept to represent text from different languages of the world ( more
appropriately scripts) emphasizes the script first and then only the language.
This is quite the opposite of our approach to languages. It is the language
(defined by the sounds) that comes first and then only the script. We
all know that any of the Indian languages can use any writing system so
long as the sounds can be preserved. There will be no confusion
in the process as we all know well that Sanskrit can be written in Devanagari,
Sharada (from Kashmir) or Grantha from the south. All these retain the
phonetic information in the script through properly formed rules or mapping
the syllable into a shape. Marathi used to be written in a script known
as Modi though one uses Devanagari these days.
Unicode has a bias
towards the rules of the writing system which cannot be denied. There are
valid code values that will not refer to a linguistic element but to a
shape. The zero width and non zero width joiners are examples of this provision.
Hence deriving the linguistic content from a string of Unicode values is
not as easy as simple string matching, when such characters are present.
Even a simple application such as a text editor requires linguistic processing
when a find or search and replace operation is to be supported.
For those willing
to experiment with the idiosyncrasies of Microsoft's implementation of
Unicode support for Indian languages, the following is worth an attempt.
In the screen shot
below try and figure out the expression to be typed in to get a match for
the strings shown.
copy of the file is available for download. Open the file with
Wordpad and see if you can type in expressions to match all the strings.
Even though some strings look identical, their Unicode representations
are not. When the file is opened under Wordpad, the window which pops us
when you select the find option does not seem to permit the entry of the
zero width joiner or non joiner characters.
of data entry today, most Indian languages require the use of punctuation
marks and the few but important mathematical signs such as the plus, minus
etc.. Since these are not explicitly included in the Unicode assignments
for Indian languages, data entry would require frequent switching of the
keyboard. Many keyboards for Indian language data entry (including the
Microsoft Keyboard which is based on the Inscript layout) pack so many
shapes into the keys that even standard symbols cannot be accommodated.
(See if you can type in the parentheses in the Microsoft Tamil Keyboard!)
Though Uniscribe is
meant to provide the required representation of a syllable for display
and printing, the onus is on the application to correctly handle the spacing
of the text. What this means is that an application is intricately tied
to Uniscribe and the associated Open type font and the developer must know
the actual capabilities of Uniscribe's shaping behaviour. This is
rather unfair, for developers should concentrate on the processing of the
information and not be burdened with formatting details. Elsewhere in this
analysis, we have provided examples of three different Microsoft applications
that compute the widths of the same text string totally differently. It
turns out that when you copy and paste a Unicode text string into Word,
cursor movement no longer applies at the syllable level as required but
more at the individual unicode character level. Cursor positioning to edit
the copied text cannot be ascertained by moving the cursor to the required
syllable. Amusing results will be seen if you try and do this. Much of
this can be inferred from the illustration above.
case against arbitrarily long syllables.
The basic assignment
of Unicode allows arbitrarily long syllables to be constructed even though
they will make no sense. Uniscribe attempts to process long text strings
to identify syllables and this can lead to absurdities. From what is known
in India, there are only about a thousand meaningful syllables, most of
which have only two consonants and rarely three or four consonants. There
is virtually no need to allow new shapes for a new syllable even
if it be built with three or four consonants because the writing system
permits the syllable to be written in split form.
While one may feel
pleased that there is no limit to the syllables that can be formed by Uniscribe,
one can readily see that a perfectly valid Unicode string can cause enough
confusion to the shaping engine. We have already seen an example of this.
Uniscribe could well stop with three or four consonant syllables to make
the text preparation process simpler. Editing at the syllable level is
not without its problems in Microsoft applications.
track of two representations.
The need to correctly
identify syllables along with the need to to maintain correct spacing of
text on the screen requires very complex processing. The problem arises
as a consequence of the display being managed in terms of codes referring
to glyphs while the text itself be handled using assigned character codes
(Unicode) for the script. The irony is that the Open type font is also
a Unicode font with valid glyph codes but not having a one to one relationship
with the stored text in terms of characters and glyphs. Errors are bound
to occur in any computation that has to struggle hard to keep track of
two different representations at the same time.
Copy/paste features in an
application heavily rely on the ability of the application to trace back
to the internally stored text from the displayed text. For most western
scripts this is straightforward but for any writing system that follows
a syllabic representation, this requirement is not easy to fulfill.