Review of Microsoft
applications supporting unicode for Indian scripts/languages.
Unicode support for
Indian languages/scripts is in principle available under Windows 2000/XP.
Currently Notepad, Wordpad and Word2000 seem to have provided application
level support and allow data entry and word processing in Devanagari and
Tamil. Towards this Microsoft includes two open type fonts, Mangal and
Lata for Devanagari and Tamil respectively.
Data entry is based
on the INSCRIPT keyboard layout standardized for ISCII. This keyboard mapping
is uniform across the languages in respect of keystrokes for the basic
vowels and consonants. With the INSCRIPT method it may not be possible
to type in the full compliment of aksharas consistent with the conventions
followed in the writing systems. This layout does not also have keys for
some of the punctuation marks. There are no specific keys for typing in
the zero width modifier characters. This will have to be accomplished only
by typing in the decimal equivalent of the Unicode value while keeping
the ALT key pressed.
Among the applications
in the Office 2000 suite, Word 2000 seems to implement text rendering using
Uniscribe. Excel does not seem to go by the shaping engine.
The extent to which
data entry is supported consistent with the requirements of Unicode seems
to vary across the applications. Find and replace boxes do not seem to
support the entry of Unicode characters based on their decimal equivalents.
Text rendering across
applications is not consistent and is quite arbitrary. Word 2000 runs into
problems in estimating the length of words and this causes unacceptable
gaps between words. Editing is effected differently when you backspace
or delete. Delete removes a whole syllable to the right while backspace
deletes the last part of the syllable before the cursor.
Cutting and pasting
across applications results in many inconsistencies.
There is very little
support by way of linguistic processing. String matching in Word 2000 seems
to match syllables but fails in the presence of some zero width modifiers.
Text rendered in Devanagari
departs from convention for many syllables which are written one below
the other. This is not a serious problem for Hindi but alternate shapes
as indicated are as per normal convention. We have used the IITM software
to generate these forms and pasted them into the document.
implementation of Uniscribe conforms to the recommendations in the Unicode
book. However, a valid Unicode string in any Indian language need not contain
linguistically meaningful information. Quite likely, algorithms which look
for linguistic content in a Unicode string will get confused!
The availability of Uniscribe to shape Unicode text does not guarantee
anything in respect of linguistic processing of text. This is the responsibility
of the application and each application must code into itself enough linguistic
knowledge to effect any meaningful text processing. The multibyte
representation for a syllable, coupled with the need to filter out characters
which relate to rendering information can cause the applications to become
really messy. In the illustration below, the same linguistic content is
displayed in twelve different ways, all legal in terms of Unicode representation.
For an application to actually figure out that the strings convey the same
linguistic information, very complex text processing will be required.