Data entry issues with
is prepared through data entry, the user should be provided with a natural
interface to generate the desired text from the keystrokes. In the past,
several schemes had been proposed depending on the coding chosen or the
font used to display the aksharas (Ref. section on Data entry in Indian
languages). The Inscript layout had been traditionally recommended for
use with ISCII and data entry in Microsoft applications supporting Indian
languages is based on this layout. In this scheme, the keystrokes correspond
to pure vowels, consonants, matras and special characters, all of which
have been assigned specific Unicode values. One will observe that since
the Matras have been assigned codes, it will be possible to type them in
standalone form though the matra may be seen with a dotted circle so as
to identify where it will be located with respect to a consonant.
In the implementation
of Unicode based data entry, the basic understanding is that each keystroke
will register internally as a Unicode character and it will be the responsibility
of of the application to form the desired syllables from the codes for
the consonants, vowels and matras. The normal rule is that a syllable is
formed when a series of consonants is terminated with a vowel or a matra
conforming to the form CCCCV. Here "C" refers to a generic consonant without
a vowel. By convention, C usually refers to a consonant with the built
in vowel "a", and so one forms syllables by typing in a halanth character
in between, ChCh..ChM where "h" refers to the halanth character and "M"
a matra. Thus the Unicode assignment for a consonant is actually a syllable
with one consonant and the vowel "a". The generic consonant will therefore
have to be distinguished during data entry through the use of the halanth
character, as well as some context defining characters such as the zero
The general form of
the syllable in Unicode will therefore be ChCh...ChM, with
no specific restrictions on the number of consonants.
may perform a small experiment to see the vagaries of text editing under
Windows 2000/XP. We have prepared three different files containing the
same linguistic information, i.e., the same text string in different scripts.
The text files can be opened under Notepad, Wordpad or Word. The RTF files
may be opened under Wordpad or Word. Notice the differences in the
actual display when seen in the three applications and also check out how
the applications behave differently while editing. Editing backwards from
the end of a string Under Word is quite some experience!
Text processing algorithms
lose their simplicity and elegance when they have to examine multiple byte
strings that are arbitrarily long, to extract the linguistic information
contained in the strings. When the same linguistic quantum is given two
or more different representations (all perfectly acceptable as equivalents),
processing becomes involved often leading to unpredictable results. It
just happens that one cannot really predict what syllable will come in
a string. In the screen shot below, one sees what happens in Word when
a series of keystrokes is input. The key corresponds to a matra. Word merely
displays the matra with a dotted circle up to a point beyond which it gets
confused. One additional input can cause the application to run into confusion!
Worse still, try and type in four or five lines of the same matra in Wordpad,
block the text and copy the text. The application runs into an error situation
and outputs a message. Often it just crashes!
strings but with no Linguistic content!
An application supporting
Unicode based data entry in Indian languages is also expected to allow
data entry of all legal Unicode values. It is therefore possible to type
in perfectly legal Unicode strings but without any linguistic content as
in the illustration below. While there is no harm in permitting data
entry of all legal Unicode values, it will be a complex issue to identify
whether the string has a valid linguistic content. Many applications
suffer due to bugs in the implementation of this feature which basically
boils down to identifying the quanta that can be handled by the shaping
engine displaying the syllables.
and punctuation marks
It is true that traditional
manuscripts written in India do not include punctuation. In line with the
western tradition, punctuation is now standard with most scripts. In assigning
Unicode values, it was assumed that punctuation symbols from the western
scripts would not be assigned in other scripts and so a single assignment
would suffice. Typically, the keyboard would provide for all important
punctuation marks to be keyed in directly.
In respect of Indian
scripts, the keyboard layout used for data entry utilizes most of the keys
to type in one letter of the script or other and thus does not directly
provide for all punctuation marks to be entered. In the Inscript layout
seen in Microsoft applications, one sees this problem. It may not be possible
to type in a punctuation mark unless the keyboard is switched. In Tamil
for instance, at least four important symbols ( Question mark, Exclamation
mark, the parentheses etc.) cannot be typed in as the keys corresponding
to these have been assigned Tamil letters. With Devanagari, the parentheses
can be typed in but not the question mark and the Exclamation mark.
Switching keyboards is not
an issue that we can ignore since it requires additional effort on the
part of the operator.
Multilingual Computing- A view from SDL
Unicode for Indian Languages
Unicode support in Microsoft applications
Recommendations for Developers of Indian language Applications