Home --> Software Design Issues --> Unicode --> uni_dentry
Search  
 
Data entry issues with Unicode
  When text is prepared through data entry, the user should be provided with a natural interface to generate the desired text from the keystrokes. In the past, several schemes had been proposed depending on the coding chosen or the font used to display the aksharas (Ref. section on Data entry in Indian languages). The Inscript layout had been traditionally recommended for use with ISCII and data entry in Microsoft applications supporting Indian languages is based on this layout. In this scheme, the keystrokes correspond to pure vowels, consonants, matras and special characters, all of which have been assigned specific Unicode values. One will observe that since the Matras have been assigned codes, it will be possible to type them in standalone form though the matra may be seen with a dotted circle so as to identify where it will be located with respect to a consonant.

  In the implementation of Unicode based data entry, the basic understanding is that each keystroke will register internally as a Unicode character and it will be the responsibility of of the application to form the desired syllables from the codes for the consonants, vowels and matras. The normal rule is that a syllable is formed when a series of consonants is terminated with a vowel or a matra conforming to the form CCCCV. Here "C" refers to a generic consonant without a vowel. By convention, C usually refers to a consonant with the built in vowel "a", and so one forms syllables by typing in a halanth character in between, ChCh..ChM where "h" refers to the halanth character and "M" a matra. Thus the Unicode assignment for a consonant is actually a syllable with one consonant and the vowel "a". The generic consonant will therefore have to be distinguished during data entry through the use of the halanth character, as well as some context defining characters such as the zero width modifiers.

  The general form of the syllable in Unicode will therefore be   ChCh...ChM, with no specific restrictions on the number of consonants.

 
  Interested viewers may perform a small experiment to see the vagaries of text editing under Windows 2000/XP. We have prepared three different files containing the same linguistic information, i.e., the same text string in different scripts. The text files can be opened under Notepad, Wordpad or Word. The RTF files may be opened under Wordpad or Word.  Notice the differences in the actual display when seen in the three applications and also check out how the applications behave differently while editing. Editing backwards from the end of a string Under Word is quite some experience!

  Text processing algorithms lose their simplicity and elegance when they have to examine multiple byte strings that are arbitrarily long, to extract the linguistic information contained in the strings. When the same linguistic quantum is given two or more different representations (all perfectly acceptable as equivalents), processing becomes involved often leading to unpredictable results. It just happens that one cannot really predict what syllable will come in a string. In the screen shot below, one sees what happens in Word when a series of keystrokes is input. The key corresponds to a matra. Word merely displays the matra with a dotted circle up to a point beyond which it gets confused. One additional input can cause the application to run into confusion! Worse still, try and type in four or five lines of the same matra in Wordpad, block the text and copy the text. The application runs into an error situation and outputs a message. Often it just crashes!

Legal Unicode strings but with no Linguistic content!

  An application supporting Unicode based data entry in Indian languages is also expected to allow data entry of all legal Unicode values. It is therefore possible to type in perfectly legal Unicode strings but without any linguistic content as in the illustration below.  While there is no harm in permitting data entry of all legal Unicode values, it will be a complex issue to identify whether the string has a valid linguistic content.  Many applications suffer due to bugs in the implementation of this feature which basically boils down to identifying the quanta that can be handled by the shaping engine displaying the syllables.


Special symbols and punctuation marks

  It is true that traditional manuscripts written in India do not include punctuation. In line with the western tradition, punctuation is now standard with most scripts. In assigning Unicode values, it was assumed that punctuation symbols from the western scripts would not be assigned in other scripts and so a single assignment would suffice. Typically, the keyboard would provide for all important punctuation marks to be keyed in directly.

  In respect of Indian scripts, the keyboard layout used for data entry utilizes most of the keys to type in one letter of the script or other and thus does not directly provide for all punctuation marks to be entered. In the Inscript layout seen in Microsoft applications, one sees this problem. It may not be possible to type in a punctuation mark unless the keyboard is switched. In Tamil for instance, at least four important symbols ( Question mark, Exclamation mark, the parentheses etc.) cannot be typed in as the keys corresponding to these have been assigned Tamil letters. With Devanagari, the parentheses can be typed in but not the question mark and the Exclamation mark.

Switching keyboards is not an issue that we can ignore since it requires additional effort on the part of the operator.


 
Multilingual Computing- A view from SDL

Introduction
Viewpoint
Writing systems
Linguistic requirements
Dealing with Text
Computing requirements (for India)


Unicode for Indian Languages

The conceptual basis for Unicode

Unicode for Indian scripts
Data entry
Issues in rendering Unicode
Using a shaping engine
Discussion on sorting
Open type fonts


Unicode support in Microsoft applications

Uniscribe
Limitations of Uniscribe

A review of some MS applications supporting Unicode



Recommendations for Developers of Indian language Applications

Using True type fonts to render Unicode Text

Can we simplify handling Unicode text?

Guidelines for development under Linux


Summary of SDL's observations

Acharya Logo
Distant views of the Himalayan Peaks are unforgettable and awe inspiring!

Today is Aug. 16, 2018
Local Time: 15 59 12

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on     Best viewed at 800x600 or better