image
image
image
image
image
image
image
 
Home --> Software Design Issues --> Data Entry methods for Indian languages
Search  
 
Data Entry methods

Data Entry Methods suited for Indian languages

  This section deals with the subject of preparing texts and documents in various Indian languages and scripts, by using the standard QWERTY keyboard seen with most computers.

    The answer to the question "can we do it as simply as one does it for English?" is an obvious NO but a qualified YES.  The "no" part of the answer has to do with the fact that the limited number of keys on the keyboard will certainly not be able to cater to the thousands of aksharas which occur in our texts.  The qualified "yes" is based on the observation that the keys may be used to represent only the vowels and consonants and thus provide for inputting a series of consonants and vowels from which the required aksharas may be formed using suitable computer programs.

    The question of data entry in Indian scripts had attracted the attention of scholars and computer experts for many many years and today, one sees several different computer programs which permit document preparation in different languages and scripts. These programs, some of them very good in many respects, tend to differ significantly in their approaches to data entry.  The variety seen in their approaches merits discussion so that we may better understand the problems involved.

    The programs permitting data entry in many Indian languages/scripts may be classified based on the specific approach taken to forming the aksharas from the keystrokes.  These are listed below.

  • Language/script specific data entry which relies on a specific font.
  • Transliteration based data entry.
  • Data entry conforming to Manual Typewriter keyboards, specific to each language.
  • Data Entry based on the INSCRIPT layout
  • Data entry methods specific to generating HTML pages supporting our scripts (web page creation)
  • Data entry based on uniform mapping of the keys for all the  languages/scripts.

Data entry methods which are based on fonts.

     The font based data entry methods utilize the feature supported by conventional word processors where the font to be used for displaying the text may be dynamically selected/changed during data entry. Today, the font rendering capabilities built into the operating systems are quite sophisticated in that the required shape of a character may be built from several primitive shapes which are called glyphs. Each font may consists of about 200 different glyphs, where each glyph may directly represent a letter of the alphabet, a special character or a symbol. 

  In fonts for Indian languages, the glyphs will invariably include shapes for the matras, special samyuktakshars and special ligatures besides the basic shapes for the vowels and the consonants themselves.  When a font is selected, the word processor will display the glyphs corresponding to the keys entered. For the English language (Roman alphabet) each letter corresponds to only one glyph in the font and data entry is smooth. In the case of Indian scripts, we will have to know what keys will have to be entered to display the sequence of glyphs which will make up one character. In the case of Roman, the set of displayable glyphs correspond to the set of ASCII codes that are generated when keys are pressed on the keyboard.  This is a set of 96 characters and anything more than this number will require special data entry, as keyboard has only a limited set of keys. 

  Conventional word processors are designed for languages where a letter (or a character) to be displayed is associated with one glyph only. Also, for most of the western languages, the character set itself is limited and so the set of displayable characters is well within the 96 mentioned above. Even though a font for western languages may need to accommodate only the displayable set, many glyphs in the font may be present that are not displayed when the keys are pressed during regular data entry. That is, there may be glyphs in a font which are displayable but not necessarily shown when keys are entered. These glyphs usually correspond to characters with accent marks, specialized symbols, diacritic marks etc., and may be required mostly in printed text and special applications.  Some word processors do support data entry for these glyphs which are typically located in the upper ASCII range (160-255) by allowing the numeric value of the glyph location to be input with the ALT key kept down as the numeric value is typed in.

 Fonts for Indian languages (Except Tamil) are required to have many more than 96 glyphs and so, data entry based on this method of inputting the numeric glyph code values and displaying the character, will become necessarily cumbersome. Worse still, the input sequences are font specific and will vary from font to font even for a given script. Fonts for Indian languages had evolved arbitrarily and do not follow any standards since none exist.  Consequently one sees wide variations in the glyphs themselves as well as the encoding for the font which locates the glyphs at specified locations in the table of 256 locations. as of today (March 1999), there  are no standards for glyph locations for Indian scripts and it is likely that such standards may not be possible at all.

   For a basic discussion of fonts and the issues to be considered in designing Indian language fonts, the viewer may look at the  relevant section within these pages.

    A point to keep in mind is that the internal representation of the text prepared according to this method is in the form of eight bit glyph codes. This has serious consequences if one were to attempt any sort of string processing of the text because the glyph codes bear no relationship whatsoever to the linguistic nature of the aksharas in terms of lexical ordering, sorting or indexing etc,, Yet, this font based data entry method is popular with DTP packages, where one is interested more in printing text as opposed to linguistic analysis.

     There is however a bright side to this approach. Though keyboard entry is cumbersome, one might effectively use the cut and paste facilities supported in the word processors to perform  some editing of the entered text. In some word processors, one also sees an image of the keyboard with aksharas and matras assigned to the keys and the user may simply click on the keys to select the glyph to be displayed. Also if the user were to keep a standard file containing the glyphs, then individual glyphs may be cut and pasted even for entering short sentences of text. Some Urdu word processors have this feature.

     It must be emphasized that data entry on the basis of fonts and glyph codes cannot really provide a natural interface, even if supported through sophisticated macro facilities found in some word processors. You may want to try inputting the following multilingual text using your favourite word processor or DTP program.

Well there ought to be an easy way of doing this!

Return to top of page



Transliteration based data entry methods.

   Transliteration has been a popular approach to preparing printed documents in different Indian scripts.  The idea behind the method is to use Roman letters to represent the aksharas of the languages and process the resulting string (ASCII text) using special computer programs, to produce printed output. The output is obtained using appropriate fonts.

    One of the early computer programs to successfully implement this idea is the Dvng processor for Devanagari using TeX. This program produced a TeX file which could be typeset using the TeX program. Franz Velthuis who devised this package, had also included a special Devanagari font for use with the package. The Dvng package ran on Unix systems and TeX fonts have the advantage that nearly every glyph in the font (which may have as many as 250 glyphs), may be used in printing. In contrast, fonts for other systems such as X-windows, MSWindows, PostScript etc., are restricted to just about 200 glyphs.  This is not a design limitation of the font but a problem arising out of the inability of application programs and font rendering routines to look at specific glyph locations.  As a consequence of the rich set of glyphs, the Dvng package could print a rich set of conjunct characters in Devanagari.

     After Dvng, Charles Wikner enhanced the fonts to accommodate Vedic symbols and also gave a new processing package.  As of today, the Devanagari output obtained using this package is of remarkably high quality and Wikner's choice (or design) of the glyphs has allowed nearly a thousand different conjunct formations to be derived from the basic set of about 250 glyphs.

    Both the packages mentioned here  had arrived at some guidelines for standardization in the selection of the Roman letters for the aksharas of Sanskrit.  In many instances, special symbols from the ASCII set were required to be used to distinguish similar sounding aksharas. Printout using these packages were restricted to Devanagari but Roman could be part of the text as well, permitting bilingual outputs. Subsequently TeX based systems were introduced for Tamil, Telugu, Malayalam, Gurmukhi, Gujarati and Bengali.

     Following the success of the TeX based packages, Avinash Chopde  developed a special transliteration package that allowed other scripts to be handled as well, via language specific fonts. His ITRANS package is well known on the web. Subsequently he enhanced the package to work with normal fonts under Windows-95 and X-windows and was able to generate html documents for display on the web.  The most recent version of ITRANS supports quite a few languages. 

     It must be remembered that all transliteration based data entry methods, require a computer program to generate (as well as format) the output and hence they cannot be applied or used for interactive data preparation, where the display in Indian scripts immediately follows the key strokes.

     The ITRANS package was followed by JTRANS, a Javascript based program by Sandip Sibal who allowed quick generation of html documents from transliterated inputs. This package introduced Xdvng, a quality font for Devanagari which could be used for viewing web pages with Devanagari text both under MSWindows and XWindows. Sibal's package is restricted to Devanagari however.

    The Itranslator package from Onkarnath Ashram in Rishikesh allows data entry in ASCII using the ITRANS scheme but allows the string to be converted to Devanagari and displayed on the screen itself. The font used by this package is probably the finest of the freely available fonts for Devanagari and is known as Sanskrit_1.2. Unfortunately, this font is suited for the Windows platform alone and has glyphs in locations that create problems on other platforms. The more recently announced version of this font (Sanskrit-98) seems to avoid the above problem. Please the web site referenced above for recent additions to the Itranslator package, including new fonts.

Transliteration schemes for Tamil and Telugu.

    There have been a few popular packages for Tamil and Telugu which use the transliteration based data entry method. The Adhami package was written for use under DOS and subsequently enhanced to work under MSWindows and produced displays and printouts in Tamil. Other transliteration schemes such as Mylai and Cologne were also popular with Tamil. For Telugu, the RIT package developed by Rama Rao Kanneganti,  used TeX for typesetting the output. Details of some of the transliteration schemes may be found in our pages on transliteration principles.

Universal transliteration scheme for Indic scripts.

Recently, Dr. Anthony P. Stone  has recommended a special transliteration scheme to handle all the Indian scripts. This interesting proposal uses eight bit character codes to represent the vowels and consonants and hence maps a fairly large superset of vowels and consonants  from all the scripts of interest. This is a meaningful proposal but has only one likely limitation. Existing data entry facilities do not permit easy typing of characters in the upper ascii range (160-255) and so data entry using this scheme may not be feasible, as of now. However, it is quite easy to display all our aksharas using this scheme. Therefore printouts of our texts in transliterated form, may be easily generated. Standardization of transliteration will help considerably in dealing with Indian languages in a uniform manner.

Summary of Transliteration based data entry.

   1. This method allows text in Indian languages to be input using Roman letters. A special  computer program is used to process this text in Roman to produce printouts or displays using appropriate fonts for the scripts. There are several transliteration schemes in use. Most of the processing programs run under Unix.

   2. Transliteration schemes are often specific to one Indian language/script. There is no single scheme yet that correctly handles all the Indian languages.

  3. Phonetically close Roman letters may not be found for all our aksharas. So some compromise is required in selecting the Roman letters. Also multiple representations for the same akshara seem to be allowed, making the processing  somewhat complex.

  4. It is possible to confuse most of the processing programs by inputting arbitrary formations of conjunct aksharas.

   Transliteration based data entry is a workable solution for Indian scripts, since in principle, it allows for a uniform data entry mechanism for all the languages. The transliteration scheme should be comprehensive enough to handle all the aksharas across all the languages/scripts.

   Will it be meaningful to have  a system where, as one types in the transliterated text, the actual characters of the Indian script appear on the screen? This is what is being attempted by some of the recent applications which work under Microsoft Windows systems. While this is an interesting development, the transliteration schemes used are often language specific and may not always permit the formation of many complex conjuncts (Samyuktakshars).

Return to top of page



Manual Typewriter Keyboard based data entry.

  Manual typewriters for different Indian languages have been available for quite some time and their use in Educational institutions and Government offices is substantial. Manual typewriters provide for a minimal set of aksharas consisting of the basic vowels and consonants together with the matras so that text can be prepared conforming to the writing system for the language. The location of the keys for the vowels and consonants on a regional language typewriter is specific to the language.  Many are adept at using such typewriters and when they have to move over to using word processors, they would rather see the same keyboard mappings.  Some word processors do indeed provide for data entry based on the typewriter based key mappings.  The resulting text may not include a number of conjuncts but will be entirely adequate for normal modern day correspondence.

Data entry based on the INSCRIPT keyboard.

  The INSCRIPT keyboard allows more or less uniform data entry of text across the different scripts. The mapping provides for the data entry of vowels, consonants and matras consistent with the specifications in ISCII. The INSCRIPT layout utilizes only the keys provided on a standard QWERTY keyboard and is hence implemented easily on personal computers. It may be observed that a number of keys normally used for punctuation or special symbols are also mapped to the ISCII characters. It will therefore be difficult to perform data entry of text along with a full complement of punctuation marks which have come to into use with almost all the scripts. Microsoft applications also use the INSCRIPT layout for Unicode data entry and hence suffer from this problem. The Microsoft Hindi keyboard has apparently provided for many punctuation marks but one has to effect multiple keystrokes to enter them. Shown below is the INSCRIPT layout on a QWERTY keyboard. Keys corresponding to the ISCII characters are common across all the scripts.

INSCRIPT Layout
Return to top of page


Special programs for Web page creation.

  During the past several years, display of Indian language text on the Internet (Newspapers and Magazines) has become popular. Web pages in Indian scripts are feasible on account of the fact that web browsers may be asked to display a given text in a specified font. We have included some useful information on this in our section on setting up web pages supporting Indian scripts.

  The html standard provides for an interesting way of specifying the glyphs to be displayed either through the numeric code assigned to the glyph or the universal name assigned to that glyph location consistent with the font encoding that has now become standard. This way, the html language also functions as a macro language, where a text string describing the glyphs to be shown may be just typed in using standard ascii.  While one may not need to worry about this for glyphs in the displayable ascii range, the approach is very useful for glyphs in the upper ascii range.  In lighter vein, some people on the net refer to this as the method for the "ASCII impaired"!   The advantage of this approach need not be emphasized, for virtually any text editor capable of data entry for the upper ASCII characters can be used to produce web pages in Indian languages, provided one has patience!

  As an example, the html document shown below will produce the display given in the image that follows. < and « represent two glyphs that are specified through their name entities.

<html>
<center> View the source of this document to see how name entities have been used in preparing the Devanagari string seen below <br>
<font face="sanskrit 1.2"> s&lt;Sk&laquo;tm! </font>
</center>
</html>

  The user preparing the html document must necessarily know the location of the glyphs. This, as we know is font specific, even if the font is meant for a specific script.

  In a sense, generating display through html is similar to the macro based approach taken by TeX, the typesetting program developed by Dr. Knuth. While TeX has the advantage of using most of the 256 glyphs in a font, html displays are constrained to using only about 200, thus loosing the ability to display some conjunct letters.

Web Pages supporting display of Unicode Text.

  Unicode has been accepted as a meaningful standard for handling multilingual text .Most browsers introduced after 2002, include support for this. With Unicode text, the method indicated above does not apply, for the encoding standard automatically identifies the font to be used. Unfortunately, rendering Unicode text in Indian languages is beset with multitudes of problems and it is unlikely that correct rendering of text will be realized. Unicode text in Indian scripts will have to be created using appropriate programs such as Microsoft Word and related applications. As of this writing (April 2005), several browsers cannot correctly display Indian language text represented through Unicode. The interested reader may visit the pages at this site where the difficulties encountered in dealing with Unicode for Indian languages is explained in greater detail.

Return to top of page



Phonetic mapping of the vowels and consonants.

  One way of looking at data entry in Indian languages is to view the text as consisting of aksharas that can always be decomposed into vowels and consonants and perhaps a few symbols. In this phonetic approach to data entry, just one key stroke is associated with each vowel and consonant and a computer program (typically an input module in an application) keeps track of the keystrokes and forms the aksharas.  In many ways, this approach is similar to the transliteration based data entry except that we are not constrained to mapping the vowels and consonants to any specific keys. Also, in the transliterated input case, more than one keystroke may be required to form a vowel or a consonant (e.g., an aspirated consonant or a diphthong).

  The Inscript keyboard layout ( the recommended standard for ISCII based systems) follows this approach though it includes keystrokes for the matras as well. Since the addition of a matra to form a consonant vowel combination is not uniformly applicable to all cases (in Tamil and Malayalam, the combination with the vowel "u" changes the shape of the consonant), the Inscript keyboard does not correctly indicate or reflect what would happen when a combination is input. However it may be assumed that the key for a matra does not always result in a matra but may change the shape of the consonant. The inscript keyboard basically confirms that a phonetic approach to data entry is feasible. True, the basic requirement here is that the input module must process each keystroke taking into consideration the previously entered keys and also check if the conjunct is valid or meaningful. But this is a module that can be written once and incorporated into an application program, to work uniformly across all the Indian languages.

 The data entry scheme recommended for the IIT Madras software essentially follows this approach with one additional facility.  The CTRL key (or an equivalent) is used to indicate that a combination is required to be effected with the previously formed akshara and the current input. Thus the user explicitly indicates that a conjunct will have to be formed.  This feature is helpful in situations where consonants and vowels not present in a language are attempted to be input. The system will not accept such inputs thus providing a safeguard that only valid combinations may be input.

 In the phonetic approach, the key mappings do not relate to the generic consonants i.e., a consonant without any vowel. The mapping relates to the form of the consonant where the first vowel "ah" is assumed to be present.  This is often the way the consonants are taught for children. This way, only one keystroke will be required to enter the most frequently required form of each consonant, as opposed to the case with transliteration based data entry where two keystrokes will be needed.


Main issues

Data Entry specific to a font

Transliteration

Typewriter Keyboards

Inscript Keyboards

Web Page creation

Phonetic Methods



 
The logo on this page gives a view of the Himalayas from a distance. In summer as the snow melts, the true colour of the peaks emerge.

Today is Mar. 24, 2017
Local Time: 12 05 08

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on 10/26/12     Best viewed at 800x600 or better