Acharya Logo: Depicting the Earth and Water
image
Software Design Issues
image
Enabling the Disabled
Online Resources
Downloads
Sitemap
  
Home --> Software Design Issues
Search  
  
Software Design Issues

Multilingual Computing with Indian languages

Basic issues

   The term "Multilingual Computing" refers to the use of computer applications in Indian languages. Traditionally, computer applications were based on English as the medium of interaction with the system. In India, when one attempts to use computers for education and literacy, one faces the problem of language where majority of the population that should get the benefit of Information Technology, does not speak English.

   Elsewhere in the world, computer applications have been developed in different languages, appropriate to the user communities in different countries. It is seen that application development relies on user interfaces which display information in the script that is relevant to a user. The scripts used for text displays have generally been simple scripts based on the letters of the alphabet, typically running into just about a hundred different symbols or shapes. Coding such information has been relatively straightforward.

   The writing systems in use in the South Asian Region of the world are based on syllable representation and for this reason, it has generally been difficult to develop user interfaces supporting them. Multilingual text representation has however been made possible through Unicode, the scheme that supports representation for the scripts of the world so that computer applications could really be multilingual in terms of user interaction.

   It turns out that it has not been easy to adapt the methodologies suggested by Unicode to text in Indian languages on account of the complexities of the writing systems. While in principle, methods have been suggested for handling syllable level information through multibyte Unicode strings, practical difficulties arise in developing applications based on Unicode (or for that matter ISCII) where text or linguistic processing is involved. Basically the problem has to do with the issue of internal text representation which has to be necessarily different from the representation for rendering the text. Technical solutions have been recommended to handle this but one cannot assert that the results have been satisfactory.

   At the Systems Development Laboratory, IIT Madras, the view is that for computer applications to be really meaningful, text processing with Indian languages must be attempted at a syllable level, consistent with the rules of the writing system. The project undertaken in the lab has emphasized the need to develop solutions that are universal in terms of applicability across the languages of India. There are several technical issues to be considered as also the viability of solutions for implementations that will gain acceptance by the users.  The discussions in the pages of this site highlight the issues involved in computing with Indian languages.

Rendering Text

  An interactive computer application should permit data entry and display in the script of interest to the user. With writing systems that are syllabic in nature, data entry has to ultimately lead to the formation of syllables, if needed, through multiple key strokes. The problem of text rendering deals with approaches that can be taken to represent syllables in such a way they can be displayed correctly.

  In most computer applications today, the internal representation involves codes for each key stroke and hence the syllable is specified through a variable number of codes. From a variable length code (internal representation) one has to generate the display using appropriate fonts. Dealing with variable length codes is a difficult problem. It is for this reason IITM has recommended the use of fixed length syllable level codes for text. The problem of text rendering is discussed in detail in the section on Electronic Representation of Text.

Return to Top

Encoding Standards

  Over the years, different approaches have been taken to represent syllables for internal storage. Many of these, relied on the availability of a specific font designed to accommodate the shapes from which syllables could be composed (built). In 1991, the ISCII standard was adopted for general use and subsequently Unicode for different Indian scripts became a standard. Salient aspects of these encoding standards are discussed in these sections.

  ISCII and Unicode standards are discussed in a separate section.

Font Standards

  In simple terms, a font provides for displaying a set of symbols through well defined shapes for each symbol. The symbol is a generic concept and the font is an instance of specific representation of a set of symbols.  Traditionally, the symbols mentioned here have been the letters of the alphabet in a particular language along with punctuation marks and special characters. Fonts used to be created by craftsmen and artists during the days of printing machines that used movable type faces. Today, fonts are created by artists and designers who work with computer based tools.

   Inside a font, the specific shape for a symbol is described either in terms of a digital image through bit maps or in terms of a filled outline. The former is called a bit mapped font and the latter, an outline font. Outline fonts are increasingly being used on account of their scalability. The descriptions result in a pictorial representation or shape for each symbol, which is referred to as a glyph. Most fonts have a provision for describing up to 256 different glyphs, though in practice only about 190 - 240 may be present. Text mode displays on computers (Dos command shells or Unix command shells) use bit mapped fonts while outline fonts are used with Graphical user interfaces. 

  Font standards have evolved over the years and apply to the scripts of different languages. Fonts with a restriction of eight bit codes for selecting the glyphs cannot support multiple scripts at the same time. Traditionally, to allow a given 8 bit code to refer to a shape, the concept of Font Encoding was employed. The concept behind font encoding permits a shape to be identified with a name (may be even different names). Thus the name for a shape is mapped to a geometrical description of the shape. Such mapping implies that the code for a given name is located through a table mapping names to codes and the resulting code used as an index into the set of glyphs to select the shape appropriate to the name. 

  Fonts are typically designed to support specific encoding schemes. While rendering text generated form a computer application, the code for the text is first mapped to a name and the shape corresponding to the named character is located through the internal name to glyph mapping in the font. Web browsers allow the choice of specific fonts for specific text encoding schemes which typically differed across platforms. This will permit text in a web page to be displayed properly using an appropriate font for which support is provided in the system on which the browser is running.

Return to Top

Fonts for Indian languages

   Fonts are inherently proprietary in nature and tend to be incompatible across computer systems as well as applications. It is true that the internet is a repository where many fonts are freely given by their designers. However, the design of a font, commercial or freely given, deals with issues of rendering of complex conjuncts and syllables when one thinks of Indian languages/scripts. Today, it appears that very few non Roman fonts are available for practical use which are supported under all the important platforms.  This has imposed fairly severe restrictions in respect of web displays of Indian Language text on account of the totally arbitrary approach to designing the fonts. True, these fonts were not designed with the idea of text processing but more for getting good printouts. However when we use the same for web pages, we run into many incompatibilities.

  In text editors and word processors, the internal representation of text is usually the ASCII code (possibly Unicode today) of the letter displayed and these codes happen to be the same as the numeric codes assigned to the glyph locations containing the letters of the alphabet in standard fonts for the western languages.  When it comes to fonts for Indian languages, the display has to be built up with more than one glyph for many aksharas and hence the internal representation of the aksharas is purely a function of where the glyphs for the aksharas are located within the font. Thus, one faces the problem that, the stored text is not in a format that can be viewed on different computer systems because, the encoding standard may not be supported in each system. Also, glyph codes are the choice of the font designer and will bear no relationship to the ordering of the aksharas in our scripts. Thus, linguistic processing of the stored text is a formidable task, being font dependent even for the same script.

  Thousands of fonts have been designed for Indian scripts but each design has its own specific purpose, often compatibility with a typing scheme. Applications dealing with Indian scripts have generally relied on the availability of a specific font(s) for a script. Consequently applications could not move data transparently across platforms since encoding issues come into play. The problem continues even with Unicode where some standardization has been effected for text representation because text rendering is separated from internal representation (strictly not true for Unicode).

   Unlike a web page in English which could be seen through a substitute font, if the specified font were not available, text in Indian languages will require the specified font to be present. This is the reason why web pages often make available for download, the font associated with the displayed information. Unfortunately, a single fonts cannot be used across platforms and even a given font is not guaranteed to display text properly if the internal encoding differs, which is usually the case with many indian language fonts.

Fonts for Indian languages are discussed in a separate section.

Return to Top

Language enabling

  Language enabling is a concept where a computer application will be able to allow data entry and display in the required language by allowing dynamic selection of the language during data entry. In an application that is enabled for a particular language, what is seen on the screen or printed, will have text in that language. Data entry may not always be straightforward if the letters of the language bear no resemblance to the Roman alphabet. Yet many applications may project a keyboard on the screen and allow data entry through mouse clicks. In all these cases, the current practice is largely, one keystroke, one glyph, where each glyph shown on the screen corresponds to an individual letter of the alphabet. This approach does pose difficulties for languages where the representation of the characters involves combinations of two or more glyphs to display a single conjunct character.

  Language enabling methods rely on switching keyboard input for the entry of text in different scripts. Current approach is to effect this switch through services in the Operating system (typically called Locales). It often happens that one is required to switch locales to permit text input in respect of punctuation or special symbols that are often required in a script but these symbols may not be part of the traditional writing system using the script.  This happens quite frequently with Indian scripts which today, employ standard punctuation. Often the keyboard assignments are tight and one may not be able to accommodate the newer symbols unless the locale is switched or multiple keystrokes are employed even for simple punctuation. In other words, keyboard switching becomes inevitable when multilingual text is to be entered and the required locales have to be included in the OS for this to work properly.

Return to Top

Localization

  Language Localization is a totally different concept in which the entire interaction with the application, including all the commands, is done in the specific script for the language. This calls for major enhancements to the system software to allow interpretation of text strings in different languages. In an application that is localized for a particular language, one may never see Roman text on the screen and all computing, including naming of files, may be done in the specified language. In other words, an application supporting localization for a language can provide an effective user interface in that language. Thus a person need not know English to run the application.

   Localization is difficult to achieve for languages which have large number of letters such as Indian languages. This is a consequence of the fact that localization of applications still rely the assumption that a small set of letters (128) is all that will be encountered for text processing! It turns out that while one sees improvements in rendering multilingual text, the interpretation of the text string continues to pose problems. The real problem is that of having to work with syllables for the purpose of interpretation while the rendering of text has to do with the shapes of the written characters.

Return to Top

Unicode

Unicode for Indian scripts
 The generic concept of Unicode works well for the western languages where there is only one shape associated with one and only one code value. That is, each code value can directly refer to a glyph index and when the glyphs are placed side by side, the required display is achieved. In this case, a text string is rendered simply by horizontally concatenating the shapes (Glyphs) of the letters. Thus a Unicode font for a western script need have only one glyph for each character code. The Glyph index and the code value can therefore be exactly the same. When the glyph indices are given, the original text is also known exactly due to the one to one mapping. Most languages whose writing system is based on the Latin alphabet come under this category.

  This simplistic view does not help when the displayed shape does not correspond to a single letter but relates to a group of consonants and a vowel which constitute a linguistic quantum. In the South East Asian region, writing systems are based on rendering syllables and not the consonants and vowels. The accented characters mentioned earlier may also be viewed in this light as being made up of two or more shapes derived from two or more codes.

  The problem at hand in respect of Indian languages is one of finding a way to display  thousands of such combinations of basic letters where each combination is recognized as a proper syllable. This corresponds to a situation where a string of character codes map to a single shape. In the context of Indian scripts, the code for a consonant followed by a code for the vowel will usually imply a simple syllable often rendered by adding a matra (ligature) to the consonant, though there are enough exceptions to this rule.

  Those responsible for assigning Unicode values to Indian languages had known about the complexity of rendering syllables. But they felt that the assigned codes correctly reflected the linguistic information in the syllable and so suggested that there was no need to assign codes to each syllable. It would be (and should be) possible to identify the same from a string of consonant and vowel codes (Just as syllables are identified in English). What was specifically recommended was that an appropriate rendering engine or shaping engine should be used to actually generate the display from the multibyte representation of a syllable.

  Since Unicode evolved from ISCII, there was also the special provision of Unicode values to specify the context in which a consonant or vowel was being rendered as part of a syllable. In other words, Unicode also provided for explicit representations achieved by forcing the rendering utility to build up a shape for a syllable, different from what might be a default. Thus Unicode for Indian scripts does not strictly separate the rendering from the internal representation and provides codes which specify the context for rendering. This bias, exhibited by Unicode can cause enough headaches for developers when Unicode text has to be processed.

Limitations

  Limitations with Unicode are seen more in respect of text processing than text display. The nature of the writing systems followed in India require multiple display forms for a given syllable and this cannot be provided easily, for the onus is on the programmer to check if the required display form can be generated using the given font. Hence the application is influenced by what the font can offer. A direct consequence of this is that applications across platforms will not be in a position to utilize a standard rendering approach resulting in incompatibilities across applications and platforms.

  A detailed discussion of Unicode for Indian Languages and the incompatibilities observed in standard applications has been included in an independent section of this site.

Return to Top

Data Entry methods

  The specific problem discussed here is the use of the standard  QWERTY keyboard to keyin data in Indian scripts. This is a fairly well understood problem and several solutions are available. Data entry rules should be easy to follow while at the same time permit the formation of complex conjunct aksharas consistent with the rules of the writing system. In respect of Indian scripts, one sees additional requirements brought about by the use of punctuation marks from the Roman Script. Please read the discussion in the section on Data entry methods suited for Indian languages.

Return to Top

Transliteration

   Transliteration has been an important approach to displaying text in Sanskrit, Tamil and other languages using equivalent Roman letters with suitable diacritical marks. Use of transliteration simplifies text processing if only the Roman letters are involved. In fact TeX has taken this further to permit description of the syllables so that a preprocessor could be used to identify the manner in which the syllable could be composed.

  Transliteration can be seen in books printed about Indology during the early days of printing in India when typefaces for Indian scripts had not come into use. Unfortunately, there have been no standards in respect of the choice of Roman letters. Today, several schemes are in use each having specific merit for specific scripts. 

  Transliterated text is more amenable to linguistic processing using conventional text processing algorithms using ASCII.

  The Acharya site offers detailed information on Transliteration and the IITM software includes utilities for converting transliterated text to a format suitable for further processing with the IITM software tools. 

Return to Top

Linguistic considerations

 Linguistic processing refers to the analysis of syllables or text strings in a language. This requirement comes up whenever one is trying to study the grammatical information in a text string or just interpret the string as a command. Very often, linguistic processing will require the use of a corpus. The creation of the corpus should also take into account the linguistic aspects of the language such as verb forms associated with different tenses etc.

  In respect of Indian languages, linguistic processing is to be attempted at the level of a syllable since the writing systems are syllabic in nature. Almost all the Indian languages have a structure that is based on root syllables from which actual words are derived. The scriptures exhibit remarkable consistency in the number of syllables used in a verse and this consistency is also an indirect means to checking the correctness of text.

  The pages on Linguistics and computation contain useful information on this subject. You will also see amazing word structures and  long palindromes from some noted poetic works in Sanskrit and Tamil.

Writing Systems

  Writing systems employed for Indian languages work on the principle of syllables where individual syllables are identified through individual shapes. In essence a syllable is composed from basic shape following the rules of the writing system which employ special shapes for the medial vowels as well as ligatures for specific combinations of consonants and vowels. Though the languages of India are phonetic in nature, each language is free to use a script which can display the syllables correctly so as to map to the correct sound. Hence, the basic sounds are fundamental to the writing system for any Indian language and not the script. When text processing is to be attempted, one faces the problem of identifying the syllables corresponding to the displayed text. Systems which emphasize the script (and hence code the syllables consistent with the requirements of the script) often run into problems if the text for the language is to be displayed in another script. This happens to be the case with Unicode.

Writing systems of India are explained in detail in a separate section

Text analysis

 Text analysis refers to the process of interpreting text for a specific purpose, say to find out if a specific combination of sounds is present in the text. Text analysis involves string processing of syllables and the coding scheme employed for text often decides the complexity of the algorithms required for string manipulations. Fixed length codes are much more effective for this purpose (based on the principles of regular expression matching) than variable length codes.

  Text analysis is an invaluable tool for linguistic studies. A separate page has been devoted to the discussion of this topic.
 

Return to Top

Application development

Interactive applications

  The IITM software was primarily written to support effective user interfaces in Indian languages for different applications. The software is very well suited for developing web based interfaces for Indian scripts. The multilingual editor is the base for many other interactive applications to support Indian language text entry into a computer. Please look at the page describing the applications developed as part of the IITM project.

Syllable level Codes

  Syllable level codes simplify text processing since existing algorithms for fixed length codes could be utilized. While it is true that applications based on Unicode representation of text in Indian languages have been implemented with some success, the basic problem of uniformity across applications (and platforms) continues. The real issue here is that applications which process Unicode are required to handle specific rendering of text. With syllable level codes one does not see the problem. Though one can legitimately ask if syllable level codes can be standardized for arbitrary syllables, it turns out that virtually every syllable encountered in practice can be handled using  a superset of syllables across the languages. Detailed discussion of the Syllable level coding scheme is provided in the section relating to the IITM Syllable level Coding scheme.

Local Language Library

  The Local Language Library is a set of functions which may be called from an application program to perform input, output and string processing of text in Indian languages. These functions are similar to the ones in the standard C-Library but work with Syllable level codes. Also, the functions supported have some features like those supported under the curses library used under Unix for text rendering. The Local Language Library is universal in the sense that it supports calls which are language/script independent. Hence, applications built with the library will work transparently across the languages and the required script can be selected through parameters when the application is invoked. The functions work on the basis of fixed length syllable level codes and hence one can work with standard algorithms for string processing, regular expression matching etc..

  The Library is documented in a separate page.

 Text to Speech ( Software for the disabled )

   The availability of the computer opens up several possibilities for helping the disabled gain basic educational and professional skills. Many countries of the world have provided very useful applications suited to specific disabilities and these have won appreciation from experts all over the world. Unfortunately, most of these applications require knowledge of English. The IITM software development team had already envisaged the need for many of these applications to support Indian language user interfaces. This is a technically challenging proposition which at the same time can provide many opportunities for the rural masses of the country to get basic education and through that, employment. The technical issue ultimately serves a social cause and one needs no further incentive for taking up the project. As on April 2002,  the Systems Development Laboratory has successfully developed speech enhanced applications for use by the Visually handicapped. Details are available in the corresponding pages.

Return to Top

PERL Modules

 The fixed size two byte encoding used in the IITM software lends itself to direct manipulations using PERL which is a remarkably good choice for writing applications which would interpret scripts written in Indian languages. Very little is required by way of enhancements to standard PERL which handles regular expressions with great ease and simplicity.

  The enhancement required in PERL for this is a simple module which can present "llf" characters (an llf character corresponds to one syllable) as equivalent ASCII strings. Such a module has been developed in the lab and is known as "llperl". This module provides support for processing text prepared with the IITM multilingual editor or any application which can generate syllable level codes consistent with the IITM coding standard. The idea behind this approach is to permit PERL programs to be written using the IITM editor where text strings in Indian languages could be present.

  Details of the PERL Modules are available in the linked page.

Return to Top


Contents

Basic issues

Rendering Text

Encoding Standards

Font Standards

Fonts for Indian languages

Language enabling

Localization

Unicode

Data Entry methods

Transliteration

Linguistic considerations
  
   Writing Systems
   Text analysis

Application development
  
   Interactive applications
   Syllable level Codes
   Local Language Library
   Text to Speech

PERL Modules
 



| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |

Last updated on April 21, 2005    Best viewed at 800x600 or better