Home --> Software Design Issues --> Unicode --> summary
Search  
 
Unicode for Indian languages: Summary of problems
  • Script or Language?
  • Adequate and correct representation of Linguistic content.
  • Multibyte representations are difficult to handle.
  • Modifier codes and assignment of codes for shapes.
  • Programmers have to be familiar with the writing systems.


Script or Language?

  Unicode works on the basis of a script and not language. Unless provided for in Unicode, one cannot choose a script of one's choice to render a syllable. For example, there is no way one can use Unicode to allow Sanskrit to be rendered in the Grantha script or Marathi in the Modi script.

  For years, it has been common practice to use Roman Diacritics to show Indian language text. This will not be easy with Unicode since the concept of diacritics is associated with letters of the alphabet and not syllables. Bilingual printing in Local scripts as well as Roman Diacritics is highly desirable in the context of disseminating information relating to the scriptures and ancient literature of  India.

  Transliteration across scripts, another desirable feature in respect of computing with Indian languages becomes unnecessarily complex with Unicode, not to speak of processing documents in Bharati Braille which cannot be handled by Unicode at all.
 
Adequate and correct representation of Linguistic content.

  While syllables can be represented properly through a sequence of codes, the mapping form the codes to the displayed shape is beset with problems. This is due to the provision in Unicode to force a specific form for the displayed shape through the use of modifier codes. In other words, Unicode not only provides for specifying linguistic content but to some extent the displayed form as well. The presence of modifier codes will cause serious problems during text processing, specifically pattern matching using regular expressions.

  Unicode also includes codes for medial vowels. This implies that two different codes can be specified for the same linguistic quantum. This will create problems during text processing since identifying a vowel will involve more complex processing.

  The set of aksharas which have been assigned codes is not sufficiently complete in respect of symbols which have come into use in different writing systems. The set of symbols vary across scripts as well. For instance, the "arasunna" symbol in Telugu or even the currency symbol is omitted from the set through a separate code value has been assigned for the Rupee symbol outside the range of Unicode values for Indian scripts.

Multibyte representations are difficult to handle.

  The code for a syllable is a string of Unicode characters, a multibyte representation. Multibyte representations are difficult to handle in any computer application since one has to identify syllable boundaries. In particular, Unicode allows syllables of arbitrary length to be coded and many computer applications run into problems when the number of consonants in a syllable is arbitrarily large. Wordpad under Windows used to crash when very long unicode strings were encountered.

  The problem is compounded by the fact that syllable identification is the responsibility of the application, when text processing is required. Try and write an algorithm for detecting a palindrome to understand the complexities of the situation! Most computer applications are written to process character level information. It is only the languages of India and Southeast Asia which depart from this by requiring syllable level processing.

Modifier codes and assignment of codes for shapes.

  Unicode, unwittingly provides code values for shapes that relate to specific representations of syllables (e.g., the Tippe sign in Gurmukhi which is used to indicate consonant doubling).  Ambiguities can arise when such codes are used in place of standard representations for syllables. A nukta character can always be placed before any consonant but will not make sense (say a dot under "ma" or the nasal "nga"). It is well nigh impossible to program into an application cases where such representations have to be either ignored or interpreted properly.

Programmers have to be familiar with the writing systems.

  It is the responsibility of the application to check if a given string is meaningful for display. There is no equivalent of getstr( ) or putstr( ) which applications can use. The problem faced is a consequence of the fact that rendering is also the responsibility of the applications since a universal rendering engine (such a Uniscribe) can never be trusted to do the job right. When Opentype fonts are used (as recommended by the experts), the application has to determine if the desired representation will be possible by querying the font and performing glyph substitutions (otherwise).

  In a variable length code situation, writing applications can be a nightmare for the programmer since he/she has to anticipate code sequences which will not make linguistic sense but nevertheless be valid as unicode strings. The developer of the application is required to be familiar with the intricacies of the writing system as well. It is not often one finds programmers who also understands the complexities of the writing systems.

  The writing systems in use permit multiple representations for a syllable. As a consequence, string processing can get to be hopelessly complex. This will be a serious problem with applications which work with indexed text in search applications.

  

Multilingual Computing- A view from SDL

Introduction
Viewpoint
Writing systems
Linguistic requirements
Dealing with Text
Computing requirements (for India)


Unicode for Indian Languages

The conceptual basis for Unicode

Unicode for Indian scripts
Data entry
Issues in rendering Unicode
Using a shaping engine
Discussion on sorting
Open type fonts


Unicode support in Microsoft applications

Uniscribe
Limitations of Uniscribe

A review of some MS applications supporting Unicode



Recommendations for Developers of Indian language Applications

Using True type fonts to render Unicode Text

Can we simplify handling Unicode text?

Guidelines for development under Linux


Summary of SDL's observations

Acharya Logo
Distant views of the Himalayan Peaks are unforgettable and awe inspiring!

Today is Aug. 16, 2018
Local Time: 15 59 20

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on     Best viewed at 800x600 or better