Home --> Software Design Issues --> Unicode --> assign
Search  
 
Unicode for Indian Languages
A Perspective
A brief  introduction
  The essential concept underlying Unicode relates to the assignment of codes for a superset of world languages (essentially scripts used in different writing systems) such that a single coding scheme would adequately handle multilingual text in any document. In Unicode, it is generally possible to identify the language/script and the letter of the alphabet or a language specific symbol from a unique code made up of sixteen bits. It is important to keep in mind the fact that the need for handling different languages of the world had been felt long before Unicode was thought of. The earlier solution was a simple one. Collect the set of letters to be displayed and give the set a name or an identification.  A computer application could then be told to interpret a character code with respect to a character set. The idea of the character set was simply that a set of values, typically 128 or in some cases going up to 255, would relate to a set of displayed shapes or symbols for a specific language associated with the character set.  The character set name would be given as a parameter to the application which would then choose an appropriate font to display the text specified by the eight bit code values in a text string.

  The only issue that had to be taken care of with the earlier approach was that the application always had to work in the context of some language to be able to correctly interpret the code. Since the codes were common to all the character sets (being eight bit codes), it would not be easy for an application to interpret a given code unless the associated character set  was also known. This would be a constraint to reckon with while handling  multilingual text.

 For most western scripts, the number of distinct shapes to convey information through displayed text is usually small, typically of the order of about 70 and perhaps about 100 if the new symbols which have become meaningful in the context of electronic data processing get included. In some of the western scripts, accented characters are present which will have to be treated as independent linguistic entities. Otherwise, an accented letter may be viewed as a composite with a base letter and an accent mark. Viewed in the light of this, the normal ISO-Latin character set has about 94 displayable characters without accents and perhaps another 90 which include accented letters, the accents themselves and other special symbols. An eight bit code is entirely adequate to meet all linguistic requirements here.

  Computer applications render text by using the rendering support provided by the Operating System. Given that a code value is associated with a character set, the application will choose an appropriate font containing the letters and symbols for the script associated with the character set. Traditionally most of the fonts were eight bit fonts providing a maximum of about 190-200 Glyphs for each character set.

Multilingual documents

  An application rendering multilingual text should know which portion of the document should be rendered in a particular script. Typically, the format of multilingual documents included the means to identify portions of the text as having certain attributes which include the font, colour and the size of text. The Rich Text Format standardized by Microsoft or the HTML specification allows a document to describe itself using descriptors made up of symbols from the set of letters used in the script. Readers familiar with Word processors will readily appreciate the fact that the document contains a lot of formatting details all of which is described using only the characters from the set. These are generally known as tags. HTML documents contain a lot of tags which tell the browser application how to present the text in a window.

  Formats for documents which allow the document to describe itself, are usually known as Mark up languages. RTF, HTML and XML all belong to this category of Mark up languages. While this approach appears meaningful, there are practical difficulties in using the self describing tags where the tags themselves appear as text in the document but the specifications for the document usually provide for handling such situations through the concept of Entities, where an entity may uniquely describe a specific character in the text through a unique name assigned to the character. Multilingual text is usually tagged in ASCII but the tags can confuse Web Browsers if not handled properly.

  Unicode was introduced as the solution to the problem of handling multilingual text where any character in the text could be individually and uniquely identified as belonging to a script/langauge. In Unicode for Indian languages, each character is identified through a field within the code which specifies the language and a field which specifies an individual letter within that language. Though sixteen bits are used to specify each code, the number of codes assigned to any language is small and is often just about 128, with very few exceptions.

The Unicode experts may actually describe Unicode as one single scheme for dealing with all the scripts and languages of the world, where the code space of 65536 has been apportioned to the different languages, one after another. So the idea of splitting the code into two fields does not really apply in general. However, when only 128 code values have been assigned for a language, it is very easy to see that the two fields can be uniquely discerned. Among the Indian languages, Unicode assignment has been effected for all the basic scripts: Devanagari, Bengali, Oriya, Gurmukhi, Gujarati, Tamil, Telugu, Kannada and Malayalam. For these, the language descriptor part of the code occupies nine bits and the remaining seven refer to the consonants, vowels and the matras along with special symbols.

Devanagari -   128 code values from 0900
Bengali -         128 code values from 0980
Oriya -            128 code values from 0B00
Gurmukhi  -     128 code values from 0A00
Gujarati -        128 code values from 0A80
Tamil              128 code values from 0B80
Telugu  -        128 code values from 0C00
Kannada -      128 code values from 0C80
Malayalam -   128 code values from 0D00

The Unicode book specifies a unique English name for each code. This is typically a combination of the language name and and an individual name for each of the 128 characters in the range. For most of the Indian scripts, several code values in the set of 128 for each may be reserved. The actual code assignments may be seen from the web pages at the Unicode Consortium web site.

Unicode and conformity to linguistic requirements.

  The Unicode Book is specific in respect of implementing schemes to render text in a manner which is consistent with the linguistic requirements of the language. Here the original intent of Unicode was to represent only the basic linguistic elements forming the alphabet and not a specific rendered form. For example, an accented character which may be used in German or French is identified as a single letter though composed of a letter and an independent accent mark. Since such accented characters belong to the set of letters used in the writing system, they are assigned individual codes. An accented character could well be described by two codes, one for the letter and one for the accent but in the wisdom of the designers of Unicode, almost all accented characters have been assigned individual codes to make text processing simpler.

In normal Roman (standard English), one does not see such characters and so the basic set for Roman excludes them. However, these are linguistically important and so they are included as an extension to the normal Latin character set, called the Latin supplement where each accented character is assigned a unique code. (Refer to the chart at the Unicode Web site). Unicode consortium did not however specify how they would be typed in along with English. This was the responsibility of the application. Even today, very few applications can actually permit direct data entry of accented characters from the standard keyboard without resorting to a keyboard switch.

  The generic concept of Unicode works well for the western languages where there is only one shape associated with one and only one code value. That is, each code value can directly refer to a glyph index and when the glyphs are placed side by side, the required display is achieved. In this case, a text string is rendered simply by horizontally concatenating the shapes (Glyphs) of the letters. Thus a Unicode font for a western script need have only one glyph for each character code. The Glyph index and the code value can therefore be exactly the same. When the glyph indices are given, the original text is also known exactly due to the one to one mapping. Most languages whose writing system is based on the Latin alphabet come under this category.

  This simplistic view does not help when the displayed shape does not correspond to a single letter but relates to a group of consonants and a vowel which constitute a linguistic quantum. In the South East Asian region, writing systems are based on rendering syllables and not the consonants and vowels. The accented characters mentioned earlier may also be viewed in this light as being made up of two or more shapes derived from two or more codes. 

  The problem at hand in respect of Indian languages is one of finding a way to display  thousands of such combinations of basic letters where each combination is recognized as a proper syllable. This corresponds to a situation where a string of character codes map to a single shape. In the context of Indian scripts, the code for a consonant followed by a code for the vowel will usually imply a simple syllable often rendered by adding a matra (ligature) to the consonant, though there are enough exceptions to this rule.

  Those responsible for assigning Unicode values to Indian languages had known about the complexity of rendering syllables. But they felt that the assigned codes correctly reflected the linguistic information in the syllable and so suggested that there was no need to assign codes to each syllable. It would be (and should be) possible to identify the same from a string of consonant and vowel codes (Just as syllables are identified in English). What was specifically recommended was that an appropriate rendering engine or shaping engine should be used to actually generate the display from the multibyte representation of a syllable.

  Since Unicode evolved from ISCII, there was also the special provision of Unicode values to specify the context in which a consonant or vowel was being rendered as part of a syllable. In other words, Unicode also provided for explicit representations achieved by forcing the shaping engine to build up a shape for a syllable, different from what might be a default. The zero width modifier characters accomplish this along with the Nukta character, when dotted characters (the Persian or Urdu characters in Hindi) have to be handled. These do not directly belong to the basic set of vowels and consonants but are sort of derived shapes.
 

Download Unicode Text file
 
  The idea of assigning codes to displayed shapes may appear to contradict the original intent of Unicode where codes would be assigned only to the linguistic elements. This is usually justified on the following grounds.

  You always require a font containing the basic letter shapes and ligatures to render text as per the rules of the writing system. It is not going to hurt to add a few characters in the input string which may influence the selection of specific glyphs for a given context so long as the application does not interpret the string linguistically and performs only string matching. This is perfectly acceptable in situations where serious text processing is not attempted ( e.g., parse the input string to identify prefixes or suffixes in a verb). However, in the context of Indian languages, a word has to be interpreted properly to extract  linguistic information and this requires analyzing the syllable structure. It is here that the multibyte representation can cause serious headaches for a programmer, for the algorithms working with multibyte structures are usually quite complex. The presence of characters which do not carry linguistic information will only compound the problem and there is also the possibility that  the algorithm would fail when ambiguities arise.

  In the context of text processing in Indian languages, an interactive application which  supports a find and replace feature may actually fail to identify the string in question if it is difficult for the user to correctly identify the actual codes used in the text though the display may look familiar. This is in fact what happens with Unicode when different text strings get rendered identically. That all these different text strings convey the same linguistic information may not be easy to discern in an application unless all possible representations (i.e., Unicode text strings) for a syllable are examined. This will not be easy at all. Given below is an example of a word with three syllables represented in twelve different ways, all linguistically identical but very different in terms of Unicode representation. The associated file containing the Unicode characters is downloadable as aditya.txt




 
 

Multilingual Computing- A view from SDL

Introduction
Viewpoint
Writing systems
Linguistic requirements
Dealing with Text
Computing requirements (for India)


Unicode for Indian Languages

The conceptual basis for Unicode

Unicode for Indian scripts
Data entry
Issues in rendering Unicode
Using a shaping engine
Discussion on sorting
Open type fonts


Unicode support in Microsoft applications

Uniscribe
Limitations of Uniscribe

A review of some MS applications supporting Unicode



Recommendations for Developers of Indian language Applications

Using True type fonts to render Unicode Text

Can we simplify handling Unicode text?

Guidelines for development under Linux


Summary of SDL's observations

Acharya Logo
Distant views of the Himalayan Peaks are unforgettable and awe inspiring!

Today is Jan. 20, 2018
Local Time: 22 48 02

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on     Best viewed at 800x600 or better