image
image
image
image
image
image
image
 
Home --> Software Design Issues --> Electronic Representation of Text
Search  
 
Electronic Representation of Text

  Electronic processing of text in any language requires that characters (letters of the alphabet along with special symbols) be represented through unique codes. Usually, this code will also correspond to the written shape of the letter. A code is basically  a number associated with each letter so that computers can distinguish between different letters through their codes. The ASCII code is the standard by which the Roman alphabet is handled. 

  The code serves the important purpose of standardizing the approach to dealing with  text on different computer systems. As of today, the ASCII code is probably the only code correctly identified on all the computer systems. Information made up of pure ASCII coded  text is thus viewable on almost all computers. Email is an example of an application that works on almost all the computers since it uses pure ASCII coded text in the messages. 

  In the early days of information processing, the Roman character set served as the basis for interacting with computer systems. European languages which use a slightly different set of character codes manage quite well with the ASCII approach by replacing a few of the special characters in ASCII by symbols specific to each language. 

  The ASCII code covers a range of 128 characters of which 96 (codes from 32 to 127)  are reckoned as standard displayable ASCII. Actually eight bits are used in ASCII but the Codes between 128 and 255 are often used for  displaying symbols useful for tabular information, graphics etc..  Some European languages  such as Greek, Russian etc., also support codes for their languages in the range 128 to 255,  to allow bilingual information (English as well s the specific language) to be displayed easily. The international Standards Organization has come up with recommendations known as the  Latin Character sets, which encode the alphabet of the different European languages.  For an  excellent review of these the viewer is encouraged to look at Internationalization

   The fundamental idea in standardizing character codes is to allow data entry to proceed  using the standard QWERTY keyboard. Most word processors and text editors associate the keys on the keyboard with specific ASCII codes and hence can support data entry in any language that assigns the displayable ASCII codes to its alphabet. Display is effected through the use of fonts, where associated with each ASCII code is the shape of the letter that should be displayed. Fonts typically deal with specific codes representing different character sets (known as font encoding).The displayed shape or form of the character will also differ from font to font though the fonts may be encoded identically. This allows us to display a given string of text using shapes suited to a given requirement.

  Since there are many languages and the ASCII encoding supports only 96 regular letters for data entry and display, some special mechanism is needed to associate the codes with the letters of different languages, if multilingual information is to be displayed. This  mechanism has been provided through Unicode, which is essentially derived from ASCII  but provides some means of identifying the script associated with the characters. Unicode caters to a very large set of characters representing several scripts of the world. 

   Unicode is yet to be recognized by many word processing and data entry software running on different computers of the world (as of today, Jan. 2005). Microsoft Windows (Windows 2000/XP), Linux and Java support Unicode but bulk of the systems continue the plain ASCII approach. Unicode is an international standard that should necessarily be understood by persons developing multilingual software. Web browsers provide support for displaying Unicode text on different computer systems.

   Before we take up the issue of coding Indian language characters, we make some  important observations. 
 

  • Eight bit character codes are entirely adequate for languages whose alphabet is a small set and the written text consists of only the individual letters themselves and  possibly some punctuation.
  • Data entry in any language can be effected with ease by encoding the letters of the alphabet along the lines of the ASCII code and using appropriate fonts with the Word processing software.  If the codes are also assigned in the range 128-255, then  data entry is not straightforward and will require special input mechanisms.
A document  prepared on the basis of Unicode encodings, will be truly  Multilingual.  However, data entry on word processors supporting Unicode (e.g., Microsoft Word) still remains  cumbersome. More about this in the section on Unicode for Indian Languages.

Return to Top


Is there a character set for  Indian languages?

   Any attempt at encoding text in Indian languages has to address this important  question. While it is true that all Indian languages have a phonetic base built on top of a  fixed number of vowels and consonants, the writing systems permit many different shapes to be generated depending on the syllables in the text.  In a way this may be likened to the addition of ligatures. The ligature is a special shape that is added to the basic shape of a consonant when syllables are formed with the consonant through combinations with other consonants and finally, a vowel. The writing systems for Indian languages provide for representing thousands of combinations of the basic consonants and vowels. 

  The Samyuktaksharas or the conjunct characters which the writing systems use, represent combinations of sounds. Linguistically, the Akshara is the basic quantum or measure used in reckoning the number of sound combinations within a word and poetic Metre is specified according to the number of of such aksharas in each line of verse.  We  illustrate this through an example. The verse shown below is the opening verse of the  famous Bhagavadgita.

 
  The aksharas in the verse are individually identified in the representation and  in this specific Metre, each line of verse contains two groups of eight aksharas.

   In sanskrit and other Indian languages, one observes strict adherence to the rules for the Metre.  If one has to work with text  from any linguistic angle, one sees the need to  identify and and work with aksharas which include combinations of consonants and a  vowel. It is known that there are several thousand combinations, each having an  individual representation, even though all of them are derived from a basic set of about  thirty five consonants and sixteen vowels. On the basis of this observation, the question that comes to our mind is "what will constitute the character set for Indian Languages?". We may approach the question from two or three different viewpoints or approaches to character coding. 

Return to Top



Internal Representation: Approach-1

Treat the basic set of consonants and vowels as the character set and recommend a code for each consonant and vowel.

  It will be easy to accommodate this set within the range of displayable ASCII. However, this alone will not work in practice, for in Indian languages, a consonant vowel combination is a single Akshara and cannot be represented by writing the vowel after the consonant.  It is possible to assign additional codes for the ligatures (vowel extensions which are called Matras), so that  consonant vowel combinations are are also handled through the codes. This approach has  the advantage that conventional word processors or editors may be used to prepare text if appropriate fonts are available. 

  This is basically the approach taken in the ISCII scheme, a standard that was proposed  in the eighties and was revised during the early nineties.  The ISCII values were assigned  in the range 160-255 and so one could work with Roman and Indian Scripts simultaneously.  However, data entry is not straightforward from any standard word processor and special software is required. The Center for Development of Advanced Computing (CDAC) had  pioneered the development of systems for Multilingual text but they approached the problem  partly through hardware solutions for the PC platforms.  In recent times they have brought out Windows based software (ILeap) but the earlier problems continue to remain in respect of error free transliteration across languages as well as lack of preservation of the sorting order in Southern languages. 

  The primary objection to the ISCII approach, which is an eight bit representation of the  consonants and vowels is that text processing would become cumbersome on account of  the variable number of bytes for each Akshara. Before the aksharas could be displayed, one has to identify the terminating vowel associated with the consonant (or conjunct) and  generate the shape to be displayed.  This is really the most difficult aspect of the approach  as it requires a complex algorithm to associate a variable number of bytes to a shape that is either obtained through a single glyph or built from multiple glyphs depending on the set of shapes (also known as Glyphs) supported in the font.  In the case of the  Roman letters, such a complex situation does not arise, for each byte is associated with one glyph only.

From the illustration above, it is clear that any system based on ISCII (or other eight bit representation) has to keep track of a variable number of bytes for each akshara and what is more, combine appropriate glyphs from the font to display the same. In the case above,  the first syllable is obtained by combining two glyphs by overlaying the second on the first,  while the second has to be built from three glyphs by placing them side by side. Thus the  process of displaying the akshara from the internal representation is quite complex. Worse still, it is language or script dependent since the  writing systems vary across the languages. 

   In spite of this complexity, a system based on ISCII is implementable, though with some difficulty. For a discussion on the problems faced with the ISCII scheme of coding visit the corresponding pages. 

  It turns out that UNICODE for Indian languages is also similar in concept to that of ISCII (basically Unicode for Indian languages was derived from ISCII) with  minor changes effected. Consequently our discussion of ISCII also applies to UNICODE. 

Internal Representation: Approach-2

In this approach, we use Roman letters or a short Roman string to represent each vowel and consonant of the language.  Each string will, in some discernible manner, indicate the akshara it stands for. This will be a variable length string representation but will consist of only the Roman letters.  Given below are some examples of the representation for some vowels, consonants and some conjuncts as well.

transliteration
 
  We notice that this representation helps more for data entry than display. In this specific case, transliteration involves only the lower case letters and so typing in the data should be relatively easy. The problem of figuring out the glyphs from the strings continues to be relevant. However, for some fonts, the Roman letters  themselves correspond to the glyphs required. Of course this works mostly for the basic consonants and vowels. Samyuktakshars are still difficult to enter. In any transliteration scheme such as the one above, there may be instances where one could see ambiguities, e.g., typing in the vowel "i" following an "a" will be construed as a single vowel though in some cases two vowels may be desired (Gujarati has some words where "i" will follow an "a"). Also a transliteration scheme will have to use more than the 26 letters of the Roman alphabet to accommodate the full set of vowels and consonants of the Indian languages and this reduces the number of punctuation marks and special symbols which may be used in the text.
 

  This approach is the Transliteration based representation for Indian language aksharas. Several software packages have relied on this representation for preparing documents.  Starting with  "DVNG", a package based on TeX, to produce quality printed documents,  there have been several transliteration schemes proposed and used on the web. Most of  these schemes have chosen the strings arbitrarily and hence there is no common choice  across the languages. The section on Transliteration principles explains the idea behind  schemes. 

  It should be mentioned that this approach continues to suffer from the problems of  variable length representations. Also, dictionary sorting order can not be maintained, for sorting proceeds on the basis of the arrangement of the ASCII values and not the order in which Indian language aksharas are placed. 

Return to Top


Internal representation: Approach-3

 In this approach to defining the character set and assigning codes, we identify a set of all aksharas that have been in use across all the languages of the country and assign  unique codes to each akshara including Samyuktaksharas.  As seen in earlier sections, this  set is of the order of thousands of individual combinations and so the normal eight bit encoding will not be adequate. A sixteen bit encoding is recommended.

  The sixteen bit code will work well, for it identifies the aksharas from a linguistic angle as well.  However the most unacceptable aspect of this, at least for computer people, lies in the fact that no existing software will recognize this and so the advantages of  using conventional software such as word processors will be lost. It is true that Unicode is technically a sixteen bit code and many computer programs may (or will) recognize the same. However, Unicode, as seen earlier, contains just about seven bit information relating to the characters for the Indian scripts and so the methods of handling Unicode will not apply to our 16 bit scheme. In other words, the number of characters assigned to a script in Unicode is of the order of 128 for most scripts and hence the associated fonts are expected to support only that many glyphs. 

   Though this proposed 16 bit code catering to a large number of aksharas will be the right choice  for processing text in Indian languages, the need to write new programs for every meaningful  application cannot be ignored.

    It is this problem that the development team at IIT Madras pursued in 1991. Fortunately, a good solution has been provided. It is a solution that combines the advantages of both eight bit and sixteen bit representations. The solution is explained below. 

    During data entry, a special processing module converts the entered data into the sixteen bit representation. The entered text may be displayed using conventional display methods which use fonts or the text may be displayed using the special rendering module  developed for this purpose.  Since display is based on 8 bit glyph codes, virtually all the  methods available to us for displaying text may be used. Also the 16 bit representation may be converted into formats that are consistent with other word processors. One such format has already become popular and is known as the rich text format. Besides, the HTML  format itself is universal enough to display information through 8 bit font glyph codes and so the approach naturally allows Indian language web pages to be created with ease. 

    Thus, the complex problem of data entry and internal representation for the large set of aksharas could be managed through special input routines that can be added to many conventional software packages through the use of input modules. All text processing may be attempted with fixed size 16 bit codes and all results displayed  by transforming the 16 bit  representations into glyph codes appropriate to the font used. This process iseasily accomplished through a table lookup.

    In essence, the approach does not disturb the existing methods which deal with Indian  scripts but merely enhances their function by allowing uniform data entry across all the  languages. The linguistic requirements are also met. The required conversion routines may  be very easily written using  mapping tables and so dealing with ISCII, UNICODE or even  transliterated text becomes quite simple. The figure below illustrates the approach.

IITM approach to generating 16 bit codes
 
   During data entry, the sequence shown above is input.  These are transformed into two aksharas and stored.  This unique representation may be converted to other standard  formats such as HTML, RTF, ISCII etc., using conversion modules running as applications.  The module shown as IIT's module is a special library available on different computer systems to display the characters without fonts. 

   It is thus apparent that the 16 bit encoding of characters is meaningful in practice as the interfaces required to work with 8 bit systems are all provided as part of a package that works with the 16 bit code. Besides text processing, the 16 bit codes may also serve useful  purposes in generating sounds corresponding to the akshara since each code directly relates to one sound, the sound of a syllable. This has potential applications in text to speech systems in Indian  languages

   In summary, the 16 bit encoding for the aksharas will greatly alleviate the problem of  data entry and at the same time provide compatibility with existing eight bit standards for displaying the scripts. Many different data entry schemes may be supported based on the convenience to the user, since the coding has no connection with the keyboard mapping assigned for the  vowels and the consonants.  Text processing is also made simple since a uniform 16 bit code is used for all the aksharas, permitting existing applications to do the processing by  slightly modifying the definition and handling of the character. Even standard applications  may be used to handle the processing by allowing the applications to work with ASCII strings obtained through suitable transformations on the 16 bit representation.  Sorting, searching, indexing and similar applications work well and virtually any client server application may be modified to handle the longer characters without much difficulty. The method allows  perfect transliteration across the languages and is thus well suited for preparing common information for the different regions of the country.

  To come back to Character set for Indian languages, the answer to the question raised above is now clear. There can be no character set for Indian languages consistent with the concept of character sets for western languages. The set of characters used for displaying text in an Indian script could well form a character set, or for that matter the basic set of vowels and consonants along with the symbols for the matras. Thus, a character set may be defined purely for display purposes but from a linguistic angle we draw a blank unless we consider the set of  aksharas as a whole. This set cannot be precisely defined since new aksharas can always be formed. However, if we consider the manuscripts over a period of a few centuries, it may be possible to accept a set of about 12000 aksharas as adequate for linguistic work. If these are coded properly, text processing will be relatively easy and existing tools for text processing may be used with very little modifications to handle sixteen bit codes.


Main Issues
 

Character sets: Do we have any for Indian languages?

Codes assigned only for the basic vowels and consonants

Using Roman letters to represent text in Indian langauges

Codes for syllables
 



Important Points to remember

 Text representation using codes which refer to the glyphs of a font are adequate for rendering text in indian languages. Eight bit fonts have been used with much success for publishing and printing High quality documents in all the Indian scripts.


Linguistic processing is different in that one has to interpret the text string and effect some processing. Searching for specific strings in an archive is an example.

To make linguistic processing effective, one has to work at the level of a syllable. Getting a meaningful standard at the level of a syllable is difficult, for the set of syllables can become arbitrarily large. Yet, it is unlikely that one would require beyond typically 7000-10000 syllables which have been written in the past.


It is preferable to have codes of the same size for all the syllables to make linguistic processing more effective.  String Matching (regular expression matching) work much better with fixed width codes.


Representation through transliteration in Roman has been used with some success but this scheme uses diacritical marks to distinguish between many of the aksharas. Plain Roman can be used as has been shown in ITRANS but linguistic processing will continue to be difficult.
 

 

Sunrise in the Himalayas.

Today is Apr. 23, 2017
Local Time: 19 47 42

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on 10/26/12     Best viewed at 800x600 or better