General Introduction to Indian Languages

  There are many many languages spoken in India. Most of them relate to one of the  officially recognized languages and there are about eighteen languages identified for regular use in the country. All these languages have a phonetic base, though their writing systems vary.  Some of the languages have a common script and some have scripts of their own.  There are nine basic scripts besides the scripts for Urdu and Sindhi. These nine constitute  the basic scripts of India.  The eighteen languages mentioned above, have been given the status of official languages by the Government. Though the use of a language may  appear to be confined to a region within the country, the mother tongue of many persons living in that region may be quite a different one, traceable to early migrations of families. 

    All the recognized languages (mostly referred to as the regional languages) have a phonetic base. It is seen that there is a substantial set of words common to many of these languages and the roots of these words may be traced to specific languages such as  Sanskrit or Tamil, both of which are considered very ancient languages. 

    Linguistic aspects of Indian languages have always attracted scholars from different  parts of the world on account of the hoary past of the languages as well as their unique  phonetic base. Another interesting aspect of Indian languages is the fact that language was a means not merely for communication in daily life but also for expressing religious, philosophical, scientific and professional concepts in amazingly compact ways. 

Computers and Indian Languages

    It is not surprising therefore, to find renewed interest today, in studying and understanding  many of India's ancient literary works. With the possibility of using computers for  linguistic studies and with the increasing demand to disseminate information in the  vernacular, computing in Indian languages has gained significance. Though applications such as word processing, Desk Top Publishing etc., have been successfully implemented for Indian languages, the solutions remain substantially language specific. One is happy that many of these applications are really useful in practice, in spite of the effort needed in handling data entry. Yet, very little seems to have been done in respect of electronically processing the information.  Viewed nationally, there is an urgent need to provide a uniform and meaningful software solution to computing in Indian languages.

   The phonetic nature of the languages leads to a writing system which represents sounds  through unique symbols.  Each language has its own representation for the sounds and thus its own script, though it was mentioned earlier that some languages may use a common script. In practice, there are small variations in the scripts that probably matter when linguistic aspects are brought in.

Writing Systems

   The writing systems for most Indian Languages employ symbols for about sixteen vowels and as many as thirty five consonants. Syllables which are formed from these basic sounds are  also given unique representations. The term conjunct is used to refer to a syllable formed from  one or more consonants and a vowel. Though one can theoretically think of thousands of conjuncts, only about 800  of them are known to be in regular use and each of these can combine with a vowel to make nearly 13000 or  so individual sounds, each with its own unique representation in the script. 

  Interestingly,  the writing systems employ just about 200 or so symbols to form  the unique shapes representing the conjuncts by combining shapes, somewhat in the manner of adding  ligatures.   For each language, well defined rules exist for writing most of the conjuncts and their combinations  with the vowels.  The term Akshara is normally used to refer to a consonant or a vowel or a simple combination of a consonant and a vowel. The term Samyuktakshara is used to refer to conjuncts. 

 Handling Indian languages on the computer is complicated by the requirement that each  and every one of these  aksharas or samyuktaksharas be individually recognized. Though only a few hundred primitive shapes may be employed in practice to form the combinations, the large number of aksharas must necessarily be identified individually for linguistic or text processing purposes. Children in India are taught to identify thousands of aksharas and once they have mastered reading the script, they find learning other languages, including European languages, relatively easy.

 The methods which work well for a limited set of twenty six different letters in the Roman alphabet, obviously fail or become inadequate when applied to Indian languages, not only for the reason there are thousands of aksharas but also that there is more than  one accepted way of writing many of the combinations. Though there are clear rules for  writing the combinations, existing practices permit multiple representations for the same conjunct, even within a language, not to speak of variations across the languages. 

Codes for the Aksharas

 Thus there is need to look at the problem of representing (coding) the large set of aksharas so as to  arrive at a standard that can apply uniformly across all the Indian languages. Electronic text processing can then be attempted using these codes.

  The pioneering work which resulted in the GIST technology at the Center for  Development of Advanced Computing must be regarded as the earliest of the attempts  towards some standardization.  This development permitted DOS based applications to  handle Indian language text. The text was electronically represented using the ISCII code and was largely language independent, thus  permitting a uniform approach to dealing  with the languages.  Over the years, this hardware dependent approach has been replaced  by quality word processing and data preparation software but the essential eight bit coding of the characters has been retained. As will be explained in the section on Character encoding  for Indian languages, eight bit codes are not really suitable for efficient string processing. 

  All the official languages of the country are written using scripts specific to each  language.  Scripts denote the writing systems employed by the languages to represent the sounds which form the phonetic base of the languages. Currently, the following language  specific scripts are considered essential. 

  Devanagari, Gurmukhi, Bengali, Gujarati, Oriya, Telugu, Tamil, Kannada and Malayalam.  The scripts for Urdu and Sindhi should also be included in the above, though Devanagari is often used for writing in Sindhi.

Languages of India

The number of different languages spoken in India runs into hundreds. All the languages do not have associated writing systems. By and large, the languages of India use a script identified for the language though some languages are known to have utilized two or more scripts.

The terms language and script
are often used interchangeably. The Official languages of the country use a script from the following list.


The following scripts had been in use in earlier times.

Sharada (Kashmir)
Modi ( Marathi)
Granth (Sanskrit-Tamil)

Current belief is that all the above scripts evolved from Brahmi, the script used by Emperor Ashoka to engrave text on rocks (rock edicts) and pillars.

Interestingly, proper understanding of the Brahmi script was effected only during the later half of the 19th century by Princep. The clue to decipherment came from a Bilingual coin with Brahmi and Greek letters.

Western scholars opine that Brahmi was based on early Semitic writing systems.

