Computing with Tamil

  This section is devoted to discussions on "computing with Tamil", where the IITM Software is used for illustrating some of the important aspects of electronic processing of text in different scripts, specifically Tamil.  Computing with Tamil relates to the design and development of useful applications which permit interaction with the system in Tamil. The applications could be standalone applications running on different platforms or web based applications such as search engines,  instant messaging or chat. 

    Among the Indian Languages, Tamil perhaps has the simplest set of aksharas consisting of twelve vowels and eighteen consonants. However, six aksharas from Sanskrit have also become part of the set. Strictly speaking, the term "phonetic language" may not be applicable to Tamil, for the sounds associated with the aksharas will vary depending on the position of the consonant in a word. Grammar rules are quite specific about this but, for most people in India whose mother tongue is not Tamil, reading out text in Tamil will pose some initial difficulties. More information on this will follow in related sections.
About existing standards for Tamil Computing

   For many years now, text in Tamil has been displayed on the web, thanks to the magazines which have gone on-line. Each publication has standardized the approach to displaying the text through designated fonts and there are quite a few of them.  Besides the magazines, independent groups have proposed data entry schemes to be used with specific fonts which seem to cater to some sort of character coding for the characters. The multiplicity of the fonts seen have posed real problems in arriving at some uniformity in text display. This was the main theme of discussions during the Tamilnet99 conference held in Chennai during February 1999. 

   At this conference, it was proposed that  the placement of glyphs within  a Tamil font would follow a recommended scheme. Both bilingual and monolingual schemes were standardized. In the bilingual scheme, the initial 96 character positions are retained for Roman letters while the top 128 are allocated for Tamil. In the monolingual scheme, all the glyphs are used for Tamil. Details of the standards and related documents should be available at the Tamilnet99 web site link provided on the right. 

    The conference also arrived at a standard for data entry in Tamil. Three different keyboard layouts were arrived at for use by different sections of the users. The first relates to what is termed as the phonetic keyboard where data entry is effected through lower case keys alone for the basic text. the second scheme referred to as the Romanized keyboard, specifies data entry based on the Roman letter that comes close to the sound of the vowel or consonant. The third is the layout seen in standard Tamil typewriters.  Details of the keyboard layouts and some software that support the schemes are included in the pages at the Tamilnet99 web site.

   The conferences held in 2000 and 2001 (Links available on the right) do not seem to have led to significant additional  recommendations. The proliferation of fonts seems to continue as also specific encoding recommendations. Most of the efforts  seem to relate to the Win9X platforms. The new TISCII recommendation seems to be gaining some ground, as seen from the increasing reference to it in the web..

Unicode for Tamil

   Unicode has become a world standard and many computer applications have provided Unicode support so that multilingual text can be handled. The Unicode standard proposed for Tamil has not taken into consideration some of the important linguistic issues. Also, even among the professionals, there seems to be considerable difference of opinion in respect of the adequacy of Unicode. 

   At the Systems Development Laboratory, the view held is that Unicode is not really suited for text processing in Indian languages though the data entry and display requirements could be handled with the current Unicode assignments for Indian languages. There are differing views about the suitability of Unicode, even in respect of Tamil. The specific issues have been addressed in a separate section on Unicode with specific reference to Tamil.

The relevance of the IITM Software

    The IITM software, on account of its flexible approach to computing with Indian languages, was able to support the requirements specified in the Tamilnet99 standards. The multilingual editor conforming to the Tamilnet99 standards was easily developed at the lab and has been made available for general use. Please follow the appropriate links on the right.

    The real power of the IITM software becomes apparent in applications that require linguistic analysis of text in Tamil. The links below refer to many applications that have been developed at IIT Madras for document preparation and linguistic processing with Tamil.

Multilingual text editor

   A simple but very effective text editor conforming to the syllabic requirements for linguistic processing. This editor provides support for generating displays with a variety of fonts, including the fonts such as Tamilnet99, Anjal, Murasu, Mylai, Tiscii and more. This editor also provides variations in internal storage for the letters which have the same shape but differing sounds. The data entry scheme for the editor is flexible. The Tamilnet99 standard for data entry is fully supported here.  Look up the features in the section on Multilingual Editor (linked above). 

  A text to speech enhanced version of the editor has been provided for the benefit of the Visually Handicapped. Files prepared with the editor may be pasted into Word, Outlook Express and other Windows applications. The editor is distributed free of charge and executable binaries for Microsoft Windows as well as Linux  may be downloaded from this web site. The link above provides additional details.

Letter and word Frequency Count programs

  The set of utilities developed for linguistic processing in different Indian languages includes a program for computing the frequencies of occurrences of vowels, consonants and their combinations in any given text. Essentially the program does a count of the syllables and tabulates the results in a useful manner. The link above will also take you to the results of frequency counts of the aksharas in Tirukkural and Sambhandar Tevaram. The results have much to reveal. 

Sorting utilities

  Lexical ordering according to the specified order of the letters of Tamil has been a major issue and this specific problem has not been given sufficient attention by those developing standards for Tamil. The Unicode assignment for Tamil is a hopelessly mangled set of the letters but the claim is that Unicode does not purport to preserve lexical ordering!

   The IITM Software preserves the lexical ordering and utilities for sorting and indexing text prepared using the Multilingual editor. It is also possible to write utilities in PERL to effect text processing and regular expression matching with text in indian languages. The section on PERL modules for Indian languages has additional information.

   There is also the fundamental question about what constitutes a proper sorting order for Tamil. This question can be answered only after the full set of aksharas and special symbols required for regular use are correctly identified and codes assigned for the set. We have a separate section discussing the set of Tamil characters that would adequately represent and meet linguistic processing requirements of Tamil.

Email and chat

  Sending and receiving email with text in Tamil or handling chat has been greatly simplified. All that is required is a cut and paste into the application, Outlook Express, Instant messenger or similar ones, from text entered into the Multilingual Editor. Using the Multilingual Editor, email and chat are just one simple paste operation. The link above discusses the principles and also explains how you can send the required text as an attachment as well so that email in Tamil can be sent and received on Linux systems too.

Tirukkural- On-line reference

  A comprehensive on-line reference for Tirukkural permitting the text of Kural to be viewed from virtually any graphics enabled browser has been included at this site. No fonts of any kind will be required. The pages also offer the provision to search for words in the text of Kural, where the search word may be directly entered into the web page in Tamil. A wordlist consisting of all the words in the text of Kural is also presented with reference information on the couplet containing the word. This presentation is unique on the web. We do not know of any other site that offers a service close to what is provided here (as on May 2006). 

Search engines which can accept query strings in Tamil

  An example of a web based application searching for words in the text of Tirukkural, Tolkappiam and other works in Tamil. This is a Java based web interface and allows data entry of text strings in Tamil. This is a unique presentation. 

Text to Speech generation

  A look at the approach taken by IIT Madras to synthesize speech in Tamil as well as other Indian Languages. The results are extremely satisfying, despite a robotic flavour to the synthesized output. The speech enhanced Multilingual Editor and other applications have found acceptance by the Visually Handicapped community in Tamilnadu as well as other states in India. We believe this to be the VERY FIRST demonstration of continuous text to speech generation on the web in any Indian language. If you are intrigued, we invite you to see and hear the output in the on-line demo.

Data Base applications

Using PERL to work with Tamil

Tamil script through the ages


Linguistic aspects of  Tamil


Concept Paper on
Tamil Computing
Tamil Internet 2001
Tamil Internet 2002
  Fonts for Tamil
  Tamil on the Web
Desirable set of Tamil letters
Acharya Logo
  Inscription in Early Brahmi script inside a cave situated in Tamilnadu, South India. The text includes the word "satiaputo" which stands for Emperor Ashoka whose emissaries spread Buddhism in the South and Sri Lanka. The letter "sa" is not seen in Tamil and so the inscription must have ben effected by persons who knew Sanskrit as well as Tamil.

Image graciously offered for reproduction in this page by Sri. Iravatham Mahadevan.

