image
image
image
image
image
image
image
 
Home --> Software Design issues --> Linguistics --> Paper on Computing with Tamil
Search  
Computing with Tamil
A tutorial on the use of the vernacular in designing user interfaces.
Foreword
Effective dissemination of knowledge and information is the key to the well being of any society. The assimilation of knowledge, be it for leading a healthy life or for furthering the cause of education, starts at the school and continues through other means including self learning as well. The computer has come to be accepted as an effective tool for collecting, processing and disseminating information in all walks of human life. There is an emerging need to utilize this tool on a large scale to bring the benefits of Information Technology to the people as it can play a vital role in basc education and public administration. The only deterrent to implementing this has been the requirement that learning to use the computer calls for knowledge of English. Over the years, many successful attempts have been made to provide computers with user interfaces in Tamil and other Indian Languages. Most of the solutions provided have established the feasibility of interaction with a computer in local languages and have led to the availability of quality word processing and publishing software. Yet, a number of different applications that should form an essential part of the computing environment in Tamil have not been viewed with seriousness on account of the technical challenges faced in dealing with local languages on computers.

This paper presents in the form of a short tutorial, the important technical issues which require consideration before any IT solutions or applications supporting a user interface in Tamil can be implemented. Some recommendations have also been effected in respect of the standardization envisaged for the use of computers in Tamil.

Computing with Tamil: An overview.

The essential idea here is that people will be able to use computers to deal with information given in Tamil. This means that input and display are both supported in Tamil and will not require the use of English. Thus the problem, if approached in a manner which does not require the knowledge of English on the part of the user, assumes significance in our country.

Any system of computing has to meet the basic requirement that it be able to accept, process and display text based information. Ultimately, it is text that relates directly to human communications and hence it is quite important that a computer supporting Tamil based interfaces should in the least, be able to handle data entry and display in Tamil, if not text processing. Electronic information processing is not critical for displaying and printing the text since very little processing is involved on the entered data except formatting. The techniques available for doing this with English may be easily adapted to handle Tamil. However true computing involves the collection of varied items of information which are pooled together and processed to produce results based on the user requirements, such as search for information in a data base. Applications involving records in public institutions are examples of text processing on a large scale.

A minimal computing environment should therefore support the concept of internal representation of information in Tamil in a meaningful way as to permit a computer program to interpret the information and process the same. Thus there are three essential components to a computing system in Tamil.

1. Data entry of text in Tamil.
2. Display or printing of text in Tamil.
3. Processing of the text to suit user requirements.
Approaches to dealing with all the three have been standardized for the western languages. This standardization has been effected through the concept of character codes and fonts. This standardization has allowed English to be handled in a uniform manner on all computers which use the standard ASCII code for all text processing.
It turns out that data entry, display or printing of Information in Tamil can be effected using the standardization accepted for English, by designing suitable fonts for the display of Tamil letters and using conventional word processing software to effect data entry. This way, word processing and Desk Top Publishing programs are fooled into believing that they are handling Roman text (English). Thus, with a suitable font and the use of the Macro facilities of word processors which allow a sequence of key strokes to be interpreted together, data entry and printing in Tamil may be comfortably handled. Many commercial packages take this approach to providing DTP facilities in Tamil. Unfortunately, such solutions cannot provide meaningful support to the real text processing requirements in Tamil which has to be handled purely from a linguistic or language related point of view.

To perform any meaningful text processing in Tamil (or for that matter any Indian language) one has to work with the syllabic structure of the language and not merely with the vowels and consonants. This is a fairly complex issue as it involves individual identification of hundreds of characters which result through the combinations of the basic vowels and consonants. The phonetic aspect of Tamil makes it necessary for one to relate the written form to the linguistic units  and there lies the basic difficulty in deriving a suitable internal representation for the text.

A number of available approaches to dealing with Tamil on the computer have not looked at the text processing requirement with much seriousness, and have concentrated primarily on data entry and printing only. As a consequence, one sees a multitude of fonts for Tamil, each with its own accompanying data entry software. One need not be alarmed at this as most of these offer working solutions capable of producing quality printouts or web displays on the internet. The need to look at standardization arises only when one considers the issue of text processing. While such processing is in principle provided for with the internal representations used with fonts, it is nevertheless a complicated task as it bears no relationship to the linguistic aspects of Tamil.

The attempts at the center for the development of advanced computing have resulted in some standardization in respect of coding and the internal representations of text in Indian languages. This approach, which became popular through the GIST technology, has fairly severe limitations in practical use especially for Tamil. The CDAC approach does however allow a number of DOS based applications to work with Indian scripts. There are many difficulties in using the system because of the restriction on the platform and the hardware approach in using the GIST card. Yet, one must accept that the ISCII coding scheme used by the software does allow some computing in Indian languages.

To summarize, computing with Tamil should address the problem of text processing. In the western world, the writing systems are based on a small set of symbols and letter shapes. The software caters to the standardized representations of the text through the ASCII code. The letters of Tamil do not lend themselves to such interpretations, for they represent sounds at the level of a syllable derived from a basic set of vowels and consonants. Each syllable is unique from a linguistic point of view and any effort at standardization or electronic representation of text should necessarily conform to linguistic requirements.

Back to Contents

Some Details on the characters used in Tamil.

There are twelve basic vowels and eighteen consonants in Tamil. The consonants are divided into three groups. The "Ayda" letter is often viewed as the thirteenth vowel. Each consonant in its intrinsic form is given without any vowel. Often in Tamil primers the student is taught the consonants with the vowel "a"  since it is easier to learn the sound associated with the consonant this way. Hence the "Ayda" letter is also imagined to be a vowel which when combined with a consonant results in its intrinsic form. Thus there are 13+18+(18*13) different letters or aksharas which are significant linguistically.  Tamil has its own representation for the numerals and there are other special symbols to represent the date, month and the year. Given below is the set of symbols which should be considered basic to the script.

Added to the eighteen consonants are six consonants from Sanskrit which are used fairly regularly these days, though one does not see them in ancient Tamil texts. The thirteen combinations of each of these with the basic vowels is also to be represented. Tamil has its own representation for the numerals. While writing, a consonant vowel combination is not always subject to the same rule of adding a vowel extension. Special ligatures are used to represent consonant vowel combinations. There are rules for writing a consonant-vowel combination but the rules do not apply uniformly for all the vowels.

Though punctuation marks are rarely seen in older texts, there is a need to use some of the special symbols and punctuation marks if we have to use the language for preparing educational material and scientific text.

Though Tamil comes under the general category of phonetic languages, the pronunciation of some intermediate letters in words is context dependent and from a linguistic angle quite important. Hence some mechanism to distinguish the different sounds produced by the same letter will be helpful in the preparation of educational material for teaching Tamil.

Lastly, the writing system for Tamil has evolved over a number of centuries and it is necessary to display ancient tamil text in the same scripts used in old manuscripts and stone inscriptions.

Back to Contents

A brief overview of the method of using Tamil with Fonts.

Many of the solutions provided in the past for dealing with Tamil on computers have relied on the use of special fonts. A font is essentially a set of basic shapes which may be shown individually or in combinations to display the letters of a script. Eight bit fonts are the most commonly used fonts with Indian scripts. Seen below is the basic shapes (called Glyphs) which permit Tamil text to be displayed on a computer screen, as present in the font known as Indo-Web-Kambar.

 To understand how fonts are used, consider the string shown below which string is built up from individual Glyphs.

There are 10 shapes here (each is known as a glyph) but only seven of them are distinct. The glyphs within a font are arranged in some order and inside the computer, the shapes to be displayed are given in terms of the positions occupied by the glyphs in the fonts. Typically a font will support up to about 200 glyphs. In the Kambar font used in displaying the Tamil string above, the glyphs used in the string are (234, 235, 168, 200, 208, 226, 235, 232, 200, 168) in the order in which they will be displayed. It may be noted that the string has only 5 Tamil letters though 10 glyphs are used, 7 being distinct. Thus going by the number of Glyphs alone, one will not be able to figure out the number of letters in the string. In other words, linguistic analysis of text will be cumbersome if font based representation is chosen. 

The method most often used for representing information in Tamil is based on glyph codes for the letters. This works alright for a given font and to some extent string processing may also be attempted. However the dependence of the codes on the font is a major deterrent to using this approach. The glyphs required to display a string are generated when the keystrokes are effected. For this, the keys on the ascii keyboard are mapped into the tamil letter whose glyph is specified through the ascii code for the letter. Some combinations will require two keystrokes but this is alright since the second keystroke will correspond to a vowel extension.

Inspite of its simplicity, this method is quite painful in practice, for it is always not easy to remember the mapping between the Tamil letter and the key. Worse still, some combinations require three keystrokes to be executed whereas other combinations may be handled through just one keystroke. An example of this is the difference between keyboard entries for "ti" and "tu". Depending on the font design, the number of glyphs and therefore the number of keystrokes will vary. This is not a useful approach for general acceptance.

It is thus apparent that Tamil cannot be typed in just as English is. Somehow the combinations have to be handled in a font independent manner with a uniformly same number of keystrokes for combinations. Some word processors provide special support by tracking the keystrokes and combining them appropriately. Packages from CDAC or word processors such as SRILIPI use this method. Here too, the sequence is decided by the glyph mappings though in the CDAC software, a special mapping known as the INSCRIPT is used but this is not intuitive for those familiar with English. With most software, one has very little choice for the fonts since the data entry method is in some way tied to the use of fonts. No two font designers agree on the glyphs nor their placement within the font. Worse still, fonts are not always compatible across computer systems.

Looking at the problem of keyboard mapping itself, the key to be pressed for a specific letter is fixed by the glyph code. This key may have no phonetic equivalence with the Tamil letter. In fact for many fonts where the designer had intended bilingual use (Roman and Tamil), the glyphs for Tamil are located in the 128-255 range making data entry even more difficult, unless the MACRO features are invoked.

One solution recommended by some designers has been to place the glyphs of the Tamil letters at positions corresponding to the ASCII code of the phonetically equivalent letter in Roman. This makes data entry a bit more intuitive but here too variations occur when dealing with consonant vowel combinations which change the basic shape of the consonant.

In the scheme followed at CDAC, internal storage is not in terms of glyph codes but correspond to the ISCII scheme. The special word processor transforms the keystrokes into appropriate letters and an output module converts the ISCII based internal representation into glyph codes. This method has the advantage that it applies to all the Indian languages. But the scheme itself suffers from some language specific representations.

The ISCII code is an eight bit code that codes only the basic consonants and vowels of the language. Consequently it requires more than one byte to represent a combination though a consonant or a vowel by itself requires only one byte. Thus ISCII also amounts to a multibyte variable length code making the font rendering mechanism quite complex. Nevertheless ISCII is a code that represents basic sounds and hence is quite useful in practice. In implementation however, ISCII has run into some problems for South Indian scripts, especially Tamil.

Back to Contents

Basic Requirements for coding schemes.

1. Codes must necessarily correspond to syllables which form the linguistic base for the language. Also the coding must use fixed length codes, even if multibyte. This is the best way to handle complex string processing issues consistent with the phonetic nature of our languages.

2. It will be helpful if glyph positioning within a font for Tamil have some relationship with the internally assigned codes. Such a provision will help in string processing.

3. Codes assigned must conform to the dictionary sorting order for the letters of the language.

4. Codes must also be assigned for the special symbols used in the language, numerals included.

5. Codes should identify the consonant and vowel forming the syllable. This will help in linguistic analysis as well as conversion to other formats. This will also help in speech synthesis.

6. Coding should take into account syllables found in other languages of India so that transliteration into Tamil is easily effected. This will help teach other languages through Tamil.

7. The assignment of codes may also take into consideration the numeric values traditionally assigned to the letters of Indian languages.

8. Codes should have no relationship to the glyphs used to display the letter. This is essential to make sure that the internal representation is independent of the font rendering process. Only then we have the possibility of using the software on different platforms. This recommendation does not really run contrary to the observation made earlier that glyph positioning may be influenced by the internal representation.

Back to Contents

Recommendations for the design of Fonts.

1. It will be useful to fix the glyphs in the displayable range for ASCII since data entry can be effected from wordprocessors. This is feasible for Tamil. This way editing can be done on the text string using conventional wordprocessors. Even if placed in the 128-255 range, the glyphs may be located at positions corresponding to the equivalent ASCII in the lower range.

2. It is a good idea to offer a minimum set of glyphs for each script and fix their locations as well. This will make it easier to view the same text in different fonts so long as the text does not incorporate special symbols.

3. Font rendering methods vary across systems and it is a good idea to build fonts with standard encodings so that they may be rendered on different platforms using the same glyph codes. This specification is important for web based applications. The default encoding prefered for Tamil is the latin-8859-1 encoding which is well supported on all the three platforms. Also, the minimal set of glyphs may be used on all the fonts with some advantage.

4. It is a good idea to include punctuation and special symbols in the glyphs for the language. This way it will be possible to treat special symbols as part of the language itself. Punctuation should include the period, the comma, exclamation mark and the question mark.

5. It will be helpful to provide some special symbols used in ancient manuscripts of Tamil (e.g., shapes for some double consonants).

6. It is inevitable that variable width characters be designed. In the glyphs, it is recommended that the width is made a multiple of some basic unit (say 2pixels). This will help retain vertical alignment in the text without having to resort to special formatting. One glyph may be retained as a special space whose width is an odd number of units. This glyph will be useful in retaining the alignment.

Back to Contents

Recommendations on the use of the keyboard.

It turns out that keymappings may be assigned arbitrarily and the processing software can do the required mapping to the internal codes. Though in earlier sections we hinted at mappings that bore some relationship to the font glyphs, it is useful to look at keyboard mappings from the user point of view rather than programming convenience. The following are some of the recommendations.

1. Data entry should be natural and must relate to the letters of Tamil. It should be easy enough to train persons in the use of the standard QWERTY keyboard which is what one will see on all computers.

2. As far as possible, use only the common keys to map the letters. Not all the keyboards will have all the keys seen in the PC keyboards. 

3. Standardize the input method for entering a consonant vowel combination.

4. The manual typewriter keyboard (for Tamil) is a choice that should not be ignored. It is an existing standard and many have been trained in it. This keyboard is entirely adequate for modern Tamil writing. The processing software can always use a keyboard filter to transform the sequence of keystrokes into appropriate letters. This may be accomplished through the use of Macros in most word processors.

5. Do not fix or relate keymapping to any fonts though this is an approach that may allow virtually any wordprocessor to handle data entry in Tamil. What is desirable is that the internal representation be exportable to some word processors so that the powerful formatting facilities seen in them may be utilized.

6. Keyboard mappings should not also relate to the codes assigned for the characters. This is to ensure that the internal representation is independent of the system in which we are processing the text. It is always easier to work with standard codes that do not directly relate to any hardware specific aspect of a computer. Tables can be used to relate the internal representation to the glyphs or keyboard mappings thus allowing great flexibility in dealing with the input and the display.

7. Keyboard mappings arrived at on te basis of some studies on the observed frequencies of the letters in normal Tamil writing are no doubt helpful. It turns out that writing styles vary so much that the frequencies seen in ancient texts are quite different from the frequencies in modern texts. It is therefore preferable to look at the thinking process as one types and assign the keymappings based on user recommendations.

Back to Contents

A brief on the applications to be supported.

Computing in Tamil, or with Tamil has to necessarily provide support for some basic applications that will help bring information technology closer to the people. While one might wish to have virtually every popular application run in Tamil, we need to view some applications as being important to begin with. Given below is a list of applications that should be made available in the vernacular at the earliest. Whatever standardization is contemplated in respect of representing Tamil inside the computer, the recommendations should keep these applications in mind and offer viable technical solutions.

1. Data preparation applications.

 Data entry and printing applications which may include DTP, both for general use and commercial publishing. Large scale newspaper and magazine publishing as well as archiving.

 Display and dissemination of information through the web. The applications include tools for preparing HTML documents, Search engines, Archiving software, Applications for preserving ancient manuscripts in their original scripts. Generation of on-line references such as dictionaries which may be queried 

2. Text processing applications.

 Applications involving sorting, indexing and searching of large volumes of data.

 Data base systems supporting interaction in Tamil. Report generation should be supported in Tamil as well. This will help maintain records in many Government, public and private institutions. The system should supported on standard Data base packages such as Oracle, Access, Informix etc..

3. Educational applications.

 Applications involving the teaching of Tamil to the people of other states through their mother tongue. Likewise, learning other languages through Tamil.

 Preparation of computer based lessons in Tamil to enhance classroom instruction. Lessons should be interactive as well as web based.

4. Linguistic applications.

 Analysis and study of the structure of languages.

 Study of ancient texts from a linguistic angle.

 Concordance generation.

 Applications catering to the analysis of sentences, Morphology, word frequency computations,  parsing, natural language processing etc..

5. Email communications.

 Software to handle email in Tamil so that the benefits of this wonderful facility are fully utilized. 

6. Applications specific to the Government and Public Institutions.

 Maintaining records of text data bases, Police records, Historical data, Minutes of meetings etc., for quick and effective access.

7. Other consumer applications.

 Accounting and small data base packages supporting user interfaces in tamil.

 Client server applications supporting user interfaces in Tamil, to work with standard data bases.

Back to Contents

Summary

Information Technology, in respect of usage of Tamil on computers on a State-wide scale, should necessarily aim at reaching all the people so as to give them the benefits of this new technological wonder. This is not a simple matter that can be handled by designing fonts or standardizing keyboard mappings. Any approach to dealing with user interfaces in Tamil cannot be based on software solutions that merely cater to data entry and printing, no matter how good the results are to look at. These solutions will not provide a simple, easy and uniform way of communication that can be used by all people. It is necessary to look at the problem from the angle of information processing for the long term growth of the Tamil language both for electronic processing and the socio-economic progress of the state. In a wider context, Tamilnadu should be able to share its expertise to provide IT solutions for the rest of the country in their respective languages. Software and business opportunities throughout the country will then come within the reach of the Tamilnadu software industry. We should not loose the opportunity in proposing a viable approach to Information technology in Indian languages.

Back to Contents

Given on the left is a concept paper read at the Tamilnet99 conference held at Chennai in February 1999.


Overview

Characters used in tamil

Use of Tamil fonts

Coding schemes (Requirements)

Designing suitable fonts

Keyboard mapping

Important applications

Summary




























 

Acharya Logo
  Inscription in Early Brahmi script inside a cave situated in Tamilnadu, South India. The text includes the word "satiaputo" which stands for Emperor Ashoka whose emissaries spread Buddhism in the South and Sri Lanka. The letter "sa" is not seen in Tamil and so the inscription must have ben effected by persons who knew Sanskrit as well as Tamil.

Image graciously offered for reproduction in this page by Sri. Iravatham Mahadevan.

Today is Mar. 26, 2017
Local Time: 13 13 52


| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on 10/30/12    Best viewed at 800x600 or better