There are 10 shapes
here (each is known as a glyph) but only seven of them are distinct. The
glyphs within a font are arranged in some order and inside the computer,
the shapes to be displayed are given in terms of the positions occupied
by the glyphs in the fonts. Typically a font will support up to about 200
glyphs. In the Kambar font used in displaying the Tamil string above, the
glyphs used in the string are (234, 235, 168, 200, 208, 226, 235, 232,
200, 168) in the order in which they will be displayed. It may be noted
that the string has only 5 Tamil letters though 10 glyphs are used, 7 being
distinct. Thus going by the number of Glyphs alone, one will not be able
to figure out the number of letters in the string. In other words, linguistic
analysis of text will be cumbersome if font based representation is chosen.
The method most often used
for representing information in Tamil is based on glyph codes for the letters.
This works alright for a given font and to some extent string processing
may also be attempted. However the dependence of the codes on the font
is a major deterrent to using this approach. The glyphs required to display
a string are generated when the keystrokes are effected. For this, the
keys on the ascii keyboard are mapped into the tamil letter whose glyph
is specified through the ascii code for the letter. Some combinations will
require two keystrokes but this is alright since the second keystroke will
correspond to a vowel extension.
Inspite of its simplicity,
this method is quite painful in practice, for it is always not easy to
remember the mapping between the Tamil letter and the key. Worse still,
some combinations require three keystrokes to be executed whereas other
combinations may be handled through just one keystroke. An example of this
is the difference between keyboard entries for "ti" and "tu". Depending
on the font design, the number of glyphs and therefore the number of keystrokes
will vary. This is not a useful approach for general acceptance.
It is thus apparent that
Tamil cannot be typed in just as English is. Somehow the combinations have
to be handled in a font independent manner with a uniformly same number
of keystrokes for combinations. Some word processors provide special support
by tracking the keystrokes and combining them appropriately. Packages from
CDAC or word processors such as SRILIPI use this method. Here too, the
sequence is decided by the glyph mappings though in the CDAC software,
a special mapping known as the INSCRIPT is used but this is not intuitive
for those familiar with English. With most software, one has very little
choice for the fonts since the data entry method is in some way tied to
the use of fonts. No two font designers agree on the glyphs nor their placement
within the font. Worse still, fonts are not always compatible across computer
Looking at the problem of
keyboard mapping itself, the key to be pressed for a specific letter is
fixed by the glyph code. This key may have no phonetic equivalence with
the Tamil letter. In fact for many fonts where the designer had intended
bilingual use (Roman and Tamil), the glyphs for Tamil are located in the
128-255 range making data entry even more difficult, unless the MACRO features
One solution recommended
by some designers has been to place the glyphs of the Tamil letters at
positions corresponding to the ASCII code of the phonetically equivalent
letter in Roman. This makes data entry a bit more intuitive but here too
variations occur when dealing with consonant vowel combinations which change
the basic shape of the consonant.
In the scheme followed at
CDAC, internal storage is not in terms of glyph codes but correspond to
the ISCII scheme. The special word processor transforms the keystrokes
into appropriate letters and an output module converts the ISCII based
internal representation into glyph codes. This method has the advantage
that it applies to all the Indian languages. But the scheme itself suffers
from some language specific representations.
The ISCII code is an eight
bit code that codes only the basic consonants and vowels of the language.
Consequently it requires more than one byte to represent a combination
though a consonant or a vowel by itself requires only one byte. Thus ISCII
also amounts to a multibyte variable length code making the font rendering
mechanism quite complex. Nevertheless ISCII is a code that represents basic
sounds and hence is quite useful in practice. In implementation however,
ISCII has run into some problems for South Indian scripts, especially Tamil.
Requirements for coding schemes.
1. Codes must necessarily
correspond to syllables which form the linguistic base for the language.
Also the coding must use fixed length codes, even if multibyte. This is
the best way to handle complex string processing issues consistent with
the phonetic nature of our languages.
2. It will be helpful if
glyph positioning within a font for Tamil have some relationship with the
internally assigned codes. Such a provision will help in string processing.
3. Codes assigned must conform
to the dictionary sorting order for the letters of the language.
4. Codes must also be assigned
for the special symbols used in the language, numerals included.
5. Codes should identify
the consonant and vowel forming the syllable. This will help in linguistic
analysis as well as conversion to other formats. This will also help in
6. Coding should take into
account syllables found in other languages of India so that transliteration
into Tamil is easily effected. This will help teach other languages through
7. The assignment of codes
may also take into consideration the numeric values traditionally assigned
to the letters of Indian languages.
8. Codes should have no relationship
to the glyphs used to display the letter. This is essential to make sure
that the internal representation is independent of the font rendering process.
Only then we have the possibility of using the software on different platforms.
This recommendation does not really run contrary to the observation made
earlier that glyph positioning may be influenced by the internal representation.
for the design of Fonts.
1. It will be useful to fix
the glyphs in the displayable range for ASCII since data entry can be effected
from wordprocessors. This is feasible for Tamil. This way editing can be
done on the text string using conventional wordprocessors. Even if placed
in the 128-255 range, the glyphs may be located at positions corresponding
to the equivalent ASCII in the lower range.
2. It is a good idea to offer
a minimum set of glyphs for each script and fix their locations as well.
This will make it easier to view the same text in different fonts so long
as the text does not incorporate special symbols.
3. Font rendering methods
vary across systems and it is a good idea to build fonts with standard
encodings so that they may be rendered on different platforms using the
same glyph codes. This specification is important for web based applications.
The default encoding prefered for Tamil is the latin-8859-1 encoding which
is well supported on all the three platforms. Also, the minimal set of
glyphs may be used on all the fonts with some advantage.
4. It is a good idea to include
punctuation and special symbols in the glyphs for the language. This way
it will be possible to treat special symbols as part of the language itself.
Punctuation should include the period, the comma, exclamation mark and
the question mark.
5. It will be helpful to
provide some special symbols used in ancient manuscripts of Tamil (e.g.,
shapes for some double consonants).
6. It is inevitable that
variable width characters be designed. In the glyphs, it is recommended
that the width is made a multiple of some basic unit (say 2pixels). This
will help retain vertical alignment in the text without having to resort
to special formatting. One glyph may be retained as a special space whose
width is an odd number of units. This glyph will be useful in retaining
on the use of the keyboard.
It turns out that keymappings
may be assigned arbitrarily and the processing software can do the required
mapping to the internal codes. Though in earlier sections we hinted at
mappings that bore some relationship to the font glyphs, it is useful to
look at keyboard mappings from the user point of view rather than programming
convenience. The following are some of the recommendations.
1. Data entry should be natural
and must relate to the letters of Tamil. It should be easy enough to train
persons in the use of the standard QWERTY keyboard which is what one will
see on all computers.
2. As far as possible, use
only the common keys to map the letters. Not all the keyboards will have
all the keys seen in the PC keyboards.
3. Standardize the input
method for entering a consonant vowel combination.
4. The manual typewriter
keyboard (for Tamil) is a choice that should not be ignored. It is an existing
standard and many have been trained in it. This keyboard is entirely adequate
for modern Tamil writing. The processing software can always use a keyboard
filter to transform the sequence of keystrokes into appropriate letters.
This may be accomplished through the use of Macros in most word processors.
5. Do not fix or relate keymapping
to any fonts though this is an approach that may allow virtually any wordprocessor
to handle data entry in Tamil. What is desirable is that the internal representation
be exportable to some word processors so that the powerful formatting facilities
seen in them may be utilized.
6. Keyboard mappings should
not also relate to the codes assigned for the characters. This is to ensure
that the internal representation is independent of the system in which
we are processing the text. It is always easier to work with standard codes
that do not directly relate to any hardware specific aspect of a computer.
Tables can be used to relate the internal representation to the glyphs
or keyboard mappings thus allowing great flexibility in dealing with the
input and the display.
7. Keyboard mappings arrived
at on te basis of some studies on the observed frequencies of the letters
in normal Tamil writing are no doubt helpful. It turns out that writing
styles vary so much that the frequencies seen in ancient texts are quite
different from the frequencies in modern texts. It is therefore preferable
to look at the thinking process as one types and assign the keymappings
based on user recommendations.
brief on the applications to be supported.
Computing in Tamil, or with
Tamil has to necessarily provide support for some basic applications that
will help bring information technology closer to the people. While one
might wish to have virtually every popular application run in Tamil, we
need to view some applications as being important to begin with. Given
below is a list of applications that should be made available in the vernacular
at the earliest. Whatever standardization is contemplated in respect of
representing Tamil inside the computer, the recommendations should keep
these applications in mind and offer viable technical solutions.
1. Data preparation applications.
Data entry and printing
applications which may include DTP, both for general use and commercial
publishing. Large scale newspaper and magazine publishing as well as archiving.
Display and dissemination
of information through the web. The applications include tools for preparing
HTML documents, Search engines, Archiving software, Applications for preserving
ancient manuscripts in their original scripts. Generation of on-line references
such as dictionaries which may be queried
2. Text processing applications.
sorting, indexing and searching of large volumes of data.
Data base systems supporting
interaction in Tamil. Report generation should be supported in Tamil as
well. This will help maintain records in many Government, public and private
institutions. The system should supported on standard Data base packages
such as Oracle, Access, Informix etc..
3. Educational applications.
the teaching of Tamil to the people of other states through their mother
tongue. Likewise, learning other languages through Tamil.
Preparation of computer
based lessons in Tamil to enhance classroom instruction. Lessons should
be interactive as well as web based.
4. Linguistic applications.
Analysis and study
of the structure of languages.
Study of ancient texts
from a linguistic angle.
to the analysis of sentences, Morphology, word frequency computations,
parsing, natural language processing etc..
5. Email communications.
Software to handle
email in Tamil so that the benefits of this wonderful facility are fully
6. Applications specific
to the Government and Public Institutions.
of text data bases, Police records, Historical data, Minutes of meetings
etc., for quick and effective access.
7. Other consumer applications.
Accounting and small
data base packages supporting user interfaces in tamil.
Client server applications
supporting user interfaces in Tamil, to work with standard data bases.
Information Technology, in
respect of usage of Tamil on computers on a State-wide scale, should necessarily
aim at reaching all the people so as to give them the benefits of this
new technological wonder. This is not a simple matter that can be handled
by designing fonts or standardizing keyboard mappings. Any approach to
dealing with user interfaces in Tamil cannot be based on software solutions
that merely cater to data entry and printing, no matter how good the results
are to look at. These solutions will not provide a simple, easy and uniform
way of communication that can be used by all people. It is necessary to
look at the problem from the angle of information processing for the long
term growth of the Tamil language both for electronic processing and the
socio-economic progress of the state. In a wider context, Tamilnadu should
be able to share its expertise to provide IT solutions for the rest of
the country in their respective languages. Software and business opportunities
throughout the country will then come within the reach of the Tamilnadu
software industry. We should not loose the opportunity in proposing a viable
approach to Information technology in Indian languages.