all basic sounds
All the basic vowels
and consonants should find a place in the code space. All the symbols that
convey related information about the text (Vedic symbols, Accounting symbols
etc.) should also be coded. Punctuation marks, consistent with the use
of the scripts in use today and the ten numerals, should also be accommodated
in the code space irrespective of whether they have been accommodated with
other scripts or not.
2. Lexical ordering
A meaningful ordering
of the vowels and consonants will help in text processing. Over the years,
on line dictionaries have become very meaningful. Arrangement of
words within a dictionary should conform to some known lexical ordering.
Lexical ordering of the aksharas may not really conform to any known arrangement
for different languages since no standards have been recommended or proposed.
The ordering currently in vogue is somewhat arbitrary and different across
3. Coding structure to
reflect linguistic information
When codes are
assigned to the basic vowels and consonants, it would be of immense help
to relate the code value to some linguistic information. For instance,
the consonants in our languages are grouped into classes based on the manner
in which the sound is generated such as the cerebrals, palatals etc.
It would certainly help if looking at a code one could immediately recognize
the class. In fact the system of using aksharas to refer to numerals is
a well known approach to specifying numbers and this system, familiar to
many as the "katapayadi" system has been followed
in India for ages.
4. Ease of data entry
The scheme proposed
for data entry must provide for typing in all the symbols without having
to install additional software or use multiple keyboard schemes. It is
also important that data entry modules restrict data entry to only those
strings that carry meaningful linguistic content. In the context of Unicode,
data entry schemes may permit typing in any valid Unicode character though
it may convey nothing linguistically. It would therefore help if the schemes
allowed only linguistically valid text strings.
5. Transliteration across
It is important that
the coding structure allows codes corresponding to one script be easily
displayed using other scripts as well. In a country such as India, where
a lot of common information has to be disseminated to the public, one should
not be burdened with the task of generating the text independently
for each script. The Unicode assignments for linguistically equivalent
aksharas across languages is not sufficiently uniform to permit quick and
effective transliteration. One requires independent tables for each pair
of scripts. ISCII assignments were uniform across the scripts and made
transliteration easier. Transliteration is quite complex with Unicode.
The problem of finding equivalents requires that characters assigned in
one script but not in the other will have to be mapped based on some phonetic
content. This may not always be possible with current Unicode assignments.
The illustration below is typical of what one may prefer. Three consonants
in Tamil have their Unicode equivalents specified only in Devanagari but
not for other scripts. This means that proper transliteration of Tamil
text into say Bengali or Gujarati may not be feasible with the existing
Unicode assignments and only nearest equivalents may be shown. Transliteration
based on nearest phonetic equivalents may not be appropriate from a linguistic
up another important issue as well. In the Unicode assignment for Devanagari,
equivalent codes for aksharas from Tamil have been specifically provided
for. But the Unicode book also allows the same aksharas
to be rendered using two Unicode characters, the first corresponding to
the basic phonetic equivalent and second, the Nukta character which identifies
the dot in the preceding character. This creates problems in practice
when two different Unicode strings result in identical text displays, for
tracing back to the correct internal representation will be difficult.
This shows the bias exhibited by Unicode towards a coding structure which
also specifies rendering information as opposed to rigidly specifying
6. String matching issues
Archives of text in
Indian languages may have to be indexed and stored for purposes of retrieval
against specific queries. The query string may pertain to text in a given
language but the result may actually be text in another language. Here
is a situation which illustrates this.
A Journalist might
have filed a report in a language for publication in a magazine. At a later
time, a similar event may have to be reported in another region and information
from the earlier report might prove useful. Here the journalist covering
the latter event may actually query a data base for keywords in the original
language in which the earlier report was written but actually submit
the query in a different script but containing the same linguistic information.
The question of correctly forming a query string is also something that
one must think about, for it is quite easy to make spelling errors while
typing in the query string. How would one find a match? This is a typical
scenario in India where centralized information sources cater to dissemination
of the information in different regional languages.
7. Handling spelling errors
One of the major difficulties
in preparing a query string is getting the spelling right. With syllabic
writing systems, it is entirely possible that conjuncts (i.e., syllables
with multiple consonants) are typed in with some error. Often the string
is derived on the basis of its pronunciation. With errors in spelling,
string matching on the basis of syllables can be very difficult. The problem
indicated here assumes significance when central data bases are queried
in regional scripts. A person in Tamilnadu may desire to lookup information
about places in the Himalayas and submits a query in Tamil for a match
against the name.
The characters in
the Tamil string will have to be transliterated into appropriate codes
for Devanagari text in which the information may be kept. The syllables
in Tamil are always written in decomposed form and this will result in
differences between the Tamil and Devanagari strings causing the string
matching program to report either a spelling error or the absence of a
match. In respect if Indian scripts it will be too much to expect users
to know the correct spelling. Thus string matching on the basis of close
sounds will be required rather than on the internal representation. This
argument will also apply to applications that might attempt to check spelling
in a data entry program.