Home --> Software Design Issues --> Unicode --> defling
Search  
 
Coding schemes: Linguistic requirements 
1. Accommodate all basic sounds

  All the basic vowels and consonants should find a place in the code space. All the symbols that convey related information about the text (Vedic symbols, Accounting symbols etc.) should also be coded. Punctuation marks, consistent with the use of the scripts in use today and the ten numerals, should also be accommodated in the code space irrespective of whether they have been accommodated with other scripts or not.

2. Lexical ordering

  A meaningful ordering of the vowels and consonants will help in text processing. Over the years, on line  dictionaries have become very meaningful. Arrangement of words within a dictionary should  conform to some known lexical ordering. Lexical ordering of the aksharas may not really conform to any known arrangement for different languages since no standards have been recommended or proposed. The ordering currently in vogue is somewhat arbitrary and different across languages.

3. Coding structure to reflect linguistic information

   When codes are assigned to the basic vowels and consonants, it would be of immense help to relate the code value to some linguistic information. For instance, the consonants in our languages are grouped into classes based on the manner in which the sound is generated such as the cerebrals, palatals etc.  It would certainly help if looking at a code one could immediately recognize the class. In fact the system of using aksharas to refer to numerals is a well known approach to specifying numbers and this system, familiar to many as the "katapayadi" system has been followed in India for ages.

4. Ease of data entry

  The scheme proposed for data entry must provide for typing in all the symbols without having to install additional software or use multiple keyboard schemes. It is also important that data entry modules restrict data entry to only those strings that carry meaningful linguistic content. In the context of Unicode, data entry schemes may permit typing in any valid Unicode character though it may convey nothing linguistically. It would therefore help if the schemes allowed only linguistically valid text strings.

5. Transliteration across scripts

  It is important that the coding structure allows codes corresponding to one script be easily displayed using other scripts as well. In a country such as India, where a lot of common information has to be disseminated to the public, one should not be burdened with the  task of generating the text independently for each script.  The Unicode assignments for linguistically equivalent aksharas across languages is not sufficiently uniform to permit quick and effective transliteration. One requires independent tables for each pair of scripts. ISCII assignments were uniform across the scripts and made transliteration easier. Transliteration is quite complex with Unicode. The problem of finding equivalents requires that characters assigned in one script but not in the other will have to be mapped based on some phonetic content. This may not always be possible with current Unicode assignments. The illustration below is typical of what one may prefer. Three consonants in Tamil have their Unicode equivalents specified only in Devanagari but not for other scripts. This means that proper transliteration of Tamil text into say Bengali or Gujarati may not be feasible with the existing Unicode assignments and only nearest equivalents may be shown. Transliteration based on nearest phonetic equivalents may not be appropriate from a linguistic angle.

  This brings up another important issue as well. In the Unicode assignment for Devanagari, equivalent codes for aksharas from Tamil have been specifically provided for. But the Unicode book also allows the same aksharas to be rendered using two Unicode characters, the first corresponding to the basic phonetic equivalent and second, the Nukta character which identifies the dot in the preceding character. This creates problems in practice when two different Unicode strings result in identical text displays, for tracing back to the correct internal representation will be difficult. This shows the bias exhibited by Unicode towards a coding structure which also specifies rendering information as opposed to rigidly specifying  syllables alone.

6. String matching issues

  Archives of text in Indian languages may have to be indexed and stored for purposes of retrieval against specific queries. The query string may pertain to text in a given language but the result may actually be text in another language. Here is a situation which illustrates this.

  A Journalist might have filed a report in a language for publication in a magazine. At a later time, a similar event may have to be reported in another region and information from the earlier report might prove useful. Here the journalist covering the latter event may actually query a data base for keywords in the original language in which the earlier report  was written but actually submit the query in a different script but containing the same linguistic information. The question of correctly forming a query string is also something that one must think about, for it is quite easy to make spelling errors while typing in the query string. How would one find a match? This is a typical scenario in India where centralized information sources cater to dissemination of the information in different regional languages.

7. Handling spelling errors

  One of the major difficulties in preparing a query string is getting the spelling right. With syllabic writing systems, it is entirely possible that conjuncts (i.e., syllables with multiple consonants) are typed in with some error. Often the string is derived on the basis of its pronunciation. With errors in spelling, string matching on the basis of syllables can be very difficult. The problem indicated here assumes significance when central data bases are queried in regional scripts. A person in Tamilnadu may desire to lookup information about places in the Himalayas and submits a query in Tamil for a match against the name.

  The characters in the Tamil string will have to be transliterated into appropriate codes for Devanagari text in which the information may be kept. The syllables in Tamil are always written in decomposed form and this will result in differences between the Tamil and Devanagari strings causing the string matching program to report either a spelling error or the absence of a match. In respect if Indian scripts it will be too much to expect users to know the correct spelling. Thus string matching on the basis of close sounds will be required rather than on the internal representation. This argument will also apply to applications that might attempt to check spelling in a data entry program.


 
 
Multilingual Computing- A view from SDL

Introduction
Viewpoint
Writing systems
Linguistic requirements
Dealing with Text
Computing requirements (for India)


Unicode for Indian Languages

The conceptual basis for Unicode

Unicode for Indian scripts
Data entry
Issues in rendering Unicode
Using a shaping engine
Discussion on sorting
Open type fonts


Unicode support in Microsoft applications

Uniscribe
Limitations of Uniscribe

A review of some MS applications supporting Unicode



Recommendations for Developers of Indian language Applications

Using True type fonts to render Unicode Text

Can we simplify handling Unicode text?

Guidelines for development under Linux


Summary of SDL's observations

Acharya Logo
Distant views of the Himalayan Peaks are unforgettable and awe inspiring!

Today is Aug. 16, 2018
Local Time: 15 58 59

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on     Best viewed at 800x600 or better