Home --> Software Design Issues --> Unicode --> concept_uni
Search  
 
Conceptual Basis for Unicode
A coding scheme provides for representing text in a computer. The text presumably comes from some language. The writing system employed for a language utilizes shapes associated with the linguistic elements which are fundamental to the language. These are usually the vowels and consonants present in the language and this set is normally known as the alphabet. Assigning codes to the letters of the alphabet has been the standard practice in respect of processing information with computers.

Codes are generally assigned on the basis of linguistic requirements. A code is essentially a number that is associated with a letter of the alphabet. Working with numbers is easier and a good deal of text processing can be effected by just manipulating the numbers. For example, an upper case letter in English can be changed to its lower case by applying a simple formula. The number of codes required in practice for any particular language will be decided by the totality of shapes associated with the writing system such as upper case letters, punctuation symbols, numerals etc. Typically this would be a set with less than a hundred codes for most Western languages.

Traditionally, computer applications dealt with text corresponding to only one language. Subsequently the need to work with multilingual text was felt and this brought in additional requirements in respect of codes. The letters from different languages cannot be normally distinguished on the basis of their codes, for across different languages, the numerical values assigned for the codes fall in the same range. Thus one might find that the code assigned to the letter "a" in English is really the same as the code assigned for the Greek letter "alpha" or an equivalent letter in the Cyrillic alphabet. A multilingual document with text from different languages cannot really be identified as one, unless a mechanism is available to specifically mark sections of the text as belonging to a specific language/script.

The traditional way of solving this was to embed descriptors in the text in a default language/script and allow these descriptors to specify multilingual content. Typically one would use different fonts to identify different languages and the application would use the specified font to display portions of the text in a particular language/script. This way, at least the display of multilingual information was possible though it was still difficult to associate a code, i.e., a character in the text with its language, unless the application kept track of the context. Keeping track of the context requires that an application should necessarily examine the text in the document from the beginning to the current letter, for only then the language associated with the letter could be ascertained without doubt. 

The eight bit coding schemes, the codes are typically in the range 32-127 though values above 128 are also used.  Since different characters from different languages are assigned codes in the same range, identification of the language for a given code is rather difficult unless the context is also specified. The concept of the "character set" was precisely introduced for this purpose so that each language/script could be identified through the name given to the character set. The character set name would figure in the document (in a default language) and thus the context could be established. This is predominantly the method used in most word processor documents as well as  web pages displayed through web browsers.

Linguistic processing with codes can proceed only when the language associated with the codes is known. keeping track of the context of the language is cumbersome though not impossible. The idea behind Unicode is to present the language information associated with each character code in a manner that an application can readily associate the character with the particular language. Clearly, the need to identify the set of languages/scripts which would qualify for processing comes up first and Unicode first examined the different scripts used in the writing systems of the world and provided a comprehensive set of codes to cover most of the languages of importance. The rationale for this is the following. Typically, the writing systems employ shapes or symbols which are directly related to the alphabet and so by providing for the script, one would also provide for the language or languages which use the same script (though with minor variations). Majority of the languages of the world could be handled this way including Japanese, Chinese and Korean where literally twenty thousand or more shapes are required. Unicode indeed set aside a very large range of numbers to cater to these.

The basic idea in Unicode was to assign codes over a much larger range of numbers form 0 to nearly 65000. This large range would be apportioned to different languages/scripts by assigning chunks of 128 consecutive numbers to each script which may also include a group of special symbols. The size of the alphabet in many languages is much less than 50 and so this minimal range of 128 is quite adequate even to cover additional symbols, punctuation etc..

The list of languages supported in the current version of Unicode (Version 3.2) is given at the Unicode web site. 

An important concept in Unicode is that codes are assigned to a language on the basis of linguistic requirements. Thus, for most languages of the world which use the letters of their alphabet in the writing system, the linguistic requirement is basically satisfied if all the letters are covered along with special symbols. Display of text would proceed by identifying the letters through their assigned Unicode values both in the input string and the displayed string, which for most languages/scripts would be identical. Thus a Unicode font for a language need incorporate only the glyphs corresponding to the letters of the alphabet and the glyphs in the font would be identified with the same codes used for the letters they represent.

As a concept, Unicode provides for a very effective way of dealing with multilingual information both in respect of text display and linguistic processing. Unfortunately, we encounter special problems with languages which use syllabic writing systems where the shapes of the displayed text may not bear a one to one relationship with the letters of the alphabet. In other words, for those languages of the world where the writing system employed displays syllables, the one to one relationship between the letters of the alphabet and the displayed shape does not apply. The languages of the South Asian region as well as the Semitic languages like Hebrew, Arabic, Persian etc., typically employ the syllabic writing system. Unicode assignment for these languages does meet the basic linguistic requirements. However, the issue of display or text rendering has to be addressed separately for these languages. 


 
Multilingual Computing- A view from SDL

Introduction
Viewpoint
Writing systems
Linguistic requirements
Dealing with Text
Computing requirements (for India)


Unicode for Indian Languages

The conceptual basis for Unicode

Unicode for Indian scripts
Data entry
Issues in rendering Unicode
Using a shaping engine
Discussion on sorting
Open type fonts


Unicode support in Microsoft applications

Uniscribe
Limitations of Uniscribe

A review of some MS applications supporting Unicode



Recommendations for Developers of Indian language Applications

Using True type fonts to render Unicode Text

Can we simplify handling Unicode text?

Guidelines for development under Linux


Summary of SDL's observations

Acharya Logo
Distant views of the Himalayan Peaks are unforgettable and awe inspiring!

Today is Oct. 18, 2018
Local Time: 20 27 00

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on     Best viewed at 800x600 or better