A coding scheme
provides for representing text in a computer. The text presumably comes
from some language. The writing system employed for a language utilizes
shapes associated with the linguistic elements which are fundamental to
the language. These are usually the vowels and consonants present in the
language and this set is normally known as the alphabet. Assigning codes
to the letters of the alphabet has been the standard practice in respect
of processing information with computers.
Codes are generally assigned
on the basis of linguistic requirements. A code is essentially a number
that is associated with a letter of the alphabet. Working with numbers
is easier and a good deal of text processing can be effected by just manipulating
the numbers. For example, an upper case letter in English can be changed
to its lower case by applying a simple formula. The number of codes required
in practice for any particular language will be decided by the totality
of shapes associated with the writing system such as upper case letters,
punctuation symbols, numerals etc. Typically this would be a set with less
than a hundred codes for most Western languages.
Traditionally, computer applications
dealt with text corresponding to only one language. Subsequently the need
to work with multilingual text was felt and this brought in additional
requirements in respect of codes. The letters from different languages
cannot be normally distinguished on the basis of their codes, for across
different languages, the numerical values assigned for the codes fall in
the same range. Thus one might find that the code assigned to the letter
"a" in English is really the same as the code assigned for the Greek letter
"alpha" or an equivalent letter in the Cyrillic alphabet. A multilingual
document with text from different languages cannot really be identified
as one, unless a mechanism is available to specifically mark sections of
the text as belonging to a specific language/script.
The traditional way of solving
this was to embed descriptors in the text in a default language/script
and allow these descriptors to specify multilingual content. Typically
one would use different fonts to identify different languages and the application
would use the specified font to display portions of the text in a particular
language/script. This way, at least the display of multilingual information
was possible though it was still difficult to associate a code, i.e., a
character in the text with its language, unless the application kept track
of the context. Keeping track of the context requires that an application
should necessarily examine the text in the document from the beginning
to the current letter, for only then the language associated with the letter
could be ascertained without doubt.
The eight bit coding schemes,
the codes are typically in the range 32-127 though values above 128 are
also used. Since different characters from different languages are
assigned codes in the same range, identification of the language for a
given code is rather difficult unless the context is also specified. The
concept of the "character set" was precisely introduced for this purpose
so that each language/script could be identified through the name given
to the character set. The character set name would figure in the document
(in a default language) and thus the context could be established. This
is predominantly the method used in most word processor documents as well
as web pages displayed through web browsers.
Linguistic processing with
codes can proceed only when the language associated with the codes is known.
keeping track of the context of the language is cumbersome though not impossible.
idea behind Unicode is to present the language information associated with
each character code in a manner that an application can readily associate
the character with the particular language. Clearly, the need
to identify the set of languages/scripts which would qualify for processing
comes up first and Unicode first examined the different scripts used in
the writing systems of the world and provided a comprehensive set of codes
to cover most of the languages of importance. The rationale for this is
the following. Typically, the writing systems employ shapes or symbols
which are directly related to the alphabet and so by providing for the
script, one would also provide for the language or languages which use
the same script (though with minor variations). Majority of the languages
of the world could be handled this way including Japanese, Chinese and
Korean where literally twenty thousand or more shapes are required. Unicode
indeed set aside a very large range of numbers to cater to these.
The basic idea in Unicode
was to assign codes over a much larger range of numbers form 0 to nearly
65000. This large range would be apportioned to different languages/scripts
by assigning chunks of 128 consecutive numbers to each script which may
also include a group of special symbols. The size of the alphabet in many
languages is much less than 50 and so this minimal range of 128 is quite
adequate even to cover additional symbols, punctuation etc..
of languages supported in the current version of Unicode (Version 3.2)
is given at the Unicode web site.
An important concept in Unicode
is that codes are assigned to a language on the basis of linguistic requirements.
Thus, for most languages of the world which use the letters of their alphabet
in the writing system, the linguistic requirement is basically satisfied
if all the letters are covered along with special symbols. Display of text
would proceed by identifying the letters through their assigned Unicode
values both in the input string and the displayed string, which for most
languages/scripts would be identical. Thus a Unicode font for a language
need incorporate only the glyphs corresponding to the letters of the alphabet
and the glyphs in the font would be identified with the same codes used
for the letters they represent.
As a concept, Unicode provides
for a very effective way of dealing with multilingual information both
in respect of text display and linguistic processing. Unfortunately, we
encounter special problems with languages which use syllabic writing systems
where the shapes of the displayed text may not bear a one to one relationship
with the letters of the alphabet. In other words, for those languages of
the world where the writing system employed displays syllables, the one
one relationship between the letters of the alphabet and the displayed
shape does not apply. The languages of the South Asian region as well as
the Semitic languages like Hebrew, Arabic, Persian etc., typically employ
the syllabic writing system. Unicode assignment for these languages does
meet the basic linguistic requirements. However, the issue of display or
text rendering has to be addressed separately for these languages.