A tutorial on Fonts for
Indian Languages: Section-1
Concept of Font Encoding
In the previous
section we looked at a simple font arrangement where the glyph location for
a letter of the alphabet is fixed through the ASCII code for the letter. In
practice, it is not necessary to adhere to this arrangement. The glyph for
a particular letter of the alphabet may be kept in any glyph location but
some mechanism is required to relate the code to its shape. The encoding scheme
provides this relationship where the name of the symbol associated with the
code is specified in a table. Inside the font file, the name of the
symbol and the location of its shape (called the Glyph index) is also specified.
Thus from the character code one obtains the name of the letter and this
name is used to arrive at the glyph for display. The process is shown
in the figure below.
some important observations form the figure above. The final shape to be
displayed is decided by the character code together with the character set
as well as font encoding. A computer system displaying text using its default
character set encoding is likely to display different shapes from the intended.
The displayed shape is likely to be correct if there is a way of telling
the system that it should display the shape for a character name. This works
properly for English but when the shapes relate to a different script (Sanskrit
in the example above), the display will have no relationship with the named
character but a shape chosen by the designer of the font to meet the requirements
of the script (writing system)
idea, while appearing to be unnecessary if the code can be directly related
to the glyph, actually gives some freedom for the font designer to place
the glyphs in any order that gives some convenience in respect of the design.
This way, it will always be possible to relate a character to its glyph so
long as the name of the character is correctly specified to the font.
Over the years, standard locations for the glyphs corresponding to standard
letters of the alphabet for a language have emerged, giving rise to the notion
of standard font encoding.
specifies what character names are correctly recognized in a computer system.
The encoding concept also permits different characters with different names
from different languages be displayed with the same glyph if the characters
have the same shape. For example, the minus sign and the hyphen could both
be displayed with the same glyph though a hyphen may mean something different
from the mathematical minus sign. In a font designed according to an encoding
scheme, a desired symbol can be selected for display by specifying the name
for the symbol. This name is related to the code used to refer to the character
and the relationship is specified through a Table called the Encoding Table.
The term "character set" is used to relate an eight bit code value to the
name of the character represented by the code.
of font encoding allows us to generate displays of text strings in many different
languages by using fonts which contain the glyphs corresponding to their
alphabet. The text to be displayed is represented as a series of eight
bit characters and for all practical purposes, these may be reckoned as a
string of ascii codes. The computer system takes each code and displays the
glyph associated with it. The glyphs may be viewed as the building blocks
for the letter to be displayed where, by placing the glyphs one after another,
the required display is generated. Fonts also incorporate a feature whereby
some of the glyphs may be defined to have zero width even though they extend
over a horizontal range. Thus when the system places a zero width glyph
next to another, the two are superimposed and thus permit more complex shapes
to be generated, such as accented letters. Zero width glyphs are very important
for Indian language fonts.
location of a glyph within a font that caters to non Roman letters is pretty
much arbitrary though designers tend to follow one or more encoding
standards. As indicated earlier, different encoding methods are in use today
where each encoding has standardized the locations where glyphs may be placed.
Unfortunately, these locations are not uniform across different encodings.
This is a consequence of different vendors or independent groups choosing
to support their own encoding standard in their applications (or the Operating
System). Over the years, the following encoding standards have become popular.
1. The standard
Latin character set as per ISO-8859-1
encoding is recognized under most Operating Systems. The standard provides
for about 190 glyphs and 96 out of these represent the letters of the English
language, punctuation and other symbols seen in print.
2. Windows specific
Latin character set known as Latin-1252
This encoding supports a dozen or more glyphs beyond the number supported
under 8859-1. This encoding is the most common choice under Microsoft
The Mac encoding is quite unique in many respects. It is one of the encodings
supporting more than 235 glyphs. This encoding has been used for many non
Roman fonts from the very beginning when the Macintosh was introduced. Mac
encoding is not compatible with ISO-Latin-8859-1. It has been observed that
on a Mac, Web browsers do not correctly display many of the characters of
the standard 8859-1 set (13 in the upper ASCII range).
4. Encoding supported
PostScript is an independent approach to generating printed documents based
on a Graphic description language developed at Adobe Systems, USA. There are
two or three different encodings supported under PostScript but it is possible
to have user specified fonts with arbitrary encodings.
5. User specified encoding
a font that has user specified encoding, the mapping of the name of the character
to its glyph index will be different from one of the standard encodings.
There is no specific advantage to having a custom encoded font if the text
to be displayed in the system has been generated in the system itself or
it conforms to the default character set understood by different applications
running in the system. When text is moved across systems, there is a possibility
that the character names, though correctly understood get mapped to glyphs
other than the right ones because the coding of the text in one system differs
from the character set understood in the second. It is clearly known that
a text file created in Microsoft Windows will not display correctly display
on a Unix Machine or a MacIntosh, since the Windows Character set differs
from the character sets understood by the two.
ask, if it is not possible to have all system correctly understand one character
set for a specified language. Historically, Unix machines dealt with a simpler
character set, typically ISO-8859-1 (about 190 characters) while the Mac,
way back in 1983 or so supported a much richer character set catering to nearly
235. This made the Mac an ideal choice for DTP in languages other than English
which required many more symbols to display the text. Microsoft Windows followed
but had its own definition for the character set. The standard Windows character
set called Western-1252, has about 210 characters defined. The common denominator
across the three encodings is only 94 characters!
standard is supposed to be a universal character set to be interpreted and
displayed correctly on all systems. It will be shown subsequently that even
this standard poses problems for Indian languages. Unicode is a good choice
for European languages which have a much simpler writing system compared
to the languages of India.
Unicode encoding combines the character sets from several different
languages/scripts of the world into one single scheme. The code assigned to
characters from each scripts are identified with unique names given to each
character in the set. Characters which combine diacritics have individual
assignments in Unicode and therefore map to a single glyph. Unicode is a sixteen
bit code and requires specific support from the system for displaying the
characters. In respect of Indian languages and others which employ a syllabic
writing system, Unicode assignments are confined to the basic set of vowels,
consonants and medial vowel representations. With this, a syllable has to
be made up from a string of Unicode values and for display, the string should
map into a unique glyph. The recommended method for this is to use OpenType
of Glyphs required in practice.
More glyphs in a font means that the font will permit display of
a more comprehensive set of letters and characters. Some of the encodings
allow as many of 240 or more glyphs to be handled. However due to variations
in text processing across different computers, only the first of the above
mentioned standards is really usable across different computers. In this
standard approximately 190 glyphs are available for display.
Character sets versus Font Encoding
of the character set relates to how a computer system should interpret the
codes in the text (i.e., figure out what letter of the alphabet or symbol
the code actually represents). Clearly, one should know which character set
has been used in creating text in a document to be able to display the text
correctly. One does this normally by specifying the font to be used for generating
the display and the specified font will correctly show the text if its encoding
matches the mapping used in the character set.
normally associate specific fonts with specific character sets. The character
set specified in a web page will usually let the browser select a font for
display consistent with the encoding of the character set. This may not
be possible always since there is no specific requirement that the character
set should be indicated since it may refer to the default for the system.
Such pages will display differently on different browsers. Worse still,
if an application has its own approach to interpreting the document, the
resulting display will be different from what was intended. This is illustrated
in the screen shots taken from a MacIntosh. The image below shows the contents
of testchar3.html as served from a web server and displayed by Netscape. Three
of the characters in the upper ASCII range do not get shown properly.
In the second
image, the Browser displays the same text from a copy of the same file saved
in the system. This time the character ser interpretation corresponds to that
of the native encoding for the Mac.
The third image
displays the rendering og the text on a text editor, Simple Text. The locally
saved file is opened in th editor. Also, a copy of the text from the Browser
window is pasted onto the application.
An html file
has been opened by the simple text application which interprets the html contents
and shows text as per standard conventions. Two of the characters from the
upper ASCII region in the text are correctly displayed whether coded as
direct ASCII or in the form of an html entity. Netscape on the Mac however
interprets the file differently when the page is treated as the contents
of a local file. Now the two characters get interpreted differently. Also,
when one copies ans pastes text into simple text from the window of the browser,
the problem continues since simple text treats the contents as encoded as
MacRoman! The problem persists on Windows systems as well where glyphs
such as soft hyphen will not get rendered on win2000/XP systems but get displayed
properly under Win98.
of the text on a Windows System with Opera results in the expected display.
Back to Contents
The concept of the font
Fonts for Indian languages/scripts
Problems in designing fonts for our languages/scripts
Unicode fonts for Indian languages
Fonts as used in web pages
Why should we be concerned about Font Encoding?
When viewing text in
Indian Scripts on a computer or printing a document on paper, we need to use
appropriate fonts. The font rendering application is usually given a series
of codes (eight bits) and shapes associated with the codes are taken from
the font and drawn on the screen or printed on paper.
The critical factor
is that the association between an eight bit code value and a shape is dependent
on not only the encoding chosen for the font but also the system that interprets
Web Browsers are not
really able to uniformly handle this problem and the consequence is that text
displayed may not be uniform across web pages seen in different Browsers.
However, the characters from the standard Roman alphabet, specified through
ASCII codes are correctly interpreted and displayed on all systems, for
these correspond to the English language.
It is possible to design
fonts in such a way that this uniformity is achieved at least in the case
of the browsers used under Windows9X, Unix and the Macintosh.
Font Encodings compared