image
image
image
image
image
image
image
 
Home --> Software Design Issues --> Tutorial on Fonts for Indian Languages.
Search  
  Font Tutorial- Encoding concepts  
A tutorial on Fonts for Indian Languages:  Section-1

Concept of Font Encoding

  In the previous section we looked at a simple font arrangement where the glyph location for a letter of the alphabet is fixed through the ASCII code for the letter. In practice, it is not necessary to adhere to this arrangement. The glyph for a particular letter of the alphabet may be kept in any glyph location but some mechanism is required to relate the code to its shape. The encoding scheme provides this relationship where the name of the symbol associated with the code is specified in a table.  Inside the font file, the name of the symbol and the location of its shape (called the Glyph index) is also specified. Thus from the character code one obtains the name of the letter and this name is used to arrive at the glyph for display.  The process is shown in the figure below.

  We make some important observations form the figure above. The final shape to be displayed is decided by the character code together with the character set as well as font encoding. A computer system displaying text using its default character set encoding is likely to display different shapes from the intended. The displayed shape is likely to be correct if there is a way of telling the system that it should display the shape for a character name. This works properly for English but when the shapes relate to a different script (Sanskrit in the example above), the display will have no relationship with the named character but a shape chosen by the designer of the font to meet the requirements of the script (writing system) 

  The encoding idea, while appearing to be unnecessary if the code can be directly related to the glyph, actually gives some freedom for the font designer to  place the glyphs in any order that gives some convenience in respect of the design. This way, it will always be possible to relate a character to its glyph so long as the name of the character is correctly specified to the font.  Over the years, standard locations for the glyphs corresponding to standard letters of the alphabet for a language have emerged, giving rise to the notion of standard font encoding. 

  Font encoding specifies what character names are correctly recognized in a computer system. The encoding concept also permits different characters with different names from different languages be displayed with the same glyph if the characters have the same shape. For example, the minus sign and the hyphen could both be displayed with the same glyph though a hyphen may mean something different from the mathematical minus sign. In a font designed according to an encoding scheme, a desired symbol can be selected for display by specifying the name for the symbol. This name is related to the code used to refer to the character and the relationship is specified through a Table called the Encoding Table. The term "character set" is used to relate an eight bit code value to the name of the character represented by the code.

  The concept of font encoding allows us to generate displays of text strings in many different languages by using fonts which contain the glyphs corresponding to their alphabet.  The text to be displayed is represented as a series of eight bit characters and for all practical purposes, these may be reckoned as a string of ascii codes. The computer system takes each code and displays the glyph associated with it. The glyphs may be viewed as the building blocks for the letter to be displayed where, by placing the glyphs one after another, the required display is generated. Fonts also incorporate a feature whereby some of the glyphs may be defined to have zero width even though they extend over a horizontal range.  Thus when the system places a zero width glyph next to another, the two are superimposed and thus permit more complex shapes to be generated, such as accented letters. Zero width glyphs are very important for Indian language fonts. 

  The location of a glyph within a font that caters to non Roman letters is pretty much arbitrary though designers tend to follow one or  more encoding standards. As indicated earlier, different encoding methods are in use today where each encoding has standardized the locations where glyphs may be placed. Unfortunately, these locations are not uniform across different encodings. This is a consequence of different vendors or independent groups choosing to support their own encoding standard in their applications (or the Operating System). Over the years, the following encoding standards have become popular. 

 1. The standard Latin character set as per ISO-8859-1

  This encoding is recognized under most Operating Systems. The standard provides for about 190 glyphs and 96 out of these represent the letters of the English language, punctuation and other symbols seen in print.
 2. Windows specific Latin character set known as Latin-1252
   This encoding supports a dozen or more glyphs beyond the number supported under 8859-1. This encoding is the most common choice under Microsoft  Windows. 
 3.  Macintosh Encoding. 
   The Mac encoding is quite unique in many respects. It is one of the encodings supporting more than 235 glyphs. This encoding has been used for many non Roman fonts from the very beginning when the Macintosh was introduced. Mac encoding is not compatible with ISO-Latin-8859-1. It has been observed that on a Mac, Web browsers do not correctly display many of the characters of the standard 8859-1 set (13 in the upper ASCII range).
 4. Encoding supported under PostScript.
   PostScript is an independent approach to generating printed documents based on a Graphic description language developed at Adobe Systems, USA. There are two or three different encodings supported under PostScript but it is possible to have user specified fonts with arbitrary encodings.
5. User specified encoding (Custom Encoding).
  In a font that has user specified encoding, the mapping of the name of the character to its glyph index will be different from one of the standard encodings. There is no specific advantage to having a custom encoded font if the text to be displayed in the system has been generated in the system itself or it conforms to the default character set understood by different applications running in the system. When text is moved across systems, there is a possibility that the character names, though correctly understood get mapped to glyphs other than the right ones because the coding of the text in one system differs from the character set understood in the second. It is clearly known that a text file created in Microsoft Windows will not display correctly display on a Unix Machine or a MacIntosh, since the Windows Character set differs from the character sets understood by the two. 

  One might ask, if it is not possible to have all system correctly understand one character set for a specified language. Historically, Unix machines dealt with a simpler character set, typically ISO-8859-1 (about 190 characters) while the Mac, way back in 1983 or so supported a much richer character set catering to nearly 235. This made the Mac an ideal choice for DTP in languages other than English which required many more symbols to display the text. Microsoft Windows followed but had its own definition for the character set. The standard Windows character set called Western-1252, has about 210 characters defined. The common denominator across the three encodings is only 94 characters!

  The Unicode standard is supposed to be a universal character set to be interpreted and displayed correctly on all systems. It will be shown subsequently that even this standard poses problems for Indian languages. Unicode is a good choice for European languages which have a much simpler writing system compared to the languages of India.

6. Unicode 
Unicode encoding combines the character sets from several different languages/scripts of the world into one single scheme. The code assigned to characters from each scripts are identified with unique names given to each character in the set. Characters which combine diacritics have individual assignments in Unicode and therefore map to a single glyph. Unicode is a sixteen bit code and requires specific support from the system for displaying the characters. In respect of Indian languages and others which employ a syllabic writing system, Unicode assignments are confined to the basic set of vowels, consonants and medial vowel representations. With this, a syllable has to be made up from a string of Unicode values and for display, the string should map into a unique glyph. The recommended method for this is to use OpenType fonts. 
Number of Glyphs required in practice.

More glyphs in a font means that the font will permit display of a more comprehensive set of letters and characters.  Some of the encodings allow as many of 240 or more glyphs to be handled. However due to variations in text processing across different computers, only the first of the above mentioned standards is really usable across different computers. In this standard approximately 190 glyphs are available for display.



Character sets versus Font Encoding

  The concept of the character set relates to how a computer system should interpret the codes in the text (i.e., figure out what letter of the alphabet or symbol the code actually represents). Clearly, one should know which character set has been used in creating text in a document to be able to display the text correctly. One does this normally by specifying the font to be used for generating the display and the specified font will correctly show the text if its encoding matches the mapping used in the character set. 

  Web browsers normally associate specific fonts with specific character sets. The character set specified in a web page will usually let the browser select a font for display consistent with the encoding of the character set. This may not be possible always since there is no specific requirement that the character set should be indicated since it may refer to the default for the system. Such pages will display differently on different browsers. Worse still, if an application has its own approach to interpreting the document, the resulting display will be different from what was intended. This is illustrated in the screen shots taken from a MacIntosh. The image below shows the contents of testchar3.html as served from a web server and displayed by Netscape. Three of the characters in the upper ASCII range do not get shown properly. 

  

Netscape Rendering on a Mac

  In the second image, the Browser displays the same text from a copy of the same file saved in the system. This time the character ser interpretation corresponds to that of the native encoding for the Mac.

Netscape displaying local file

  The third image displays the rendering og the text on a text editor, Simple Text. The locally saved file is opened in th editor. Also, a copy of the text from the Browser window is pasted onto the application.

Simple Text displaying the contents

  An html file has been opened by the simple text application which interprets the html contents and shows text as per standard conventions. Two of the characters from the upper ASCII region in the text are correctly displayed whether coded as direct ASCII or in the form of an html entity. Netscape on the Mac however interprets the file differently when the page is treated as the contents of a local file. Now the two characters get interpreted differently. Also, when one copies ans pastes text into simple text from the window of the browser, the problem continues since simple text treats the contents as encoded as MacRoman!  The problem persists on Windows systems as well where glyphs such as soft hyphen will not get rendered on win2000/XP systems but get displayed properly under Win98. 

  The rendering of the text on a Windows System with Opera results in the expected display.

Rendering of testchar3.html under Windows

Back to Contents

Sections

The concept of the font

Font encoding

Fonts for Indian languages/scripts

Problems in designing fonts for our languages/scripts

Unicode fonts for Indian languages

Fonts as used in web pages

In summary


Why should we be concerned about Font Encoding?

When viewing text in Indian Scripts on a computer or printing a document on paper, we need to use appropriate fonts. The font rendering application is usually given a series of codes (eight bits) and shapes associated with the codes are taken from the font and drawn on the screen or printed on paper. 

The critical factor is that the association between an eight bit code value and a shape is dependent on not only the encoding chosen for the font but also the system that interprets the codes. 

Web Browsers are not really able to uniformly handle this problem and the consequence is that text displayed may not be uniform across web pages seen in different Browsers. However, the characters from the standard Roman alphabet, specified through ASCII codes are correctly interpreted and displayed on all systems, for these correspond to the English language.
 

It is possible to design fonts in such a way that this uniformity is achieved at least in the case of the browsers used under Windows9X, Unix and the Macintosh.




Font Encodings compared


Acharya Logo
Text in Brahmi script at the Gate of the Great Stupa at Sanchi. The text records the donation of the pillar by a desciple of Arya Kshudra. The text reads "aya chuDasa atevAsino balamitasa dAnam thabho". More information about the Brahmi script is presented under Languages and Scripts.

Today is Jul. 27, 2017
Local Time: 20 43 07

| Home | Design issues | Online Resources | Learn Sanskrit | Writing Systems | Fonts |
| Downloads | Unicode, ISCII | SW for the Disabled | Linguistics | Contact us |
Last updated on 10/24/12     Best viewed at 800x600 or better