Tamil Standard Code for Information Interchange (TSCII) -part II

This Proposal outlines a Tamil Font Encoding Standard that was evolved through mailing list discussions in the Internet over a two year period (1995 -1997). The draft was put up for comments and evaluation by the internet audience in late '97 and was available as a web page for a period of six months. Field tests were carried out in several areas of Tamil computing using sample font faces. The enclosed Draft is the final /revised version that came outcome of this exercise. The present proposal, drafted by the Internet Working Group for the Tamil Standard Code (IWG-TSC) is being submitted to the Tamilnadu Government for possible adoption.

A Proposal for
A Tamil Standard Code For Information Interchange
(TSCII) - part II

Click here to go to part I of this description of the Tamil standard Code TSCII

Description of the proposed Standard Code for Tamil TSCII

Glyph Choices

The proposed Tamil standard is a 8-bit bilingual scheme with the following selection of glyphs:

Slot positions 0-127 /rows 0-7:

Roman characters and punctuation marks - glyph choices identical to those in standard lower ASCII code / 8859-1 (Latin-1) scheme.

Slot positions 128-255 /rows 8-15:

i) entire set of vowels (uyir) -13 and consonants (mei with puLLi) -18 a, aa/A, i, II/I, u, uu/U, e, ee/E, ai, o, oo/O, au and ak/aytham (13) k, ng, c, nj, t/d, N, th/dh, n^/n-, p, m, y, r, l, v, L, zh, R, n (18)

ii) entire set of akaram-eRRiya meis (18), ukara varisai (18) and uukara varisai (18);

iii) consonant-modifiers for Tamil characters: aakara, ikara, iikara, ekara, Ekara, aikara and au-kara varisai modifiers (7);

iv) Tamil alphabets ti and tii/tI (2)

v) grantha characters (13): ja, sha, sa, ha, ksha ( 5) in the vowel form, corresponding akara varisai (5), consonant-modifiers for the ukara and uukara varisai of the grantha characters (2) and sri

vi) Tamil numerals for 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100 and 1000 (13) and

vii) special characters: curly single and double quotes (#145 - #148, 4) and copyright sign (#169) at their respective ANSI slot positions.

Figure 1 presents the proposed encoding scheme showing the glyph choices with slot allocations. The following paragraphs explains the rationale behind the above selection of glyphs and specific slot assignments as indicated in the figure. Before this, it is useful to comment on the presence of some of the special characters (indicated under vi above) and three vacant slots (#160, #254 and #255) in the encoding scheme. Copyright sign (at its ANSI slot #169) is increasingly used in many of the Internet-based protocols /info. exchanges particularly in web pages. Presence of this will avoid un-necessary resorting to other roman font faces.

One of the goals of the present encoding scheme is to allow ready usage of shrink-wrap softwares (for word-processing, graphics etc.) that are already available for use in English and European languages. In most of these softwares, for more pleasing display of texts, the straight form of single (') and double (") quotes are replaced by the corresponding curly quote forms. Latin-1 scheme has these curly quotes at slots #145-148 and most softwares do this "glyph-substitution" as a default case. Hence it was felt useful to have these four curly quotes at their "usual" slots so that TSCII-based Tamil fonts can be used on these existing commercial softwares without any problem.

Presence of two vacant slots (at #254, #255) as "private use area" is another notable feature of the proposed encoding scheme. What are possible uses for such private use area? These slots can be used as "escape slots" through which software developers can bring many more characters for special applications. This could be for example, to invoke glyph substitution routes or such as those required for recording/processing of old classical Tamil texts still in palm-leaf manuscript level. It should be noted that, none of the 'standard softwares written specifically for Tamil' will use the characters that are placed in these slots in 'search' and/or 'sort' routines. Hence use of this "private area" to place old-style Tamil characters such as lai/Nai/nai or Raa/Naa/naa in electronic archives/electronic texts and other digital libraries is not encouraged so as to prevent mis-interpretation of their 'values' or 'meanings.

Rationale for the glyph choices and their slot assignments in TSCII

The Tamil encoding scheme must be able to handle (i.e. display, print) the entire Tamil alphabet soup (247), grantha characters (13) and the Tamil numerals (13). The number of unique Tamil glyphs to consider is about 170. Since the number of slots available in upper-ASCII segment (#128-255) is much less than the required ones to allocate one slot for each, choices have to be made on which of the Tamil alphabets are to be included in the native form, which are to generated using modifiers (several keystrokes in sequence) and the actual slots for each of the chosen ones. Choice of glyphs will determine the quality of the output and the slot allocations will determine the problem-free performance in different computer platforms. We will discuss the rationale for the glyph choices first and then the specific slot assignments.

Glyphs Choice

Two factors guide the selection of glyphs. One is the frequency of occurrence of the alphabets in a typical Tamil text and second the structural complexity of the Tamil alphabet so that they can be generated nicely in on-screen display and in print with or without the add of kerning and other basic font handling techniques already available for over a decade in all computer platforms. It is not a good idea to go for an encoding scheme where 80% of the chosen glyphs occur for less than 30% of the actual text. Admittedly, the quality of any font face (outline definition of the glyphs) will largely determine the quality of the output. Assuming that the quality of the glyphs in the font face are of exceptionally good quality, more the number of alphabets in native form, higher will be the quality of the Tamil text. A good balance has to be made between frequency of occurrence and structural complexity of some of the alphabets.

There have been several analyses of the frequency of occurrence of Tamil alphabets and they have been used earlier in determination of the keyboard layout. The following is one such result where the frequency is given in percentages [15]: basic vowels (uyir) - 7.00%; basic consonants (uyir, with puLLi ik/il,..) - 28.85%; akara varisai (ka, nga, ...) - 23.50%, aakara varisai - 6.39%; ikara varisai - 11.11%; iikara varisai - 0.70%; ukara varisai - 11.88%, uukara varisai - 0.62%; ekara varisai - 1.44%, eekara varisai 1.88%; okara varisai - 1.19%; ookara varisai 1.11%; aikara varisai 4.41%, aukara varisai - 0.04%.

How can we ensure "high quality" production in a glyph-encoding based standard?

Fortunately several of the Tamil alphabets are written as a composite of two or three basic components (referred to as "modifiers"). e.g. aakara, ekara, Ekara, okara, Okara, aikara and aukara varisai alphabets. Along with the basic consonants (mei), it suffices to have a select collection of modifiers (aakara, ekara, Ekara, aikara, aukara) in the encoding scheme to generate all these compound (uyirmei) characters. There is no need to have these as unique glyphs in the encoding scheme. Similar logic can be applied to grantha series as well. It suffices to include the special ukara and uukara modifiers and can use the Tamil modifier glyphs for the rest.

As regards the basic consonants (mei with puLLi), they are the most frequently occurring series of all varisais (>28%) . The classical Tamil typewriter has a single dot/puLLI to add to all the akara varisai uyirmeis and many of the 7-bit Tamil fonts have incorporated this option. The dot/puLLI is placed through kerning. The width of the meis are not same and so with a single pulli, it is not possible to get it go right at the middle of the character. Hence, for aesthetic/quality, it is preferable to keep the basic consonants (18) as native glyphs. Basic meis (ka, nga etc.) comes next and we need to have them without discussion because they are used to generate other ekara, Ekara, okara and Okara varisais !!. Hence these akara varisai alphabets (18) are to be kept in native form.

Now between the two related series ikara and iikara varisai as one group and ukara and uukara varisai as the other group, their total occurrence statistics is nearly the same for these two groups (ca. 12% each). So a priori we can go for either as native ones and the other for kerning. But when one looks at the glyph forms of these two series, ukara varisai and Ukara varisai uyirmeis are notorious for unique forms that cannot be generated through simple kerning techniques. In view of the uniqueness of the entire series of glyphs in these series, it is preferable to keep ukara- and uukara varisai also in tact as native glyphs (2 x 18 each). Even in the ikara and iikara varisai, complex structures such as ti and tii not easily obtained via kerning (and also frequently occurring!) are preferably kept as native glyphs. The above selection process would leave the ikara and iikara varisai to be generated using the corresponding modifiers (kokki / kombu) and kerning techniques.

Slot allocation

Before getting into the details of specific allocation, it is useful to recognize that, in real world (except for the very recent generation of 16- and 32-bit computers), all of the 128 slots in the upper ASCII segment are not used the same way. Widely used ISO 8859-x Encoding scheme for European Languages (e.g., Latin-1) have characters only in rows 10-16. Many Internet Protocols and applications assume Latin-1 as the "default standard" and information exchange is guaranteed for this standard. For example, till the recent release of Unicode-based HTML standard 4.x and associated web-browsers, all web pages in the Internet are all based on HTML version 3.x with Latin-1 as the default standard.

Operating System software for Windows (ANSI) and Macintosh employ 8-bit encoding schemes that use the entire 128 slots available in upper ASCII segment (though the slot allocation for individual glyphs are different). Unix uses a scheme similar to Latin-1. Traditionally, a number of slots in the upper ASCII segment, particularly those in rows 8 and 9 were used for control operations in windows-based softwares. Fontographer Technical notes (#3700) identifies the following slots as not-usable ("non-printable") for a font with standard ANSI character set: slots with the decimal numbers of 127, 128, 129, 141, 142, 143, 144, 157, 158, and 160. Due to this, as a safety measure, many of the softwares for Windows rarely use these character slots of rows 8 and 9. In early days, another reason for not using rows 8 and 9 are fears of byte-stripping.

In view of the above two real-world situations (Working of Web and Encoding schemes used in popular Windows OS), it is desirable to split the upper ASCII segment to be of three parts:
i) a reliable Latin-1 segment (slots 160-255);
ii) a "hot" spot zone consisting of the above cited 8 characters and
iii) "cold/safe" slots of everything else that are in rows 8, 9.

All the important glyphs corresponding to Tamil alphabet are collected in the main Latin-1 segment . Smart quotes (curly single and double right and left quotes) and copyright signs are left at their Windows/ANSI slots. Smart quotes are left at their slots 145-148 to accommodate the fact that most of the recent softwares (word-processing, graphics, database,..) implement replacement of straight quotes by the smart quotes as a default option. This way current shrink-wrap softwares for English can be used for Tamil computing without the need for special/dedicated softwares. All others glyphs such as Tamil numerals, grantha characters and rarely used Tamil uyirmeis nju, ngu, njU and ngU are now moved up to the rows 8 and 9. Here again, the grantha characters and Tamil glyphs nju, ngu, njU and ngU are placed in the "cold" slots. In view of their rare usage, the Tamil numerals are spread amongst the "hot spot" slots cited above.

It should be emphasized that the above arbitrary breaking of chosen glyphs into three groups and placing the granthas in rows 8 and 9 does not mean that the latter are given secondary importance. It does make the sorting a bit more complex. But we chose to do it that way to achieve reliable rendering of "pure Tamil text" (clean and very readable) even in poor/bad implementation scenarios of TSCII and in situations where the computer system is relatively old and cannot run recent versions of softwares/web-browsers. TSCII standard is designed to provide some sort of "backward compatibility" to early generation computers (at least those bought within the last decade).

It should be emphasized that present generation computers are 16- or 32-bit clean, capable of handling 8-bit encoded information cleanly with appropriate softwares and this includes "printable" characters placed in rows 8 and 9. KOI8-R, for example is a well-established de facto standard 8-bit encoded character set for Russian/Cyrillic that has the entire rows 8 and 9 filled with "printable" characters. We have already tested successful the display and print of the entire glyph set of TSCII in word-processing using model font faces. Hence there is no need for any anxiety in this regard.

TSCII encoding scheme is capable of meeting special requirements of select application scenarios.

What about the special needs of publishing houses that require a high quality output. Can a standard based on glyph encoding guarantee this?

With the choice of glyphs discussed above, we have nearly 87.07% of the Tamil characters accommodated in native form in the encoding scheme: meis with puLLis 28.85%; basic meis (akaram eRRiya meis) 23.50%; ukara varisai 11.88%; entire uyirs: 7.00 %; aakara varisai (with stand alone "aa" modifier) 6.39 %; aikara varisai (with stand alone "ai" modifier) 4.41%; eekara varisai (with stand alone "ee" modifier) 1.88%; ekara varisai (with stand alone "e" modifier) 1.44%; ti and tii 1.06%; uukara varisai 0.62% and aukara varisai (with e, au modifiers) 0.04%.

What does this statistics mean?

It means that, nearly 87% of the Tamil characters that you see on the screen or print will appear as native ones without any kerning. Their quality will be dependent purely on the quality of the font face design. Even in the ca. 13% generated via kerning (used mainly in the ikara and iikara varisai), majority of them can be generated in quite satisfactory way using kerning procedures. Kerning is a routine font handling technique now available in all of the common computer platforms/OS. As a right-end modifier, the ikara and iikara varisai uyirmeis can be rendered fairly precise on all platforms. So it is likely that, using the proposed glyph encoding scheme, over 98% of the Tamil characters can be rendered easily on screen and in print without any loss of quality. Techniques such as pair-wise kerning can handle even the residuals adequately.

Professional publishing houses with more stringent requirements on the glyph display invariably use more sophisticated printing equipment and high-end computer systems. Advanced font handling techniques such as glyph substitution (GSUB) through (or without) Open True Type fonts are already implemented at the OS level. Hence it should not be problem for these cases to use dedicated softwares where single form of these alphabets are stored elsewhere and brought in wherever they are needed.

It may be pointed out here that, already we do have many 7-bit Tamil fonts sold commercially and many of these are being used by commercial publishing houses. It is not that, if we invoke kerning, we cannot produce good quality output. Closer examination of the widely used classical Tamil typewriter type encoding for example, will show that, in these >50% of the characters involve kerning!! The proposed scheme is far, far better than any of the existing ones.

In any glyph-encoding scheme, it is preferable to keep as many of the uyirmei series (varisais) as complete ones. This will greatly facilitate sorting and other related issues, for beginners who learn the language directly from computers and also more logical for those who use different type of keyboard input implementations. It is not a good idea to push the statistics logic too far and go for a scheme where we simply pick maximum of frequently occurring alphabets that will provide maximum good quality output. It will be a jamboree collection without any logic to help the end-users.

Some of the participants in the discussions raised reservations about the generation of ikara and iikara varisai using a single modifier for each. The same argument stated earlier for adding a puLLi to akara varisai uyirmeis to get the basic meis (ik, il etc) hold good in these cases as well. What about using a series of kokkis and kombus with different widths? When we choose to have a modifier, to avoid ambiguity, it is preferable not to have many different kokkis doing the same job - some to go with ka/ca, some to go with Na, na etc. Such presence of multiple "modifiers" for the same series of uyirmeis can render the glyph choices non-unique and pose problems in search, sorting routines.

What about the display in Point-of-Sale (POS) terminals?

Excessive invoking of kerning can pose problems for character display in POS terminals. As stated above, the kerning is invoked in only two series (ikara, iikara varisai). Even in these case the kokki falls apart in primitive POS systems, the Tamil text should be still readable. Since the screen is constantly re-written, there should not be any problem to display all characters. Even in the character only terminals certain characters cannot be rendered legibly (Na, ha, ksha, sa may not fit in the usual 8 x 12 cell).

Concluding Remarks

The proposed glyph encoding scheme as a Standard for Tamil Computing is the outcome of nearly two years of intense deliberations by many experts in a public forum accompanied by extensive field testing using model font faces, text conversion softwares and sorting softwares. It has the support of a broad spectrum of Internet Tamil community (developers of popular freeware and commercial font faces and word-processing softwares, web-masters of comprehensive Tamil websites, distinguished Tamil scholars committed to building vast electronic library of Tamil literary classics, etc.). In a short span of two months since the present encoding scheme was adopted as the final form by the Internet Working Group, several Tamil commercial software developers have started already distributing FREE TSCII-encoding based tools: Tamil font faces and keyboard editors for use in Windows, Mac platforms, text converters to go between TSCII and popular Tamil font faces and vice versa, and Email softwares that allow exchanges directly in Tamil. A dedicated web site for TSCII has also been formed to provide all the necessary technical assistance for quick implementation of the standard and to serve as "the site" where anyone can download above type of TSCII-based tools. Hence we strongly believe that the proposed standard is a very viable one, guaranteed to deliver the goods it promises. We sincerely hope that the Tamilnadu Government will give a fair hearing to this proposal and possibly adopt it as the Standard for Tamil Computing as soon as possible.

References

1. V.S. Rajam, A reference grammer of Classical Tamil Poetry, American Philosophical Society, Philadelphia, , p.1, (1992); Encyclopaedia Britannica, vol. 21, p.647-648, 1972 ed.
2. A pot-pourri of web-sites of interest to Tamils:

http://www.oocities.org/Athens/5180/tlinks.html
3. TamilNet'97: International Symposium for Information Processing and Internet Resources in Tamil, National Univ. of Singapore, Aug. 1997:

http://www.irdu.nus.sg/tamilweb/tamilnet97/paper/html/content.html
4. An overview of different approaches to word-processing in Tamil and a proposal towards standardisation of tamil computing, K. Kalyanasundaram, Talk delivered at TamilNet'97:

http://www.oocities.org/Athens/5180/sintalk1.html
5. Recommended Keyboard Layouts for Tamil Computing of the Tamilnadu Tamil Computing Standardisation Advisory Committee:

http://irdu.nus.sg/tamilweb/tamilnet97/doc-draft/keyboard-layout.html
6. Internet-based Tamil Email Discussion List Tamilnet :

http://www.tamil.net/
7. Internet-based Tamil Email Discussion List TamilWeb :

http://irdu.nus.sg/tamilweb/mailinglist.html
8. Indian Standard Code ISCII and Inscript Keyboard layout:

http://www.cdac.org.in/html/gist/articles.htm
9. Unicode Standard for Multilingual Computing:

http://www.unicode.org/
10. Unicode and Tamil: Issues with Implementation, Muthulilan M. Nedumaran

http://www.irdu.nus.sg/tamilweb/tamilnet97/paper/html/muthu.htm y/a>
11. Multi-lingual Support on Windows

http://www.microsoft.com/typography/multilang/default.htm
12. Indian Language Kit for Macintosh OS from Apple:

http://www.apple.com/macos/multilingual/indian.html
13. Open Truetype Font Technology

http://www.adobe.com/supportservice/devrelations/opentype/otover.htm

http://www.microsoft.com/globaldev/gbl-gen/codesets/truetype.htm

http://www.microsoft.com/opentype/tt/tt.htm
14. Truetype GX /QuickDraw GX Technology of Apple

http://developer.apple.com/techpubs/mac/GXTypography/GXTypography-2.html
15. Encyclopaedia of Tamil Literature, Institute of Asian Studies, Chennai, 1983, p. 109

Click here to go to Part I (providing background info., need for the proposed standard and design goals of TSCII).
Click here to go to the Web page carrying the Annexes.
A draft version of the proposal was first put up in the Internet on Dec. 2, 1997 and this file was last revised on October 27, 1998.
You are visitor number

since Dec. 9, 97
Please send your comments to Dr. K. Kalyanasundaram

This page hosted by

A Proposal for A Tamil Standard Code For Information Interchange (TSCII) - part II

Description of the proposed Standard Code for Tamil TSCII

Rationale for the glyph choices and their slot assignments in TSCII

Concluding Remarks

References

A Proposal for
A Tamil Standard Code For Information Interchange
(TSCII) - part II