HKUST End-User Defined Characters

Table of Content


Abstract

In HKUST, besides adoption of standard BIG-5 Chinese encoding scheme, different End-User Defined Characters (EUDC) are implemented in different Chinese systems on PC platforms. This paper summarizes the current implementation on different Chinese systems, the migration problems that may arise and the proposed future development that we should pursue.


Why we need EUDC

Unlike western languages, CJK languages consist of ideographic characters instead of phonetic alphabets. We cannot imagine that there will be 27th character in English, but the need for extra ideographic characters arises from different aspects.
Strictly speaking, character carries the meaning while glyph is a representation of character. A character may have more than one glyphs. Examples are: "" vs. "", "" vs. "", "" vs. ""...etc. All these pairs are actually the same character with different glyphs. However, in current system implementations, one character encoding can only refer to one glyph in presentation. There is no way that we can store multiple glyphs in one single character encoding space. The only solution is to define a new dummy character with the required glyph.
Hong Kong, with its unique cultural history, has "invented" a rich set of Chinese characters that were not included in both China GB-2312 and Taiwan BIG-5 encoding schemes. Examples are "", "", "" etc. All these characters need to be defined in additional to the established encoding schemes.
More and more new characters will arise from the evolving cultural advancement. Ancient Chinese characters will no longer exist in contemporary documents. New characters may arise from processes like character simplification, advances in professional dictionaries. All these characters need to be defined before they are incorporated into encoding schemes.

BIG-5 Encoding Scheme

BIG-5 encoding scheme is defined by Taiwan's Institute for Information Industry's (III) Technical Report C-26. The encoding range can contain a total of 19,782 characters. The specification of BIG-5 is as:
High Byte 
(Lead Byte) 
0xA1 to 0xFE 1 
0x8E to 0xA0 
0x81 to 0x8D 
*126 
Low Byte 
(Trail Byte) 
0x40 to 0x7E 
0xA1 to 0xFE 
*157 
The whole BIG-5 encoding scheme composes of three coding regions.
Region  Description  Character Type  Code Range  Characters 
Standard Chinese Character Area  Very often used  0xA440 - 0xC67E 2  5401 
    Often used  0xC940 - 0xF9D5  7652 
    System Reserved  0xF9D6 - 0xF9FE  41 
    Special Characters  0xC6A1 - OxC8FE  408 
Special Chinese Character Area  Standard Symbols  0xA140 - 0xA3BF  408 
    Control Code  0xA3C0 - 0xA3E0  33 
    Reserved Control Code  0xA3E1 - 0xA3FE  30 
User Defined Area  1st segment  0xFA40 - 0xFEFE  785 
    2nd segment  0x8E40 - 0xA0FE  2983 
    3rd segment  0x8140 - 0x8DFE  2041 
All EUDC should be created in the 3rd region as defined by BIG-5.

Note :


What are HKUST EUDC

Since the establishment of HKUST during early 90s, ITSC has been facing a problem that the characters defined in standard BIG-5 cannot meet our user's need. There are quite a lot of Chinese characters missing in the standard ETen system, the first Chinese system that we implemented. The source of such characters arose from the Chinese name of staff and students, as well as the address. These missing characters are primarily local to the culture of Hong Kong.

During the time that we rolled out ETen Chinese system on our network environment, there was no EUDC standard established in the region. As a result, ITSC selected the 3rd segment in the EUDC area of BIG-5 encoding scheme and created our own set of Chinese characters. The following table lists the 152 HKUST EUDC and its encoding as at 11-5-1999.


Current Situation

The encoding specifications in Region 1 and Region 2 are well defined in BIG-5 standard. The most trouble making region is the User Defined Area. Our current implementation is as follows:

Eten DOS System (倚天中文系統)

Segment  Description  Code Range  Characters 
Simplified Chinese characters copied from KuoChiau (國喬中文系統)  0xFA40 - 0xFEFE  785 
Simplified Chinese characters copied from KuoChiau (國喬中文系統)  0x8E40 - 0x98C5  1670 
HKUST End User Defined Characters  0x8D45 - 0x8DFE  152 

Chinese Windows 3.1

Segment  Description  Code Range  Characters 
R&B EUDC Standard1  0xFA40 - 0xFEFE  784 
Unused     
HKUST End User Defined Characters2  0x8D51 - 0x8DFE  140 

RichWin 4.2+/RichWin 97

Segment  Description  Code Range  Characters 
GCCS Standard3  0xFA40 - 0xFEFE  3049 
GCCS Standard  0x8E40 - 0xA0FE 
Unused (Cannot Build EUDC)   

Chinese Windows 95

Segment  Description  Code Range  Characters 
GCCS Standard  0xFA40 - 0xFEFE  3049
GCCS Standard  0x8E40 - 0xA0FE 
HKUST End User Defined Characters  0x8D45 - 0x8DFE  152 
Note :

EUDC Migration Issues

From ETen to Chinese Windows 3.1

There is no conversion available from ETen EUDC to Chinese Windows 3.1 EUDC for the characters fall in EUDC segment 1 & 2. If the document was created in ETen environment with EUDC, there might not be a corresponding character in Chinese Windows 3.1.

A tricky conversion that one may adopt is to make use of the conversion program (繁簡體轉換程式) available in ETen Chinese system. This program will convert a text file with Chinese characters in EUDC segment 1 & 2 to standard BIG-5. Then one may open it in Chinese Windows 3.1 to read the document.

Note that there are 12 characters less in Chinese Windows 3.1 environment since the EUDC created in Chinese Windows 3.1 was as at Mar, 96.

From ETen to RichWin 4.2+/RichWin 97

The EUDC creation tool that comes with RichWin 4.2+/RichWin 97 only supports creating characters in GB locale. Hence HKUST EUDC cannot be ported to RichWin system.

There is no conversion tools for converting document created in ETen with EUDC to RichWin at all.

Remark: To make RichWin properly support GCCS EUDC, one should install the additional font which is available in RichWin installation directory.

From ETen to Chinese Windows 95

Similar to that of Chinese Windows 3.1, there is no conversion available from ETen EUDC to Chinese Windows 95 EUDC for the characters fall in EUDC region 1 & 2. However, the same trick of converting character in ETen system also applies.

From Chinese Windows 3.1 to Chinese Windows 95

A special EUDC conversion utility program is available in Chinese Windows 95 platform. It can basically convert a text file with R&B EUDC encoding to GCCS encoding and remain characters in other regions (including HKUST EUDC) intact. However, the conversion is not a complete mapping, since there are some characters defined in R&B EUDC but still missing in GCCS EUDC.

One major drawback of using such conversion tool is that user will find loss in the formatting of the original document since the tool only converts text file.

From RichWin 4.2+/RichWin 97 to Chinese Windows 95

No conversion is necessary at all. The two systems both support GCCS EUDC. Chinese Windows 95 has richer EUDC set as it is incorporated with HKUST EUDC.

Future EUDC Adoption

Strictly speaking, GCCS EUDC is not an industry standard at present. However, it is the richest character set that includes most of Chinese characters that are local to Hong Kong's culture. Chinese system vendors seem promising in supporting this character set. For example, Microsoft has incorporated such standard into its product, Pan Chinese Windows NT 4.0 during early 1997.

Currently, GCCS EUDC has 3,049 Chinese characters defined and is growing at a rate of around 300 characters per year. Though the Chinese system vendors may not catch up the updating pace, it is expected to be a semi de-facto encoding standard in Hong Kong.

Detailed study on HKUST EUDC reveals that only 13 out of 152 characters are missing in GCCS EUDC. And most of these 13 characters are simply different glyphs of other defined characters.

To avoid further distraction from the GCCS EUDC, migration to Chinese Windows 95 platform will definitely provide a solution to effective Chinese information exchange.

To conclude, sticking to GCCS EUDC standard will help to obtain richest set of traditional Chinese characters. It will in turn meet the regional need for Chinese characters. One will also find that adopting such standard may facilitate document exchange not within the campus but thoughout the SAR.