HKUST End-User Defined Characters
Table of Content
Abstract
In HKUST, besides adoption of standard BIG-5 Chinese encoding scheme, different
End-User Defined Characters (EUDC) are implemented in different Chinese
systems on PC platforms. This paper summarizes the current implementation
on different Chinese systems, the migration problems that may arise and
the proposed future development that we should pursue.
Why we need EUDC
Unlike western languages, CJK languages consist of ideographic characters
instead of phonetic alphabets. We cannot imagine that there will be 27th
character in English, but the need for extra ideographic characters arises
from different aspects.
-
Strictly speaking, character carries the meaning while glyph is a representation
of character. A character may have more than one glyphs. Examples are:
"
" vs. "
", "
"
vs. "
", "
" vs. "
"...etc.
All these pairs are actually the same character with different glyphs.
However, in current system implementations, one character encoding can
only refer to one glyph in presentation. There is no way that we can store
multiple glyphs in one single character encoding space. The only solution
is to define a new dummy character with the required glyph.
-
Regional cultural difference
-
Hong Kong, with its unique cultural history, has "invented" a rich set
of Chinese characters that were not included in both China GB-2312 and
Taiwan BIG-5 encoding schemes. Examples are "
", "
",
"
" etc. All these characters need to be defined in additional
to the established encoding schemes.
-
More and more new characters will arise from the evolving cultural advancement.
Ancient Chinese characters will no longer exist in contemporary documents.
New characters may arise from processes like character simplification,
advances in professional dictionaries. All these characters need to be
defined before they are incorporated into encoding schemes.
BIG-5 Encoding Scheme
BIG-5 encoding scheme is defined by Taiwan's Institute for Information
Industry's (III) Technical Report C-26. The encoding range can contain
a total of 19,782 characters. The specification of BIG-5 is as:
High Byte
(Lead Byte) |
0xA1 to 0xFE 1
0x8E to 0xA0
0x81 to 0x8D |
*126 |
Low Byte
(Trail Byte) |
0x40 to 0x7E
0xA1 to 0xFE |
*157 |
The whole BIG-5 encoding scheme composes of three coding regions.
| Region |
Description |
Character Type |
Code Range |
Characters |
| 1 |
Standard Chinese Character Area |
Very often used |
0xA440 - 0xC67E 2 |
5401 |
| |
|
Often used |
0xC940 - 0xF9D5 |
7652 |
| |
|
System Reserved |
0xF9D6 - 0xF9FE |
41 |
| |
|
Special Characters |
0xC6A1 - OxC8FE |
408 |
| 2 |
Special Chinese Character Area |
Standard Symbols |
0xA140 - 0xA3BF |
408 |
| |
|
Control Code |
0xA3C0 - 0xA3E0 |
33 |
| |
|
Reserved Control Code |
0xA3E1 - 0xA3FE |
30 |
| 3 |
User Defined Area |
1st segment |
0xFA40 - 0xFEFE |
785 |
| |
|
2nd segment |
0x8E40 - 0xA0FE |
2983 |
| |
|
3rd segment |
0x8140 - 0x8DFE |
2041 |
All EUDC should be created in the 3rd region as defined by BIG-5.
Note :
-
0xA1 means a single character with Hexadecimal code
"A1"
-
0xA440 means two characters, the first character has
Hexadecimal code "A1", while the second character has Hexadecimal code
"40".
What are HKUST EUDC
Since the establishment of HKUST during early 90s, ITSC has been facing
a problem that the characters defined in standard BIG-5 cannot meet our
user's need. There are quite a lot of Chinese characters missing in the
standard ETen system, the first Chinese system that we implemented. The
source of such characters arose from the Chinese name of staff and students,
as well as the address. These missing characters are primarily local to
the culture of Hong Kong.
During the time that we rolled out ETen Chinese system on our network
environment, there was no EUDC standard established in the region. As a
result, ITSC selected the 3rd segment in the EUDC area of BIG-5
encoding scheme and created our own set of Chinese characters. The following
table lists the 152 HKUST EUDC and its encoding as at 11-5-1999.
Current Situation
The encoding specifications in Region 1 and Region 2 are well defined in
BIG-5 standard. The most trouble making region is the User Defined Area.
Our current implementation is as follows:
Eten DOS System (倚天中文系統)
| Segment |
Description |
Code Range |
Characters |
| 1 |
Simplified Chinese characters copied from KuoChiau
(國喬中文系統) |
0xFA40 - 0xFEFE |
785 |
| 2 |
Simplified Chinese characters copied from KuoChiau
(國喬中文系統) |
0x8E40 - 0x98C5 |
1670 |
| 3 |
HKUST End User Defined Characters |
0x8D45 - 0x8DFE |
152 |
Chinese Windows 3.1
| Segment |
Description |
Code Range |
Characters |
| 1 |
R&B EUDC Standard1 |
0xFA40 - 0xFEFE |
784 |
| 2 |
Unused |
|
|
| 3 |
HKUST End User Defined Characters2 |
0x8D51 - 0x8DFE |
140 |
RichWin 4.2+/RichWin 97
| Segment |
Description |
Code Range |
Characters |
| 1 |
GCCS Standard3 |
0xFA40 - 0xFEFE |
3049 |
| 2 |
GCCS Standard |
0x8E40 - 0xA0FE |
| 3 |
Unused (Cannot Build EUDC) |
|
0 |
Chinese Windows 95
| Segment |
Description |
Code Range |
Characters |
| 1 |
GCCS Standard |
0xFA40 - 0xFEFE |
3049 |
| 2 |
GCCS Standard |
0x8E40 - 0xA0FE |
| 3 |
HKUST End User Defined Characters |
0x8D45 - 0x8DFE |
152 |
Note :
-
The EUDC character set developed by R&B
company which is the richest BIG-5 EUDC character set available on the
market at the time we implemented Chinese Windows 95.
-
This set of HKUST EUDC was created as at
March 96. There was no further enhancement since then.
-
GCCS
(Government Chinese Character Set) published
by ITSD in
1995, it serves the standard for information exchange within
government. The adoption of this character set in Hong Kong shows that
it will become de-facto standard in the region.
EUDC Migration Issues
From ETen to Chinese Windows 3.1
-
There is no conversion available from ETen EUDC to Chinese Windows 3.1
EUDC for the characters fall in EUDC segment 1 & 2. If the document
was created in ETen environment with EUDC, there might not be a corresponding
character in Chinese Windows 3.1.
A tricky conversion that one may adopt is to make use of the conversion
program (繁簡體轉換程式) available in ETen Chinese system. This program
will convert a text file with Chinese characters in EUDC segment 1 &
2 to standard BIG-5. Then one may open it in Chinese Windows 3.1 to read
the document.
Note that there are 12 characters less in Chinese Windows 3.1 environment
since the EUDC created in Chinese Windows 3.1 was as at Mar, 96.
From ETen to RichWin 4.2+/RichWin 97
-
The EUDC creation tool that comes with RichWin 4.2+/RichWin 97 only supports creating
characters in GB locale. Hence HKUST EUDC cannot be ported to RichWin system.
There is no conversion tools for converting document created in ETen
with EUDC to RichWin at all.
Remark: To make RichWin properly support
GCCS EUDC, one should install the
additional font which is available in RichWin installation directory.
From ETen to Chinese Windows 95
-
Similar to that of Chinese Windows 3.1, there is no conversion available
from ETen EUDC to Chinese Windows 95 EUDC for the characters fall in EUDC
region 1 & 2. However, the same trick of converting character in ETen
system also applies.
From Chinese Windows 3.1 to Chinese Windows 95
-
A special EUDC conversion utility program is available in Chinese Windows
95 platform. It can basically convert a text file with R&B EUDC encoding
to GCCS encoding and remain characters in other regions (including HKUST
EUDC) intact. However, the conversion is not a complete mapping, since
there are some characters defined in R&B EUDC but still missing in
GCCS EUDC.
One major drawback of using such conversion tool is that user will find
loss in the formatting of the original document since the tool only converts
text file.
From RichWin 4.2+/RichWin 97 to Chinese Windows 95
-
No conversion is necessary at all. The two systems both support GCCS EUDC.
Chinese Windows 95 has richer EUDC set as it is incorporated with HKUST
EUDC.
Future EUDC Adoption
Strictly speaking, GCCS EUDC is not an industry standard at present. However,
it is the richest character set that includes most of Chinese characters
that are local to Hong Kong's culture. Chinese system vendors seem promising
in supporting this character set. For example, Microsoft has incorporated
such standard into its product, Pan Chinese Windows NT 4.0 during early
1997.
Currently, GCCS EUDC has 3,049 Chinese characters defined and is growing
at a rate of around 300 characters per year. Though the Chinese system
vendors may not catch up the updating pace, it is expected to be a semi
de-facto encoding standard in Hong Kong.
Detailed study on HKUST EUDC reveals that only 13 out of 152 characters
are missing in GCCS EUDC. And most of these 13 characters are simply different
glyphs of other defined characters.
To avoid further distraction from the GCCS EUDC, migration to Chinese
Windows 95 platform will definitely provide a solution to effective Chinese
information exchange.
To conclude, sticking to GCCS EUDC standard will help to obtain richest
set of traditional Chinese characters. It will in turn meet the regional
need for Chinese characters. One will also find that adopting such standard
may facilitate document exchange not within the campus but thoughout the
SAR.