82 lines
2.0 KiB
Plaintext
82 lines
2.0 KiB
Plaintext
Character Set Names
|
|
===================
|
|
|
|
A typical entry from the IANA character set listing at
|
|
http://www.iana.org/assignments/character-sets is as follows:
|
|
|
|
Name: ANSI_X3.4-1968 [RFC1345,KXS2]
|
|
MIBenum: 3
|
|
Source: ECMA registry
|
|
Alias: iso-ir-6
|
|
Alias: ANSI_X3.4-1986
|
|
Alias: ISO_646.irv:1991
|
|
Alias: ASCII
|
|
Alias: ISO646-US
|
|
Alias: US-ASCII (preferred MIME name)
|
|
Alias: us
|
|
Alias: IBM367
|
|
Alias: cp367
|
|
Alias: csASCII
|
|
|
|
Based on this, three things are needed:
|
|
- A set of valid C++ identifiers that can be used in the charset_t enumeration.
|
|
- An ASCII string that should be output when the character set's name must be printed.
|
|
- A set of ASCII strings that should be recognised when the character set's name is
|
|
input.
|
|
|
|
The following rules are applied:
|
|
|
|
|
|
charset_t enumeration members
|
|
-----------------------------
|
|
|
|
The Name: and all Alias: lines are used. All punctuation characters are replaced by
|
|
underscores; all adjacent punctuation characters are compressed to one, and any trailing
|
|
punctuation characters are trimmed (?). Letters are mapped to lower case. This results
|
|
for the example above in the following:
|
|
|
|
ansi_x3_4_1968
|
|
iso_ir_6
|
|
ansi_x3_4_1986
|
|
iso_646_irv_1991
|
|
ascii
|
|
iso646_us
|
|
us_ascii
|
|
us
|
|
ibm367
|
|
cp367
|
|
csascii
|
|
|
|
Variants are then produced with underscores omitted [but not underscores that separate
|
|
numbers on both sides]:
|
|
|
|
ansi_x3_4_1968 ansix3_4_1968
|
|
iso_ir_6 iso_ir6 isoir_6 isoir6
|
|
ansi_x3_4_1986 ansix3_4_1986
|
|
iso_646_irv_1991 iso_646_irv1991 iso_646irv_1991 iso_646irv1991
|
|
iso646_irv_1991 iso646_irv1991 iso646irv_1991 isi646irv1991
|
|
ascii
|
|
iso646_us iso646us
|
|
us_ascii usascii
|
|
us
|
|
ibm367
|
|
cp367
|
|
csascii
|
|
|
|
|
|
ASCII Output
|
|
------------
|
|
|
|
Two functions convert to ASCII: charset_name uses the string from the Name: line in
|
|
the description (ANSI_X3.4-1968 in this case) and charset_mime_name uses any "preferred
|
|
MIME name" that is specified (US-ASCII in this case).
|
|
|
|
|
|
ASCII Input
|
|
-----------
|
|
|
|
The name is looked up in a similar table to the one used for enumerations above; the
|
|
lookup is insensitive to case and to punctuation.
|
|
|
|
|