pandorafms/extras/anytermd/libpbe/doc/charsets/intro.txt

Character Sets - Introduction
=============================

This is a bottom-up file-by-file introduction to the character-set functionality in the
include/charsets/ directory.

char_t.hh
---------

This defines defined-size character types char8_t, char16_t and char32_t.

Something similar is proposed for C++0x (see N1823).  That doesn't include char8_t; does
that mean that char is certain to be 8 bits?  Even so, I think that char8_t is a useful
alias.


charset_t.hh
------------

This defines an enum, charset_t, to name character sets.

It is populated with a large number of character set names and aliases automatically
generated from tha IANA character sets registrations.  See the file names.txt for more
information about this automatic processing.


charset_names.hh
----------------

This file provides facilities to convert between the charset_t enum and textual
character set names (in ASCII).  This is also based on automatically-generated tables
from the IANA data.


charset_traits.hh
-----------------

This declares a class template, charset_traits<charset_t>, which can be specialised for
each character set.  Each specialisation indicates:

- The unit and character types, which differ for character sets variable-length
  encodings.
- A state type, for those character sets that have a "shift state".
- Booleans to indicate characteristics such as whether the character set is an ASCII
  superset, and so on.
- For variable-length encodings, various functions for encoding and decoding.  (More
  about this below.)

More will probably be added, for example constants to indicate the maximum or
typical number of units per character.


utf8.hh
-------

This provides a specialisation of charset_traits for utf8, including the encoding and
decoding functions.

Similar files will be needed for UTF-16, iso-2022, GB18030, Big5, Shift-JIS etc.


const_character_iterator.hh
---------------------------

This defines an iterator adaptor, using boost::iterator_adapter, that takes an iterator
over a sequence of units and provides a const iterator over its characters.  It does
this using the encoding and decoding functions from the charset_traits, which have
trivial implementations for fixed-length character sets.

The iterator is bidirectional, not random access.


character_output_iterator.hh
----------------------------

This defines a second iterator adaptor that provides a mutable character iterator from a
mutable unit iterator.  It is an output iterator, and would normally be used with a
back-insertion iterator that appends to the sequence of units.


charset_char_traits.hh
----------------------

This provides per-character-set classes compatible with std::char_traits.


error_policy.hh
---------------

When an error occurs during conversion, an error_policy indicates what should happen.
This file supplies two error policies, error_policy_throw and
error_policy_return_sentinel.


char_conv.hh
------------

Class char_conv converts a single character from one character set to another.  The
input and output character sets, and an error policy, are indicated by template
parameters, and (partial) specialisation will be used to provide conversions for
particular character set pairs.

The default class attempts to convert in two steps via UCS4.

Static booleans could be added to indicate whether the conversion is
losslessly-reversible, and so on - "character set pair traits", in effect.


conv/ascii.hh
-------------

Provides trivial conversion between ASCII and Unicode.


conv/unicode.hh
---------------

Provides trivial conversion between the different Unicode character sets.


conv/iso8859.hh
---------------

Provides conversion to and from the ISO 8859-n character sets.  Conversion for 8859-1 is
trivial.  For the other character sets, automatically generated tables from the Unicode
mapping data are used.  Tables for 8859-to-Unicode are built in; the sparser tables
for the reverse conversion are generated from them when first needed.


char_converter.hh
-----------------

This is a simple wrapper around char_conv that uses a stateful functor to track the
shift state from one invocation to the next.


sequence_conv.hh
----------------

This converts a sequence of characters using a similar interface to char_conv.  Its default
implementation invokes char_conv repeatedly to do the work.

Specialisations of this will be provided for fast conversion of contiguous-in-memory data.


sequence_converter.hh
---------------------

This is a simple wrpapper around sequence_conv that uses a stateful functor to track the
shift state from one invocation to the next.


string_adaptor.hh
-----------------

This provides an adaptor that provides character-at-a-time access to a string of units.  It
stores a reference to the underlying string.

There is still a lot of work to do here.  In particular, in each instance where a std::string
member takes or returns a position we have the choice to use
1. A unit poisition.
2. A character position.
3. An iterator.

There are pros and cons to each.

There's also the question of over-writing the middle of a string.


cs_string.hh
------------

A string tagged with its character set.  This is implemented using string_adaptor plus an
owned unit-string.  Again, lots still to be decided.


STILL TO DO
===========

I still plan to add:

- Fallback character conversion using iconv.

- Optimisation for conversion of contiguous-in-memory data.

- Conversion from one cs_string to another.

- Typedefs for common string and character types, e.g. utf8_string.

- Worry about all those places where I do comparisons on char values that may or
  may not be signed....