SIGIA-L Mail Archives: Re: SIGIA-L: Fonts and non-western chara
Re: SIGIA-L: Fonts and non-western character sets (long)
From: Andrew McNaughton (andrew_at_tki.org.nz)
Date: Tue Jul 24 2001 - 12:35:28 EDT
On Tue, 24 Jul 2001, Angeles, Michael (Michael) wrote:
> where Lynda Weinman says:
> The problem with other languages (Russian, Czech, Croat, German,
> Polish, Romanian, Slovak, Esperanto, Serbian, Ukrainian, Arabic, Greek,
> Hebrew to name a few) is that they use different character sets from
> of Western European languages (see Figure 2); in fact there are 10 other
> sets! If an end user from one of these other countries accesses a Web
> that specifies <FONT FACE="arial, helvetica, verdana">, it can easily
> in the use of the wrong character set, and the text will not be visible
> the required language.
This is a fairly extended post. I've had my head deep inside this area of
interest for a while now, and have learned a lot which it's generally not
easy to gather from the web. Probably this is more detail than
appropriate for this list, so followups might be better off the list.
I've tried to structure this post such that those with only a passing
interest can just read the first bit. Unfortunately this whole area is
not all that well documented on the web so system implementers are likely
to come across a lot of unexpected complications.
The simple answer
The character set provided by a font is not at all the same thing as the
set of characters in a given character encoding. You set the character
encoding as part of the mime-type, either in the http header, or in a meta
The font used for a given character should fall back through the nominated
list and browser default font(s) independently of surrounding characters
to something that has the required character. If no font has the required
character, it's normal for the browser to display a placeholder character,
typically an emtpy box, a dot or a question mark.
In practice current standards are not compatible with older browsers and
are often patchily supported by current ones, and it's difficult to get
non ISO-8859-1 characters to display reliably.
Technically correct but perhaps optimistic details
If a browser follows modern standards, (ie HTML 4 with unicode compliance)
the character *set* of an html document (as opposed to the character
encoding) is defined by Unicode (as opposed to utf-8 which is an
encoding). This means that entities (eg ćA;) correctly refer to
positions in the unicode character set, NOT positions in the character set
which is implied by the document's character encoding (it's the encoding,
not the character set that you can set in the mime-type header). This is
theoretically compatible with ISO-8859-1 for all characters from 0-255.
HTML 4 provides various named entities which are supposed to be provided
by all compliant web browsers, and which are commonly rendered from
different fonts without this necessarily being obvious to the user.
Increasingly however TrueType fonts are covering an extended range of
characters. eg Windows 2000 includes a core set of true type fonts
sufficient to cover all of the ISO-8859 character sets plus a few more,
and fonts like Arial Unicode MS exist which cover most asian languages
More realistically confused view on what actually happens
Most named character entities are not supported by all browsers. MSIE
does fairly well, Netscape less so. I hate to think what character set
confusion does to auditory browsers. Browsers commonly render numerical
entities 128-255 according to locations in the character set implied by
the current character encoding, or OS extended character set (eg CP1252)
and a huge number of web pages rely on this fact.
Even if you can get characters to display properly on older operating
systems and browsers, it's not certain that these pages will also print
It is generally not possible for a web server to adequately automatically
detect the capabilities of a user to display a given encoding or
Web character set issues are further complicated by the need to accept
incoming text through forms with varying encodings. For many smaller
languages (eg Māori which I deal wiht) appropriate keyboard drivers are
Strategies for coping with this mess
If you want to offer non ISO-8859-1 character display reliably, you have
* ask for user interaction to find out what text encoding displays
properly for them.
* Attempt automatic detection of browser capabilities and hope for
the best (unreliable)
* publish multiple versions under different URLs, and provide
navigational aids for finding the best one for a given user
* display your text using graphic images.
* publish with whatever font and character set you think will get
to the highest proprtion of your users and ignore the complexities
(the cheapest and most common strategy)
I've opted for querying the user and (mostly) built a system which stores
source text in utf-8 encoding and translates on the fly to one of several
character encodings according to user selected preference. This system is
still in testing, but mostly works well. I also translate from the users
selected character encoding in form data, and use dynamic stylesheets to
set fonts that work with the selected encoding.
Also available (and more mature) is a system called fairy (see
www.borware.com) which renders each word as an inline graphic on the
server, meaning that there is no dependence on the user's available fonts.
This approach is somewhat slow to deliver pages, and does not provide for
text recieved via forms. That said, there's little else available for
displaying many of the world's languages reliably.
Te Kete Ipurangi: The Online Learning Centre
Ph: 64 4 382 6500
Fax: 64 4 382 6509
Mobile: 021 323 076
PO Box 19-098
This archive was generated by hypermail 2.1.2
: Sun Nov 23 2003 - 22:54:47 EST