Main Index | Countries and Currencies | Currency names | Currency Abbreviations | Currency Symbols | Character Set Restrictions ]

Character Set Restrictions Imposed by Common Internet Media

Last updated: 10-May-2003

Introduction

There are three main Internet media in which almost any Internet user can author content: E-mail, News and the World Wide Web. Each of these media has restrictions (largely arising from historical circumstance) on the characters that may be used. These restrictions are arbitrary, difficult to remember and sometimes appear petty, but ignoring them will cause problems for others and possibly yourself.

The situation is made worse by the fact that many modern applications dealing with these media are written by people who are unaware of, do not understand or (worse) deliberately ignore Internet standards and conventions. One of the worst offenders in this regard ironically claims that the Internet was designed to be compatible with its software (even though users of its products are regularly lambasted for using "broken" software that ignores and breaks most Internet standards and conventions).

History

The Internet was developed in the US back in the days when only 7-bit character sets (a 7-bit character set can contain up to 128 characters) were in use and 8-bit character sets (which can contain up to 256 characters) were unheard of.

Although the Internet transmits information in 8-bit chunks, since there were no 8-bit character sets around at the time, the designers of the Internet had to choose one of the many 7-bit character sets in use around the world as the standard character set for many protocols. The natural decision (since they were in the US) was to use US ASCII (ISO 646-US) as the standard character set for these protocols (such as e-mail and later news).

Why You Should Follow the Rules

The Internet adopts change slowly. Although it is possible to extend protocol definitions in ways which are backwards-compatible (so people using old software can continue to use that software to send data), this can cause problems for older software when receiving data which makes use of the extensions. As long as anyone on the Internet is using older software, it is unwise to make use of protocol extensions. On the Internet, old software doesn't die, it simply fades away (but very slowly).

The rules presented here define what it is possible to do and still be certain that you will not cause problems for others. In some cases you can break the rules without any apparent problem (it may cause problems for some but if they don't complain you won't know about it) in the same way that you can often break the speed limit and not get caught. Just like breaking the speed limit, just because you have managed to get away with it several times that does not mean that it is a good idea to do so.

Some people don't particularly care that they may cause problems for others. In news posts a common attitude is that if 90% (or 75%, or whatever) of people can read their posts then that is good enough. This is a very selfish attitude if you have something to say that others would like to read and a very foolish attitude if you are requesting help (it is possible that the only person with the answer is one of those you have excluded by posting something he can't read). In practise you'll find that if you break the rules a large proportion of those with long experience of the Internet will ignore your posts whether they can read them or not on the principle that if you're so selfish or foolish as to exclude those with older software your posts have nothing worthwhile to offer anyway.

It takes a little more effort to follow the rules, but if you have something to say you surely wish as many people as possible to read it. If not, why bother in the first place?

Valid ASCII Characters

ASCII (ISO 646-US) is the character set used by e-mail and news. Valid (printable) characters are:
 !"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~

Also valid are the space character and the tab character.

You should not explicitly use the carriage-return or line-feed control characters: when you start a new line in e-mail or news your TCP/IP software translates whatever your mailer/newsreader uses to start a new line into the standard end-of-line convention used in transmitting e-mail and news (carriage-return followed by line-feed). The effect of inserting an isolated carriage-return or line-feed (whichever of them happens not to be used to indicate new lines on your computer) is unpredictable.

Valid ISO 8859/1 Characters

ISO 8859/1 is the default character set used by HTML (web pages). The first 128 characters are identical with ASCII. It is not valid to use the non-ASCII characters of ISO 8859/1 in e-mail or news. Valid (printable) ISO 8859/1 characters are:
 !"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
 ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæÇèéêëìíîïðñòóôõö÷øùúûüÝþÿ

Also valid are the space character, the tab character, the hard space and the soft hyphen. The hard space character (character code 160 decimal) is identical in appearance to the ordinary space but applications which automatically format text should not break a line at a hard space. The soft hyphen (character code 173 decimal) should be discarded by all applications unless they automatically format text, in which case it indicates a suitable point to break a line and display a hyphen (many applications either fail to discard the soft hyphen when they should or fail to treat it as a potential break point).

You should not explicitly use the carriage-return or line-feed control characters: when you start a new line in e-mail or news your TCP/IP software translates whatever your mailer/newsreader uses to start a new line into the standard end-of-line convention used in transmitting e-mail and news (carriage-return followed by line-feed). The effect of inserting an isolated carriage-return or line-feed (whichever of them happens not to be used to indicate new lines on your computer) is unpredictable.

E-mail

Despite various extensions, you should still regard e-mail as being 7-bit, US-ASCII unless you are certain that the recipient can handle MIME and has access to the same character set that you are using.

Although the e-mail transfer protocol has been extended to allow the transfer of 8-bit characters, a significant fraction of mail transfer agents (mail sending software, mail receiving software and mail gateways) are still restricted to 7 bits. So even if your software is capable of sending 8-bit characters, they may not arrive at the destination intact: 8-bit characters may be discarded entirely or may mutate into 7-bit characters by having the top bit set to zero. This means that if you write:

German Schloß for sale in Köln, only £250,000.
it may turn into (if 8-bit characters are discarded):
German Schlo for sale in Kln, only 250,000.
or (if 8-bit characters have their top bit set to zero):
German Schlo_ for sale in Kvln, only #250,000.

Even if you are lucky enough that your message gets through intact, your recipient may not be using the same 8-bit character set as yourself. Although ISO 8859/1 is very common (being designed to serve the needs of several European languages), there are many other ISO 8859 character sets and even more 8-bit character sets that are not part of ISO 8859. The standard Mac Roman encoding for character sets has many (but not all) of the ISO 8859/1 characters but few of the characters are in the same position as in ISO 8859/1. People in different countries will use the character set appropriate to their country, which may or may not match yours. As far as English speaking countries go, any character set which is compatible with US ASCII will serve, and incompatibilities may only surface when you need to send a currency symbol or a non-English character.

Actually, even 7-bit US ASCII is not guaranteed to reach all recipients unchanged. Some older software running on some mail gateways (notably those that use the EBCDIC character set) has problems with the following characters:

[\]^`{|}~

Generally, the Internet community regards software that incorrectly handles these characters as "broken" but that is of no help if you're mailing C source code to somebody. Fortunately, this software is extremely rare these days.

MIME

A way of getting around the problems of 7-bit e-mail transfer is to use MIME (Multipurpose Internet Mail Extensions). MIME has many modes of operation, but normally if you include an 8-bit character in an item of e-mail a MIME-aware mailer will convert the message into "quoted-printable" form and use special sequences of characters starting with an equals sign (=) to encode the 8-bit characters.

The first problem with using MIME in e-mail is that not everyone has a MIME-capable mailer. You must always ensure that your recipient can handle MIME before sending MIME-encoded e-mail. What happens if the recipient does not have a MIME-capable mailer is that if you send:

German Schloß for sale in Köln, only £250000 (=$375000).
he will see:
German Schlo=DF for sale in K=F6ln, only =A3250000 (=3D$375000).

Raw MIME is extremely confusing. Your recipient might figure out that "Schlo=DF" means "Schloß" and that "K=F6ln" means "Köln." It is very likely that he may think that "=A" is an obscure way of representing "£" and that you're asking £3250000 for your "Schlo=DF" (this sort of mistake is very common in these circumstances). The fact that your message already contained an equals sign is unfortunate because it gets turned into =3D and just adds to the confusion. And just to make things worse, if your message had trailing blanks at the end of a line then MIME adds equals signs to the ends of those lines.

The second problem with using MIME in e-mail is that not everyone is using the same character set as yourself. MIME will automatically include details of your character set so your recipient's software can automatically select the appropriate character set - if that character set happens to be available or the software knows how to synthesize it from available character sets. Even so, you must ensure that your recipient can deal with your character set before using MIME.

News

News articles are essentially a superset of e-mail messages, although the transmission method differs. This means that they have the same restrictions as e-mail messages, namely that the protocol specifies the use of the printable characters of US ASCII (ISO 646-US).

Modern news server software is "8-bit clean" which means that it passes 8-bit characters correctly. However, a lot of news servers are still running older software which is not 8-bit clean (which means it drops or mutates 8-bit characters just as with e-mail). Not only that, but news is global in scope so many people will not be using the same 8-bit character set as yourself and even if 8-bit characters reach them intact they will see something entirely different to what you wrote.

MIME is not a feasible solution to the problem of sending 8-bit characters. MIME should only be used when you can guarantee that your recipient can handle it. Since you have no knowledge or control over who receives your news article, you cannot guarantee that they can cope with MIME.

Note that some countries operate a closed group of news servers guaranteed to be 8-bit clean and with the convention that articles posted to those servers use a single, agreed character set. If, and only if, your article is being posted to a closed community of news servers then you may use 8-bit characters (provided you post using the correct character set). Newsgroups in the major hierarchies (alt, biz, comp, humanities, misc, news, rec, sci, soc, talk, etc.) as well as most English-speaking hierarchies (such as uk) are not in a closed community of this kind and you should restrict your posts in those hierarchies to US ASCII.

World Wide Web

The base character set for web pages, used by default, is ISO 8859/1. Other character sets and encoding schemes may be used (such as the Japanese JUNET encoding of ISO 2022-JP or Unicode UCS-2 format) but these are outside the scope of this document. This section assumes that the default ISO 8859/1 character set is being used.

HTML allows you to enter all valid ISO 8859/1 characters directly. However, the 8-bit characters which are not in the ASCII (ISO 646-US) character set cause problems for those who are limited to retrieving web pages by e-mail because e-mail cannot be guaranteed to transmit other than ASCII characters correctly. It also causes problems for people using Apple Macs because most Mac browsers have problems with 8-bit characters entered directly. It is therefore considered good practise to use character entities (such as &aelig; for "æ") or numeric character references (such as &#163; for £) unless your document contains so many 8-bit characters as to make this impractical.

The characters:

[\]^`{|}~
may also cause problems when web pages are retrieved by e-mail if the e-mail passes through (now rare) older mail gateways, but it is not considered necessary to use numeric character references to deal with those in ordinary text (it is considered necessary to URL-encode them when they appear in URLs).

One extension to HTML, which is an adjunct to most versions of HTML but incorporated into HTML version 4, is the ability to refer to characters in the Unicode character set. The first 255 characters of the Unicode character set are identical to the ISO 8859/1 character set (in the same way that the first 128 characters of the ISO 8859/1 character set are identical to the ASCII character set). Unicode characters can be entered means of numeric character references greater than 255 (such as &#8352; to give the Euro currency symbol). This depends on browsers having Unicode support and the user having Unicode fonts which contain the relevant symbols. Older browsers may display nothing at all, an error marker, or a totally incorrect character (usually a character from ISO 8859/1). Some older browsers may even crash when confronted with a numeric character reference greater than 255. Until Unicode support is much more widespread you should think very carefully before using Unicode characters in web pages.

HTML Character Entities and Numeric Character References for ISO 8859/1 Characters

HTML has two mechanisms (which do not work for e-mail or news) for referring to characters that are difficult to enter from the keyboard, are subject to corruption when HTML pages are accessed by e-mail or have special meaning to HTML. These are character entities and numeric character references.

Character entities take the form &entity-name; where entity-name is one of those listed below. Entity names are case-sensitive: &thorn; and &THORN are lower- and upper-case versions of the same character whilst &Thorn; is not a valid entity.

Numeric character references take the form &#number; where number can be any of 9, 10, 13, 32-126, 160-255.

Notes:

  1. Numeric character references should work with any browser.

  2. Character entities are generally preferable to numeric character references because it is easier to recognize them when reading HTML source. However, early drafts of the HTML specification defined character entities only for letters not in the English alphabet and accented characters. Older browsers do not understand other (non-alphabetic) character entities, which is why the table below gives a preferred form. The non-breaking space is an exception to this since there are roughly the same number of browsers that have problems with &#160; as &nbsp;.

  3. The semicolon terminating character entities and numeric character references may be omitted if the following character would not be interpreted as part of the entity or numeric reference. However, since most people don't know what characters are and aren't valid in entity names, it's safest to always include the terminating semicolon. Even if you happen to understand the rules of entity names, amending a file can result in the semicolon becoming necessary without you noticing, but if you always include the semicolon you'll never get caught out.

  4. The ", &, > and < characters all have special significance in HTML (although there are also circumstances when they behave as ordinary characters): when you need to display one of these characters, rather than using it for the purposes of HTML mark-up, you should use the character entity or numeric character reference.

Character      Entity      Numeric Reference      Preferred
"      &quot;      &#34;      &quot;
&      &amp;      &#38;      &amp;
<      &nbsp;      &#60;      &lt;
>      &gt;      &#62;      &gt;
[Non-breaking space]      &nbsp;      &#160;      &#160;
¡      &iexcl;      &#161;      &#161;
¢      &cent;      &#162;      &#162;
£      &pound;      &#163;      &#163;
¤      &curren;      &#164;      &#164;
¥      &yen;      &#165;      &#165;
¦      &brvbar;      &#166;      &#166;
§      &sect;      &#167;      &#167;
¨      &uml;      &#168;      &#168;
©      &copy;      &#169;      &#169;
ª      &ordf;      &#170;      &#170;
«      &laquo;      &#171;      &#171;
¬      &not;      &#172;      &#172;
[Soft Hyphen]      &shy;      &#173;      &#173;
®      &reg;      &#174;      &#174;
¯      &macron;      &#175;      &#175;
°      &deg;      &#176;      &#176;
±      &plusmn;      &#177;      &#177;
²      &sup2;      &#178;      &#178;
³      &sup3;      &#179;      &#179;
´      &acute;      &#180;      &#180;
µ      &micro;      &#181;      &#181;
     &para;      &#182;      &#182;
·      &middot;      &#183;      &#183;
¸      &cedil;      &#184;      &#184;
¹      &sup1;      &#185;      &#185;
º      &ordm;      &#186;      &#186;
»      &raquo;      &#187;      &#187;
¼      &frac14;      &#188;      &#188;
½      &frac12;      &#189;      &#189;
¾      &frac34;      &#190;      &#190;
¿      &iquest;      &#191;      &#191;
À      &Agrave;      &#192;      &Agrave;
Á      &Aacute;      &#193;      &Aacute;
      &Acirc;      &#194;      &Acirc;
à      &Atilde;      &#195;      &Atilde;
Ä      &Auml;      &#196;      &Auml;
Å      &Aring;      &#197;      &Aring;
Æ      &AElig;      &#197;      &AElig;
Ç      &Ccedil;      &#199;      &Ccedil;
È      &Egrave;      &#200;      &Egrave;
É      &Eacute;      &#201;      &Eacute;
Ê      &Ecirc;      &#202;      &Ecirc;
Ë      &Euml;      &#203;      &Euml;
Ì      &Igrave;      &#204;      &Igrave;
Í      &Iacute;      &#205;      &Iacute;
Î      &Icirc;      &#206;      &Icirc;
Ï      &Iuml;      &#207;      &Iuml;
Ð      &ETH;      &#208;      &ETH;
Ñ      &Ntilde;      &#209;      &Ntilde;
Ò      &Ograve;      &#210;      &Ograve;
Ó      &Oacute;      &#211;      &Oacute;
Ô      &Ocirc;      &#212;      &Ocirc;
Õ      &Otilde;      &#213;      &Otilde;
Ö      &Ouml;      &#214;      &Ouml;
×      &times;      &#215;      &#215;
Ø      &Oslash;      &#216;      &Oslash;
Ù      &Ugrave;      &#217;      &Ugrave;
Ú      &Uacute;      &#218;      &Uacute;
Û      &Ucirc;      &#219;      &Ucirc;
Ü      &Uuml;      &#220;      &Uuml;
Ý      &Yacute;      &#221;      &Yacute;
Þ      &THORN;      &#222;      &THORN;
ß      &szlig;      &#223;      &szlig;
à      &agrave;      &#224;      &agrave;
á      &aacute;      &#225;      &aacute;
â      &acirc;      &#226;      &acirc;
ã      &atilde;      &#227;      &atilde;
ä      &auml;      &#228;      &auml;
å      &aring;      &#229;      &aring;
æ      &aelig;      &#230;      &aelig;
ç      &ccedil;      &#231;      &ccedil;
è      &egrave;      &#232;      &egrave;
é      &eacute;      &#233;      &eacute;
ê      &ecirc;      &#234;      &ecirc;
ë      &euml;      &#235;      &euml;
ì      &igrave;      &#236;      &igrave;
í      &iacute;      &#237;      &iacute;
î      &icirc;      &#238;      &icirc;
ï      &iuml;      &#239;      &iuml;
ð      &eth;      &#240;      &eth;
ñ      &ntilde;      &#241;      &ntilde;
ò      &ograve;      &#242;      &ograve;
ó      &oacute;      &#243;      &oacute;
ô      &ocirc;      &#244;      &ocirc;
õ      &otilde;      &#245;      &otilde;
ö      &ouml;      &#246;      &ouml;
÷      &divide;      &#247;      &#247;
ø      &oslash;      &#248;      &oslash;
ù      &ugrave;      &#249;      &ugrave;
ú      &uacute;      &#250;      &uacute;
û      &ucirc;      &#251;      &ucirc;
ü      &uuml;      &#252;      &uuml;
ý      &yacute;      &#253;      &yacute;
þ      &thorn;      &#254;      &thorn;
ÿ      &yuml;      &#255;      &yuml;

URL-Encoding

When writing URLs (whether as actual links on web pages or when referring to URLs in e-mail and news), certain characters must be represented by means of URL-encoding, which means substiting the character with %hexadecimal-character-code. The full details of when and why URL-encoding is necessary is discussed in RFC 1630, RFC 1738 and RFC 1808. Some basic details are given here.
  1. All non-ASCII characters (i.e., valid ISO 8859/1 characters that are not also ASCII characters) must be URL-encoded. E.g., the file köln.html would appear in a URL as k%F6ln.html.

  2. Some characters are considered to be "unsafe" when web pages are requested by e-mail. These are: the tab character (%09), the space character (%09), [ (%5B), \ (%5C), ] (%5D), ^ (%5E), ` (%60), { (%7B), | (%7C), } (%7D) and ~ (%7E). When writing URLs you should URL-encode these characters.

  3. Some characters have special meanings in URLs, such as the colon (:) that separates the URL scheme from the rest of the URL, the // that indicates that the URL conforms to the Common Internet Scheme syntax and the percent sign (%). Generally, when these characters appear as parts of filenames they must be URL-encoded to distinguish them from their special meaning in URLs (this is a simplification, read the RFCs for the full details). These characters are: " (%22), # (%23), % (%25), & (%26), + (%2B), , (%2C), / (%2F), : (%3A), < (%3C), = (%3D), > (%3E), ? (%3F) and @ (%40).

© Copyright 1998, Paul L. Allen
Comments to John Hall: webmasteratjhall.co.uk (replace "at" by "@")

Index | Countries and Currencies | Currency names | Currency Abbreviations | Currency Symbols | Character Set Restrictions ]