---------------------------------------------

NOTE: This article is an archived copy for portfolio purposes only, and may refer to obsolete products or technologies. Old articles are not maintained for continued relevance and accuracy.
May 4, 2004

Internationalized Domain Names

The Internet is becoming increasingly international, accessed by people who speak a wide variety of different languages. However, the character set used by DNS and other core protocols hasn't kept up. This must change if IP technology is to reach broader acceptance among non-English-speaking audiences, and breaking the Internet's dependency on seven-bit ASCII is a good place to start. One critical advance toward this objective was made last year when the IETF published "Internationalizing Domain Names in Applications (IDNA)" (RFC 3490). IDNA specifies the use of Internationalized Domain Names (IDNs) to display characters from foreign languages and alphabets.

For people in predominately English-speaking countries, international characters may seem irrelevant, or at best a distraction from more pressing needs. However, large-scale changes to the global infrastructure will affect every network whose users communicate internationally. Sending e-mail to users in another country may eventually require an upgrade to IDNs. Companies selling products or services worldwide may want to register an IDN that accurately represents their wares, and anyone with international clientele will need to be prepared for support issues.

The IDNA Transformation Model

IDNA describes algorithms for the presentation and encapsulation of IDNs. This means that an IDN can exist either in a "rich" form that uses characters from international languages, or in an encoded form that's compatible with legacy ASCII. Applications can then use whichever form is most appropriate. For example, the rich form is likely to be presented to users for display purposes, while the encoded ASCII form can be passed through to underlying applications and protocols. Eventually, underlying applications may also be able to use the rich form internally.

Domain names are encoded in ASCII form by default. This way, they're compatible with legacy host name rules and can thus be supported by every Internet application. These legacy rules only allow English alphanumeric characters and the hyphen, and also impose certain ordering and length restrictions on individual labels and the overall domain name.

The IDNA encoding mechanism produces ASCII syntax that's compatible with legacy DNS rules. This ensures that all legacy protocols and services using domain names are able to interoperate, even if they can't display the full rich form to the end user. For example, SMTP uses e-mail addresses as identifiers, while HTTP uses URLs and host names for various operations. These protocols must continue to use ASCII until they're extended to use the full IDN.

For the moment, the internationalized form is used mainly for input and output operations that interface with a human user. For example, a user may be able to type an IDN into a Web browser's URL bar, but the browser must convert that domain name into the ASCII-encoded form before performing a DNS lookup. Similarly, HTTP will issue a GET message using ASCII.

Full IDNA support requires that this conversion also be performed in reverse, presenting encoded domain names in their rich form. If a user clicks on a link leading to a Web page in an internationalized domain, the browser should render that domain name using non-ASCII characters for display, printing, and bookmarking. Similarly, an e-mail client should allow internationalized addresses to be stored in the client's address book.

Unfortunately, not all applications support this seamless process. For instance, e-mail and Usenet messages all contain a globally unique Message-ID header field that includes a domain name. If these are converted on the fly during search and fetch operations, the resulting values won't match the originals, causing problems for indexing software and mail databases.

The Internationalization Wave

To date, few applications have implemented IDNA's transformation service. Among those that have, the implementations have been of variable quality and aren't always complete.

For example, the Web browser component of Mozilla 1.6 offers support for IDNA-to-ASCII conversion, but not the reverse: It will accept and process an IDN entered into the URL bar, but won't display the rich form of one reached through a hyperlink. Mozilla's e-mail client offers no support, so users can't enter an e-mail address containing an IDN. Recent versions of Opera and Konqueror offer pretty good support, although both still include some minor bugs.

Microsoft doesn't currently offer IDNA support for either Internet Explorer or Outlook, but has announced plans to implement it in future releases. Users who don't want to wait can add IDNA to current versions of Explorer and Outlook using third-party plug-ins.

Most instances of Internet software don't perform any kind of IDNA transformation yet. Everyday applications such as Traceroute will have to be extended to perform input and output conversions before the Internet can appear to be anything other than an ASCII-centric network. Similarly, basic services such as DHCP and SNMP will need to be upgraded before they can be used to reach domains containing non-ASCII characters. A 100 percent international experience requires a 100 percent replacement of every user-facing piece of code on the planet, from ping to printer drivers.

This means that the entire network needs to undergo a forklift upgrade before IDNA can truly internationalize the Internet. On top of that, several core technologies require enhancements to support IDNs internally. For example, even though the domain element within an e-mail address can be internationalized through IDNA transformation technology, the local element that defines a username or mailbox is still limited to ASCII.

The IDNA-to-ASCII Conversion Process

IDNs require two different types of conversion: one in which the rich domain name is encoded into ASCII, and the other in which ASCII is encoded into a rich domain name. These conversion operations are described in RFC 3490 as the ToASCII and ToUnicode functions, respectively.

The ToASCII function requires the following steps to be performed:

  1. Separate the domain name into its component labels-the fields separated by dots-and check each one for international characters. If a label only contains ordinary ASCII characters, it doesn't require conversion.
  2. Convert any extended characters into Unicode, the international standard for non-ASCII characters. Many OSs and applications use other character sets, but the encoding routines require Unicode.
  3. Normalize the characters to a particular form as specified in RFC 3491, "Nameprep: A Stringprep Profile for Internationalized Domain Names." This step is required because different Unicode strings can represent the same domain name. Uppercase characters must be converted to lowercase, and accents must be added to characters. For example, if a user enters "www.Ex¨Ample.com," the sequence will need to be normalized to "www.exämple.com". Some domain name registries apply additional restrictions. For instance, the Polish top-level domain (.pl) only allows characters from European and Middle Eastern languages. However, these restrictions only need to be considered if the domain name is actually registered; DNS will simply return a "Name not found" error if an illegal domain name is requested.
  4. Apply the conversion algorithm specified in RFC 3492, "Punycode: A Bootstring Encoding of Unicode for Internationalized Domain Names in Applications." At this point, each Unicode character is replaced with a sequence of ASCII characters. For instance, the "exämple" label would be converted into "example-cua," where the "-cua" sequence indicates the location and characters that have been encoded.
  5. Prepend a special tagging sequence, "xn—," to the beginning of the resulting label. This allows systems to recognize that the label contains an encoded domain name. A system that doesn't support rich characters will thus display "www.example.com" as "www.xn—example-cua.com".

When the "www.xn—example-cua.com" domain name is received by an IDNA-aware application, it will recognize the "xn—" tag in the middle lable as indicating an IDNA domain name, and can convert the domain name back to the "www.exämple.com" original. Note that since IDNA does not preserve non-normalized information during conversion and encoding, it is not possible for the original name with mixed case and separated accents to be preserved.

-- 30 --
Copyright © 2010-2017 Eric A. Hall.
Portions copyright © 2004 CMP Media, Inc. Used with permission.
---------------------------------------------