Friday, October 30, 2009

Non-ASCII domain names allowed

Today, ICANN in Seoul has voted that from Summer 2010, the internet addresses - URLs - will be allowed to include arbitrary non-English character in the Unicode set: see The New York Times.



From September 2009, you could have registered Cyrillics-based domain names in Bulgaria.

It's a pretty substantial extension of the possible names because the latest Unicode contains 107,000 characters - imagine 500 pages of stuff like this. So far, only 26 letters, 10 digits, and the dash was possible as an "atom" of the URLs (37 characters that we know and love in total).

A few Unicode characters are enough to create a huge number of new possibilities to spoof and to misspell. It's my understanding that even the tiniest difference between URLs will imply that the two addresses are inequivalent. Isn't it terrible?




For example, in Czechia, it may be convenient because you will be able to use domain names such as "škoda.cz" for the carmaker. (By the way, it would be even nicer if they also allowed "škoda.čz" where "čz" stands for "české země", or Czech lands. So far, even in 2010, the top domains such as "com" have to be in Latin.) It looks better than the domain names with the removed diacritics (and it sounds much better in the commercials) but the fact that there exist so many variations is annoying and kind of dangerous. The spoofing risks are obvious. For example, an e-mail will offer you a great new book from
ɑМāƶơƞ.com
Needless to say, none of the letters in the "amazon" is a genuine Latin character. ;-) (You may play with "Run" / "charmap.exe" in Windows systems.) And I could probably find more accurate approximations of the Latin characters. Your humble correspondent has once been fooled even by a Latin-based fake amazon so I suspect that the amount of phishing will skyrocket.

A similar comment applies to brand protection. It will be easier for the people to parasite on famous names by registering and using variations of the well-known people and brands. The costs of brand protection will increase. But at the end, I guess that the price of a single domain name should plummet.

Well, I think that the visual improvement caused by arbitrary characters in the URLs is good but the resulting huge landscape of inequivalent possibilities is bad. If I were in charge of this stuff, I would allow complex domain names but they would be identified with (and probably automatically redirected to) their "closest ASCII approximations", calculated according to some algorithm. The browsers could be showing you the local characters but all the internet communication would use the ASCII-restricted strings.

For example, the algorithm could say that the characters with diacritics would be simply identified, in all domain names, with their diacritics-free counterpart. The same could be true for exotic characters that may be replaced by one Latin letter. More complicated Chinese etc. characters could be identified e.g. with their transliteration followed by a dash.

The issue of special characters - and, in the Czech context, diacritics - has a long history. Recall that diacritical signs were introduced into written Czech by Mr Jan Huss in the 15th century.

Well, 25 years ago, when we were working with computers like Commodore 64, most of the texts - including semi-professional Czech text games - would be stripped of diacritics and people got used to it. I remember that I used to expect that the rules of written Czech would eventually be modified so that the Czechs would adapt to the ASCII-based computers. That was stupid, of course, and something else happened.

Computers got better and they were forced to learn our full national alphabets. However, 15 years ago, the situation was still very difficult because there existed many code pages how to express the special characters by one byte (or, later, sequences of bytes): the ISO-2 code, Windows-1250 code, Kamenických code, Unicode, and others.

In the mid 1990s, I would spend 1/2 of my time on the web with tricks (and improvements to scripts) how to help users who were using diverse browsers with diverse coding systems to see my web pages properly. These problems were kind of solved years later. All the important computer programs eventually learned how to identify the right character set of the input and how to translate it to others coding systems if other users or programs demand it. And the underlying architecture has been increasingly dominated by Unicode which allows "all" the characters. For example, this blog is coded in Unicode and I switched my Gmail to Unicode, too.

However, the new radical change of the rules may revive the hassle that will be reminiscent of what we used to have in the mid 1990s. Was it a good idea?

What do you think?

No comments:

Post a Comment