PGTS Humble Blog
Thread: Perl Programming
|Open the pod bay doors, please HAL|
The UNICODE Tin Of Worms
Chronogical Blog Entries:
Date: Sun, 21 Jun 2009 22:13:27 +1000
How to translate URI encoded strings into (UTF-8) UNICODE.
Recently a question regarding UNICODE arrived in the PGTS feedback:
I saw your script to extract Google search terms from an Apache log file - nice work! I have a question: many of the viewers of the web site, which I am maintaining the log files for, use Arabic and thus the Google search terms in the logs are URL encoded. Thus if I use your script I get a list of URL encoded items, i.e. %a3%b3 etc. Do you know of any quick way to further translate these URL "encodes" so that I can at least get a list of the Arabic search terms that were used? Thanks in advance.
John didn't follow up on his query, so I can only assume that he was looking at UNICODE search strings. If so, I would also assume that he was talking about UTF-8, which may be in the process of becoming the de-facto standard. In which case I'll have to further assume that the sample he quotes was just an example of hex URI encoding, because %a3%b3 doesn't really make sense as an Arabic string. The Arabic UTF-8 codes range from 0x0600 to 0x06FF.
Nevertheless, I must admit to a profound ignorance of UNICODE search strings. Despite claiming to be a "humble" blogger, like so many native English speakers, I have been somewhat lacking in humility when it comes to coding schemes other than the traditional American sets like ASCII and EBCDIC.
In fact, I confess that in recent years I have adopted ASCII to the exclusion of all other codes. This is rather like the way that the simple morality tales created by Hollywood present a Universe that runs on American Eastern Standard time populated by sentient beings that all speak and understand English (with an American accent). That artifice does simplify the plot considerably and allows compression of time lines ... However it is a long way from reality.
Sometimes it's hard to be humble when you are a native English speaker ... In the past whenever I encountered unprintable ASCII strings represented in a URI by being encoding with a '%' character followed by hex representation of the byte, I turn would turn them back into bytes with a simple subroutine.
If you don't want to roll your own subroutine, and you are using perl, the best place to start is the URI::Escape package. To read about it you can enter the command
Or if you have installed the full perldoc package you can use the "man" command.
That should turn your encoded bytes into octets. The next step is turning them into something that makes sense on your console.
But where to go from there?
The most popular encoding scheme for UNICODE is UTF-8. Whenever I have started to read about UTF-8, I have gained the impression that it is a tin of worms and hastily put the lid back on. Fortunately (or perhaps unfortunately?) the UTF-8 standard has been designed so that ASCII can exist as a valid subset. That is after all quite reasonable, and probably the only way that such a new coding scheme would be widely accepted ... But it allows ASCII bigots to carry on as with the Hollywood fantasy that all sentient beings speak English with an American accent and it is New York time every where on earth.
Reading about UNICODE is a daunting task. However for perl programmers the best place to start is the UNICODE introduction. You can read about it with this command:
perldoc perluniintro or man perluniintro
The recommendation in the Introduction is to use this command in your perl script.
If you still aren't inclined to put the lid back on the UNICODE tin of worms and pull the sheets over your head, you can then proceed to read the "perlunicode" man pages. Unfortunately, trying to pretend that the problem doesn't exist isn't really an option. For the time being, it looks as if UTF-8 is here to stay.