I wasn't impressed with the result. So a few months ago I repeated the experiment and eventually got a script that had to be cleaned up and de-bugged, but it worked well enough to give me a working process that I describe here As it turned out, there was already an excellent open source utility, dos2unix, that does this. I had been using it for many years, mainly to handle Ye Olde CR|LF issue. It wasn't until I read the man pages again that I realised the dos2unix also handles UTF-16 to ASCII.
I had written a basic perl script more than 20 years ago to perform a basic version of this task. This was to clean up text that I had copied from a source like a web-page and just remove unwanted characters (by replacing them with SPACE) and expanding TABS to 8 space characters.
#!/usr/bin/perl while (<>){ s/[\x00-\x08\x0e-\x1F\x80-\xFF]/ /g; if (/\t/) { my @p = split "\t",$_; $_ = shift (@p); while (@p) { $_ .= " " x (8 - (length($_)%8)); $_ .= shift(@p); } } print $_; } |
By giving it a name like "cln", I could copy a block of text from a web source and insert it into a shell script ... And then, because I use "vim", enter the 5 keystrokes !}cln, and the code would be "cleaned ... There was another variant of this script which would clean up the text, word-wrap it and add a '#' character to the start of each line ... Leaving some readable documentation in the shell script about where I had found something and how I might follow up on it, if I needed to, in the future.
So I went back to look for the code that had originally been presented to me by Chat GPT, when I carried out my first experiment with AI. It was nowhere near as well-written as the later version. ... But it looked as if some of it could be used as a starting point for a C program to replace cln, the perl script above. I cleaned it up and added indentation (that was totally lacking in the original).
The end result is shown below:
#include <stdlib.h> #include <stdio.h> #include <ctype.h> #include <string.h> /* CLN Cleanup text and turn it into plain text [ASCII] === ================================================ Simple C program that reads from standard input, replaces some common non-ASCII characters with their ASCII equivalents, and writes the result to standard output. It handles a few common accented characters and some special characters. It also expands TAB to be compatible with tabstop=8 in "vim". You can extend the mapping as needed. See also the perl script plaintxt. Date Description ========== ============================== 2025-06-23 First draft written by ChatGPT 2025-07-19 First draft of u2asc written by ChagGPT 4 2025-11-16 Cleaned up and added comments to replace perl script "cln" */ // ------------------------------------------------------------------------ // A helper function to map UTF-8 two-byte sequences for some accented letters // Returns ASCII equivalent or 0 if no mapping char map_utf8_sequence(unsigned char first, unsigned char second) { // Common Latin-1 Supplement characters in UTF-8 start with 0xC3 if (first == 0xC3) { switch (second) { case 0x80: return 'A'; // À case 0x81: return 'A'; // Á case 0x82: return 'A'; // Â case 0x83: return 'A'; // Ã case 0x84: return 'A'; // Ä case 0x85: return 'A'; // Å case 0x87: return 'C'; // Ç case 0x88: return 'E'; // È case 0x89: return 'E'; // É case 0x8A: return 'E'; // Ê case 0x8B: return 'E'; // Ë case 0x8C: return 'I'; // Ì case 0x8D: return 'I'; // Í case 0x8E: return 'I'; // Î case 0x8F: return 'I'; // Ï case 0x91: return 'N'; // Ñ case 0x92: return 'O'; // Ò case 0x93: return 'O'; // Ó case 0x94: return 'O'; // Ô case 0x95: return 'O'; // Õ case 0x96: return 'O'; // Ö case 0x99: return 'U'; // Ù case 0x9A: return 'U'; // Ú case 0x9B: return 'U'; // Û case 0x9C: return 'U'; // Ü case 0x9F: return 'Y'; // Ÿ case 0xA0: return 'a'; // à case 0xA1: return 'a'; // á case 0xA2: return 'a'; // â case 0xA3: return 'a'; // ã case 0xA4: return 'a'; // ä case 0xA5: return 'a'; // å case 0xA7: return 'c'; // ç case 0xA8: return 'e'; // è case 0xA9: return 'e'; // é case 0xAA: return 'e'; // ê case 0xAB: return 'e'; // ë case 0xAC: return 'i'; // ì case 0xAD: return 'i'; // í case 0xAE: return 'i'; // î case 0xAF: return 'i'; // ï case 0xB1: return 'n'; // ñ case 0xB2: return 'o'; // ò case 0xB3: return 'o'; // ó case 0xB4: return 'o'; // ô case 0xB5: return 'o'; // õ case 0xB6: return 'o'; // ö case 0xB9: return 'u'; // ù case 0xBA: return 'u'; // ú case 0xBB: return 'u'; // û case 0xBC: return 'u'; // ü case 0xBF: return 'y'; // ÿ default: return 0; } } return 0; } // ------------------------------------------------------------------------ int main(void) { int c; int l = 0; // Position of char in current line while ((c = getchar()) != EOF) { unsigned char uc = (unsigned char)c; if (uc < 128) { if (uc == 0x00) { putchar(' '); } else if (uc == 0x0A) { putchar(uc); // Reset the counter for LF l = 0; continue; } else if (uc == 0x0D) { continue; } else if (uc == 0x09) { putchar(' '); while (++l % 8) { putchar(' '); } continue; } else if (uc < 20) { putchar(' '); continue; } else { // Standard ASCII character, output as is putchar(uc); } } else { // Possibly multi-byte UTF-8 character // For simplicity, try to read next byte and map two-byte sequences starting with 0xC3 if (uc == 0xC3) { int c2 = getchar(); if (c2 == EOF) { // Unexpected EOF, just output first byte as '?' putchar('?'); break; } unsigned char uc2 = (unsigned char)c2; char mapped = map_utf8_sequence(uc, uc2); if (mapped) { putchar(mapped); } else { // Unknown sequence, replace with '?' putchar('?'); } } else { // Cleanup various non-ISO MS chars often found in HTML data switch (uc) { case 0x91: uc = '`'; // MS Word smart quote break; case 0x92: uc = '\''; // MS Word smart quote break; case 0x93: uc = '"'; break; case 0x94: uc = '"'; break; case 0x95: uc = '*'; break; case 0x96: uc = '-'; // MS Word hyphen break; case 0x97: uc = '-'; // Soft hyphen putchar(uc); l++; break; default: uc = ' '; break; } putchar(uc); } } l++; } return 0; } |
As with the previous example, this was largely redundant, since the existing perl scripts were well tested and reliable. However I undertook this mainly as an exercise to gain some experience with using AI to create code.
AI has been around for years. Google incorporated features into the Google Home talking devices and the summary that Google search gives at the top of each search result. AI does seem like a powerful tool, and I am using it to gather "proof of concept" for some of my programming projects.
However, like any power tool, the results may depend largely on the competence of the person wielding the tool. A power saw, in the right hands, can deliver precise, perfectly straight, professional looking cuts to various types of timber. However it can be extremely efficient at cutting off the fingers of a careless, reckless or ignorant operator who doesn't understand the basics of carpentry and wood work.
G. Patterson.   T/A PGTS ABN: 99885392845