PGTS G. Patterson.   T/A PGTS ABN: 99885392845

point Site Navigation

point Other Blog Threads



  Valid HTML 4.01 Transitional

   Stop Spam! Stop Viruses!
   Secure And Reliable Ubuntu Desktop!

   Ubuntu

   If you own a netbook/laptop~
   Download Ubuntu Netbook!






PGTS Humble Blog

Thread: Tips/Tricks For Programming etc

Author Image Gerry Patterson. The world's most humble blogger
I didn't sign up for this s**t!

Converting text to plain ASCII 7-bit


Chronogical Blog Entries:



Date: Sun, 16 Nov 2025 19:16:54 +1100

Earlier this year, I asked chat GPT to write a C program that would translate UTF-16 to 7-bit ASCII.

I wasn't impressed with the result. So a few months ago I repeated the experiment and eventually got a script that had to be cleaned up and de-bugged, but it worked well enough to give me a working process that I describe here As it turned out, there was already an excellent open source utility, dos2unix, that does this. I had been using it for many years, mainly to handle Ye Olde CR|LF issue. It wasn't until I read the man pages again that I realised the dos2unix also handles UTF-16 to ASCII.

I had written a basic perl script more than 20 years ago to perform a basic version of this task. This was to clean up text that I had copied from a source like a web-page and just remove unwanted characters (by replacing them with SPACE) and expanding TABS to 8 space characters.

#!/usr/bin/perl
while (<>){
	s/[\x00-\x08\x0e-\x1F\x80-\xFF]/ /g;
	if (/\t/) {
		my @p = split "\t",$_;
		$_ = shift (@p);
		while (@p) {
			$_ .= " " x (8 - (length($_)%8));
			$_ .= shift(@p);
		}
	}
	print $_;
}

By giving it a name like "cln", I could copy a block of text from a web source and insert it into a shell script ... And then, because I use "vim", enter the 5 keystrokes !}cln, and the code would be "cleaned ... There was another variant of this script which would clean up the text, word-wrap it and add a '#' character to the start of each line ... Leaving some readable documentation in the shell script about where I had found something and how I might follow up on it, if I needed to, in the future.

So I went back to look for the code that had originally been presented to me by Chat GPT, when I carried out my first experiment with AI. It was nowhere near as well-written as the later version. ... But it looked as if some of it could be used as a starting point for a C program to replace cln, the perl script above. I cleaned it up and added indentation (that was totally lacking in the original).

The end result is shown below:

#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>
#include <string.h>

/*  

    CLN Cleanup text and turn it into plain text [ASCII]
    === ================================================

    Simple C program that reads from standard input, replaces some
    common non-ASCII characters with their ASCII equivalents, and writes
    the result to standard output. It handles a few common accented
    characters and some special characters. It also expands TAB to be
    compatible with tabstop=8 in "vim". You can extend the mapping as
    needed. See also the perl script plaintxt.

    Date       Description
    ========== ==============================
    2025-06-23 First draft written by ChatGPT
    2025-07-19 First draft of u2asc written by ChagGPT 4
    2025-11-16 Cleaned up and added comments to replace perl script "cln"

*/

// ------------------------------------------------------------------------

// A helper function to map UTF-8 two-byte sequences for some accented letters
// Returns ASCII equivalent or 0 if no mapping

char map_utf8_sequence(unsigned char first, unsigned char second) {
	// Common Latin-1 Supplement characters in UTF-8 start with 0xC3
	if (first == 0xC3) {
		switch (second) {
			case 0x80: return 'A'; // À
			case 0x81: return 'A'; // Á
			case 0x82: return 'A'; // Â
			case 0x83: return 'A'; // Ã
			case 0x84: return 'A'; // Ä
			case 0x85: return 'A'; // Å
			case 0x87: return 'C'; // Ç
			case 0x88: return 'E'; // È
			case 0x89: return 'E'; // É
			case 0x8A: return 'E'; // Ê
			case 0x8B: return 'E'; // Ë
			case 0x8C: return 'I'; // Ì
			case 0x8D: return 'I'; // Í
			case 0x8E: return 'I'; // Î
			case 0x8F: return 'I'; // Ï
			case 0x91: return 'N'; // Ñ
			case 0x92: return 'O'; // Ò
			case 0x93: return 'O'; // Ó
			case 0x94: return 'O'; // Ô
			case 0x95: return 'O'; // Õ
			case 0x96: return 'O'; // Ö
			case 0x99: return 'U'; // Ù
			case 0x9A: return 'U'; // Ú
			case 0x9B: return 'U'; // Û
			case 0x9C: return 'U'; // Ü
			case 0x9F: return 'Y'; // Ÿ
			case 0xA0: return 'a'; // à
			case 0xA1: return 'a'; // á
			case 0xA2: return 'a'; // â
			case 0xA3: return 'a'; // ã
			case 0xA4: return 'a'; // ä
			case 0xA5: return 'a'; // å
			case 0xA7: return 'c'; // ç
			case 0xA8: return 'e'; // è
			case 0xA9: return 'e'; // é
			case 0xAA: return 'e'; // ê
			case 0xAB: return 'e'; // ë
			case 0xAC: return 'i'; // ì
			case 0xAD: return 'i'; // í
			case 0xAE: return 'i'; // î
			case 0xAF: return 'i'; // ï
			case 0xB1: return 'n'; // ñ
			case 0xB2: return 'o'; // ò
			case 0xB3: return 'o'; // ó
			case 0xB4: return 'o'; // ô
			case 0xB5: return 'o'; // õ
			case 0xB6: return 'o'; // ö
			case 0xB9: return 'u'; // ù
			case 0xBA: return 'u'; // ú
			case 0xBB: return 'u'; // û
			case 0xBC: return 'u'; // ü
			case 0xBF: return 'y'; // ÿ
			default: return 0;
		}
	}
	return 0;
}

// ------------------------------------------------------------------------

int main(void) {
	int c;
	int l = 0;	// Position of char in current line
	while ((c = getchar()) != EOF) {
			unsigned char uc = (unsigned char)c;
			if (uc < 128) {
                                if (uc == 0x00) {
                                        putchar(' ');
				} else if (uc == 0x0A) {
					putchar(uc);
					// Reset the counter for LF
					l = 0;
					continue;
				} else if (uc == 0x0D) {
					continue;
				} else if (uc == 0x09) {
					putchar(' ');
					while (++l % 8) {
						putchar(' ');
					}
					continue;
				} else if (uc < 20) {
					putchar(' ');
					continue;
				} else {
					// Standard ASCII character, output as is
					putchar(uc);
				}
			} else {
				// Possibly multi-byte UTF-8 character
				// For simplicity, try to read next byte and map two-byte sequences starting with 0xC3
				if (uc == 0xC3) {
					int c2 = getchar();
					if (c2 == EOF) {
						// Unexpected EOF, just output first byte as '?'
						putchar('?');
						break;
					}
					unsigned char uc2 = (unsigned char)c2;
					char mapped = map_utf8_sequence(uc, uc2);
					if (mapped) {
						putchar(mapped);
					} else {
						// Unknown sequence, replace with '?'
						putchar('?');
					}

				} else {
					// Cleanup various non-ISO MS chars often found in HTML data
					switch (uc) {
						case 0x91:
							uc = '`';	// MS Word smart quote
							break;

						case 0x92:
							uc = '\'';	// MS Word smart quote
							break;

						case 0x93:
							uc = '"';
							break;

						case 0x94:
							uc = '"';
							break;

						case 0x95:
							uc = '*';
							break;

						case 0x96:
							uc = '-';	// MS Word hyphen
							break;

						case 0x97:
							uc = '-';	// Soft hyphen
							putchar(uc);
							l++;
							break;

						default: uc = ' ';
							break;
					}
					putchar(uc);
				}
			}
			l++;
		}
		return 0;
}

As with the previous example, this was largely redundant, since the existing perl scripts were well tested and reliable. However I undertook this mainly as an exercise to gain some experience with using AI to create code.

AI has been around for years. Google incorporated features into the Google Home talking devices and the summary that Google search gives at the top of each search result. AI does seem like a powerful tool, and I am using it to gather "proof of concept" for some of my programming projects.

However, like any power tool, the results may depend largely on the competence of the person wielding the tool. A power saw, in the right hands, can deliver precise, perfectly straight, professional looking cuts to various types of timber. However it can be extremely efficient at cutting off the fingers of a careless, reckless or ignorant operator who doesn't understand the basics of carpentry and wood work.



Other Blog Posts In This Thread:

Copyright     2025, Gerry Patterson. All Rights Reserved.