PGTS Humble Blog



2026 Upgrading Apache/Postgres To Ubuntu 24.04 LTS Set Playlist To Automatic In VLC 2025 Converting text to plain ASCII 7-bit Chrome Freezing Bug With Old Graphics Cards Set Default Font For Gvim: Some Handy Commands For Setting Up Wi-Fi 2018 Using mintty as a replacement for putty A Tiny Problem The New Improved Task Scheduler 2017 Counting arguments in a windows batch file. Now is the hour of hour discontent Transcode Matroska X265 to MP4 X264 Citrix client on Ubuntu 16.04 2016 Running Steam in Ubuntu 16.04 (64-bit) Tail Command For Windows (again) Installing ubuntu 15.10 on HP Pavilion (laptop) 2015 Restoring postgres databases 2014 Upgrading to Ubuntu 14.04 Server Setting File Associations with ASSOC and FTYPE in Windows 7 Ubuntu 14.04 LTS (Trusty Tahr) Editing MP4 Tags. Building CPAN Module for Strawberry in Windows 7 2013 Cygwin UID and Group IDs Installing Ubuntu On HP 650 Notebook 2012 Using Postgres Sequences Older Query Column Names In A Postgresql Table Converting HTML to PDF. Using ODBC with Windows Seven. Problems With /etc/fstab And Server Upgrade Buffering Audio Streams. Configuring Nautilus Network Manager In Ubuntu Upgrading To Cygwin 1.7 Convert Video For A5146 With ffmpeg Gmail And Other MUAs CR/LF LF Linefeeds Again And Again Language Settings Open Office Copying Firefox Settings In Windows XP A Couple Of Tips For Vim and Gvim Some Handy Commands Detecting EBCDIC With Perl Problems with k3b library in Kubuntu Citrix Client On Ubuntu NDS Using R4DS For A DS Lite Simple EBCDIC Translation Which For Windows Ghostscript For Cygwin Back To The Floppy - Another Blast From The Past Using Filezilla with SSH in Windows Cygwin - Never Leave Unix Without It Coping With COBOL Signed (S9) Fields With Perl Postgres and Network Address Types
PGTS Humble Blog Thread: Tips/Tricks For Programming etc
	Gerry Patterson. The world's most humble blogger
Opinions are like arseholes. Everyone has one -- Dirty Harry

Converting text to plain ASCII 7-bit
Chronogical Blog Entries: Prev: 20-Oct-2025 Chrome Freezing Bug With Old Graphics Cards Next: 02-Dec-2025 Startup folder and win+R commands

Date: Sun, 16 Nov 2025 19:16:54 +1100 Earlier this year, I asked chat GPT to write a C program that would translate UTF-16 to 7-bit ASCII.

I wasn't impressed with the result. So a few months ago I repeated the experiment and eventually got a script that had to be cleaned up and de-bugged, but it worked well enough to give me a working process that I describe here As it turned out, there was already an excellent open source utility, dos2unix, that does this. I had been using it for many years, mainly to handle Ye Olde CR|LF issue. It wasn't until I read the man pages again that I realised the dos2unix also handles UTF-16 to ASCII.

I had written a basic perl script more than 20 years ago to perform a basic version of this task. This was to clean up text that I had copied from a source like a web-page and just remove unwanted characters (by replacing them with SPACE) and expanding TABS to 8 space characters.

#!/usr/bin/perl
while (<>){
	s/[\x00-\x08\x0e-\x1F\x80-\xFF]/ /g;
	if (/\t/) {
		my @p = split "\t",$_;
		$_ = shift (@p);
		while (@p) {
			$_ .= " " x (8 - (length($_)%8));
			$_ .= shift(@p);
		}
	}
	print $_;
}

By giving it a name like "cln", I could copy a block of text from a web source and insert it into a shell script ... And then, because I use "vim", enter the 5 keystrokes !}cln, and the code would be "cleaned ... There was another variant of this script which would clean up the text, word-wrap it and add a '#' character to the start of each line ... Leaving some readable documentation in the shell script about where I had found something and how I might follow up on it, if I needed to, in the future.

So I went back to look for the code that had originally been presented to me by Chat GPT, when I carried out my first experiment with AI. It was nowhere near as well-written as the later version. ... But it looked as if some of it could be used as a starting point for a C program to replace cln, the perl script above. I cleaned it up and added indentation (that was totally lacking in the original).

The end result is shown below:

#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>
#include <string.h>

/*  

    CLN Cleanup text and turn it into plain text [ASCII]
    === ================================================

    Simple C program that reads from standard input, replaces some
    common non-ASCII characters with their ASCII equivalents, and writes
    the result to standard output. It handles a few common accented
    characters and some special characters. It also expands TAB to be
    compatible with tabstop=8 in "vim". You can extend the mapping as
    needed. See also the perl script plaintxt.

    Date       Description
    ========== ==============================
    2025-06-23 First draft written by ChatGPT
    2025-07-19 First draft of u2asc written by ChagGPT 4
    2025-11-16 Cleaned up and added comments to replace perl script "cln"

*/

// ------------------------------------------------------------------------

// A helper function to map UTF-8 two-byte sequences for some accented letters
// Returns ASCII equivalent or 0 if no mapping

char map_utf8_sequence(unsigned char first, unsigned char second) {
	// Common Latin-1 Supplement characters in UTF-8 start with 0xC3
	if (first == 0xC3) {
		switch (second) {
			case 0x80: return 'A'; // À
			case 0x81: return 'A'; // Á
			case 0x82: return 'A'; // Â
			case 0x83: return 'A'; // Ã
			case 0x84: return 'A'; // Ä
			case 0x85: return 'A'; // Å
			case 0x87: return 'C'; // Ç
			case 0x88: return 'E'; // È
			case 0x89: return 'E'; // É
			case 0x8A: return 'E'; // Ê
			case 0x8B: return 'E'; // Ë
			case 0x8C: return 'I'; // Ì
			case 0x8D: return 'I'; // Í
			case 0x8E: return 'I'; // Î
			case 0x8F: return 'I'; // Ï
			case 0x91: return 'N'; // Ñ
			case 0x92: return 'O'; // Ò
			case 0x93: return 'O'; // Ó
			case 0x94: return 'O'; // Ô
			case 0x95: return 'O'; // Õ
			case 0x96: return 'O'; // Ö
			case 0x99: return 'U'; // Ù
			case 0x9A: return 'U'; // Ú
			case 0x9B: return 'U'; // Û
			case 0x9C: return 'U'; // Ü
			case 0x9F: return 'Y'; // Ÿ
			case 0xA0: return 'a'; // à
			case 0xA1: return 'a'; // á
			case 0xA2: return 'a'; // â
			case 0xA3: return 'a'; // ã
			case 0xA4: return 'a'; // ä
			case 0xA5: return 'a'; // å
			case 0xA7: return 'c'; // ç
			case 0xA8: return 'e'; // è
			case 0xA9: return 'e'; // é
			case 0xAA: return 'e'; // ê
			case 0xAB: return 'e'; // ë
			case 0xAC: return 'i'; // ì
			case 0xAD: return 'i'; // í
			case 0xAE: return 'i'; // î
			case 0xAF: return 'i'; // ï
			case 0xB1: return 'n'; // ñ
			case 0xB2: return 'o'; // ò
			case 0xB3: return 'o'; // ó
			case 0xB4: return 'o'; // ô
			case 0xB5: return 'o'; // õ
			case 0xB6: return 'o'; // ö
			case 0xB9: return 'u'; // ù
			case 0xBA: return 'u'; // ú
			case 0xBB: return 'u'; // û
			case 0xBC: return 'u'; // ü
			case 0xBF: return 'y'; // ÿ
			default: return 0;
		}
	}
	return 0;
}

// ------------------------------------------------------------------------

int main(void) {
	int c;
	int l = 0;	// Position of char in current line
	while ((c = getchar()) != EOF) {
			unsigned char uc = (unsigned char)c;
			if (uc < 128) {
                                if (uc == 0x00) {
                                        putchar(' ');
				} else if (uc == 0x0A) {
					putchar(uc);
					// Reset the counter for LF
					l = 0;
					continue;
				} else if (uc == 0x0D) {
					continue;
				} else if (uc == 0x09) {
					putchar(' ');
					while (++l % 8) {
						putchar(' ');
					}
					continue;
				} else if (uc < 20) {
					putchar(' ');
					continue;
				} else {
					// Standard ASCII character, output as is
					putchar(uc);
				}
			} else {
				// Possibly multi-byte UTF-8 character
				// For simplicity, try to read next byte and map two-byte sequences starting with 0xC3
				if (uc == 0xC3) {
					int c2 = getchar();
					if (c2 == EOF) {
						// Unexpected EOF, just output first byte as '?'
						putchar('?');
						break;
					}
					unsigned char uc2 = (unsigned char)c2;
					char mapped = map_utf8_sequence(uc, uc2);
					if (mapped) {
						putchar(mapped);
					} else {
						// Unknown sequence, replace with '?'
						putchar('?');
					}

				} else {
					// Cleanup various non-ISO MS chars often found in HTML data
					switch (uc) {
						case 0x91:
							uc = '`';	// MS Word smart quote
							break;

						case 0x92:
							uc = '\'';	// MS Word smart quote
							break;

						case 0x93:
							uc = '"';
							break;

						case 0x94:
							uc = '"';
							break;

						case 0x95:
							uc = '*';
							break;

						case 0x96:
							uc = '-';	// MS Word hyphen
							break;

						case 0x97:
							uc = '-';	// Soft hyphen
							putchar(uc);
							l++;
							break;

						default: uc = ' ';
							break;
					}
					putchar(uc);
				}
			}
			l++;
		}
		return 0;
}

As with the previous example, this was largely redundant, since the existing perl scripts were well tested and reliable. However I undertook this mainly as an exercise to gain some experience with using AI to create code.

AI has been around for years. Google incorporated features into the Google Home talking devices and the summary that Google search gives at the top of each search result. AI does seem like a powerful tool, and I am using it to gather "proof of concept" for some of my programming projects.

However, like any power tool, the results may depend largely on the competence of the person wielding the tool. A power saw, in the right hands, can deliver precise, perfectly straight, professional looking cuts to various types of timber. However it can be extremely efficient at cutting off the fingers of a careless, reckless or ignorant operator who doesn't understand the basics of carpentry and wood work.

PGTS Humble Blog

Thread: Tips/Tricks For Programming etc

Converting text to plain ASCII 7-bit

Chronogical Blog Entries:

Date: Sun, 16 Nov 2025 19:16:54 +1100

Earlier this year, I asked chat GPT to write a C program that would translate UTF-16 to 7-bit ASCII.

Other Blog Posts In This Thread: