PGTS PGTS Pty. Ltd.   ACN: 007 008 568

point Site Navigation

point Other Blog Threads



  Valid HTML 4.01 Transitional

   Give Windows The Boot!
   And Say Goodbye To Viruses!

   Ubuntu

   If you own a netbook/laptop~
   Download Ubuntu Netbook!






PGTS Humble Blog

Thread: Tips/Tricks For Programming etc

Author Image Gerry Patterson. The world's most humble blogger
Edited and endorsed by PGTS, Home of the world's most humble blogger

Detecting EBCDIC With Perl


Chronogical Blog Entries:



Date: Sun, 30 Nov 2008 10:26:21 +1100

Recently I was writing a perl script and I found that I needed to identify whether or not certain files were ASCII or EBCDIC. There are probably many ways to do this. The easiest way that I know of is to use the "file" command.

For example in AIX, you might use the following perl code:

my $ftype = `file $foo`;
if ( $ftype =~ /ascii text\n/) {
        print "ASCII file\n";
} elsif ( $ftype =~ /data or International Language text\n/) {
        print "EBCDIC File\n";
} else {
        print $ftype;
}

However because this involves a system call and reads at least 1024 bytes, of each file it can be a little slow. More bothersome was the fact that the EBCDIC files had been transferred from a mainframe using various methods. Some were fixed length binary files and some had an ASCII "\n" terminator at the end of each line of EBCDIC text. Depending on what type of Operating system you are using the message returned from the file command can vary. For example in (Ubuntu) Linux, the file command might return the following for EBCDIC files:

ISO-8859 text
or
ISO-8859 text, with very long lines, with no line terminators
or if the file has newline terminators
Non-ISO extended-ASCII text, with LF, NEL line terminators
Whereas for ASCII files, it might return
ASCII text
or
ASCII English text, with CRLF line terminators

Probably there is a way to reliably, easily (and quickly) determine a file's type. However, I know that these particular files are all mainframe reports which begin with ANSI control codes. Mainframe programmers will realise that this is just a number (like "0" or "1", etc) And the EBCDIC codes for numbers are 0xF0 - 0xF9. In fact EBCDIC text is practically guaranteed to have a character in the range 0x80 - 0xFF. And of course those sort of characters NEVER occur in an ASCII file.

Furthermore since in this particular case the files were either ASCII or EBCDIC, I settled on the following rather quick and dirty subroutine.

sub is_ebcdic_rpt {
        my $buf;
        open RPT,"gzip -dc $_[0]|" || die "Cannot open file $!";
        my $n = read RPT,$buf, 8;
        close RPT;
        return 0 if ($n < 8);
        return 1 if ($buf =~ m/[\x00-\x07\x80-\xFF]/);
        return 0;
}

In the above case the files were all gzipped. Essentially this script reads the first 8 bytes of the gzipped file, and determines if there is a character in the range 0x80 - 0xFF. Such an approach may not be suitable for you. This would be especially true for EBCDIC text that contained no ISO control characters and mostly spaces (0x40 or ASCII '@').


Other Blog Posts In This Thread:

Copyright     2008, Gerry Patterson. All Rights Reserved.