|
|
PGTS Humble BlogThread: Tips/Tricks For Programming etc |
|
Gerry Patterson. The world's most humble blogger | |
Edited and endorsed by PGTS, Home of the world's most humble blogger | |
| |
Detecting EBCDIC With Perl |
|
Chronogical Blog Entries: |
|
| |
Date: Sun, 30 Nov 2008 10:26:21 +1100Recently I was writing a perl script and I found that I needed to identify whether or not certain files were ASCII or EBCDIC. There are probably many ways to do this. The easiest way that I know of is to use the "file" command. |
For example in AIX, you might use the following perl code:
my $ftype = `file $foo`; if ( $ftype =~ /ascii text\n/) { print "ASCII file\n"; } elsif ( $ftype =~ /data or International Language text\n/) { print "EBCDIC File\n"; } else { print $ftype; } |
However because this involves a system call and reads at least 1024 bytes, of each file it can be a little slow. More bothersome was the fact that the EBCDIC files had been transferred from a mainframe using various methods. Some were fixed length binary files and some had an ASCII "\n" terminator at the end of each line of EBCDIC text. Depending on what type of Operating system you are using the message returned from the file command can vary. For example in (Ubuntu) Linux, the file command might return the following for EBCDIC files:
ISO-8859 textor
ISO-8859 text, with very long lines, with no line terminatorsor if the file has newline terminators
Non-ISO extended-ASCII text, with LF, NEL line terminatorsWhereas for ASCII files, it might return
ASCII textor
ASCII English text, with CRLF line terminators
Probably there is a way to reliably, easily (and quickly) determine a file's type. However, I know that these particular files are all mainframe reports which begin with ANSI control codes. Mainframe programmers will realise that this is just a number (like "0" or "1", etc) And the EBCDIC codes for numbers are 0xF0 - 0xF9. In fact EBCDIC text is practically guaranteed to have a character in the range 0x80 - 0xFF. And of course those sort of characters NEVER occur in an ASCII file.
Furthermore since in this particular case the files were either ASCII or
EBCDIC, I settled on the following rather quick and dirty subroutine.
sub is_ebcdic_rpt { my $buf; open RPT,"gzip -dc $_[0]|" || die "Cannot open file $!"; my $n = read RPT,$buf, 8; close RPT; return 0 if ($n < 8); return 1 if ($buf =~ m/[\x00-\x07\x80-\xFF]/); return 0; } |
In the above case the files were all gzipped. Essentially this script reads the first 8 bytes of the gzipped file, and determines if there is a character in the range 0x80 - 0xFF. Such an approach may not be suitable for you. This would be especially true for EBCDIC text that contained no ISO control characters and mostly spaces (0x40 or ASCII '@').