To Compile or not to Compile?

To Compile or Not to Compile?

By Gerry Patterson

A Problem That Was Not A Problem

The transition from compiled to interpreted code did not happen overnight for me. One of the main advantages of compiled code is speed of execution. This was a widely accepted programming paradigm and I did not question it. It is especially true for C code. C compilers are lean and mean, when compared to other compilers, and the binaries that result from C code are very quick. With many years experience as an assembler programmer, I had taken to C like a duck to water. The Microsoft PC community had adopted C in the eighties and I had seen it as the obvious choice for programming on a PC.

Then, getting back to my mainframe roots, in the early nineties, I encountered a problem that wasn't a problem, while working on a HPUX system. To explain this self-contradiction it is necessary to go into some detail about the project that I was working on. So we may have to wade through some acronyms and computer jargon and other techno-babble.

I was converting data from a DL/I IBM mainframe database to a HPUX mid-range Oracle RDBMS. The IBM programs were written in COBOL, because that was the only compiler available. In fact, I did not write the COBOL programs, I generated them with my own generator. I developed blocks of COBOL code as templates which could be used to assemble a finished COBOL program. A JCL job would read descriptions of the DL/I schemas and the output was parsed by a REXX script. Using this information the REXX procedure assembled the final COBOL program that would read the DL/I tables and produce an output file for download. I chose a mixture of REXX and COBOL for flexibility. On the face of it, you might think that it would have been easier to have just rolled up my sleeves and hacked out a COBOL program for each download. However, If I had chosen this path, it could have taken up to three days for each separate program to be written and de-bugged. Cutting COBOL code was, and remains, very labour intensive, which is why there were so many COBOL programmers in days gone by. So, the reason for the complex mix of JCL and REXX producing COBOL was to produce a solution that was quick to implement as well as flexible. The final suite of programs would enable me to just nominate a table and some selection criteria and within a matter of minutes I would have a COBOL program generated by the REXX and JCL. The target database was still undergoing development and was experiencing teething problems. Past experience with data conversion projects of this nature had led me to expect last minute changes. So a flexible solution gave me the breathing space I needed during the crucial cut-over phase, and would allow me accommodate last-minute requests from unreasonable project managers.

It turned out that the best way to quickly migrate the data was via a PC, running Microsoft DOS. The connection that went direct from the HP to the mainframe was extremely slow. The route between the HP and the mainframe had to go via a slow link to Chicago. Now, you may find it strange that in order to send data to the machine next door, I would have to route it half-way around the world. The PC, as it turned out, had a high-speed link to the local SNA network via a token ring LAN and a link to the high-speed TCP/IP HP network. This made it a good candidate to be used as a bridge between the mainframe and the Unix host. I don't want to get bogged down in too much technical details. Suffice to say it was a networking thing. At the start of the nineties I was new to Oracle and I did not realise the potential of the fabulous, though at the time, poorly documented SQL load utility, so I chose to transform the data with a program. Whilst working on mainframe-PC data conversion projects in the past, I had written programs in Borland C for the Intel chip, which converted data from a mainframe download to an ASCII format more amenable with non-mainframe systems. I re-cycled these programs to perform a similar function in the data migration. This reduced the size of the files and the hence the time taken for upload to the HP.

It soon became apparent that the conversion programs on the PC were a weak link. I had re-used the original code, which had been a once only download, rather than a generic solution. If I stuck to this scheme and there were last minute additions or changes, I would have to write a C program for each additional file I chose to download. Furthermore each program took on average half an hour to run, which would add about five hours time to the cutover. Although I was willing to wear this extra time. It was the coding time that was the most serious weakness. Nor could I easily take the C programs and re-compile them for the HP. There was a fair amount of byte-level manipulation in the programs which like many quick and dirty once-off programs had been written in a hurry with no thought given to portability. Because the Intel processor uses the so-called "backwords" storage, this meant that they could not be ported to the HP without considerable re-writing.

I decided to write a generic conversion method in awk on the HP. I had spent a considerable amount of time experimenting with awk on my home system, which was running Coherent Unix. I decided to use awk because I needed to get the code written quickly. And the main advantages were:

Reduced time for coding. Because awk is an interpreter it was possible to easily test components on the fly, and build a working finished product.
Efficiency. It is possible to accomplish a task in awk with as much as eighty per-cent less code. This is because awk has been designed specifically for text manipulation. The syntax is inherently terse and powerful, even more so than C. This also reduces coding time.
Existing algorithms. Despite these distinctive features, awk is very similar to C, and it would be easy to re-use any appropriate algorithms from existing C code. (i.e. the parts without Intel specific backwords storage code)

In summary, the main advantage of awk were flexibility and reduced coding time. These were important considerations for once-off projects like data migration. I supposed that being an interpreter it might take a little longer to run than the existing C programs, which were compiled. I took it as an article of faith that compiled programs would always out-perform interpreted programs. The new 486 processor had been glorified as Intel's gift to programmerkind. In some circles, it was proclaimed that the amazing micro-chip could rival more expensive mid-range and mainframe processors. These opinions were often accompanied by graphs, figures and phrases like clock speed, megaflops, wheatstones and other benchmark techno-babble. These opinions seemed very well-informed, so I accepted them uncritically. Nevertheless, I decided that within reasonable limits, performance was not the most important factor. In the overall scheme of things, it was more important to save on programming time, so I set about coding an awk conversion script.

Of course, writing a program like this in awk is a doddle. Within a couple of hours I had finished a script. I still believe that ease of coding remains the most convincing argument for scripting engines like awk. I would have sweated for days over an equivalent C program, and I could easily have said goodbye to a whole fortnight, if I'd been gripped by a sudden though severe bout of temporary insanity and attempted to code it all in COBOL. Working in the shell with a powerful set of text-manipulation tools, it was possible to write a few lines of awk code, test them, write a few more, test them, write a few more ... Ok, you get the picture.

I was ready for a test download to really exercise my script. I copied one of the download files and ran my test. I entered a command line similar to the one that would be placed in the final conversion script, that would convert all the data at cut-over. And nothing happened! Ah well I thought, its bit like the old adage:

If at first you do not succeed ...
You must be a programmer!

So I reviewed the initial tests I had done prior to the download. I also reviewed the awk code. Everything looked ok. It looked as though I would need to add some diagnostics to the awk script. Before doing this, I retrieved, from shell history, the command line I had entered for the test run. I inspected it carefully. It looked ok. Then I did something that I usually would not do. I pressed [Enter] (in order to run it again). On numerous occasions, in the past, I might make scathing comments like "There is rumour circulating that the programmer, who wrote this, put some special code to detect how many times you run it, and the thousandth time it will work!", as some misguided computer user repeatedly runs the same program again and again with the same input conditions, in the futile belief that this might make it work. Such sarcasm is often lost, as this continues to be a common response to software that does not work. How often do you see users click repeatedly on an icon in the vain hope that it will eventually work? And how often do they repeat this futile behaviour? Well sometimes it works with old hardware, why shouldn't it work with software? How many rhetorical questions should I put in this paragraph? Anyway on this occasion I ran the awk script again. I had expected the system to pause for half an hour while awk digested the input data. However it just came back with a blinking cursor ... except that this time, because I was more alert, I thought I detected a small delay, perhaps it was just a tenth of a second? I checked the output area and voila! There were the output files. It had worked all along! It had worked the first time and the second time! I just thought it hadn't worked because it was so mind bogglingly fast! And hence, I had not bothered to check for output. In fact, the awk program had slurped up many thousands of records and spat out the result in the blink of an eye! A similar algorithm on the 486 had taken half an hour, to produce the same result.

And that was the problem that was not a problem.

Awk, Awk, Awk!

Needless, to say that was the last time I wrote C programs for a database conversion project. It certainly wasn't the last database conversion I have worked on. But, after this experience, there seemed only one logical choice for data conversion, and that was awk! Most conversions must be undertaken within time constraints and cost constraints. And usually these are one and the same. The entire suite of Unix text manipulation tools, in general, and awk, in particular, offer an excellent way to save time and money in the conversion process. And if I have to give a fixed-price quote for the job, it could be my time and my money, if I fall behind on the schedule. And so until recently, it remained my choice for data conversion and many other tasks.

And I discovered that there many other uses that awk can be put to, besides data conversion. Admittedly text manipulation is awk's forte. However, it can also be combined with other utilities like Oracle's sqlplus. The traditional approach to writing a program for Oracle is to use pro-C or PL/SQL. A considerable amount of time needs to go into designing and coding when the program has been put into production. And as a result maintenance and tuning can become the major tasks as the database grows and requirements change. An alternative to a single monolithic process is to create the SQL parts as separate queries running in sqlplus and pipe the results into awk. The shell can act as glue that holds the whole thing together. The advantage of this approach is flexibility. This allows the SQL parts to be tuned for maximum performance and the awk parts can do what awk does best. The initial testing of the concept can become the finished product. Tuning will be easier because the procedure is already de-constructed.

The question of performance is interesting, because I had been mislead by opinions in technical journals. These opinions seemed well-informed, and the conclusions may have been true for rocket science and or manipulating graphical objects. And if you are rocket scientist, please don't take offence, I love your work, really. However, many of us drones at the coal-face aren't always concerned with linear algebra, partial differential equations or manipulating three-dimensional objects in a two-dimensional space. Most often I find myself trying to move large amounts of digital information from one form of magnetic storage to another, changing it somewhat in the process. And it doesn't matter what clock-speed the CPU runs at or how many megaflops it is capable of. If the process is I/O bound then that is the end of the story. In the example above, a C-program running on a 486, the algorithm was simple. It would read a record from an input file, do some manipulation of the data, and write it to an output file. If the input file contains 20,000 records, the program performs 40,000 individual I/O operations. Ok, if I wanted to be fancy I could have buffered the input and the output in memory caches. However, if I start handling my own buffers, I might end up doing memory management as well. And do I want to do that? I mean if I wanted to be fancy I could go to work dressed as The Phantom. I doubt that would cut coding time though.

Unix already does highly effective buffering on commands. And awk seems to take advantage of this. Which is why the same algorithm was amazingly fast on a HP 9000. These days most people would be more likely to use a 486 as a door-stop rather than a computer. However there is a way to get quite respectable performance out of a 486 system, provided you have sufficient RAM. Just install Linux (or BSD) on it. It is an excellent way to breath new life into old computer hardware. And provided you don't intend to use it as graphics workstation, the performance will be quite respectable.

Well as we all know by now, as far as scripting engines go, there is a new kid on the block. Around about the turn of the century some users asked me if it were possible to get awk on their PCs. They'd heard about some of the things I was doing on Unix boxes and thought that some awk scripts might be handy on PCs. My answer was circumspect. I did have awk on my laptop, even though it was a Windows NT machine. There is an excellent suite of Unix utilities in the public domain called CygWin, which has lately been taken over by RedHat. The Latest Windows 2000 version of this software is very good, and I would recommend it for Unix programmers and system administrators. The Windows NT version that I had on my laptop in the late nineties was not the sort of thing I would have installed on the average user's machine. It might have been possible to copy just the awk binary and the necessary DLLs to target machines. However awk really works best if you have a shell to glue things together. I tried searching the web for awk and Microsoft NT. Google came back with page after page of hits for perl! Many of these hits contained text which claimed that perl was a scripting language that combined elements of sed, awk and shell in a single scripting engine. This seemed an extravagant claim. Now it probably seems strange that, given my background, I had managed to live through the nineties without encountering perl. I had experimented with C++ in the nineties. I had given a fixed-price quote on a job. When the quote was accepted, I purchased Borland's C++ and set to work. I had seen a video which featured Bjarne Stroustrup at the start of the decade and the arguments for object oriented programming seemed convincing. So there I was preparing to inherit the wheel rather than re-invent it. And I have to say, that as a promo, that was such a great line! Why re-invent the wheel when you can INHERIT it? However, the reality I discovered, was that I had inherited thousands of wheels within wheels, most of which I did not require. Also the multi-threaded asynchronous nature of C++ opened countless, hitherto unimagined possibilities for errors, and at the same time presented few opportunities for effectively de-bugging them. The program, I was attempting to write was trivial. Had I been writing it in a command line environment in C, it would have been a miniscule task. But it took me over a hundred hours in C++ on a Windows NT! Ok, I was expecting a learning curve, and hence I had chosen an easy program, given a fixed quote and decided I would write off any losses to self-education. This job turned out to be, in hourly terms, probably the lowest paid of my entire career! The experience confirmed a suspicion I had started to develop about some of the fashions sweeping the globe at the time. I used to think that CICS COBOL programs were a remarkably convoluted but effective way to use up any spare time you might have. But now I could see an even better way to waste time!

Now, at last, I could see the Emperor's New Clothes, and they appeared to be object-oriented. As the emperor's state of dress came more clearly into focus, a wry smile crossed my lips when I recalled that the promo on OOPs (interesting acronym that one). And the manifest raison d'etre of this new fashion statement was to save programmer's time! God have mercy on us all if any one set out to invent a programming language that was a deliberate waste of time! You might say that I was somewhat underwhelmed by the emperor's new line of garments. Unfortunately, I lumped many of the new languages in the same category. This was partly due to the rush from the fringes to adopt the latest Imperial fashion. I had already seen perl scripts appearing in technical journals and these had been written in an object oriented style with what seemed to be calls to Window's OLE. The man pages for perl looked too large to digest in a single sitting, so I dismissed it as another object-oriented contender, without closer inspection.

So after a decade, I finally did investigate perl. This was mainly because I had encountered, on the web, the extraordinary claim that it was combination of sed, awk and shell. The streaming editor sed has strong search and replace capabilities, because of it's use of regular expressions, but the decision and programming structures are very clumsy. Awk on the other hand, has a powerful C-syntax, primitive array and hash capabilities, and very useful built-in functions. However awk lacks the control capabilities of the shell, even though the enhanced versions have added support for procedures. I/O remains clumsy and calling other processes from within an awk script is poorly supported. The shell, of course is the Grand Vizier of command and control, but is poor in the area of array-handling, arithmetic and string manipulation. So if anyone could take the best of these three and produce a single language, it would be worthy of investigation. However a poor implementation would be nothing more than a useless jumble. It turned out that perl indeed lived up to the claims of containing elements from sed, awk, perl and the shell in a single usable package. And with a good knowledge of all three, I was able to start using the interpreter effectively in very little time. However as, the nineteen ninety-nine dollar TV commercials say, that's not all! Perl also contains a compact, but powerful report writer (hence its' name), enhanced regular expressions, and enhanced array and hashing capabilities including multi-dimension support. There are also many built-in functions for array manipulation, and I/O support. In addition to the diverse syntax and a surplus of library functions, packages called pods have been added by perl enthusiasists around the world. Perl also works very well on Microsoft platforms. As we all know, Microsoft software lacks a (useful) shell which can assist decision making in scripts. Perl fills a much needed market niche, and can be used for controlling and administering Windows NT and Windows 2000 servers.

Over time, I learned about some of these features. In fact perl seems to have so many features, I could imagine only a madman (or a genius) would contemplate creating such a complex mish-mash and make it work. That mad genius is Larry Wall, who is credited with being the creator of perl. And his achievement may prove to be as significant as Linus Torvald. Both of them have created what for lack of a better word could be described as post-modern software. And I use that much over-used cliche with considerable trepidation. The word post-modern has been so over-used that we should, by now, be in the post-post-modern age. Or we would be if the modern age didn't keep rising, phoenix-like, from its' own post-modern ashes. Both Linux and perl have been invented quickly, borrowing from just about everything and set free to roam the world gathering critical mass. And the word post-modern is appropriate for both of them.

Ok, I am starting to sound like a perl salesman. I don't need to sell it. There is lots of good material about perl on the web. And this is the key to its' success. Perl is powerful and flexible with contributions from users on the web. Which is not to say that there aren't any critics. As I have already discovered, perl is probably the most eclectic scripting language in the world. It is this loose hybridization which is seen by some as its' main weakness. Critics draw attention to the (possible) manipulation of arrays, hashes, variables, pointers and object classes without the rigour of strong data typing. These detractors see perl as a loaded gun that could easily shoot the owner in the foot. Of course you could make the code a little safer by using strict. In the past, many of the same criticisms were levelled at awk, which does not have the luxury of a strict declarative. Still programmers who prefer to program while dressed in a straight-jacket and a corset, with very tight army boots, a suspender belt and nylons, can always stick to C++. In any case, I find that I often write neat code when I use awk or perl. This is out fear for the consequences if I write sloppy code. This paradox manifests itself in other human behaviour. For instance, it is widely believed that poor driving conditions contribute to road accidents. This is not true! When driving conditions are poor there are less accidents. It is also believed that modern engineering and improved performance and safety in motor vehicles has reduced accidents. However, making cars safer does not decrease accidents. Because modern vehicle design and engineering has minimised road noise, improved road holding, and decreased the perception of motion and the risk of injury (for the vehicle occupants), drivers compensate by driving less safely! Actually it's not that difficult a concept to understand. There is nothing like the threat of imminent death or permanent disability to make humans slow down, keep their eyes on the road and proceed with caution. That's why there are less accidents in poor driving conditions. And also why drivers will drive more carefully if the car does not feel safe. And that's why I write neat code when I am programming in awk or perl. If I wanted to do extremely tricky and highly obscure arithmetic with variable pointers, because ... well maybe I had just visited a part of the world that had rabies and been bitten by a rabid dog, and with the illness had come the desire to do arcane pointer manipulation. Well then awk (or perl) would let me do it. If I wanted to use a variable in the wrong context then most of the time, perl won't complain. Ok, the safety conscious ones who like to wear a seat-belt can use strict but there are no air-bags in the perlmobile. Still, you can get to feel the wind in your hair.

I'm sure the debate concerning safety vs productivity will go on for some time. The principal difference between motor vehicles in the hardware universe and programs is that it is easier to appreciate the importance of reducing the total number of accidents in a soft universe, rather than minimising the effect on individual occupants as we tend to do in a hard universe (except for pedestrians). So we will see how perl develops. I'm predicting a big decade for perl (even bigger than the preceding).