Pilot Error
Here’s a very brief story in response to Doug Burns’ post at his blog calling for stories on Human Error.
My very first job out of college was with a large oil company that had Oracle running on bunch of VAX/VMS systems. We had a lot of code written in FORTRAN with OCI calls (this was back in the early 80’s, so no pre-compilers yet). I was working late one night from home, which in itself was an unusual thing because not too many people were able to work remotely at that time. We only had 1200 baud modems for crying outloud, so it was painful to do much of anything remotely. At any rate, I was working on a program with some kind of iterative processing which took a while to complete. So I’d make a few changes and run it, make a few changes and run it. Well I noticed that the execution time slowed down somewhat and so I went looking to see what else was running on the system. (brief digression: I had become a neophyte sys admin due to my being the Oracle DBA and needing to have some system privileges for doing upgrades and whatnot) So I had a look to see what might be slowing my program down and sure enough there was a batch job running that was really using a lot of cpu. Well I had learned about the ability of VMS to set process priorities and so I thought to myself, “that batch job has all night to run it shouldn’t be slowing me down right now”. So I determined to change the priorities so my program would not be competing so heavily with the batch job. Unfortunately, instead of lowering the priority of the batch process, I jacked the priority of my process way up. (you’ll see why I say “unfortunately” in a minute) So anyway, the priority change worked out great. My program executions began running even faster than they had prior to the batch job kicking off. So I went back to my programming routine. Make a few changes, execute the program, check the results, make a few changes, execute the program, check the results… until I executed the program and it didn’t come back. I remember thinking, “Uh oh, I think I messed up the check for getting out of that loop”. So I thought, “well I’ll just ctl-C out and fix it”. Unfortunately, in a stupendous example of Murphy’s Law, it was at just this point that my modem lost it’s connection. Great! So I tried to reconnect. The modem was able to establish a connection, but the machine was so busy running the process with the insanely high priority that it didn’t have enough spare CPU to log another process in. (Unlike most systems today, VMS had a very hard priority system. The process with the highest priority basically stayed on the CPU as long as they wanted – oh and by the way, there was only the one CPU) So anyway, the program ended up running most of the night and only stopped because it filled up the disk with a log file that it was writing. Needless to say, the real sys admins were not too happy with me the next morning when I showed up at work.