[Note: In 1990, after reading The Soul of a New Machine, I wrote this fun, factual, and unpublishable paper because I enjoyed the style of the book more than technical papers. Five years later, thanks to the Web, it is seeing the light of day. The names and events are true, but the dialogue was probably not this corny.]
"If you were a run-time bug, where would you hide?"
"In the sensor and intel models."
"True. So how do we find them among all that code?"
"Janet told me about a VAX tool called the Performance and Coverage Analyzer (PCA). You link the program with its Collector, which samples the process while it runs. When you are done its Analyzer tells you how much time the program is spending in each subroutine and even at each line."
"How does it know that?"
"The PCA reaches into the process every 10 milliseconds, similar to the way we use the debugger. It records the subroutine and line number that are executing and lets the program continue running. When you think you have a good enough sample run you stop the program and all of that information is stored in a file.
"Then you execute the PCA software. It takes the data out of the file and creates tables and bar charts of the subroutines in the simulation. These are ordered by the number of samples that were taken in each one. PCA tells you the total number of samples that were taken, the number falling into each subroutine, the percentage this is of the total, and the 95% confidence interval on the percentage calculation. It looks like this (Figure 1)."
"So I can use this thing like a glass wall to look into a process while it is running. This is great! They never taught us about anything like this is college."
"That's why we hire computer science majors, to help you statisticians."
We proceed to link the simulation process with the PCA tool.
When Keith and I started this task we expected to spend a couple of our free days finding two or three obvious problems, fix them, and return to our assigned projects. Now it looked like we could make a serious impact on the problem.
The program, the Air Land Battle Assessment Model (ALBAM), simulates all of the major systems involved in war. It flies airplanes, moves ground units, processes all types of sensors, conducts air strikes, operates surface to air missile defenses, engages in ground combat, and performs just about every other major type of combat imaginable. The basic concept is not new, it has been done before with simulations like JESS, AWSIM, GRWSIM, TACSIM, MTM, and many others. The important difference is that the others were training simulations meant to allow military commanders to hone their skills in military exercises. ALBAM, on the other hand, is designed to analyze the performance and interactions of all the systems and determine whether they justify the cost of procurement. Our VAX 8800's give us enough power to include the breadth of coverage and the depth of analysis needed for this type of simulation.
As the code came together, some of us started noticing that it was running slower than we had expected. As the simulation moved through the first three hours of game time it slowed noticeably as more combat occurred and the number of aircraft increased. I had been with the project from the beginning, four years, and Keith Tucker came in just a year later. This made us old hands and gave us a feeling of ownership. The model was ours, if it misbehaved, it was up to us to find out why and discipline it.
Unfortunately, run-time improvements did not interest management at the time. The director needed the simulation to run error free for two days, and he needed all of the new capabilities listed on the PERT project chart. Using two of his most experienced modelers to grease the axles was not on his priority list. Therefore, our request to work on the project was tabled "until we are through the fire drill".
But we knew that there was always another fire drill waiting when the current one was over. It was the standard mode of operation in our department. Keith and I began to steal loose hours from our other projects. Only the two of us knew what we were doing. Now that we had found PCA we knew that we would be able to find the biggest problems, which also meant that this was going to take more time than we had planned.
"Ok. It's linked."
"Right on, lets try it."
"The manual says to run the program normally. I think we want the default data collection for program counter sampling data."
"Say 'GO' already." Keith enters "GO" at the PCA prompt and we wait.
"It sure is slow. The program should have cycled by now."
"There! A three and a half minute cycle. A good sample is going to take at least an hour."
It was 4:00 pm. Some how I didn't think we would be leaving at 4:45 today. I wondered if Tonya had cooked dinner. I hoped not - less problems that way.
"Lets go call Tonya and Sandra and tell them we will be late." Keith's relationship with Sandra was still a secret to most people. Dating someone on the same project did not seem to Keith to be a career enhancing move. But, a year later they went public with the announcement of their marriage and both careers still seem intact.
"Ok, stop the simulation. Lets see what the output looks like."
"Enter 'PCA ALBAM', then at the 'PCA>' prompt enter 'TABULATE'."
"Good night! This thing is spending 28% of its time in FOOTPRINT. And look, another 16% in DYNMEM. Something has got to be wrong."
"Lets go after those two, then try it again."
"Look at this. Every time this routine is called it initializes this array with zeros. What is that, 237 X 149 is 35,313. And this routine is called for all 500 units in the game. That is 17,656,500 zeros each cycle."
"So, do we have to do that?"
"FOOTPRINT uses the array to store unit and sensor information. That array is read in DETECT to process the data. See, DETECT searches each column in the array until it hits a zero. It stops searching when it finds a column with a zero in the first entry."
"I know. Lets go back to FOOTPRINT and recode it. We will initialize the top four variables, (1,1), (1,2),(2,1),(2,2), to zero. Then, as we fill the array with information we will put a zero at the bottom of each column. When we finish the columns of data we will add one more column filled with zeros. That will create a zero fence around the new data. On the other side of the fence will be old data. But who cares. DETECT will never process the old data because it will not cross the zero fence" (Figure 2).
"Great! Compile FOOTPRINT and I will look at DYNMEM."
"Look, the memory is used in GETMEM. One slot for each piece of information in a combat pair. Every time two units enter into combat the information is stored in the slots allocated by DYNMEM. But surely the simulation does not engage in enough engagements each cycle to need one million slots. I'll bet he just picked a big number during development and never got around to modifying it once we had a good simulation to measure our real needs with."
"We could ask him if he hadn't quit. Isn't he at some oil and gas company in Houston now?"
"Yeah, I think so. I'm going to put a counter in GETMEM. It will be an array. Each cycle it will count the number of memory slots actually used out of that million. Then when the simulation stops I will dump the information to a file. That should give us a good idea about how many we need."
"Hey! Is DYNMEM compiled? I've modifies the OPT file so we can link the new stuff."
"Linking. So if this works our PCA output should show a decrease in those two routines but an increase in the others because the saved processing time is being spread out through the whole simulation."
An hour later we were looking at the new PCA output.
"FOOTPRINT is down to 2% and your counter in GETMEM says it never used more than 4,000 memory slots."
"Ok, lets cut DYNMEM to allocating 10,000 slots. That is one percent of what it was doing."
The PCA run with this change showed DYNMEM down to two tenths of one percent. This was too easy. With the PCA tool finding the problems was like shooting fish in a barrel. We saved the run-time information to a standard data file.
"It's ten o'clock. Lets go home. I'm probably in enough trouble by now."
"I can't believe how much we accomplished in only one day. At this rate we will have the model running way below real-time by the end of the week."
"Really. I think we should get a medal."
Unfortunately things did not continue "at this rate". PCA led us right to the doorsteps of several more problems, but fixing them was not quite so easy. In fact, the deeper we got the more we wondered just how big a change we wanted to attempt in fixing a problem. We found the model building link lists of units and events every cycle, then taking the lists apart and returning the memory to free space. But, the software never used the lists. There seemed to be a new method for doing the operations. All of the old plumbing had just not been removed.
We found a major array of information being constructed, used, and destroyed during a cycle. Then a new list with the most current information would be built, used, and destroyed the next cycle. Since most of the information did not change every cycle this was a waste of a lot of CPU. We devised a method to update the existing list when information changed. We build the list when the game starts and just update it with changes from then on.
Some subroutines would search a long list from top to bottom. The fact that the objective of the search was found at the 4th, 5th, or Nth entry did not stop the search. So we added a condition to the DO WHILE statement.
Keith and I found ourselves stealing more time from our official projects and working more nights. But incredible results were coming from it. The run-time was down by 30% in our tests. We were very excited and the project was not a secret any more. Co-workers dropped by to see how it was going and to offer suggestions.
"Look at our new logo."
"What? Your kidding."
I had made a red poster with "RUN-TIME LABORATORY" in large black letters and a gold lightening bolt through the center. I hung it next to the door of our work area. This drew a lot of flack and good natured teasing.
"We have some major run-time improvements here. I think it is time to present them to Davis." George Davis was the Vice-President of the project.
"Ok, but you do it."
The next day Keith came back from the meeting, but he wasn't exactly a returning hero.
"Davis says it looks great, but can we document the improvements? Exactly how much faster is it, and how many man-hours did we take to do it?"
"Well, we didn't measure any of that stuff. The best we have is our word and these horrid notes that we made while working."
"Lets estimate it, but from now on we have to measure what we do. We know what we did but management can't see it without some kind of ruler."
So we claimed a 30% improvement at a cost of 70 man-hours. Then we started keeping records.
We found utility routines that were used by all programs to access the database. These were streamlined.
One routine converted current simulated clock time into elapsed minutes from the start, this was subtracted from another number similarly converted. This was done by so many routines that it was using a significant amount of run-time. John Goodridge decided that there was no reason for the program to work with time in the conventional clock format. Time should be stored as elapsed minutes from the start so that the software could add and subtract them directly. Time would be converted to clock format only when it was displayed to the users.
Kevin Stafford realized that the simulation could process intelligence and sensor information for blue units independent of the red units processing similar information. He got Eric Nathman to help him split the work into two separate processes that ran in parallel and were controlled with event flags. This let us take advantage of both CPUs in the VAX 8800.
Since our database was accessed and changed by several processes it was necessary to lock data when a process was accessing it to prevent other processes from changing it at the same time. These locks had been applied to all of the data tables. One day while generating a new database Hai Ong forgot to apply the locks. When we got it, we found two minutes of game time being processed in just three seconds. This was phenomenal. It should be taking at least twenty-five seconds. Investigations led us back to the lock error, and a run-time opportunity. We determined exactly which tables were accessed by multiple processes and locked only those. The others were free from the overhead of locking and unlocking. This yielded one of the most significant improvements.
"Did you ever dream that we could have made such a difference?"
"Never! The model is down to 50% of the run-time it was using just a month ago. But, if I had known how hard this was going to be from the start I never would have tried it."
"Good thing we were ignorant then."
"Davis was pretty impressed once we started measuring things. Did you see what he wrote on the master savings chart? 'Good work. Congratulations and Thanks. G. Davis'. And he put it on the bulletin board."
"Do you realize how much money this is going to save the company? Twenty modelers, each running the software twenty hours a week, for one year. Now they can get the same answers in only ten hours. Therefore, 20 men X 10 hours X 50 weeks is 10,000 man-hours saved in debug and exercise time this year. All because of these run-time improvements."
"That's probably a little optimistic, but it's in the right neighborhood."
"And all of us spent what - 200 to 300 hours doing it? That is one awesome return on investment. That tiny investment will be saving us man-hours for years. When do we get our medal?"
"Right. I think Davis' note is it."
"Yeah, but it was exciting work while it lasted. The simulation can run at twice real-time now. And if it gets too slow again I'll bet we can go in and find more improvements."
"I never imagined saving this much. Our minds were pretty paralyzed when we started. We thought there was no way anyone could have made such expensive mistakes when we were putting this software together. We had to break our paradigm of what the software had to be."
"Our not being intimately familiar with the software we were working on helped a lot. We saw things from a different perspective than the original developers. And we did not appreciate the magnitude of the task until we were so far in that we couldn't turn back."
Hunting and killing run-time problems was one of the most exciting projects we have ever worked on. It was never officially sponsored nor assigned, but it was officially tolerated. It was in our blood, those stolen hours and late nights were filled with electricity and possibilities. We owned this task, we had a stake in the outcome. And as it turned out the project desperately needed those improvements six months later to perform on a contract.
Guide to VAX Performance and Coverage Analyzer, Digital Equipment Corporation, Maynard, Mass., April 1989.
(VAX is a trademark of the Digital Equipment Corporation.)
Keith Tucker - Then: Software Project Lead. Now: As a result of two acquisitions, a
Staff Analyst in the Operations Research Department at Lockheed-Martin Fort
Roger Smith - Then: Senior Operations Analyst. Now: Principal Simulation Engineer at Mystech Associates in Orlando, Florida.
George Davis - Then: Vice President of the Combined Arms Systems Engineering
Center. Now: Vice President of SINCGARS radio program in Melborne, Florida.
Kevin Stafford - Then: Operations Analyst. Now: Avionics software developer at Lockheed-Martin Forth Worth.
John Goodridge - Then: Operations Analyst. Now: Technical Specialist at Lockheed-Martin Fort Worth.
Janet Williams - Then: Software Genius. Now: Team Software in Austin, Texas.
Dave Meyers - Then: Software Genius. Now: Some oil company in Houston, Texas.
Eric Nathman - Then: Software Genius. Now: Unknown.