Editor’s note: While this topic isn’t entirely security-specific, Trend Micro leader William Malik, has career expertise on the trending topic and shared his perspective.
There was a provocative report recently that the Governor of New Jersey told reporters that the state of New Jersey needed COBOL programmers. The reason was that the number of unemployment claims had spiked, and the legacy system running unemployment claims had failed. That 40-year-old system was written in COBOL, so the conclusion was that the old language had finally given out. Hiring COBOL programmers would let the State update and modernize the application to handle the increase in load.
This might be the problem, but it probably is not. Here’s why.
- Software doesn’t wear out, and it doesn’t rust. Any code that’s been running for 40 years is probably rock solid.
- Computers have a fixed amount of specific resources: processing power, memory, network capacity, disk storage. If any of these is used up, the computer cannot do any more work.
- When a computer application gets more load than it can handle, things back up. Here’s a link to a process that works fine until excessive load leads to a system failure. https://www.youtube.com/watch?v=NkQ58I53mjk Trigger warning – this may be unsettling to people working on assembly lines, or on diets.
- Adding more resources must fit the machine architecture proportionately.
- Incidentally, throwing a bunch of people at an IT problem usually makes things worse.
From these points, we learn the following lessons.
Software Doesn’t Wear Out
Logic is indelible. A computer program is deterministic. It will do exactly what you tell it to do, even if what you tell it to do isn’t precisely what you meant it to do. Code never misbehaves – but your instructions may be incorrect. That’s why debugging is such a hard problem.
Incidentally, that’s also why good developers usually make lousy testers. The developer focuses her mind on one thing – getting a bunch of silicon to behave. The tester looks for faults, examines edge conditions, limit conditions, and odd configurations of inputs and infrastructure to see how things break. The two mindsets are antithetical.
Once a piece of software has been in production long enough, the mainline paths are usually defect free. In fact, the rest of the code may be a hot mess, but that stuff doesn’t get executed so those defects are latent and do not impact normal processing. Ed Adams published a report in 1984 titled “Optimizing Preventative Service for Software Products” (https://ieeexplore.ieee.org/document/5390362, originally published in the IBM Journal of Research and Development, v 28, n 1). He concluded that once a product has been in production for a sufficient time, it was safer to leave it alone. Installing preventative maintenance was likely to disrupt the system. Most IT organizations know this, having learned the hard way. “If it ain’t broke, don’t fix it” is the mantra for this wisdom.
As a corollary, new software has a certain defect rate. Fixes to that software typically have a defect rate ten times greater. So if a typical fix is large enough, you put in a new bug for every bug you take out.
Computers Are Constrained
All computers have constraints. The relative amount of resources mean some computers are better for some workloads than others. For mainframes, the typical constraint is processing power. That’s why mainframes are tuned to run at 100% utilization, or higher. (How do you get past 100% utilization? Technically, of course, you can’t. But what the measurements are showing you is how much work is ready to run, waiting for available processing power. The scale actually can go to 127%, if there’s enough work ready.)
Different types of computers have different constraints. Mainframes run near 100% utilization – the CPU is the most expensive and constrained resource. PCs on the other hand never get busy. No human can type fast enough to drive utilization above a few percent. The constrained resource on PCs is typically disk storage. That’s why different types of computers do better at different types of work. PCs are great for user interface stuff. Mainframes are perfect for chewing through a million database records. By chance we developed mainframes first; that’s not an indictment of either type, Both are useful.
Computers Can Run Out of Resources
Any IT infrastructure has a design point for load. That is, when you put together a computer you structure it to meet the likely level of demand on the system. If you over-provision it, you waste resources that will never be used. If you under-provision it, you will not meet your service level agreements. So when you begin, you must know what the customers – your users – expect in terms of response time, number of concurrent transactions, database size, growth rates, network transaction load, transaction mix, computational complexity of transaction types, and so on. If you don’t specify what your targets are for these parameters, you probably won’t get the sizing right. You will likely buy too much of one resource or not enough of another.
Note that cloud computing can help – it allows you to dynamically add additional capacity to handle peak load. However, cloud isn’t a panacea. Some workloads don’t flex that much, so you spend extra money for flexibility for a capability that you can provide more economically and efficiently if it were in-house.
Add Capacity in Balance
When I was in high school our physics teacher explained that temperature wasn’t the same as heat. He said “Heat is the result of a physical or chemical reaction. Temperature is simply the change in heat over the mass involved.” One of the kids asked (snarkily) “Then why don’t drag racers have bicycle tires on the back?” The teacher was caught off guard. The answer is that the amount of heat put into the tire is the same regardless of its size, but the temperature was related to the size of the area where the tire touched the road. A bicycle tire has only about two square inches on the pavement, a fat drag tire has 100 square inches or more. So putting the same amount of horsepower spinning the tire will cause the bicycle tire’s temperature to rise about 50 times more than the gumball’s will.
When you add capacity to a computing system, you need to balance related capacity elements or you’ll be wasting money. Doubling the processor’s power (MHz or MIPS) without proportionately increasing the memory or network capacity simply moves the constraint from one place to another. What used to be a system with a flat-out busy CPU now becomes a system that’s waiting for work with a queue at the memory, the disk drive, or the network card.
Adding Staff Makes Things Worse
Increasing any resource creates potential problems of its own, especially of the system’s underlying architecture is ignored. Fore the software development process (regardless of form) one such resource is staff. The book “The Mythical Man-Month” by Fred Brooks (https://www.barnesandnoble.com/w/the-mythical-man-month-frederick-p-brooks-jr/1126893908) discusses how things go wrong.
The core problem is adding more people require strong communications and clear goals. Too many IT projects lack both. I once was part of an organization that consulted on a complex application rewrite – forty consultants, hundreds of developers, and very little guidance. The situation degenerated rapidly when the interim project manager decided we shouldn’t waste time on documentation. A problem would surface, the PM would kick off as task force, hold a meeting, and send everybody on their way. After the meeting, people would ask what specific decisions had been reached, but since there were no minutes, nobody could be sure. That would cause the PM to schedule another meeting, and so on. Two lessons I learned concerns meetings:
- If you do not have agenda, you do not have a meeting.
- If you do not distribute minutes, you did not have a meeting.
When you add staff, you must account for the extra overhead managing the activities of each person, and establish processes to monitor changes that every participant must follow. Scrum is an excellent way of flattening potentially harmful changes. By talking face to face regularly, the team knows everything that’s going on. Omit those meetings or rely on second-hand reports and the project is already off the rails. All that remains is to see how far things go wrong before someone notices.
In Conclusion …
If you have a computer system that suddenly gets a huge spike in load, do these things first:
- Review the performance reports. Look at changes in average queue length, response time, transaction flight time, and any relevant service level agreements or objectives.
- Identify likely bottlenecks
- Model the impact of additional resources
- Apply additional resource proportionately
- Continue to monitor performance
If you are unable to resolve the capacity constraints with these steps, examine the programs for internal limitations:
- Review program documentation, specifications, service level objectives, workload models and predictions, data flow diagrams, and design documents to understand architectural and design limits
- Determine what resource consumption assumptions were built per transaction type, and expected transaction workload mix
- Verify current transaction workload mix and resource consumption per transaction type
- Design program extension alternatives to accommodate increased concurrent users, transactions, resource demands per transaction class
- Model alternative design choices, including complexity, size, and verification (QA cost)
- Initiate refactoring based on this analysis
Note that if you do not have (or cannot find) the relevant documentation, you will need to examine the source code. At this point, you may need to bring in a small set of experts in the programming language to recreate the relevant documentation. Handy hint: before you start working on the source code, regenerate the load modules and compare them with the production stuff to identify any patches or variance between what’s in the library and what’s actually in production.
Bringing in a bunch of people before going through this analysis will cause confusion and waste resources. While to an uninformed public it may appear that something is being done, the likelihood is that what is actually being done will have to be expensively undone before the actual core problem can be resolved. Tread lightly. Plan ahead. State your assumptions, then verify them. Have a good plan and you’ll work it out. Remember, it’s just ones and zeros.
What do you think? Let me know in the comments below, or @WilliamMalikTM.