This paper will explain a scientific approach to problem solving. Although it is written to address Information Technology related problems, the concepts might also be applicable in other disciplines. The methods, concepts, and techniques described here is nothing new, but it is shocking how many “problem solvers” fail to use them. In between I will include some real-life examples.
Why do problem solvers guess in stead of following a scientific approach to problem solving? Maybe because it feels quicker? Maybe a lack of experience in efficient problem solving? Or maybe because it feels like hard work to do it scientifically? Maybe while you keep on guessing and not really solving, you generate more income and add some job security? Or maybe because you violate the first principle of problem solving: understand the problem.
Principle #1. Understand the *real* problem.
Isn’t it obvious that before you can solve, you need to understand the problem? Maybe. But, most of the time the solver will start solving without knowing the real problem. What the client or user describe as “The Problem” is normally only the symptom! “My computer does not want to switch on” is the symptom. The real problem could be that the whole building is without power. “Every time I try to add a new product, I get an error message” is the symptom. Here the real problem could be “Only the last 2 products I tried to add gave a ‘Product already exists’ error”. Another classic example: “Nothing is working”…
You start your investigation by defining the “real problem”. This will entail asking questions (and sometimes verify them), and doing some basic testing. Ask the user questions like “when was the last time it worked successfully?”, “How long have you been using the system?”, “Does it work on another PC or another user?”, “What is the exact error message?” etc. Ask for a screen-print of the error if possible. Your basic testing will be to ensure the end-to-end equipment is up and running. Check the user’s PC, the network, the Web Server, Firewalls, the File Server, the Database back-end, etc. Best-case you will pint-point the problem already. Worst-case you can eliminate a lot of areas for the cause of the problem.
A real life example. The symptom according to the user: “The system hangs up at random times when I place orders”. The environment: The user enters the order detail on a form in a mainframe application. When all the detail is completed, the user will tab off the form. The mainframe then sends this detail via communication software to an Oracle Client/Server system at the plant. The Oracle system will do capacity planning and either returns an error or an expected order date back to the mainframe system. This problem is quite serious, because you can loose clients if they try to place orders and the system does not accept them! To attempt to solve this problem, people started by investigating: 1) The load and capacity of the mainframe hardware 2) Monitoring the network load between the mainframe and the Oracle system 3) Hiring consultants to debug the communication software 4) Debugging the Oracle capacity planning system After spending a couple of months they could not solve the problem.
The “Scientific Problem Solver” was called in. It took less than a day and the problem was solved! How? The solver spends the day at the user to see what the “real problem” was. It was found that the problem only occurs with export orders. By investigating the capture screen and user actions, it was found that with export orders the last field on the form is always left blank and the user did not tab off this field. The system was not hanging, it waited for the user to press “tab” another time. Problem solved. It can be noted that the “Scientific Problem Solver” had very limited knowledge of the mainframe, of the order capturing system, of the communication software, and of the Oracle capacity planning system. And this brings us at Principle#2.
Principle #2. Do not be afraid to start the solving process, even if you do not understand the system.
How many times have you heard “I cannot touch that code, because it was developed by someone else!”, or “I cannot help because I am a HR Consultant and that is a Finance problem”? If you washing machine does not want to switch on, you do not need to be an Electrical Engineer, Washing Machine Repair Specialist, Technician, or whatever specialist to do some basic fault finding. Make sure the plug is working. Check the trip-switch, etc. “I have never seen this error before” should not stop you from attempting to solve. With the error message and an Internet Search engine, you can get lots of starting points.
In every complex system there are a couple of basic working principles. System A that reads data from System B can be horribly complex (maybe a Laboratory Spectrometer that reads data from a Programmable Logic Computer via an RS-232 port). But, some basics to test for: Does both systems have power? Is there an error message in the event log on one of these systems? Can you “ping” or trace a network packet from the one system to the other? Try a different communication cable. Search the internet for the error message.
Once you have established what the problem is, you need to start solving it. Sometimes the initial investigation will point you directly to the solution (switch the power on; replace the faulty cable, etc). But, sometimes the real problem is complex in itself, so the next principle is to solve it simple.
Principle #3. Conquer it simple.
Let’s start this section with a real-life example. Under certain conditions, a stored procedure will hang. The stored procedure normally takes about an hour to run (when it is not hanging). So, the developer tried to debug. Make some changes and then wait another hour or so to see if the problem is solved. After some days the developer gave up and the “Problem Solver” took over. The “Problem Solver” had to his disposal the knowledge under witch conditions the stored procedure would hang. So, it was a simple exercise to make a copy of the procedure, and then with this copy to strip all unnecessary code. All parameters were changed with hard-coded values. Bits of code were executed at a time and the result-sets were then again hard-coded into the copy of the procedure. Within 3 hours the problem was solved. An infinite-loop was discovered.
What the “Problem Solver” did, was to replicate the problem and at the same time tried to isolate the code that caused the problem. In doing so, the complex (and time consuming) stored procedure became something fast and simple.
If the problem is inside an application, create a new application and try to simulate the problem inside the new application as simple as possible. If the problem occurs when a certain method for a certain control gets called, then try to only include this control in the empty application and call that method with hard-coded values. If the problem is with embedded SQL inside a C# application, then try to simulate the SQL inside of a Database Query tool (like SQL*Plus for Oracle, Query Analyzer for SQL Server, or use the code in MS Excel via ODBC to the database).
The moment you can replicate the problem in a simple way, you are more than 80{1fe46aa43da29c99d93faa41b47403026427a797bc631975a851231d4d124355} on your way to solve it.
If you do not know where in the program the problem is, then use DEBUG.
Principle #4. Debug.
Most application development tools come standard with a debugger. Weather it is Macromedia Flash, Microsoft Dot Net, Delphi, or what ever development environment there will be some sort of debugger. If the tool does not come standard with a debugger, then you can simulate one.
The first thing you want to do with the debugger is to determine where the problem is. You do this by adding breakpoints at key areas. Then you run the program in debug mode and you will know between which breakpoints the problem occurred. Drill down and you will find the spot. Now that you know where the problem is, you can “conquer it simple”
Another nice feature of most debuggers includes the facility to watch variables, values, parameters, etc. as you step through the program. With these values known at certain steps, you can hard-code them into your “simplified version” of the program
If a development tool does not support debugging, then you can simulate it. Put in steps in the program that outputs variable values and “hello I am here” messages either to the screen, to a log file, or to a database table. Remember to take them out when the problem is resolved… you don’t want your file system to be cluttered or filled up with log files!
Principle #5. There is a wealth of information on the database back-end that will help to solve a problem.
The “Problem Solver” was called to help solve a very tricky problem. A project was migrating system from a mainframe to client-server technology. All went well during testing, but when the systems went live, all of a sudden there were quite a few, and quite random “General Protection Faults”. (The GPF-error was the general error trap in Windows 95 and 98). It was tried to simplify the code, debugging was attempted, but it was impossible to replicate. In the LAB environment, the problem would not occur! Debugging trace messages to log files indicated that the problem occurred very randomly. Some users experienced it more than others, but eventually all users will get them! Interesting problem.
The “Problem Solver” solved this after he started to analyze the database back-end. Not sure if it was by chance or because he systematically moved in the right direction because of a scientific approach. Through tracing what is happening on the back-end level, it was found that all these applications were creating more-and-more connections to the database. Every time a user starts a new transaction another connection was established to the database. The sum-total of the connections were only released when the application was closed. As the user navigated to new windows inside the same application, more and more connections are opened, and after a specific number of connections, the application will have enough and then crash. This was a programming fault in a template that was used by all the developers. The solution was to first test if a cursor to the database is already open, before opening it again.
How do you trace on the back-end database what is happening? The main database providers have GUI tools that help you to trace or analyze what queries are fired against the database. It will also show you when people connect, disconnect, or were unable to connect because of security violations. Most databases also include some system dictionary tables that can be queried to get this information. These traces can sometimes tell ‘n whole story of why something is failing. The query code you retrieve from the trace can be help to “simplify the search”. You can see from the trace if the program makes successful contact with the database. You can see how long it takes for a query to execute.
To add to Principle#2 (do not be afraid to start…); you can analyze this trace information, even though you might not know anything about the detail of the application.
Remember though that these back-end traces can put a strain on the back-end resources. Do not leave them running for unnecessary long.
Principle #6. Use fresh eyes.
This is the last principle. Do not spend too much time on the problem before you ask for assistance. The assistance does not have to be from someone more senior than you. The principle is that you need a pair of fresh eyes for a fresh perspective and sometimes a bit of fresh air by taking a break. The other person will look and then ask a question or two. Sometimes it is something very obvious that was missed. Sometimes just by answering the question it makes you think in a new directions. Also, if you spend hours looking at the same piece of code, it is very easy to start looking over a silly mistake. A lot of finance balancing problems get solved over a beer. It could be a change of scenery, and/or the relaxed atmosphere that will pop out the solution. Maybe it is the fresh oxygen that went to the brain while walking to the pub. Maybe it is because the problem got discussed with someone else.
Conclusion
After reading this paper, the author hope that you will try these the next time you encounter a problem to solve. Hopefully by applying these six principles you will realize the advantages they bring, rather than to “guess” your way to a solution.