Friday, February 06, 2009

A problem is easy to fix

I was reading an electronics trade magazine recently and one of the articles really hit home. The magazine, EDN (I think it used to be called Electronics Design News), has a regular feature called Tales from the Cube. They are stories about interesting problems that readers have had and their solutions to the problems. All in one page. The article I read that got me to write this was called "All fail down" and it was about an electronic product that failed in areas that are dry (that is, they had low humidity - not that they restricted alcoholic beverages). This is usually a sign that static build-up is getting from the case of the device to the electronics and this causes the a malfunction. The writer of the article was brought in as a consultant and he'd seen things like this before but as he looked over the design of the product and offered suggestions, the engineers would just say, "We already tried that. We've tried everything." He puzzled over it for a long time until he realized that the company's engineers had tried each one of the fixes and then removed it to try the next fix. So, our consultant hero realized that there was really more than one problem that had the same result. When he applied all six of the fixes at once, everything worked. Then it was time to remove the fixes to see which ones were needed to solve this problem and it turned out that only two of the six fixes were needed to solve the problem. But the fact that it was two problems was what made this such a tough thing to solve.

This is what I've seen in my work, too. Usually, fixing one problem is fairly simple. It's when you have two or more problems that interact that you have a real problem. I recently found and fixed a problem in our products that appeared to be one problem but turned out to be two things that weren't really mistakes - most of the time. I was called in because one of our devices failed to run because our software couldn't read information from it correctly. It looked like a bug in our software. But when I tracked it down, it turned out that the information in our device was incorrect. But when the hardware guys loaded a new copy of that information into the device, it continued to fail. How did our devices ever work? The same information worked when loaded into brand new devices. It turned out that the information was slightly wrong but it didn't matter when it was loaded into a brand new device. But when it was loaded into a device that had some other slightly wrong information in it (from a power glitch or mishandling), it failed. The original information was never meant to be used in anything but a brand new unit. It took the interacting problems of the slight error in the original information, the unit that failed in a certain way and the using of the original information in the wrong circumstance to cause the problem.

These kinds of problems are what keep engineers up at night. And it is the solving of these kinds of problems that gives engineering the satisfaction that makes it exciting to come to work each day.

I'm sorry if this was boring. I'm working on trying to get more stories on engineering into this blog. After all, I call it "Adventures in Engineering". I want to make the explanations understandable by non-engineers but detailed enough to give the real story. It's not easy but I will be a better engineer if I can balance these two things.

No comments: