Back to Our Roots
“Failure Analysis” is basically the first thing I learned when I joined T H Hill back in the before times. It’s like CSI, but with broken metal instead of bodies (much better). It’s fun! That’s where I get a lot of my entertaining stories (I know you’re entertained, don’t try to deny it), and where a lot of the décor in my office comes from (“wow, what’d they do to that one?!”).
As usually happens, though, time marches on, we grow—we gain expertise, we learn about new materials and components, our range gets bigger, our pants get bigger (ahem)—and I learn to distinguish between different types of failure analysis. That steel-based CSI is what we usually call “metallurgical failure analysis,” because we’re testing and prodding those chips and chunks until they give up their mysterious secrets. Sometimes, though, we study the operations themselves—the pipe was good for 514,229 metric bawdles, and you need to know whether you applied that many real-life bawdles. We gather all the data that’s available about the well and the rig and the Baywatch rerun that was on that day, and we calculate the most likely scenarios that led to your pipe’s unfortunate demise. We refer to that as an “operational failure analysis.”
We can keep going, though. You busted your widget …
… you busted your widget because there was too much load applied …
… there was too much load applied because the driller thought it could carry more …
… the driller thought it could carry more because the other tool that you normally use can, in fact, carry more …
… the engineer switched the normal tool being used (because the sales guy took him on a hunting trip) …
… the engineer didn’t tell the driller the new tool’s capacity, so the driller was working from memory …
… there’s no communication mechanism in place for the engineer to verify the particular tool in use and highlight the weak link to the driller …
… the communication of the string design from the office to the rig is ad hoc and uncontrolled, with no one person or group actually responsible for the results.
That bottom one might be called the “root cause,” the condition at the operational management level that allows or even encourages failures. Clearly, if you fix that, you not only correct this particular widget’s untimely death, but also push a broad swath of operations toward less-failure-prone-ness. We call it “Root Cause Failure Analysis.” Which, of course, we have to abbreviate to RCFA.
I think it’s obvious that finding and addressing the root cause can be wildly helpful. I think it’s also obvious that it’s a pain, about as enjoyable as extracting a molar from an angry hippo. Nobody wants to find the answer, then keep digging to find other answers and other things that we did wrong—specifically looking for the things we do wrong all the flippin’ time. Personal growth is hard, man.
Still, it’s a good idea. (Says the guy that still hasn’t cleaned out the shelves in his garage.) Imagine the general “failure analysis” that kooks like us do—we’ll tell you the “physical cause,” that is, the gizmo that broke and led to a failure. (If there is one; sometimes there’s not. Some failures are caused by poor operational choices and don’t really have a physical cause.)
We might, if you’re lucky, be able in a regular failure analysis to postulate on the “process cause,” that is, the stuff you (or your trusty representative) did or didn’t do that led to the gizmo breaking (maybe) which led to the failure. There are always process causes for failures. It usually requires a bit more digging, but we might find your inspection missed something, or that was the wrong design for this scenario, or something happened in that well that your company man isn’t telling you about.
But the root cause is the thing that allowed the process cause to happen. Things like: nobody knows what the inspection is supposed to do, or how to make sure you get a good one, so the inspection company is chosen by the lowest price / best restaurant they take you to rather than the highest quality and correct coverage.
When you fix the root cause of a failure, you keep a whole herd of future failures from happening.
Lemme’ give you an entirely fictional scenario, one that, if it is based on a real-life problem, I will never admit it. The names and places have been changed to protect the guilty.
Let’s say there’s a tool, maybe even a tool that is used to communicate information from the bottom of the well to the top of it via pressure pulses in the mud system. This “pulser,” for lack of a better word, dutifully pulsed for a while, but it took a vacation before its shift was up—it stopped working while in the hole. Given the criticality of the well, things like “knowing what the flip is happening down there” is not really optional, so the rig spent a several hours diagnosing the problem, failing to diagnose the problem, then tripping it out of the well to put the backup tool in its place. The hole was something like 25,000 ft deep at that point, so a round trip took the better part of a day; adding in the diagnostic time and the staggering day rate, that leads us to a failure cost that I’m going to call “substantial” so that I don’t have to call it “more than my whole life is worth even if I sell all my body parts on the black market.”
Why, oh why, was this “substantial” amount of money wasted? Well, in the design of the tool there’s a little tidbit that makes sure the pulser stays in its track as it moves up and down, and that tidbit was cracked—it fell apart when we disassembled it. The crack was obviously a fatigue crack, starting from an obviously-sharp corner that would be sensitive to fatigue cracking.
Not to worry, sez the vendor. We have a new tidbit that is specifically designed to be less sharp, less fatigue prone, and less likely to waste “substantial” moneys.
Wait … it’s already designed? Then why wasn’t it in our current problematic tool?
Well, you know, we replaced most of them, this one was just left in inventory accidentally. Actually, you know what, we just went and checked and found another one, so we fixed that, too!
Um, great? But didn’t the assembler notice that the part was different? Were the instructions updated?
Yes! The instructions were definitely updated … just, you know, they haven’t printed out the new assembly instructions for the shop in a while. We have now, though!
For those of you keeping score, this RCFA has now prevented another similar failure (the other old part still in the warehouse), but also revealed a couple of issues that will clearly lead to even more failures. Their engineering management-of-change process is … let me think of a nice word for “haphazard flailing” … incomplete, let’s go with incomplete. They don’t have a good handle on what is actually in the parts inventory, old or new. And the assembly of their tools uses uncontrolled copies of the instructions.
Fix those root causes, and you can head off a ton of failures at the pass. (What “pass”? Maybe “nip them in the bud”? These proverbial metaphors are taking over this article, man.)
Anyway, this is your friendly neighborhood dork telling you something you don’t want to hear: you should do more work, it’s good for you. I’m off to do some crunches now …