The Case of the Field Problem

Picking up someone else's design to get it into production is a common enough situation and yet there are always interesting problems to be solved. These can get particularly challenging if that previous engineer has left the company and was the only one who really knew how the product worked. In this case, not only was it a new product, it was the first product of a new range using a new processor and new communications protocol, which we were going to develop into many more variants. So it was inevitable that there would be some bugs found by the first customers.

Of course the product worked fine on the bench - it was when multiple products were wired together in the field that the bug showed up. Although as a product manufacturer we specified how the field wiring was to be done, it is never that simple. Wiring is done by third parties who don't know the product design and system requirements, so what looks equivalent to them isn't necessarily acceptable in the complete system. We should have been able to provide sufficient guidance about the wiring requirements to guarantee acceptable performance, but this requires either a lot of in-house testing or a lot of experience, neither of which we had at that point.

Site Visit

I'd only been handed this product a few weeks before I got the call to visit the first customer site, where the bug was showing up. I barely had time to understand the product, board, communications protocol, firmware, development system, system design and get to know how the product normally behaved. Reports were coming in of intermittent behaviour, which is always the hardest to find even with all the right monitoring equipment, and the site wasn't amenable to leaving anything connected for more than a few minutes while we were present.

Nevertheless, within a few minutes I had observed some errant communications problems between the boards which seemed to be the cause of the major problem. One of the things you quickly learn about site visits when there are customer problems is that you will be told many things, only some of which are pertinent to the major problem, and others are subsequent effects of the same one. It's important not to get distracted by all these other reports and concentrate on what you are there to observe and solve. Usually the minor problems will resolve themselves - a lot of them are misunderstandings or irrelevancies, and if not can be better observed and fixed once the main issue is resolved, but until then are only a distraction.

We hadn't had sufficient advance information to prepare any possible mitigations or even know what instrumentation to bring to diagnose the problem definitively, but the site wasn't so distant that a return visit wasn't possible. But because this was a high profile installation, one of the directors was very keen to get a quick resolution without having multiple visits. I had grabbed a few components from the lab just in case I had the opportunity to try something, but the installation made that impossible and in any case I would have needed sufficient quantity to modify all the products on that network.

By giving up on attempting to fix the problem in that first visit, I could concentrate on observing the problem. It's very important to get reliable information about site problems before attempting to fix them. For one thing, if you don't record the exact conditions under which they occur, you won't know if the fix made the improvement you were hoping for. So I came back from that site visit with sufficient information to diagnose the problematic area - the communications between the boards.

Suspicions and Trust

This product was the first we had designed using a commercially available communication network protocol which was sold to us as being highly engineered and reliable, so we had to assume we had done something wrong in the implementation to get the problems we were seeing on site. At this early stage in diagnosis, you have to start by suspecting everything but the main focus was inevitably on the more complex parts of the design, specifically in the firmware.

We see so many examples in everyday life when the more complexity a thing has, the more likely it is to go wrong, that we associate complexity with likelihood of failure. So when we are diagnosing a bug in a complete system, we tend to assume it more likely to be in the most complex part of that system which is often the software. But it always repays to check the fundamentals first.

Once back in the lab, retesting the firmware again proved how robust it was. And there didn't appear to be any major problems with the wiring such as broken wires, bad contacts and shorts between adjacent connections, despite using a higher density connector for field wiring than is commonly used.

Fine pitch connector for display wiring

Power and Signal

Within a couple of days, I had traced the problem to an interaction between the SMPSU and the communications. When I fed the 5V section of the board from a bench PSU and saw that the comms problem disappeared, I realised that somehow the SMPSU was the cause. The pressure was on to get to a solution quickly because the director was getting impatient and worried that this could indicate a fundamental design problem that put the whole range at risk, as well as having to daily fend of complaints from the customer.

Observing the SMPSU behaviour over a range of loads let me find that there were specific conditions causing interference with the communications. This was an off the shelf SMPSU chip using the components specified by the manufacturer in the application note, so there should have been no problem with it from a circuit point of view. PCB design is critical to SMPSU but the previous engineer had known that, and had done a good job of keeping loop areas small and switching node sizes small. As far as I could tell, the components were the ones recommended too, so why was there a problem?

If you can't make it better, make it worse

Sometimes nothing you try makes any improvement. This means you've been working in the wrong area, your assumptions are wrong and you need to reset them. Try the opposite tack, see if there's anything you can do to make the symptom worse. But don't try to generate additional or different problems, the aim is to get the symptom of interest to show up more strongly. This can be easier to see than a slight improvement and it shows you which area to concentrate your efforts on. There's an additional reason for having stronger symptoms on demand - it is easier to determine the effectiveness of your fix.

This board was used in a confined space which meant that the cables had to lie close to the board. Could the problem be made worse by moving the cables around?

The communications system used a small signal transformer to isolate the network from the board, so that only a single twisted-pair cable was needed. This is supposed to improve the robustness because the cable rejects common mode noise and there is no ground cable problem which can cause thresholds to differ in boards with different grounds. All of that remained true, but the use of an open drum core inductor within an inch of the comms transformer meant that some magnetic flux from the SMPSU inductor was coupled in to the comms transformer. Unfortunately there was sufficient interference within the signal bandwidth to cause this problem.

Drum core inductor near comms transformer, oblique view

The SMPSU inductor is on the left. The other drum core inductor was a d.c. power line filter and was not relevant.

Shielding the comms transformer with the metals we had in the lab didn't solve it, so my boss suggested getting some mu-metal to make a shield. I had heard this suggested at a previous job as a fix and it's never an economic solution. Costs were critical here which meant I needed to find the minimum hardware change. For example, shifting the fundamental switching frequency of the SMPSU by just changing a resistor would have been preferable to a PCB redesign to move the inductor because we had built up stock of the bare PCB and components.

Component Change

For a few pence more than the existing open drum core inductor, I found a readily available equivalent shielded core inductor which fitted in the same footprint and within the height envelope. It worked well in terms of SMPSU operation and there was no interference with the comms transformer.

Shielded through hole drum core inductor

Lesson

We're never dealing with ideal components in practice. Fields extend beyond that neat component outline shown in your PCB editor. It applies to multi-layer ceramic capacitors too, where the magnetic flux caused by the ESL can couple to the capacitor next to it.

Site Fix

Once bench testing with a single board was completed, a set of boards were made with the shielded inductor and sent to site, which fixed the problem. Time between the site visit and the fix was less than two weeks, and the fix was economical. The customer was happy but a decision was made to cancel the entire range in favour of buying in a competitor product.