How can I do root-cause analysis (RCA)?

DevicePilot's powerful cohort analysis tools let you do root-cause analysis to explore the relationship (correlation) between a problem and its possible cause

One of the first things people often do in DevicePilot is to measure the "up-time" of their device estate, by:

  1. Creating a Filter e.g. "Up" defined as "Last seen is less than 1 day ago"
  2. Creating a KPI to measure this across the device estate, e.g. "Uptime" defined as "Percentage of time where: Up"

Once you see this number, you may decide it isn't good-enough - so how can DevicePilot help you make it better?

In the View page you can explore devices one-by-one, inspecting devices that are working well and ones that are not, and perhaps getting some clues as to the root cause. But the Cohort page is a much more powerful tool to analyse this, across all your devices in one go.

Say one of the properties that your device reports is radio signal strength, and you are wondering whether this might be affecting your uptime. To explore this, all you have to do is to add "Group By Property: Signal strength" to your KPI. At the top of the page, choose a long-enough time-period to get statistical validity, e.g. 7 days. When you press "Run KPI" you'll see a chart showing the correlation of device uptime to signal strength. So if you see that device uptime is falling as signal strength falls (basically a diagonal line trending bottom-left), then you have a smoking gun.

And you can even tell empirically how much of your downtime is due to this cause - the area of the white triangle above the diagonal - estimating the potential benefit of putting a remediation plan in place.

Now you're running your Operations based on quantified, empirical facts, not just on hypothesis.

Of course, you can choose to run this correlation against any property - maybe some versions of hardware are more reliable than others? Maybe the device performs worse when its hot? Or when the battery is getting flat?

This analysis can be very useful to do when you are deploying a new version of software - as you roll it out, do a correlation of Uptime with "Group By Property: Software Version" and you'll immediately see whether your new version of software is having regressions in the field. This is particularly useful for IoT because unlike most of the things we deploy, IoT devices are deployed into the real world which is a messy and uncontrolled place with lots of uncontrolled externalities, which will test your code better than any pre-release regression tests you may run. So measuring performance as you roll out is essential, and DevicePilot gives you the perfect tool to do it.