Uptime % is the most common way to understand your estate performance - for good reason, it's often the most effective.
The first decision to make is how to define "up". The most common way is to use a timeout - so that we consider a device offline if we've not heard from it in a specified period.
This timeout period will vary between device estates, but as a general rule of thumb, the minimum time should be around 1.5 times the heartbeat interval: if your devices report every 10 minutes, the timeout should be no less than 15 minutes. You might decide that 1 missed heartbeat is ok to ignore, so perhaps set the timeout to 25 minutes to capture devices missing 2 in a row.
Heartbeats are not the only way however, this blog post explains how to use the AWS IoT connection topic to piggyback on their connection management.
Once you've decided, you need to express the logic in a Filter. To define a timeout, you'll use a Time based query:
As you create the filter, the preview will show you how many devices are currently online, against your total. Save the filter with a handy name like "Seen in last 15 minutes".
Now we have a filter, we can use Cohort to create a our KPI. Use the Metric "Percentage of time where" and choose your filter i.e. "Seen in last 15 minutes".
Click Run KPI and you'll see you uptime % for the last 24 hours.
If you want to see how your Uptime % has been trending, then you can group it by time, or by any other property that might be interesting, such as Firmware revision.