Azure Local Health Service faults — what they are, how to view them, and how to alert on them via Azure Monitor : DataON Support Portal

TL;DR: Azure Local’s Health Service continuously monitors the Azure Local cluster and surfaces problems as actionable “faults” — each carrying a severity, a plain-language reason, a recommended next step, and the affected part’s identity and physical location. View them locally with Get-HealthFault, or wire them to Azure Monitor by turning on the cluster’s Health alerts capability (no alert-rule authoring required — over 80 fault types come pre-wired). Faults are interpretive, root-cause-aware health states, not raw counters — complementary to (not a replacement for) metrics like Get-ClusterPerf.

Recommended action:

1. View current faults locally on the cluster:

Get-HealthFault

If there are no current faults, the cmdlet returns nothing. A typical fault looks like:

Severity:       MINOR
Reason:         Connectivity has been lost to the physical disk.
Recommendation: Check that the physical disk is working and properly connected.
Part:           Manufacturer Contoso, Model XYZ9000, Serial 123456789
Location:       Seattle DC, Rack B07, Node 4, Slot 11

2. Wire faults to Azure Monitor for production alerting (this is the supported path for paging, email, SMS, ITSM integration):

Unlike a typical Azure Monitor setup, you do not author individual alert rules — when you enable the Health alerts capability on the cluster, all 80+ Health Service fault types come pre-wired as Azure Monitor alerts at no additional cost. There is no Log Analytics workspace to provision, no signal / condition / threshold configuration, no rule authoring. The hookup is two portal steps:

Turn on the Health alerts capability on the cluster. In the Azure portal, go to your Azure Local system resource → select the cluster → Capabilities tab → select the Health alerts tile → Turn on. This installs the AzureEdgeAlerts Azure Monitor extension in the background (verify under Settings > Extensions). Once the Health alerts tile shows Configured, every Health Service fault on the cluster begins flowing into Azure Monitor’s Alerts blade in near real time.
Configure alert processing rules — not alert rules. This is the part that differs from the normal Azure Monitor flow and is the most common point of confusion. You are not authoring conditions, signals, or thresholds (those are already defined by the Health Service). Instead, go to Azure Monitor → Alerts → Alert processing rules and configure how the pre-wired alerts are routed: which action groups (email, SMS, webhook, ITSM connector, Logic App) get notified for which fault types, filtering by severity or component (e.g. all storage-related faults to the storage team), and any time-window suppression. Reference: Alert processing rules and Configure an alert processing rule.

Once configured, review live alerts at the Azure Local resource under Monitoring > Alerts. To disable the integration later, uninstall the AzureEdgeAlerts extension (see Uninstall an Arc extension).

3. View or change health thresholds and enable disabled fault types via PowerShell:

Some health alerts (CPU, memory, storage volume capacity, etc.) have tunable thresholds. View current values with Get-StorageHealthSetting:

Get-StorageSubSystem Cluster* | Get-StorageHealthSetting -Name 'System.Storage.Volume.CapacityThreshold.Warning'
Get-StorageSubSystem Cluster* | Get-StorageHealthSetting -Name 'System.Storage.Volume.CapacityThreshold.Critical'

Change a threshold with Set-StorageHealthSetting:

Get-StorageSubSystem Cluster* | Set-StorageHealthSetting `
    -Name 'System.Storage.Volume.CapacityThreshold.Warning' -Value 70

The same cmdlet is also how you turn on fault types that ship disabled (some are off by default to keep noise low). For example, to enable Microsoft.Health.FaultType.PhysicalDisk.HighLatency.AverageIO:

Get-StorageSubSystem Cluster* | Set-StorageHealthSetting `
    -Name 'System.Storage.PhysicalDisk.HighLatency.Threshold.Tail.Enabled' -Value 'True'

To dump every setting that has been explicitly set on the cluster:

Get-StorageSubSystem Cluster* | Get-StorageHealthSetting

The authoritative catalogue of fault types, setting names, defaults, collected metrics, providers, and entity types lives in Microsoft’s reference workbook: github.com/Azure-Samples/AzureLocal/health-service-faults. See Optional details below for what’s on each sheet.

Why (and how it differs from regular metrics or monitoring):

Metrics — IOPS, latency, throughput, capacity counters from Get-ClusterPerf, Get-StorageQoSFlow, performance monitor, etc. — are raw numerical telemetry. They’re useful for trending and ad-hoc investigation, but the operator still has to interpret them and decide whether a value is a problem. Health Service faults are a different layer of the stack, with three important properties:

Interpretive, not raw. A fault tells you something is wrong, in plain language, with an explicit recommended action attached. No thresholding by the operator required.
Root-cause correlated. When a node goes offline, the Health Service raises one fault for the node, not separate faults for every disk in it that’s now unreachable. It recognises chains of effect and collapses them into the underlying cause, which makes the output much less chatty than naive per-component alerts.
Self-clearing. Faults are not “acknowledged” or marked resolved by an operator. The Health Service removes a fault automatically once it can no longer observe the problem. This generally reflects that the underlying issue has been fixed.

The practical takeaway: use metrics for performance investigation and trending, use faults (locally via Get-HealthFault, or surfaced through Azure Monitor) for “is anything wrong right now, and what should I do about it.” They answer different questions.

What it covers (and what it doesn’t):

Over 80 health issues across storage, networking, and compute components on Azure Local clusters. Storage faults are produced by the StorHealth provider; networking faults are produced by the separate Network HUD service (e.g. flapping NICs, NIC resets, missing LLDP properties, unsupported inbox drivers). Examples include physical and virtual disks (connectivity, latency, capacity, predictive failure, bad blocks, detached drives, auto-retirement), storage pool capacity, volumes (capacity, performance, integrity, repair needs), network adapters and physical switches, Storage QoS, nodes (state, connectivity), unsupported hardware, exceeded CPU / memory / storage usage, VMs and VHDs.
Storage enclosure components (fans, power supplies, temperature sensors) depend on SES (SCSI Enclosure Services) data from the hardware vendor. If the enclosure does not expose SES, the Health Service has no way to read those components and they will not appear as faults.
Some fault types ship disabled by default (e.g. Microsoft.Health.FaultType.PhysicalDisk.HighLatency.AverageIO) to avoid alert fatigue in typical deployments. Enable on demand via Set-StorageHealthSetting (shown above).
The location field on hardware faults is derived from your fault domain configuration (datacenter / rack / node / slot). If fault domains are not configured, location output will be sparse — often just a slot number. See Fault domain awareness if you want richer location output.
Applies to Azure Local 2311.2 and later. The same underlying Health Service also runs on Windows Server 2019 and 2022 Storage Spaces Direct clusters, with the same Get-HealthFault cmdlet — though the Azure Monitor alerts integration described above is Azure Local only.

Going forward:

The Health Service itself is on by default on Azure Local clusters — there is no “enable Health Service” step. Treat Get-HealthFault as the first triage stop when a customer reports something is off, before reaching for metrics or event logs; if it returns nothing, you can rule out most common hardware and configuration problems quickly. For production deployments, hook the cluster up to Azure Monitor (Step 2 above) so faults page the right team automatically — just remember the hookup is “turn on capability, then write processing rules,” not the usual “create an alert rule” flow.

Optional details:

Microsoft’s reference workbook — Azure-Samples/AzureLocal/health-service-faults — is the canonical catalogue. The .xlsx in that folder has seven sheets and is worth bookmarking:

Health Faults — ~90 storage-side fault types (PhysicalDisk, VirtualDisk, StoragePool, Volume, Server, Subsystem, etc.) with title, description, recommended action, provider, triggering condition, and check frequency.
NetworkHUD Faults — networking fault catalogue (NetworkAdapter, PhysicalSwitch) from the Network HUD service. Covers flapping NICs, NIC resets, LLDP issues, driver validation, etc.
Health Settings — ~300 tunable settings with the exact Get-StorageHealthSetting / Set-StorageHealthSetting name to use. This is where you find the right setting string for an enable / threshold change.
Collected Metrics — every metric the Health Service collects (Compute VM, Storage, Network, etc.), keyed by metric name (e.g. Microsoft.Health.Metric.VM.CPUUtilization).
Health Fault Severity — the canonical severity mapping table: Health Urgency (UNHEALTHY 2 / WARNING 1 / HEALTHY 0) ↔ WMI Fault Severity (RED 6 / YELLOW 4 / GREEN 3) ↔ Perceived Severity (Critical / Minor / Degraded). Use this when correlating WMI fault values with Azure Monitor severity labels.
Health Providers — the providers that emit faults (Storage Health, Cluster Health, Storage Replica, Storage QoS, Compute, Perf, etc.) and their GUIDs, useful when filtering Azure Monitor alerts by source.
Fault Entities — the entity types a fault can target (Microsoft.Health.EntityType.Server, PhysicalDisk, StoragePool, VirtualDisk, Volume, etc.).

Other reference details:

Each fault carries: FaultId (unique within the cluster), FaultType (e.g. Microsoft.Health.FaultType.Volume.Capacity), Reason, PerceivedSeverity (see the Health Fault Severity sheet for the mapping), FaultingObjectDescription (part info for hardware), FaultingObjectLocation, and RecommendedActions.
For programmatic access outside Azure Monitor (custom on-prem dashboards, scripted notification, etc.), the Health Service also exposes the WMI event class MSFT_StorageFaultEvent. Each event includes a ChangeType field where 0 = Create, 1 = Remove, 2 = Update, plus all the fault properties above. The Microsoft Learn page includes a complete C# subscription example.
Faults can rediscover after intermittent issues (failover, flapping link, etc.). Azure Monitor handles dedup automatically via FaultId; for custom integrations, key on FaultId yourself so you don’t spam the on-call.
Full Microsoft docs: View Health Service faults, Respond to Azure Local health alerts using Azure Monitor alerts, and Modify Health Service settings.

Azure Local Health Service faults — what they are, how to view them, and how to alert on them via Azure Monitor Print

Related Articles