Fix Azure Event Grid Domain Missing Diagnostic Logs

TL;DR

This check flags Azure Event Grid domains that have no diagnostic settings configured, which means delivery failures, publish errors, and security-relevant events go unrecorded. Fix it by attaching a diagnostic setting that streams logs and metrics to a Log Analytics workspace, storage account, or Event Hub.

Event Grid is one of those services that quietly sits in the middle of an event-driven architecture, routing messages between publishers and subscribers. When it works, nobody thinks about it. When it stops working, and you have no diagnostic logs, you are debugging blind. This check catches Event Grid domains where diagnostic settings are missing entirely, leaving you with zero visibility into how events are flowing (or failing to flow) through your system.

What this check detects

The eventgrid_nodiagnostics check inspects every Azure Event Grid domain in your subscription and verifies whether at least one diagnostic setting is attached. A domain is the management container that groups many topics under a single endpoint, so it tends to handle a large volume of events across multiple teams or applications.

If a domain has no diagnostic setting, the check fails. Without one, Azure does not forward Event Grid's resource logs or platform metrics anywhere durable. The data is either not collected or only retained for a short, non-queryable window.

Note: Event Grid has two resource types that support diagnostics: topics and domains. This check targets domains specifically. Domains aggregate many topics, so a single misconfigured domain can blind you to a large slice of your event traffic.

What you lose without diagnostics

Delivery failure logs that tell you when Event Grid could not reach a subscriber endpoint
Publish failure logs for events rejected at ingestion
Data plane request metrics like delivery attempts, dropped events, and matched events
An audit trail you can correlate with incidents or security investigations

Why it matters

Diagnostic logging on a messaging backbone is not a nice-to-have. Event Grid is frequently wired into security-sensitive workflows: it fans out resource change notifications, triggers serverless functions, kicks off automated remediation, and forwards events between trust boundaries. A gap in logging here has concrete consequences.

Silent event loss

Event Grid retries failed deliveries and then dead-letters or drops events that exceed the retry policy. If a subscriber endpoint goes down or starts returning errors, events can be discarded permanently. With no diagnostic logs, you have no record that delivery ever failed, no way to know which events were lost, and no signal to alert on. The first sign of trouble becomes a downstream system that mysteriously stopped receiving data.

Warning: Event Grid's default retry window and dead-lettering behavior mean that without logs and a dead-letter destination, dropped events are simply gone. You cannot replay what you never recorded.

Blind spots during incident response

When something breaks in an event-driven system, the first question is always "did the event get published, and did it get delivered?" Diagnostic logs answer that in seconds. Without them, your team burns hours reconstructing event flow from application logs on both sides, assuming those logs even exist.

Security and compliance gaps

If an attacker gains access to publish events into your domain or to manipulate event routing, the diagnostic logs are a primary source of evidence. Frameworks like CIS, SOC 2, and ISO 27001 expect logging to be enabled on services that move data across boundaries. A domain with no diagnostic settings is a finding waiting to happen during an audit.

How to fix it

The fix is to create a diagnostic setting on the Event Grid domain that routes logs and metrics to a destination you control. You have three destination options, and you can use more than one at a time:

Log Analytics workspace for querying with KQL and building alerts
Storage account for cheap long-term archival
Event Hub for streaming to a SIEM or third-party tool

Option 1: Azure CLI

First, grab the resource ID of the domain and the destination workspace:

DOMAIN_ID=$(az eventgrid domain show \
  --name my-event-domain \
  --resource-group my-rg \
  --query id -o tsv)

WORKSPACE_ID=$(az monitor log-analytics workspace show \
  --workspace-name my-law \
  --resource-group monitoring-rg \
  --query id -o tsv)

Then create the diagnostic setting, enabling both logs and metrics:

az monitor diagnostic-settings create \
  --name eventgrid-diagnostics \
  --resource "$DOMAIN_ID" \
  --workspace "$WORKSPACE_ID" \
  --logs '[{"categoryGroup":"allLogs","enabled":true}]' \
  --metrics '[{"category":"AllMetrics","enabled":true}]'

Tip: Using "categoryGroup":"allLogs" future-proofs the setting. If Azure adds new log categories for Event Grid later, they are captured automatically without you having to update the configuration.

Option 2: Azure Portal

Open your Event Grid domain in the Azure Portal
Under Monitoring, select Diagnostic settings
Click Add diagnostic setting
Check the log categories (or the allLogs group) and AllMetrics
Choose a destination: Send to Log Analytics workspace, Archive to a storage account, or Stream to an event hub
Give the setting a name and click Save

Option 3: Terraform

If you manage infrastructure as code, define the diagnostic setting alongside the domain so it can never drift out of compliance:

resource "azurerm_eventgrid_domain" "example" {
  name                = "my-event-domain"
  location            = azurerm_resource_group.example.location
  resource_group_name = azurerm_resource_group.example.name
}

resource "azurerm_monitor_diagnostic_setting" "eventgrid" {
  name                       = "eventgrid-diagnostics"
  target_resource_id         = azurerm_eventgrid_domain.example.id
  log_analytics_workspace_id = azurerm_log_analytics_workspace.example.id

  enabled_log {
    category_group = "allLogs"
  }

  metric {
    category = "AllMetrics"
  }
}

Warning: Diagnostic data sent to a Log Analytics workspace is billed per gigabyte ingested and stored. High-volume domains can generate meaningful log volume, so use a storage account for cheap long-term retention and keep only what you actively query in Log Analytics.

Bicep

resource diagSetting 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
  name: 'eventgrid-diagnostics'
  scope: eventGridDomain
  properties: {
    workspaceId: logAnalyticsWorkspace.id
    logs: [
      {
        categoryGroup: 'allLogs'
        enabled: true
      }
    ]
    metrics: [
      {
        category: 'AllMetrics'
        enabled: true
      }
    ]
  }
}

How to prevent it from happening again

Fixing one domain by hand is fine. Making sure no future domain ships without diagnostics requires enforcement. Azure Policy is the right tool here.

Azure Policy with DeployIfNotExists

Azure ships a built-in policy that automatically deploys a diagnostic setting to Event Grid domains that lack one. Assign it with a remediation task and new domains get configured without anyone remembering to do it.

# Find the built-in policy for Event Grid domain diagnostics
az policy definition list \
  --query "[?contains(displayName, 'Event Grid') && contains(displayName, 'diagnostic')].{name:name, displayName:displayName}" \
  -o table

# Assign it at the subscription scope with a managed identity for remediation
az policy assignment create \
  --name eventgrid-diag-deploy \
  --policy "" \
  --scope "/subscriptions/" \
  --location eastus \
  --mi-system-assigned \
  --params '{"logAnalytics":{"value":""}}'

Note: A DeployIfNotExists policy needs a managed identity with permission to create diagnostic settings on the target resources, typically the Monitoring Contributor and Log Analytics Contributor roles. The assignment above provisions a system-assigned identity, but you still need to grant it those roles.

Catch it in CI/CD

If your domains are defined in Terraform or Bicep, add a policy-as-code gate to your pipeline so a domain without a linked diagnostic setting never merges. Tools like Checkov, tfsec, or a custom Conftest policy can flag an azurerm_eventgrid_domain that has no associated azurerm_monitor_diagnostic_setting.

# Example: run Checkov against your Terraform plan in CI
checkov -d ./infra --framework terraform \
  --check CKV2_AZURE_  # diagnostic setting checks

Tip: Combine both layers. Azure Policy catches resources created out-of-band through the portal or scripts, while the CI gate gives developers fast feedback before anything is deployed. Defense in depth applies to governance too.

Best practices

Enable diagnostics on every domain and topic, not just the busy ones. The quiet ones are exactly where silent failures hide.
Route to multiple destinations. Send logs to Log Analytics for live querying and to a storage account for cheap, long-term retention that satisfies compliance retention windows.
Always pair logging with a dead-letter destination. Diagnostics tell you an event failed; dead-lettering lets you recover the payload. Configure dead-letter storage on your event subscriptions.
Build alerts on the failure metrics. Create alert rules on DeliveryAttemptFailCount and DroppedEventCount so you hear about problems before your users do.
Standardize the workspace. Point all Event Grid resources at a central monitoring workspace so cross-service correlation during incidents is straightforward.
Review retention settings. Match your log retention to your compliance and incident-investigation needs, not the default.

Danger: Deleting or disabling a diagnostic setting on a production domain immediately stops log collection, and there is no backfill. If you must rotate a setting, create the replacement first and confirm data is flowing before removing the old one.

Event Grid is plumbing, and plumbing fails quietly. A diagnostic setting is the cheapest insurance you can buy against a debugging nightmare. Enable it everywhere, enforce it with policy, and back it up with dead-lettering so that when an event goes missing, you actually know about it.

Event Grid Domain Has No Diagnostic Logs