The Top 10 Azure Egress Traps (and How to Avoid Them)

Share on:

The Top 10 Azure Egress Traps (and How to Avoid Them)

If you’ve ever opened Cost Management on a Monday and whispered "how… why… who did this", you’re not alone. In EU Azure estates, egress shows up like glitter after a craft party: everywhere, hard to remove, and somehow still there next week. The cure isn’t a bigger budget it's a better map. Egress charges hide in paths. Shorten the path, privatize the path, or cache the path, and your bill calms down.

This is a field guide for platform engineers and FinOps teams who want fewer surprises and snappier apps. We’ll talk symptoms (the clue), why it happens (the physics), how to spot it (the detective work), and fix patterns (the move). We’ll keep it friendly, punchy, and EU-region-savvy.

1) Cross-zone traffic inside one region

You deploy everything "in West Europe", pat yourself on the back, and… egress shows up anyway. The symptom is sneaky: latency between app and DB is a few ms higher than expected, and your Data Transfer line has a quiet, steady heartbeat.

Why it happens is simple: Availability Zones are different datacenters with real distance between them. Cross-zone bytes often cost per GB. If your VMSS spreads across zones while your database or Redis sits in one, chats that hop zones add up.

How to spot it: look at your placement, not just your region. Check which zone your stateful bits live in and which zones your compute fans out to. Inspect effective routes and use Connection Monitor to compare Zone 1 ↔ Zone 1 vs Zone 1 ↔ Zone 2 latency. If "same region " still feels slow, you’re probably paying to criss-cross.

Fix patterns: keep chatty tiers in the same zone or deploy a zonal copy per zone. Zone-redundant storage and databases are great for durability, but if your hot path jumps zones constantly, consider zonal deployments and in-zone caching. Move one tier as a canary, watch the bill for a week, then roll the pattern out.

2) Inter-region replication (GRS, SQL, Cosmos) that leaks money at night

The classic 02:00 spike: dashboards yawn, but your bill doesn’t. Storage accounts with GRS, Azure SQL replicas, or Cosmos multi-region writes are moving bytes across regions on the Microsoft backbone, yes, and still billable.

The symptom is periodic: the graph climbs when replication windows open. The "why " is policy drift; someone ticked "geo-redundant " out of caution years ago, and now you’re paying for an RPO/RTO you don’t actually need or could achieve with ZRS.

Spotting this is a quick audit. Storage: check replication (LRS/ZRS vs GRS/RA-GRS/GZRS). SQL: list auto-failover groups and linked regions. Cosmos: enumerate regions and write regions. If your compute is in West Europe but you quietly replicate to North Europe (or beyond), that overnight bump is your confirmation.

Fix patterns: keep resilience in-region first. ZRS usually hits the right availability target for many workloads. If you must go cross-region, keep it intra-EU and match the replica to a real DR plan. Throttle or schedule replication in quieter hours, and avoid cross-region reads from analytics jobs use local analytical stores or copy snapshots into the same region as the compute.

3) PaaS public endpoints vs. Private Link: the quiet hairpin

Everything looks fine-HTTPS to Storage, Key Vault, SQL-until the line item for egress says "surprise". Public FQDNs often send traffic via public edges, and the path can step outside your virtual network even if it technically never leaves Azure.

The symptom is mundane: "we only allow 443 outbound", and yet costs creep up as throughput grows. Why it happens: DNS points your app to a public endpoint, and the journey takes the scenic route.

Spot it by asking DNS the awkward question. From a VM, nslookup <account>.blob.core.windows.net vs nslookup <account>.privatelink.blob.core.windows.net. If your app code, connection strings, or managed identities still talk to public names, you’ll see it in the answer. No Private Endpoint events? That’s another tell.

Fix patterns: create Private Endpoints for your PaaS staples and pair them with private DNS zones (e.g., privatelink.blob.core.windows.net, privatelink.vaultcore.azure.net). Lock public network access to "Selected networks" or Disabled. Then verify from your workload subnet that FQDNs resolve to 10.x addresses. Your packets and your bill will thank you.

4) ACR pulls from the wrong region

AKS nodes take a coffee break pulling images, CI/CD is fine, and "ACR Data Out " starts whispering from an unexpected region. The registry lives in Region A; your clusters live in Region B. Every image pull is a cross-region fetch.

You’ll spot it in ACR metrics (Data Out) and in slow node provisioning. Why it happens is nothing sinister just a registry placement that no longer matches compute growth.

Fix patterns are joyful: enable ACR geo-replication to the regions where your clusters run. Use the regional replica login server when available and warm images with ACR Tasks post-push. If you span clouds or edge sites, consider a local registry mirror to keep image gravity close to compute.

5) Per-GB processing in Firewall, App Gateway, and Front Door

You planned for egress. You forgot about processed GB. Layer-7 services often charge per GB handled, and chaining them compounds the cost: Front Door → WAF → App Gateway → Firewall → App is a beautiful diagram and a pricey conveyor belt.

Symptoms show up as "we paid twice (or thrice) for the same GB." The why is how these services work: they terminate, inspect, and forward, and each tier tallies bytes.

To spot it, map the request path end-to-end and compare "data processed " metrics per hop to egress. If you’re double-dipping, the graphs align a little too perfectly.

Fix patterns: flatten. Keep one L7 hop if you can. Push WAF to Front Door and remove redundant WAF at AppGW. For outbound, don’t chain Azure Firewall behind a third-party appliance unless you must. Turn on caching where appropriate; the cheapest processed GB is the one you never process.

6) CDN/Front Door origins too far from the truth

CDN looks heroic on the slide deck, but cache misses still hammer the origin. Users across the EU hit a POP, which fetches from an origin in West Europe, and your inter-region egress grows every time the cache forgets.

The symptom is modest cache ratios with large "origin bytes." Why it happens is origin distance and poor cacheability: too short TTLs, missing ETags, or a chatty API.

Spot it by reading the CDN/Front Door reports: origin shield status, origin fetches, cache hit ratio. Trace from a client to the origin name and check where your storage or app service actually lives.

Fix patterns: put the origin in the same region as its backing data, enable Origin Shield close to origin, compress everything compressible, set sane TTLs, and serve validators (ETag/Last-Modified) so you can revalidate instead of re-download. For APIs, cache safe GETs aggressively.

7) Cross-region diagnostic exports, the slow bleed

Telemetry is your friend until it’s 24/7 egress. Workloads in Region B ship diagnostics to a Log Analytics or Data Explorer workspace in Region A because "that’s where the team made the first workspace. " Bytes flow all day, every day.

The symptom is boring but relentless. The why is centralization without residency planning. A unified workspace feels tidy; the network bill disagrees.

Spot it in each resource’s Diagnostic settings and the workspace’s region. Ingested bytes by resource tell the stor, if your top talker lives in another region, you’ve found the leak.

Fix patterns: create regional workspaces per landing zone, use Data Collection Rules to keep logs local, and build dashboards with cross-workspace queries. If you must centralize, export summaries or aggregates, not raw firehoses.

8) Hybrid tromboning (on-prem ↔ Azure ↔ on-prem)

Requests go branch → HQ → Azure → internet → and back again, like a boomerang that never rests. Latency is weird, and egress looks like a mirror bill: one trip out, one trip back.

This happens because default routes and legacy SD-WAN habits send traffic through HQ to "inspect everything", including SaaS and Azure-to-Azure paths that would be happier and cheaper staying in Azure.

You spot it by checking effective routes (0.0.0.0/0 via VPN/ExpressRoute) and capturing packets that show two long legs where one should do. SD-WAN classifications often label Azure as "internet", sealing the fate.

Fix patterns: egress in-region. Use Azure Firewall or Virtual WAN secure hub as your north-south policy point. Publish only the prefixes you actually need over BGP. Allow local internet breakout for SaaS where policy allows. And for PaaS, prefer Private Link so packets never chase their tails.

9) Data Factory/Synapse copy paths that wander

Pipelines run; SLAs pass; bills grumble. The integration runtime (IR) auto-resolves in a different region than your source or sink, or your staging blob sits somewhere "temporary" and far away.

The symptom is copies that seem slow for "nearby" data and a line item that grows with ETL. The why is staging and compute placement that doesn’t match your data gravity.

Spot it by inspecting each pipeline’s IR region and the region of the staging account. The activity run details tell you source region → IR region → sink region. If those arrows cross regions, the bill is telling the truth.

Fix patterns: pin IR to the same region as source/sink, keep staging blob in the sink’s region, batch and compress where possible, and scale parallelism within a region rather than spraying across regions.

10) Centralized NAT/Firewall hairpins

One glorious hub in Region A processes all outbound for estates in Regions B and C. Simple to diagram, expensive to operate. Spokes in B route to A, then to the internet or to a Private Link in B via… A. You pay in egress and in per-GB processing.

The symptom is hub saturation (SNAT ports, data processed) and tidy but inflated costs. The why is a design optimized for simplicity, not for physics.

Spot it by checking UDRs (0.0.0.0/0 to a remote regional hub) and by reading Azure Firewall metrics in the hub. If your spoke in North Europe egresses to a Storage Private Endpoint in North Europe via West Europe, you’ve built a hair salon for packets.

Fix patterns: build regional hubs. Use one firewall per region or a Virtual WAN secure hub. For simple outbound, a NAT Gateway per region is boring and perfect. Best of all, avoid internet egress altogether with Private Endpoints to the PaaS services your apps use.

A quick summary table (for your screenshot folder)

# Trap Signal Quick fix
1 Cross-zone traffic Same-region egress shows up Co-locate per zone; zonal caches
2 Inter-region replication Nightly spikes Prefer ZRS; keep replicas intra-EU
3 Public PaaS endpoints Public DNS answers Private Endpoints + private DNS
4 ACR wrong region Slow pulls; ACR Data Out Geo-replicate; use local login
5 Per-GB L7 processing Multiple L7 hops Flatten; cache at the edge
6 Origin distance Low cache hit; origin bytes Align origin; Origin Shield
7 Cross-region diagnostics Workspace in other region Regional workspaces + DCR
8 Hybrid tromboning Two long trips per flow Local breakout; Azure FW in-region
9 Data Factory paths Copy slow; staging far IR pinned; local staging
10 Centralized hairpins Hub SNAT saturation Regional hubs; NAT GW per region

Bicep & CLI you can drop into a PR

Let’s make two common fixes tangible: Private Endpoints (with private DNS) and ACR geo-replication.

Bicep: Private Endpoint for Storage + Private DNS (EU example)

param location string = 'westeurope'
param vnetId string
param subnetId string
param saName string

resource sa 'Microsoft.Storage/storageAccounts@2023-01-01' existing = {
  name: saName
}

resource pdns 'Microsoft.Network/privateDnsZones@2020-06-01' = {
  name: 'privatelink.blob.core.windows.net'
  location: 'global'
}

resource vnet 'Microsoft.Network/virtualNetworks@2023-04-01' existing = {
  id: vnetId
}

resource link 'Microsoft.Network/privateDnsZones/virtualNetworkLinks@2020-06-01' = {
  name: 'weu-link'
  parent: pdns
  location: 'global'
  properties: {
    virtualNetwork: { id: vnet.id }
    registrationEnabled: false
  }
}

resource pe 'Microsoft.Network/privateEndpoints@2023-04-01' = {
  name: 'pe-blob-${saName}'
  location: location
  properties: {
    subnet: { id: subnetId }
    privateLinkServiceConnections: [
      {
        name: 'blob'
        properties: {
          privateLinkServiceId: sa.id
          groupIds: [ 'blob' ]
        }
      }
    ]
  }
}

resource pnc 'Microsoft.Network/privateEndpoints/privateDnsZoneGroups@2022-09-01' = {
  name: 'pdz-${saName}'
  parent: pe
  properties: {
    privateDnsZoneConfigs: [
      {
        name: 'blob'
        properties: {
          privateDnsZoneId: pdns.id
        }
      }
    ]
  }
}

Bicep: ACR Geo-replication (WEU + NEU)

param acrName string
param primaryLocation string = 'westeurope'
param secondaryLocation string = 'northeurope'

resource acr 'Microsoft.ContainerRegistry/registries@2023-01-01' existing = {
  name: acrName
}

resource weu 'Microsoft.ContainerRegistry/registries/replications@2023-01-01' = {
  name: '${acr.name}/${primaryLocation}'
  location: primaryLocation
  properties: {}
}

resource neu 'Microsoft.ContainerRegistry/registries/replications@2023-01-01' = {
  name: '${acr.name}/${secondaryLocation}'
  location: secondaryLocation
  properties: {}
}

az CLI: Private Endpoint + DNS link (Storage)

RG=rg-net-weu
LOC=westeurope
VNET_ID=$(az network vnet show -g $RG -n vnet-weu --query id -o tsv)
SUBNET_ID=$(az network vnet subnet show -g $RG --vnet-name vnet-weu -n snet-privatelink --query id -o tsv)
SA=stweuapps01

az network private-dns zone create -g $RG -n privatelink.blob.core.windows.net
az network private-dns link vnet create -g $RG -n weu-link   --zone-name privatelink.blob.core.windows.net --virtual-network $VNET_ID --registration-enabled false

PE_ID=$(az network private-endpoint create -g $RG -n pe-blob-$SA -l $LOC   --subnet $SUBNET_ID   --private-connection-resource-id $(az storage account show -n $SA -g $RG --query id -o tsv)   --group-ids blob --connection-name blob-conn --query id -o tsv)

az network private-endpoint dns-zone-group create -g $RG --endpoint-name pe-blob-$SA   -n blob-dns --private-dns-zone privatelink.blob.core.windows.net --zone-name privatelink.blob.core.windows.net

az CLI: ACR Geo-replication

RG=rg-build-weu
ACR=crweuapps01

# Create replica in North Europe
az acr replication create -r $ACR -g $RG -l northeurope

# Verify replicas and status
az acr replication list -r $ACR -o table

# Tip: target the regional replica login server when available.
# Example: crweuapps01.azurecr.io (WEU) and crweuapps01.northeurope.data.azurecr.io (NEU replica)

Quick playbooks that actually move the needle

Shrink a cross-zone hot path this week: list the noisiest pair (app ↔ DB or app ↔ Redis). Pin both to Zone 1 for a day in staging. If latency and cross-zone egress drop, roll the change in prod with maintenance windows per zone.

Make ACR pulls local: list clusters and regions; add ACR replicas to match; warm top images with an ACR Task; roll nodes to force fresh pulls; watch node provisioning times fall.

Flatten L7 hops: sketch the path from user to app. If you see three L7 layers, merge two. Move WAF rules to Front Door. Turn on compression and cache headers. Measure "data processed " before and after.

The EU wrinkles worth remembering

Data residency drives design. Keep replication inside the EU boundary when policy demands it. West Europe ↔ North Europe is close, but distance isn’t free two milliseconds here, a few cents there, and the month has 2.6 million milliseconds. Also, some SaaS and PaaS "EU endpoints " still front multi-region backends; when Private Link is available, it’s your friend both for privacy and for predictable paths.

10 guardrails to keep egress boring (the good kind)

  1. Policy: Deny public network access for Storage/SQL/Key Vault unless a deliberate allow-public=true tag is present.
  2. Private first: Require Private Endpoints for production PaaS; enforce with Policy.
  3. Telemetry local: Enforce regional Log Analytics/Data Explorer workspaces; block cross-region destinations by policy.
  4. Routes: No 0.0.0.0/0 UDRs to hubs in other regions.
  5. Registry: Every compute region needs an ACR replica. No replica, no cluster.
  6. Origins: CDN/Front Door origins must sit in the same region as data at rest.
  7. Hubs: One firewall/NAT per region; no cross-region SNAT.
  8. IR placement: Data Factory/Synapse IR pinned to source and sink region; staging storage co-located.
  9. Observability: Dashboards include origin bytes and processed GB for AFD/AppGW/Firewall.
  10. Change gates: Any new endpoint or SKU passes a DNS + route + cost checklist before merge.

Print that. Tape it to the wall. Or better encode it in Policy and CI checks.

Try this next (a 90-minute sprint)

  1. Export the last 60 days of Cost Management. Filter for meters named like Data Transfer or Data Processed.
  2. For the top two offenders, draw the actual packet path on one page. Include DNS answer, effective route, and every L7 hop.
  3. Make one surgical fix each: add a Private Endpoint, create one ACR replica, or move one IR.
  4. Add alerts: ACR Data Out, Azure Firewall Data Processed, Front Door Origin Bytes.
  5. Open a ticket to replace any cross-region NAT with a regional NAT Gateway or secure Virtual Hub.

Final thought

Egress isn’t a villain. It’s a breadcrumb trail. Follow it and you’ll discover where your architecture disagrees with your intentions. Most fixes are satisfying: fewer hops, more locality, cleaner DNS, calmer dashboards. Your users feel it, your auditors appreciate it, and your finance team might even buy you coffee (or at least stop side-eyeing your subscription).