Well-Architected for AI Foundry: Applying the Five Pillars Where the Guidance Still Doesn't Exist

peterrivera813
Apr 27
6 min read

Microsoft has published Well-Architected Framework guidance for AI workloads. It's a good start. But if you've actually tried to deploy Azure AI Foundry into a production landing zone with real networking constraints, real cost governance, and real SLA expectations you've probably noticed something: the guidance stops right where the hard questions begin.

I've spent the last several months integrating AI Foundry into enterprise Azure Landing Zone architectures, and this post captures the pillar-by-pillar gaps I've run into. Not theory. Not marketing fluff. Just the stuff I wish someone had written down before I had to figure it out.

The State of the Guidance Today

The Azure Well-Architected Framework now includes an AI workloads section. It covers design principles, data platform considerations, and application design patterns. There's a baseline Foundry chat reference architecture. There's even an assessment tool.

What's missing is the connective tissue between those high-level principles and the very specific, very awkward realities of running AI Foundry inside an enterprise landing zone. The guidance tends to assume you're building a greenfield AI project in isolation. In practice, you're stitching Foundry into an existing hub-spoke topology, fighting with Private DNS zones, and trying to explain PTU billing to a finance team that barely survived the Reserved Instance conversation.

Let me walk through each pillar and highlight where I've had to fill in the blanks.

Reliability: The SLA You Don't Actually Have

There isn't a single, clean composite SLA you can hand to stakeholders. Foundry orchestrates Azure OpenAI, Anthropic, AI Search, Cosmos DB, Storage, and (for agents) Container Apps with subnet delegation. Each has its own SLA, and some depend on zone-redundant deployment.

The Agent Service makes this harder:

All workspace resources — Cosmos DB, Storage, AI Search, Foundry account must be in the same region as your VNet. That's a hard constraint, not a best practice.
Multi-region failover means maintaining a complete parallel stack in the secondary region with its own VNet and all dependencies. There's no "just point the agent at the other region."
Capacity for PTU deployments is dynamic. A 429 from an undersized allocation isn't a transient blip, it's a sustained outage for your users if you haven't designed overflow.

My recommendation: Build your own composite SLA model before you commit to an architecture. Multiply the individual service SLAs together. Stare at the resulting number. Then design around the weakest link which today is often model deployment availability itself.

Security: Private Networking Is Where the Pain Lives

Enterprise Azure networking veterans know that Private Endpoints and DNS are the hill most projects die on. AI Foundry takes this to another level.

A fully network-isolated Foundry deployment requires:

Private endpoints to the Foundry resource, plus separate PEs to AI Search, Storage, and Cosmos DB — these are not auto-created when you deploy the Foundry resource; you provision them yourself
Private DNS zones for cognitiveservices.azure.com, openai.azure.com, search.windows.net, blob.core.windows.net, file.core.windows.net, documents.azure.com, and potentially others depending on which agent tools you use
Conditional forwarding if you're running centralized DNS in a hub-spoke topology

Miss one DNS zone, and your agent silently resolves to a public IP. There's no error. It just works — insecurely.

Then there's the BYO VNet vs. managed VNet decision:

Managed VNet is simpler but doesn't expose customer-visible NICs. You can't inspect the traffic. You can't route it through your firewall.
BYO VNet is the only real option for organizations requiring full egress inspection, which pushes all networking complexity back onto your team.
Agent tools like Bing Grounding and SharePoint Grounding communicate over the public internet even in network-isolated deployments. If your compliance posture requires all traffic to stay private, those tools are off the table. The docs bury this in a table footnote.

My recommendation: Treat Foundry private networking as a dedicated workstream, not a checkbox. Stand up a test deployment early, validate DNS resolution with nslookup from a VM inside your VNet, and verify every dependent resource resolves to a private IP before you write a single line of application code.

Cost Optimization: PTUs Are a Commitment, Not a Feature Toggle

Cost governance for AI Foundry is genuinely different from traditional Azure cost management, and most FinOps teams aren't ready for it.

The core tension is PTUs vs. pay-as-you-go:

Standard (PAYG): Billed per token. Easy to reason about, variable cost.
PTUs: Billed hourly based on deployed capacity regardless of utilization. Model-independent the same reservation covers GPT, DeepSeek, Llama, Manus but region-specific and deployment-type-specific (Global, Data Zone, Regional).

Where organizations get burned:

PTU reservations don't guarantee capacity. You can buy 300 PTUs and then discover the region has no availability for your model at that moment.
Microsoft's guidance is clear; deploy first, reserve second but in practice, procurement and deployment timelines rarely align.
Deleting a deployment doesn't cancel the reservation. You manage reservations separately through the Azure portal.
Savings are significant (~64% for one-month, ~70% for one-year commitments) but only if utilization is consistent. Spiky workloads need the spillover capability or a hybrid PTU + PAYG overflow pattern.

For chargeback across business units sharing a PTU pool, the APIM azure-openai-emit-token-metric policy is essential. It pushes token consumption into Application Insights — the only reliable way to meter per-team usage when everyone shares the same provisioned deployment.

My recommendation: Don't let your first PTU purchase be a one-year reservation. Start with hourly billing during development. Move to a one-month reservation after four weeks of stable utilization data. Only commit to one year once consumption patterns are proven predictable. And build the APIM chargeback pipeline before finance asks because they will.

Operational Excellence: The Observability Gap Is Real

The Well-Architected guidance now covers operational maturity models for AI. Helpful conceptually. Incomplete operationally.

Model versioning is the first gap:

Foundry deployments pin to specific model versions with end-of-life dates. Deployments don't auto-upgrade.
You need a process to track deprecation announcements, test new versions against your evaluation suite, and roll forward on a schedule.
Hardcoding model versions in Bicep or Terraform creates IaC drift when Microsoft deprecates a pinned version.

Monitoring is the second:

The Provisioned-Managed Utilization V2 metric gives per-minute PTU utilization — but not why utilization is high.
Is it larger prompts? Retry loops from failed tool calls? A new system prompt that tripled input tokens? You need APIM-level and application-level telemetry to answer those questions.

Agent lifecycle is the third:

Multi-agent architectures involve state management, Cosmos DB-backed memory, and inter-agent coordination, none of which have standard operational runbooks yet.
When an agent misbehaves in production (and non-deterministic systems will), your team needs to trace a conversation thread through the orchestration layer, identify which tool call went sideways, and reconstruct the context window at the point of failure. The tooling is improving, but it's not turnkey.

My recommendation: Build your observability stack before deploying agents to production. At minimum: APIM token metrics, Foundry utilization metrics, and application-level trace correlation. Treat model version management like OS patching — scheduled, tested, tracked. Write agent failure runbooks now, while the team is small enough to agree on process.

Performance Efficiency: Latency Is the Silent Killer

AI workloads have a fundamentally different performance profile. Latency isn't just network hops — it's model inference time, token generation speed, and the number of tool calls per conversation turn.

Key considerations:

PTU deployments guarantee consistent processing time for accepted calls. Near 100% utilization, new requests get 429'd. Sizing with headroom pushes cost up, that's the core trade-off, and the guidance doesn't quantify it.
Cross-region adds up fast. If your application and Foundry are in different regions, expect 30-60ms per hop on Azure's backbone. An agentic workflow making five tool calls per turn accumulates 150-300ms of pure network latency on top of inference time.
Semantic caching (Azure Managed Redis) can reduce both cost and latency for repetitive query patterns, but the similarity threshold needs careful tuning. Too aggressive and you serve stale responses. Too conservative and the hit rate doesn't justify the infrastructure.

My recommendation: Profile your agent's tool-call patterns before committing to a deployment topology. Count the round-trips per conversation turn, multiply by cross-region latency, then decide whether to co-locate, accept the latency, or redesign the agent to batch tool calls.

Where Does This Leave Us?

The Well-Architected Framework is the right lens for evaluating AI Foundry deployments and the five pillars still apply. But the specific guidance which DNS zones you need, how to model composite SLAs, when PTU reservations make sense, what to monitor when agents go haywire is still being written.

In the meantime, architects fill in the gaps. Build test environments early. Instrument everything. Treat cost governance as a first-class design concern. Be honest with stakeholders about what "production-ready" means for a platform that's still evolving fast.

The framework gives you the principles. Your job is to apply them where the documentation hasn't caught up yet.