This guide provides a detailed blueprint for constructing enterprise-grade AI solutions. By leveraging Semantic Kernel in concert with the Azure ecosystem, we can ensure our applications are safe, compliant, observable, and fully under our control.
Semantic Kernel filters are the first line of defense, acting as middleware to intercept and act upon LLM operations. They provide a structured way for us to enforce policies and inject custom logic directly into the AI execution pipeline.
Filters allow us to wrap logic around function invocations and prompt rendering. This modular approach keeps our core business logic clean while centralizing security, policy, and enrichment tasks.
Official Documentation: Concepts: Filters (Microsoft Learn): The definitive source for understanding how filters work in both .NET and Python.
These execute before and after a KernelFunction runs, giving us control over the inputs and outputs.
These filters run just before a prompt template is populated with data and sent to the LLM.
Why: It's critical to ensure that no sensitive data (PII) is ever sent to an LLM or external API, especially when prompts are dynamically populated with data from user profiles, databases, or API responses. Even if upstream APIs return PII, you can prevent accidental leakage by scanning and redacting at this stage.
How to implement:
Example Solution:
[REDACTED] or ***).Sample Pseudocode (C#):
// Pseudocode for a prompt rendering filter with PII redaction
public static string PromptRenderingFilter(string prompt)
{
// Detect and redact PII before sending to LLM
string redactedPrompt = PiiDetectionService.Redact(prompt);
return redactedPrompt;
}
References:
Code Example (PII Detection): GitHub Sample: PII Detection Filter in .NET: This practical example demonstrates how to build a filter that uses a service like Microsoft Presidio to find and handle PII in prompts.
While SK filters provide the mechanism for control, Azure AI Content Safety provides a powerful, pre-built service for detecting and filtering harmful content. This is a critical layer for protecting our brand and our users.
Official Documentation: What is Azure AI Content Safety? (Azure Docs): An overview of the service and its capabilities.
The best practice is to use the Azure AI Content Safety service within a Semantic Kernel filter.
This ensures that both malicious inputs and harmful outputs are caught.
We cannot secure what we cannot see. Semantic Kernel's built-in support for telemetry provides the deep visibility needed for auditing, debugging, and anomaly detection.
Semantic Kernel uses the industry-standard OpenTelemetry framework to export rich diagnostic data. This data can be sent directly to Azure Application Insights for powerful analysis and visualization.
Official Documentation & Guides:
With this data in Azure Monitor, we can build dashboards for real-time visibility and create automated alerts for suspicious patterns (e.g., a sudden increase in prompt injection attempts).
Community Discussion: GitHub Discussion: Telemetry and Observability: Provides community insights and advanced logging configurations.
To ensure full control over data and maintain compliance, we should host our LLMs within our own Azure tenancy. Semantic Kernel seamlessly integrates with Azure's secure LLM offerings, meaning we do not need to send data to external providers.
Official Documentation: Develop with Semantic Kernel and Azure AI (Microsoft Learn): Guide on connecting SK to models deployed as serverless APIs in Azure AI Studio.
Blog: Building AI agents with SK and Azure OpenAI (Will Velida): A great introductory article showing the basic wiring of SK with Azure OpenAI.
While Azure provides a secure and tightly integrated ecosystem, Semantic Kernel's greatest strength is its vendor-agnostic architecture. This allows us to avoid vendor lock-in and strategically use the best model for any given task, ensuring maximum performance and cost-effectiveness.
Semantic Kernel's extensibility makes it simple to connect to a wide range of model providers beyond Azure. This is primarily achieved through its support for OpenAI-compatible API endpoints and dedicated connectors.
Services like OpenRouter act as a unified gateway to dozens of models from different providers (Mistral, Anthropic, Google, etc.). They provide a single API key and an OpenAI-compatible endpoint.
AddOpenAIChatCompletion connector, simply pointing the endpoint to the router service's URL (e.g., https://openrouter.ai/api/v1) and providing our router-specific API key.modelId), enabling A/B testing and performance comparisons with minimal code changes.We can also connect directly to a provider's API. Semantic Kernel often has dedicated connectors for this purpose.
Microsoft.SemanticKernel.Connectors.Anthropic). This ensures full compatibility and access to model-specific features.For ultimate control over security, cost, and customization, we can host open-source models on our own cloud infrastructure (e.g., an Azure VM with GPU support).
The process involves using an inference server like vLLM or Ollama on a provisioned cloud VM. These tools load the open-source model and expose it via an OpenAI-compatible API endpoint. Our application's Semantic Kernel configuration then simply points to our self-hosted VM's IP address and port, treating it just like any other OpenAI-standard service.
This architectural flexibility ensures we can build a sophisticated, resilient, and cost-efficient AI strategy that evolves with the market, leveraging the best models from any provider—or our own—without being locked into a single ecosystem.
Understanding the cost structure of Large Language Models is crucial for building a sustainable and scalable AI strategy. This section breaks down the different pricing models and provides clear scenarios to help guide decision-making for us.
Think of a token as the basic currency for LLMs. A token isn't exactly one word; it's a piece of a word. On average, 100 tokens represent about 75 words.
All interactions with an LLM—both the questions we ask (input) and the answers we receive (output)—are measured in tokens. The total cost of an operation is the sum of the input token cost and the output token cost.
This is the most flexible model. We are charged directly for what we use, measured per 1,000 tokens. It's ideal for applications with variable traffic or when starting a new project, as there are no upfront commitments.
For applications with high and predictable usage, the PTU model offers a better solution. Instead of paying per token, we reserve a fixed amount of processing capacity for a set period (e.g., hourly). This provides guaranteed performance and predictable, stable costs, which can be more economical for consistent, high-volume workloads.
Start with Pay-As-You-Go: Perfect for development, testing, and initial launch when usage patterns are unknown.
Switch to PTU for Scale: Once our application has a steady, high volume of requests, analyze our usage and switch to PTUs to optimize costs and ensure consistent performance for users.
This is an optional but highly recommended service for protecting our brand. It checks both user input and AI output for harmful content.
For most applications, this cost is minimal. For example, processing 50,000 user prompts would cost around $37.50 per month.
Semantic Kernel's flexibility allows us to use models from other providers. Costs here are determined by the provider itself.
If we require maximum control, privacy, or cost efficiency at scale, hosting LLMs on our own cloud VMs is an option. This approach allows us to run open-source or commercial models on infrastructure we manage (such as AWS, Azure, GCP, or OCI), or to use bundled private LLM solutions that we operate ourselves.
| Service & Offering | What We Get | Pricing (approx.) |
|---|---|---|
| Amazon Bedrock | Fully managed LLM API (Anthropic Claude, Jurassic‑2, Mistral) | Varies by model; often competitive or lower than OpenAI. More model choices. |
| Google Vertex AI (Gemini models) | Hosted Geminis via API | Gemini Pro: input ≈$0.0035, output ≈$0.0105 per 1k tokens (very cost-efficient). |
| Deep Infra / OpenRouter / Mistral API | Hosted open-weight LLMs | OpenRouter ≈$0.10 per million tokens total; Microsoft Phi models at ~$0.07 input / $0.14 output per million tokens. |
| AnythingLLM (self-hosted bundle) | Private LLM with vector DB, hosted instance | $50/month basic; $99/month Pro; custom enterprise pricing. |
| Self-hosting on Cloud VMs (AWS/GCP/OCI, per-GPU pricing) | We manage model serving, e.g., Llama on VMs | Varies by provider and GPU. Self-hosting is hourly VM cost only (e.g., $3.91/hr for AWS g4dn.16xlarge). |
The following table compares different models based on a "2,000-token exchange" (e.g., a 750-word input and a 750-word output). This provides a standardized way to compare real-world costs.
| Model / Service | Rate per 1M Tokens (Input / Output) | Est. Cost per 2K Exchange | Best Use Case |
|---|---|---|---|
| GPT-4o mini (Azure) | $0.15 / $0.60 | $0.0015 | Fast, ultra-cheap, good for simple chats, summarization, and classification. |
| GPT-3.5 Turbo (Azure) | $3.00 / $4.00 | $0.007 | Reliable and low-cost for general-purpose text and workflow automation. |
| GPT-4o (Azure) | $5.00 / $15.00 | $0.04 | High-performance vision and reasoning for complex, user-facing tasks. |
| GPT-4 Turbo (Azure) | $10.00 / $30.00 | $0.08 | Top-tier model for the most demanding creative and analytical tasks. |
Let's assume we have a feature that makes 50,000 calls per month, with each call being a 2,000-token exchange.
Note: Prices are based on publicly available data as of mid-2025 and are for estimation purposes. Actual costs may vary based on region and specific agreements.
Our development strategy is designed for a low-risk, high-value rollout. We will introduce AI capabilities in carefully managed phases, building directly on top of our existing platform. This iterative approach ensures stability, allows us to gather feedback, and provides immediate value to both our internal teams and end-users without disruption.
🧩 A key advantage of this approach is that the AI acts as an intelligent interface over our current systems. We are exposing our existing, robust codebase—handling reservations, orders, club details, and reporting—as secure "plugins" for the AI. Clubs don't need to learn a new platform or migrate data. The AI simply provides a new way to access live data and trigger familiar workflows.
The core of our technical approach involves "wrapping" our existing, secure API endpoints into plugins that Semantic Kernel can understand and use. This process is deliberate and requires careful planning.
getClubDetails, listRecentOrders) before moving to more complex or write-based operations. Each endpoint will be reviewed for its suitability as a discrete "tool" for the AI.Each phase progressively expands the AI's capabilities. Estimated Effort for each phase is additive—the time for each phase includes the work of all previous phases. For example, Phase 2's estimate includes the time for Phase 1, and so on. (E.g., "6 weeks" for Phase 2 means 2 weeks for Phase 1 plus 4 more weeks for Phase 2.)
To move forward efficiently and validate our approach, we have defined a clear set of immediate actions:
Before integrating AI into our platform, we must carefully evaluate and set up the right model provider. Here's how we approach this decision and the setup process: