Introduction

Building a Secure & Governed LLM Application for Fanbase using Semantic Kernel & Azure

This guide provides a detailed blueprint for constructing enterprise-grade AI solutions. By leveraging Semantic Kernel in concert with the Azure ecosystem, we can ensure our applications are safe, compliant, observable, and fully under our control.

1. 🛡️ Granular Control with Semantic Kernel Filters

Semantic Kernel filters are the first line of defense, acting as middleware to intercept and act upon LLM operations. They provide a structured way for us to enforce policies and inject custom logic directly into the AI execution pipeline.

Core Concepts

Filters allow us to wrap logic around function invocations and prompt rendering. This modular approach keeps our core business logic clean while centralizing security, policy, and enrichment tasks.

Official Documentation: Concepts: Filters (Microsoft Learn): The definitive source for understanding how filters work in both .NET and Python.

Types of Filters & Key Use Cases

a) Function Invocation Filters

These execute before and after a KernelFunction runs, giving us control over the inputs and outputs.

  • Pre-Execution Guardrails:
    • Input Validation & Sanitization: Check if inputs match a required format or contain malicious payloads before they are processed.
    • Authorization: Verify if the calling user or process has the correct permissions to execute a specific function.
  • Post-Execution Processing:
    • PII (Personally Identifiable Information) Redaction: Automatically detect and mask sensitive data (e.g., names, emails, financial details) from the LLM's response before it's returned to us or stored.
    • Format Enforcement: Ensure the output adheres to a strict schema, like valid JSON or XML, and attempt to repair it if it doesn't.
  • Resiliency & Fallbacks:
    • Catch exceptions during execution and implement retry logic. For example, if a primary model endpoint fails, a filter can automatically redirect the request to a secondary, fallback model.

b) Prompt Rendering Filters

These filters run just before a prompt template is populated with data and sent to the LLM.

  • Prompt Injection Defense: Analyze user input for tell-tale signs of prompt injection or jailbreaking attempts (e.g., "ignore previous instructions"). The filter can then reject the request, sanitize the input, or flag it for review.
  • Dynamic Prompt Enrichment: Automatically inject relevant, real-time context into the prompt. This could be user history, recent data from a database, or few-shot examples to improve the accuracy of the LLM's response.
  • Semantic Caching: Before sending the prompt, check a cache (like Azure Cache for Redis) to see if a semantically similar prompt has already been answered. If so, return the cached response to save costs and reduce latency.

🔒 PII Detection & Redaction at the Prompt Rendering Stage

Why: It's critical to ensure that no sensitive data (PII) is ever sent to an LLM or external API, especially when prompts are dynamically populated with data from user profiles, databases, or API responses. Even if upstream APIs return PII, you can prevent accidental leakage by scanning and redacting at this stage.

How to implement:

  • Before the prompt is finalized and sent to the LLM, run all variables and merged data through a PII detection/redaction service (e.g., Microsoft Presidio or Azure AI Content Safety).
  • Automatically redact, mask, or remove any detected PII (such as names, emails, phone numbers, or addresses) from the prompt.
  • This ensures that even if sensitive data is present in API/database results, it is never exposed to the LLM.

Example Solution:

  1. Use a Semantic Kernel Prompt Rendering Filter to intercept the prompt just before it is sent.
  2. Pass the prompt (or just the dynamic data) to a PII detection/redaction service.
  3. Replace detected PII with masked values (e.g., [REDACTED] or ***).
  4. Only then allow the prompt to proceed to the LLM.

Sample Pseudocode (C#):

// Pseudocode for a prompt rendering filter with PII redaction

public static string PromptRenderingFilter(string prompt)
{
    // Detect and redact PII before sending to LLM
    string redactedPrompt = PiiDetectionService.Redact(prompt);
    return redactedPrompt;
}

References:

Code Example (PII Detection): GitHub Sample: PII Detection Filter in .NET: This practical example demonstrates how to build a filter that uses a service like Microsoft Presidio to find and handle PII in prompts.

2. 🔒 Off-the-Shelf Safety with Azure AI Content Safety

While SK filters provide the mechanism for control, Azure AI Content Safety provides a powerful, pre-built service for detecting and filtering harmful content. This is a critical layer for protecting our brand and our users.

Key Capabilities

  • Multi-Modal Harm Detection: Analyzes both text and images to detect content related to hate, violence, self-harm, and sexual content, assigning a severity score for nuanced filtering.
  • Advanced Threat Protection: Includes specific features to detect prompt injection attacks (jailbreaking) and to identify if the model is generating content that is ungrounded or hallucinated (a major risk in RAG scenarios).

Official Documentation: What is Azure AI Content Safety? (Azure Docs): An overview of the service and its capabilities.

Architectural Pattern: The Safety Sandwich

The best practice is to use the Azure AI Content Safety service within a Semantic Kernel filter.

  1. Incoming Request: A filter sends the user's raw input to the Content Safety API. If it's flagged, the request is blocked before it ever reaches the LLM.
  2. Model Invocation: The sanitized prompt is sent to the LLM.
  3. Outgoing Response: A filter intercepts the LLM's raw response and sends it to the Content Safety API. If harmful content is detected, the response can be redacted, replaced with a safe message, or blocked entirely.

This ensures that both malicious inputs and harmful outputs are caught.

3. 📊 Full Observability with Telemetry & Azure Monitor

We cannot secure what we cannot see. Semantic Kernel's built-in support for telemetry provides the deep visibility needed for auditing, debugging, and anomaly detection.

Core Concepts

Semantic Kernel uses the industry-standard OpenTelemetry framework to export rich diagnostic data. This data can be sent directly to Azure Application Insights for powerful analysis and visualization.

Official Documentation & Guides:

What to Monitor

  • Traces: Visualize the entire lifecycle of a request as it flows through different filters, plugins, and AI model calls. This is essential for debugging complex chains and identifying performance bottlenecks.
  • Metrics: Track key performance indicators (KPIs) like LLM response times, token consumption (for cost management), function error rates, and the number of times content safety filters are triggered.
  • Logs: Capture structured logs that include the rendered prompts, raw model responses, and any exceptions. This creates an invaluable audit trail for compliance and security investigations.

With this data in Azure Monitor, we can build dashboards for real-time visibility and create automated alerts for suspicious patterns (e.g., a sudden increase in prompt injection attempts).

Community Discussion: GitHub Discussion: Telemetry and Observability: Provides community insights and advanced logging configurations.

4. 🧠 Data Sovereignty with Azure-Hosted LLMs

To ensure full control over data and maintain compliance, we should host our LLMs within our own Azure tenancy. Semantic Kernel seamlessly integrates with Azure's secure LLM offerings, meaning we do not need to send data to external providers.

Azure Hosting Options

  • Azure OpenAI Service: Provides access to powerful models like GPT-4 with the enterprise-grade security and privacy guarantees of Azure. Our data is not used for training external models.
  • Azure AI Inference: Deploy and manage a curated catalog of powerful open-source models (e.g., from Mistral, Meta) as fully managed serverless API endpoints within our environment.

Official Documentation: Develop with Semantic Kernel and Azure AI (Microsoft Learn): Guide on connecting SK to models deployed as serverless APIs in Azure AI Studio.

Blog: Building AI agents with SK and Azure OpenAI (Will Velida): A great introductory article showing the basic wiring of SK with Azure OpenAI.

Security Benefits

  • Network Isolation: Use Azure Virtual Networks (VNets) and Private Endpoints to ensure our models are not accessible from the public internet.
  • Unified Governance: Manage access and permissions using Microsoft Entra ID (formerly Azure AD) and Role-Based Access Control (RBAC), providing a single control plane for our entire application.

5. 🌐 Model Flexibility & Open Integration

While Azure provides a secure and tightly integrated ecosystem, Semantic Kernel's greatest strength is its vendor-agnostic architecture. This allows us to avoid vendor lock-in and strategically use the best model for any given task, ensuring maximum performance and cost-effectiveness.

Connecting to Alternative Providers

Semantic Kernel's extensibility makes it simple to connect to a wide range of model providers beyond Azure. This is primarily achieved through its support for OpenAI-compatible API endpoints and dedicated connectors.

🔀 Using LLM Routers (e.g., OpenRouter)

Services like OpenRouter act as a unified gateway to dozens of models from different providers (Mistral, Anthropic, Google, etc.). They provide a single API key and an OpenAI-compatible endpoint.

  • Simple Integration: We can use Semantic Kernel's standard AddOpenAIChatCompletion connector, simply pointing the endpoint to the router service's URL (e.g., https://openrouter.ai/api/v1) and providing our router-specific API key.
  • Dynamic Model Selection: This allows us to switch the underlying LLM with a single string change (the modelId), enabling A/B testing and performance comparisons with minimal code changes.

🔗 Direct API Integration (Mistral, Anthropic, OpenAI)

We can also connect directly to a provider's API. Semantic Kernel often has dedicated connectors for this purpose.

  • OpenAI-Compatible APIs: For providers that adhere to the OpenAI API schema, like Mistral, we can again use the standard OpenAI connector by providing the provider's base URL and API key.
  • Dedicated Connectors: For providers with unique APIs, like Anthropic, we can use a dedicated connector package (e.g., Microsoft.SemanticKernel.Connectors.Anthropic). This ensures full compatibility and access to model-specific features.

Self-Hosting Models on a Cloud VM

For ultimate control over security, cost, and customization, we can host open-source models on our own cloud infrastructure (e.g., an Azure VM with GPU support).

  • Maximum Control & Privacy: The model and data never leave our virtual network, providing the highest level of data sovereignty.
  • Cost Management: For high-throughput scenarios, self-hosting can be more cost-effective than pay-per-token APIs, as we pay for the underlying compute infrastructure.
  • Customization: Enables the use of highly specialized or fine-tuned open-source models tailored to specific Fanbase domains.

🛠️ How it Works

The process involves using an inference server like vLLM or Ollama on a provisioned cloud VM. These tools load the open-source model and expose it via an OpenAI-compatible API endpoint. Our application's Semantic Kernel configuration then simply points to our self-hosted VM's IP address and port, treating it just like any other OpenAI-standard service.

Use Cases for Local Models in a Cloud Platform

  • Cost-Free Development & Testing: The primary use case. Developers can iterate on new features, build complex agentic workflows, and write unit tests without incurring any API costs or dealing with network latency.
  • Offline Tooling: Enables the creation of powerful developer tools or CLI applications that can function without an internet connection.
  • Specialized, Low-Latency Tasks: For small, highly-specialized models (e.g., a model fine-tuned for code formatting or data classification), running them locally can sometimes provide lower latency than a round-trip to a cloud endpoint.

Technical Guides & Tutorials

This architectural flexibility ensures we can build a sophisticated, resilient, and cost-efficient AI strategy that evolves with the market, leveraging the best models from any provider—or our own—without being locked into a single ecosystem.

6. 💰 Cost Analysis & Strategic Choices

Understanding the cost structure of Large Language Models is crucial for building a sustainable and scalable AI strategy. This section breaks down the different pricing models and provides clear scenarios to help guide decision-making for us.

What is a Token?

Think of a token as the basic currency for LLMs. A token isn't exactly one word; it's a piece of a word. On average, 100 tokens represent about 75 words.

All interactions with an LLM—both the questions we ask (input) and the answers we receive (output)—are measured in tokens. The total cost of an operation is the sum of the input token cost and the output token cost.

1. Azure OpenAI: The Main Pricing Models

Pay-As-You-Go (PAYG)

This is the most flexible model. We are charged directly for what we use, measured per 1,000 tokens. It's ideal for applications with variable traffic or when starting a new project, as there are no upfront commitments.

Provisioned Throughput Units (PTU)

For applications with high and predictable usage, the PTU model offers a better solution. Instead of paying per token, we reserve a fixed amount of processing capacity for a set period (e.g., hourly). This provides guaranteed performance and predictable, stable costs, which can be more economical for consistent, high-volume workloads.

Choosing Between PAYG and PTU

Start with Pay-As-You-Go: Perfect for development, testing, and initial launch when usage patterns are unknown.

Switch to PTU for Scale: Once our application has a steady, high volume of requests, analyze our usage and switch to PTUs to optimize costs and ensure consistent performance for users.

2. Azure AI Content Safety Costs

This is an optional but highly recommended service for protecting our brand. It checks both user input and AI output for harmful content.

  • Free Tier: Includes 5,000 text checks per month at no cost.
  • Paid Tier: After the free tier, it costs approximately $0.75 for every 1,000 text checks. A "text check" covers up to 1,000 characters (roughly a paragraph).

For most applications, this cost is minimal. For example, processing 50,000 user prompts would cost around $37.50 per month.

3. Third-Party and Open-Source Model Costs

Semantic Kernel's flexibility allows us to use models from other providers. Costs here are determined by the provider itself.

  • Router Services (e.g., OpenRouter): We pay the router's rate for each model, which is often very close to the direct API cost. This offers great flexibility.
  • Direct APIs (e.g., Anthropic, OpenAI): Models like Anthropic's Claude 3 family can be accessed via platforms like Amazon Bedrock. They often have competitive pricing and can sometimes be cheaper than Azure equivalents for specific tasks.
  • Self-Hosting: Hosting an open-source model on a cloud VM means we pay for the underlying infrastructure (the virtual machine, GPU, storage, and data transfer). This can be cost-effective for extremely high-volume, specialized tasks but carries a higher operational overhead.

Cloud VM Hosting & Self-Hosting Options

If we require maximum control, privacy, or cost efficiency at scale, hosting LLMs on our own cloud VMs is an option. This approach allows us to run open-source or commercial models on infrastructure we manage (such as AWS, Azure, GCP, or OCI), or to use bundled private LLM solutions that we operate ourselves.

Service & Offering What We Get Pricing (approx.)
Amazon Bedrock Fully managed LLM API (Anthropic Claude, Jurassic‑2, Mistral) Varies by model; often competitive or lower than OpenAI. More model choices.
Google Vertex AI (Gemini models) Hosted Geminis via API Gemini Pro: input ≈$0.0035, output ≈$0.0105 per 1k tokens (very cost-efficient).
Deep Infra / OpenRouter / Mistral API Hosted open-weight LLMs OpenRouter ≈$0.10 per million tokens total; Microsoft Phi models at ~$0.07 input / $0.14 output per million tokens.
AnythingLLM (self-hosted bundle) Private LLM with vector DB, hosted instance $50/month basic; $99/month Pro; custom enterprise pricing.
Self-hosting on Cloud VMs (AWS/GCP/OCI, per-GPU pricing) We manage model serving, e.g., Llama on VMs Varies by provider and GPU. Self-hosting is hourly VM cost only (e.g., $3.91/hr for AWS g4dn.16xlarge).
Disclaimer: Pricing is based on public sources as of mid-2025 and may change. We should always check the provider's official pricing page for the most current rates.
🧮 Cost Comparison Summary:
  • Amazon Bedrock: We pay per model; more variety may reduce costs compared to OpenAI.
  • Google Gemini: Very cost-efficient—input $0.0035, output $0.0105 per 1k tokens.
  • Open/FOSS model APIs (Deep Infra/OpenRouter): Around $0.10–$0.20 per million tokens.
  • Hosted private stack (AnythingLLM): Fixed monthly fee ($50–$99+).
  • Self-hosting VMs: Hourly VM + GPU costs only—may offer the lowest marginal cost but requires operational overhead on our part.

4. Summary & Example Scenarios

The following table compares different models based on a "2,000-token exchange" (e.g., a 750-word input and a 750-word output). This provides a standardized way to compare real-world costs.

Model / Service Rate per 1M Tokens (Input / Output) Est. Cost per 2K Exchange Best Use Case
GPT-4o mini (Azure) $0.15 / $0.60 $0.0015 Fast, ultra-cheap, good for simple chats, summarization, and classification.
GPT-3.5 Turbo (Azure) $3.00 / $4.00 $0.007 Reliable and low-cost for general-purpose text and workflow automation.
GPT-4o (Azure) $5.00 / $15.00 $0.04 High-performance vision and reasoning for complex, user-facing tasks.
GPT-4 Turbo (Azure) $10.00 / $30.00 $0.08 Top-tier model for the most demanding creative and analytical tasks.

Example Monthly Cost Scenarios

Let's assume we have a feature that makes 50,000 calls per month, with each call being a 2,000-token exchange.

  • Using GPT-4o mini: 50,000 calls × $0.0015 ≈ $75 / month
  • Using GPT-3.5 Turbo: 50,000 calls × $0.007 ≈ $350 / month
  • Using GPT-4o: 50,000 calls × $0.04 ≈ $2,000 / month

Note: Prices are based on publicly available data as of mid-2025 and are for estimation purposes. Actual costs may vary based on region and specific agreements.

🔧 Strategic Recommendations

  • Tier Our Models: Use the most cost-effective model that can accomplish the task. Use GPT-4o mini for simple tasks and reserve more powerful models like GPT-4o for complex reasoning.
  • Monitor Everything: Use the observability tools in Semantic Kernel and Azure to track token consumption in real-time. This helps identify costly operations and opportunities for optimization.
  • Start with PAYG, Scale with PTUs: Begin with the flexible pay-as-you-go model and collect data. If usage becomes high and predictable, perform a cost analysis to see if switching to provisioned throughput will save money.
  • Leverage the Ecosystem: Don't hesitate to connect to third-party models if they offer better performance or pricing for a specific need. Semantic Kernel makes this easy.

7. 🗺️ Phased Development Plan & Next Steps

Our development strategy is designed for a low-risk, high-value rollout. We will introduce AI capabilities in carefully managed phases, building directly on top of our existing platform. This iterative approach ensures stability, allows us to gather feedback, and provides immediate value to both our internal teams and end-users without disruption.

A New Layer, Not a New Platform

🧩 A key advantage of this approach is that the AI acts as an intelligent interface over our current systems. We are exposing our existing, robust codebase—handling reservations, orders, club details, and reporting—as secure "plugins" for the AI. Clubs don't need to learn a new platform or migrate data. The AI simply provides a new way to access live data and trigger familiar workflows.

How We Create AI Plugins from Our Existing API

The core of our technical approach involves "wrapping" our existing, secure API endpoints into plugins that Semantic Kernel can understand and use. This process is deliberate and requires careful planning.

  1. 📄 OpenAPI Specification: The foundation of a plugin is a well-defined OpenAPI (formerly Swagger) specification file. We will generate or author these files for each API endpoint we want to expose. The descriptions within this file are critical—they are what the LLM reads to understand what a function does, what parameters it needs, and when it should be used.
  2. ⚡ Automatic Plugin Creation: Semantic Kernel can ingest an OpenAPI specification and automatically generate a native plugin from it. This allows us to rapidly convert our documented API endpoints into callable functions for the AI.
  3. 🛡️ Planning & Curation: We will not expose our entire API at once. A significant planning effort is required to decide which endpoints offer the most value and can be exposed safely. We will start with simple, read-only endpoints (e.g., getClubDetails, listRecentOrders) before moving to more complex or write-based operations. Each endpoint will be reviewed for its suitability as a discrete "tool" for the AI.

Proposed Phased Rollout & Future Considerations

Each phase progressively expands the AI's capabilities. Estimated Effort for each phase is additive—the time for each phase includes the work of all previous phases. For example, Phase 2's estimate includes the time for Phase 1, and so on. (E.g., "6 weeks" for Phase 2 means 2 weeks for Phase 1 plus 4 more weeks for Phase 2.)

🚦 Phase 1: Internal Querying Tool

  • Goal: Allow our internal teams to query the Fanbase API using natural language, augmenting or replacing the existing account growth report. This provides immediate internal efficiency gains.
  • Estimated Effort: ~2 weeks (from project start)

💬 Phase 2: Interactive Club Q&A

  • Goal: Enable clubs to ask interactive questions and get consolidated answers that would normally require navigating multiple pages, tables, and charts.
  • Estimated Effort: 4-6 additional weeks (total: 6-8 weeks from project start)
  • Future Concerns & Planning:
    • Data Accuracy: The primary risk is the LLM "hallucinating" or misinterpreting data. We must ensure the AI answers are strictly grounded in the data returned by our API.
    • State Management: Conversations will become multi-turn. We need a robust strategy for managing conversation history and context so the AI can understand follow-up questions accurately.

✍️ Phase 3: Controlled Data Writing

  • Goal: Grant clubs the ability to perform controlled "write" operations, such as saving a new audience after querying fan data.
  • Estimated Effort: 4-8 additional weeks (total: 10-16 weeks from project start)
  • Future Concerns & Planning:
    • Confirmation & Authorization: This is a major increase in risk. We will implement a mandatory confirmation step for all data modification actions. The AI must always ask, "You are about to save this audience. Is that correct?" before executing the action.
    • Granular Permissions: We must ensure the AI respects the user's existing permissions. The plugin layer will need to integrate deeply with our Role-Based Access Control (RBAC) system.

🤖 Phase 4: Full AI-Powered Account Control

  • Goal: Empower clubs to manage their Fanbase account directly through complex AI commands like, "Create a fixture for Morton on 15th July and apply the normal ticket template."
  • Estimated Effort: 8-12+ additional weeks (total: 18-28+ weeks from project start)
  • Future Concerns & Planning:
    • Complex Plan Execution: Such commands require multiple, sequential API calls. The AI needs to create a "plan" and execute it reliably.
    • Error Handling & Rollbacks: We must build logic to handle situations where one step in a multi-step plan fails. The system must be able to report the failure clearly to the user or even attempt to roll back previous steps to prevent an inconsistent state.

Project Timeline Overview

Each segment represents a project phase. Weeks are counted from project start.
Total estimate is based on 1 season ≈ 2 weeks
🚦
Phase 1
Weeks 1–2
💬
Phase 2
Weeks 3–6
✍️
Phase 3
Weeks 7–12
🤖
Phase 4
Weeks 13–22
1 season
2 seasons
3 seasons
5 seasons

Immediate Next Steps

To move forward efficiently and validate our approach, we have defined a clear set of immediate actions:

  1. 🧪 Build a Proof of Concept (POC): We will immediately build a simple POC by wrapping two or three read-only API endpoints (e.g., 'get club details', 'list recent fixtures') into plugins. The primary goal is to validate the end-to-end flow from a natural language query to an authenticated API call and back, allowing us to accurately gauge real-world costs and technical complexity. (Estimated time: 3-4 days)
  2. 📝 Define the "Fanbase AI Agent" Feature: We will scope out the high-level features by defining user stories (e.g., "As a club manager, I want to ask 'how many tickets did we sell last month?' and receive a clear number"). This process involves not just defining what the AI can do, but also defining the guardrails and limitations for each feature to ensure safety and reliability.
  3. 💼 Strategic Product Review: Finally, we will conduct a product discussion to address critical business questions: How does this feature create a competitive advantage? What are the potential support and maintenance costs? How does it align with our broader quarterly and annual goals? This ensures that our technical efforts are perfectly aligned with business strategy.

Deciding & Setting Up Azure AI (or Alternatives)

Before integrating AI into our platform, we must carefully evaluate and set up the right model provider. Here's how we approach this decision and the setup process:

  1. 🔍 Decision Criteria:
    • Cost: Compare token pricing, throughput, and projected usage for Azure AI, OpenAI, Anthropic, and self-hosted options.
    • Compliance & Data Residency: Ensure the provider meets our data privacy, residency, and compliance requirements.
    • Features & Model Support: Evaluate available models (e.g., GPT-4, Claude, open-source), safety features, and integration capabilities.
    • Integration & Ecosystem: Consider ease of integration with our stack, SDK support, and compatibility with Semantic Kernel.
  2. ⚙️ Azure AI Setup Steps:
    • 1. Provision Azure OpenAI Resource: In the Azure Portal, create an Azure OpenAI resource in the desired region.
    • 2. Deploy a Model: Select and deploy a model (e.g., GPT-4, GPT-3.5) to your resource.
    • 3. Generate API Keys: Obtain the endpoint URL and API key for secure access.
    • 4. Configure Security: Set up network restrictions, private endpoints, and RBAC as needed for compliance and safety.
    • 5. Integrate with Semantic Kernel: Use the API endpoint and key in our Semantic Kernel configuration to enable AI-powered features.
  3. 🌐 Considering Alternatives:
    • OpenAI (Direct): Simple setup, global endpoints, but data residency may not meet all compliance needs.
    • Anthropic, Google, etc.: Evaluate for specific features, pricing, or model strengths.
    • Self-Hosting: For ultimate control, deploy open-source models (e.g., Llama, Mistral) on our own infrastructure, but requires more DevOps and security work.
Tip: The recommendation is that we should be starting with Azure AI for its enterprise security, compliance, and seamless integration with Semantic Kernel, but our architecture allows us to switch or add providers as business needs evolve.