Monitoring & Observability — Vận hành AI Agent trong Production

1. Tại sao Production AI Agent cần Observability riêng?

Ở bài trước, chúng ta đã xây dựng hệ thống Guardrails & Evaluation để đảm bảo AI Agent hoạt động an toàn. Nhưng khi hàng nghìn người dùng thật sự sử dụng agent mỗi ngày, một câu hỏi hoàn toàn mới nổi lên:

“Làm sao tôi biết agent đang hoạt động đúng, ổn định, đúng chi phí và tạo ra giá trị ngay lúc này — trong production, 24/7?”

Traditional monitoring (CPU, RAM, request/s) không đủ cho AI Agent. Agent có thể hoàn toàn “xanh” trên dashboard DevOps thông thường nhưng thực tế đang:

Trả lời sai (hallucination rate tăng âm thầm)
Tiêu token gấp 3 lần bình thường do prompt loop
Tốn thêm $800/ngày vì một model configuration sai
Stuck trong reasoning loop suốt 45 giây mà không timeout

Đây là lý do LLMOps — một nhánh riêng của MLOps — ra đời.

2. LLMOps vs DevOps Truyền Thống — 10 Điểm Khác Biệt Cốt Lõi

#	Chiều so sánh	DevOps truyền thống	LLMOps cho AI Agent
1	Tính xác định	Deterministic: cùng input → cùng output	Non-deterministic: cùng prompt → output khác nhau
2	Đơn vị chi phí	CPU giờ, bandwidth GB	Token (input + output) + API call cost
3	Metric chất lượng	Latency, error rate, uptime	Hallucination rate, groundedness, relevance score
4	Versioning	Code + config versioning	Code + config + prompt versioning + model versioning
5	Drift	Performance drift do hardware thay đổi	Model drift: nhà cung cấp update model lặng lẽ
6	Debugging	Stack trace rõ ràng	Reasoning trace phức tạp, multi-hop, khó reproduce
7	Testing	Unit test, integration test	Evaluation dataset, LLM-as-a-Judge, A/B testing
8	Rollback	Rollback code/config	Rollback prompt version + model version + memory state
9	Scaling	Horizontal scaling đơn giản	Phải cân bằng token throughput, context window, cost
10	Compliance	Log access, audit trail	Log mọi LLM interaction cho compliance + audit

2.1. Non-Determinism — Thách Thức Lớn Nhất

DevOps:  f(x) = y          → luôn đúng, test 1 lần là đủ
LLMOps:  f(x) = y₁ | y₂ | y₃ | ...  → test phải sampling, eval phải statistical

Điều này có nghĩa: bạn không thể chỉ monitor có lỗi không — bạn phải monitor output có đúng không, liên tục, theo xác suất.

2.2. Token Economy — Chi phí vô hình

Tình huống	Token consumed	Chi phí ước tính
1 câu hỏi FAQ đơn giản	~500 tokens	~$0.001
1 phiên tư vấn phức tạp (RAG + history)	~8,000 tokens	~$0.016
1 agentic workflow 5 bước	~25,000 tokens	~$0.050
10,000 users/ngày × agentic workflow	250M tokens	~$500/ngày

Kết luận: Một bug nhỏ trong prompt (ví dụ: infinite retry loop) có thể tiêu tốn $2,000+ trước khi ai phát hiện nếu không có cost monitoring.

3. Kiến Trúc Observability Tổng Thể Cho AI Agent

┌─────────────────────────────────────────────────────────────────────────────┐
│              LLMOPS OBSERVABILITY ARCHITECTURE — AI AGENT CLUSTER           │
└─────────────────────────────────────────────────────────────────────────────┘

  ┌──────────────────────────────────────────────────────────────────────────┐
  │                          AI AGENT CLUSTER                                │
  │                                                                          │
  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌───────────────┐  │
  │  │Orchestrator │  │  RAG Agent  │  │  Tool Agent │  │ Memory Agent  │  │
  │  │   Agent     │  │             │  │             │  │               │  │
  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └───────┬───────┘  │
  │         │                │                │                  │          │
  │         └────────────────┴────────────────┴──────────────────┘          │
  │                                    │                                     │
  │                    OTel SDK (Python / .NET / Java)                       │
  │                    - Traces (spans + context propagation)                │
  │                    - Metrics (counters, histograms, gauges)              │
  │                    - Logs (structured JSON + trace_id correlation)       │
  └─────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
  ┌─────────────────────────────────────────────────────────────────────────┐
  │                       OPENTELEMETRY COLLECTOR                            │
  │                                                                          │
  │   Receivers: OTLP gRPC/HTTP, Prometheus scrape, Fluentd                  │
  │   Processors: Batch, Memory Limiter, Attribute Filter, Sampling          │
  │   Exporters: → Prometheus │ → Jaeger/Tempo │ → Elasticsearch            │
  └──────────────────┬──────────────────────────────────────────────────────┘
                     │
         ┌───────────┼──────────────┐
         ▼           ▼              ▼
  ┌────────────┐ ┌──────────┐ ┌──────────────────┐
  │ PROMETHEUS │ │  JAEGER  │ │  ELASTICSEARCH   │
  │            │ │  / TEMPO │ │  / OPENSEARCH    │
  │ Metrics    │ │          │ │                  │
  │ Storage    │ │ Distributed│ │ Log Storage     │
  │ & Query    │ │ Traces   │ │ Full-text Search │
  └─────┬──────┘ └────┬─────┘ └────────┬─────────┘
        │             │                │
        └─────────────┴────────────────┘
                      │
                      ▼
  ┌─────────────────────────────────────────────────────────────────────────┐
  │                        GRAFANA DASHBOARD                                 │
  │                                                                          │
  │  [Overview] [Token Economy] [Quality] [Agent Health] [Business KPI]     │
  └──────────────────────────────────┬──────────────────────────────────────┘
                                     │
                                     ▼
  ┌─────────────────────────────────────────────────────────────────────────┐
  │                         ALERTMANAGER                                     │
  │                                                                          │
  │   Rules: Cost Spike | Latency P95 | Error Rate | Hallucination Rate     │
  │   Routing: → Slack | PagerDuty | Email | Webhook                        │
  └─────────────────────────────────────────────────────────────────────────┘

3.1. Multi-Agent Distributed Tracing Flow

  USER REQUEST (request_id: req-abc123)
       │
       ▼ [Trace Start — Span: "user_request"]
  ┌────────────────────────────────────┐
  │     API GATEWAY / LB               │
  │     Inject: traceparent header     │
  └──────────────────┬─────────────────┘
                     │
                     ▼ [Span: "orchestrator.process"]
  ┌────────────────────────────────────┐
  │     ORCHESTRATOR AGENT             │  t=0ms
  │     - Parse intent                 │
  │     - Plan sub-tasks               │
  └──┬──────────────┬──────────────────┘
     │              │              │
     ▼              ▼              ▼
  [Span:         [Span:         [Span:
  "rag.retrieve"] "tool.call"]  "memory.fetch"]
  ┌──────────┐  ┌──────────┐  ┌──────────┐
  │RAG Agent │  │Tool Agent│  │Memory    │
  │t=5ms     │  │t=5ms     │  │Agent     │
  │          │  │          │  │t=5ms     │
  │  ┌─────┐ │  │  ┌─────┐ │  │  ┌─────┐│
  │  │Embed│ │  │  │API  │ │  │  │Redis││
  │  │Query│ │  │  │Call │ │  │  │Fetch││
  │  └──┬──┘ │  │  └──┬──┘ │  │  └──┬──┘│
  │     │    │  │     │    │  │     │   │
  │  ┌──▼──┐ │  │  ┌──▼──┐ │  │     │   │
  │  │Vecto│ │  │  │Tool │ │  │     │   │
  │  │rDB  │ │  │  │Resp │ │  │     │   │
  │  └─────┘ │  │  └─────┘ │  │     │   │
  └────┬─────┘  └────┬──────┘  └─────┬───┘
       │             │               │
       └──────────────┴───────────────┘
                      │
                      ▼ [Span: "llm.generate"] t=120ms
             ┌─────────────────┐
             │   LLM CALL      │
             │   GPT-4o / etc  │
             │   tokens: 2,340 │
             │   latency: 1.8s │
             └────────┬────────┘
                      │
                      ▼ [Span: "output.guard"] t=1920ms
             ┌─────────────────┐
             │ Output Guard    │
             │ Guardrails check│
             └────────┬────────┘
                      │
                      ▼ [Trace End] t=2050ms
             FINAL RESPONSE → User
             Total: 2,050ms | tokens: 2,340 | cost: $0.0047

4. Bốn Trụ Cột của LLM Observability

4.1. Pillar 1 — Metrics

Mô tả: Dữ liệu số, time-series, aggregatable — dùng để trending và alerting.

Tools phù hợp: Prometheus, Grafana, Datadog, New Relic

Sample data:

# HELP llm_request_duration_seconds LLM request latency
# TYPE llm_request_duration_seconds histogram
llm_request_duration_seconds_bucket{agent="rag_agent",model="gpt-4o",le="0.5"} 42
llm_request_duration_seconds_bucket{agent="rag_agent",model="gpt-4o",le="1.0"} 180
llm_request_duration_seconds_bucket{agent="rag_agent",model="gpt-4o",le="2.0"} 312
llm_request_duration_seconds_bucket{agent="rag_agent",model="gpt-4o",le="5.0"} 398
llm_request_duration_seconds_bucket{agent="rag_agent",model="gpt-4o",le="+Inf"} 402

# HELP llm_tokens_total Total tokens consumed
# TYPE llm_tokens_total counter
llm_tokens_total{agent="rag_agent",type="input",model="gpt-4o"} 1284930
llm_tokens_total{agent="rag_agent",type="output",model="gpt-4o"} 423810

# HELP llm_cost_usd_total Total cost in USD
# TYPE llm_cost_usd_total counter
llm_cost_usd_total{agent="rag_agent",model="gpt-4o"} 24.87

4.2. Pillar 2 — Logs

Mô tả: Structured event records — dùng để debug, audit và tìm root cause.

Tools phù hợp: Elasticsearch, OpenSearch, Loki, Splunk

Sample data (JSON structured log):

{
  "timestamp": "2026-05-14T10:23:45.123Z",
  "level": "INFO",
  "request_id": "req-abc123",
  "session_id": "sess-xyz789",
  "agent_id": "rag_agent",
  "model": "gpt-4o",
  "prompt_tokens": 1840,
  "completion_tokens": 420,
  "total_tokens": 2260,
  "latency_ms": 2050,
  "cost_usd": 0.0045,
  "guardrail_status": "passed",
  "tool_calls": ["search_knowledge_base", "get_product_info"],
  "hallucination_score": 0.12,
  "user_satisfaction": null,
  "error": null
}

4.3. Pillar 3 — Traces

Mô tả: Distributed tracing — timeline của request xuyên qua nhiều service/agent.

Tools phù hợp: Jaeger, Grafana Tempo, Zipkin, AWS X-Ray

Sample span data:

{
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spanId": "00f067aa0ba902b7",
  "parentSpanId": "b9c7c989f97918e1",
  "operationName": "llm.generate",
  "serviceName": "rag-agent",
  "startTime": 1715677425120,
  "duration": 1823000,
  "tags": {
    "llm.model": "gpt-4o",
    "llm.input_tokens": 1840,
    "llm.output_tokens": 420,
    "llm.cost_usd": 0.0045,
    "agent.id": "rag_agent",
    "guardrail.status": "passed"
  }
}

4.4. Pillar 4 — Profiles

Mô tả: CPU/memory profiling của inference engine và Python code — tìm bottleneck.

Tools phù hợt: Pyroscope, Grafana Phlare, py-spy, cProfile

Sample — phát hiện bottleneck thực tế:

Function                          │ CPU % │ Calls │ Avg ms
──────────────────────────────────┼───────┼───────┼───────
embed_documents()                 │ 34.2% │ 2,840 │ 12.1ms
vector_db.similarity_search()     │ 21.8% │ 2,840 │ 7.7ms
openai.chat.completions.create()  │ 18.6% │  890  │ 1,820ms
json.loads() [response parsing]   │  8.3% │ 2,840 │ 2.9ms
redis.get() [session cache]       │  5.1% │ 8,900 │ 0.57ms

5. Metrics Quan Trọng Cần Theo Dõi — 5 Nhóm

5.1. Nhóm 1 — Latency Metrics

Metric	Mô tả	Target (Production)	Alert Threshold
TTFT p50	Time To First Token, median	< 500ms	> 1s
TTFT p95	Time To First Token, 95th percentile	< 1.5s	> 3s
TTFT p99	Time To First Token, 99th percentile	< 3s	> 5s
Total Latency p95	End-to-end response time	< 3s	> 5s
Queue Wait Time	Thời gian chờ trong queue	< 100ms	> 500ms
Tool Call Latency	Latency của external API calls	< 500ms/call	> 2s

5.2. Nhóm 2 — Token & Cost Metrics

Metric	Mô tả	Target	Alert Threshold
Input tokens/request	Avg input token per request	< 2,000	> 5,000
Output tokens/request	Avg output token per request	< 500	> 2,000
Cost/session USD	Chi phí trung bình mỗi phiên	< $0.05	> $0.20
Daily cost USD	Tổng chi phí theo ngày	Baseline ±20%	> 150% baseline
Monthly cost trend	Xu hướng chi phí tháng	Growth < 30%	> 50% MoM
Token efficiency ratio	Output tokens / Input tokens	> 0.3	< 0.1

5.3. Nhóm 3 — Quality Metrics

Metric	Mô tả	Target	Alert Threshold
Hallucination rate	% responses có thông tin sai	< 3%	> 10%
Guardrail block rate	% requests bị chặn bởi guardrail	0.5-2%	> 20% (surge)
Groundedness score	RAG answer grounded in context	> 0.85	< 0.70
User satisfaction	CSAT score / thumbs up %	> 80%	< 60%
Task completion rate	% tasks completed successfully	> 90%	< 75%
Escalation rate	% sessions escalated to human	< 5%	> 15%

5.4. Nhóm 4 — Reliability Metrics

Metric	Mô tả	Target	Alert Threshold
Error rate	% requests trả về lỗi	< 1%	> 5%
Timeout rate	% requests timeout	< 0.5%	> 2%
Retry rate	% requests phải retry	< 2%	> 10%
Circuit breaker state	Trạng thái circuit breaker	CLOSED	OPEN > 5min
Memory overflow rate	% context window overflow	< 1%	> 5%
Tool failure rate	% tool calls thất bại	< 2%	> 10%

5.5. Nhóm 5 — Business Metrics

Metric	Mô tả	Target	Alert Threshold
Active sessions	Số phiên đang hoạt động	Capacity planning	> 80% capacity
Daily active users	Số user unique/ngày	Growth target	Sudden drop > 30%
Task completion rate	% tác vụ hoàn thành	> 90%	< 75%
Avg conversation length	Số turn trung bình/phiên	3-8 turns	> 15 turns
ROI per agent	Giá trị tạo ra / chi phí vận hành	> 3x	< 1x
Cost per resolved query	Chi phí để giải quyết 1 query	< $0.10	> $0.50

5.6. Python — Custom Prometheus Metrics + OpenTelemetry Instrumentation

import time
import logging
from typing import Optional, Any
from functools import wraps
from prometheus_client import Counter, Histogram, Gauge, Summary
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader

logger = logging.getLogger(__name__)

# ─── Prometheus Metrics ────────────────────────────────────────────────────────

# Latency histogram với p50/p95/p99 buckets
LLM_REQUEST_DURATION = Histogram(
    "llm_request_duration_seconds",
    "LLM request latency in seconds",
    ["agent_id", "model", "operation"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0],
)

TTFT_DURATION = Histogram(
    "llm_ttft_seconds",
    "Time To First Token in seconds",
    ["agent_id", "model"],
    buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0],
)

# Token counters
LLM_TOKENS_TOTAL = Counter(
    "llm_tokens_total",
    "Total tokens consumed",
    ["agent_id", "model", "token_type"],  # token_type: input | output
)

# Cost tracking
LLM_COST_USD = Counter(
    "llm_cost_usd_total",
    "Total LLM API cost in USD",
    ["agent_id", "model", "tenant_id"],
)

# Quality metrics
LLM_HALLUCINATION_SCORE = Histogram(
    "llm_hallucination_score",
    "Hallucination probability score (0.0-1.0)",
    ["agent_id"],
    buckets=[0.0, 0.1, 0.2, 0.3, 0.5, 0.7, 1.0],
)

GUARDRAIL_DECISIONS = Counter(
    "llm_guardrail_decisions_total",
    "Guardrail decisions",
    ["agent_id", "decision", "reason"],  # decision: allow|block|escalate
)

# Reliability
LLM_ERRORS_TOTAL = Counter(
    "llm_errors_total",
    "Total LLM errors",
    ["agent_id", "model", "error_type"],
)

# Active sessions gauge
ACTIVE_SESSIONS = Gauge(
    "llm_active_sessions",
    "Number of currently active sessions",
    ["agent_id"],
)

# ─── OpenTelemetry Setup ───────────────────────────────────────────────────────

def setup_otel(service_name: str, otel_endpoint: str = "http://otel-collector:4317"):
    """Configure OpenTelemetry Tracing + Metrics với OTLP exporter."""
    # Tracing
    tracer_provider = TracerProvider()
    otlp_span_exporter = OTLPSpanExporter(endpoint=otel_endpoint, insecure=True)
    tracer_provider.add_span_processor(BatchSpanProcessor(otlp_span_exporter))
    trace.set_tracer_provider(tracer_provider)

    # Metrics
    otlp_metric_exporter = OTLPMetricExporter(endpoint=otel_endpoint, insecure=True)
    metric_reader = PeriodicExportingMetricReader(otlp_metric_exporter, export_interval_millis=15000)
    meter_provider = MeterProvider(metric_readers=[metric_reader])
    metrics.set_meter_provider(meter_provider)

    return trace.get_tracer(service_name), metrics.get_meter(service_name)

# ─── Instrumented LLM Call Wrapper ────────────────────────────────────────────

COST_PER_1K_TOKENS = {
    "gpt-4o": {"input": 0.005, "output": 0.015},
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
    "claude-3-5-sonnet": {"input": 0.003, "output": 0.015},
    "claude-3-haiku": {"input": 0.00025, "output": 0.00125},
}

class InstrumentedLLMClient:
    def __init__(self, agent_id: str, tracer, model: str = "gpt-4o"):
        self.agent_id = agent_id
        self.model = model
        self.tracer = tracer

    def calculate_cost(self, input_tokens: int, output_tokens: int) -> float:
        rates = COST_PER_1K_TOKENS.get(self.model, {"input": 0.005, "output": 0.015})
        return (input_tokens / 1000 * rates["input"]) + (output_tokens / 1000 * rates["output"])

    async def chat_completion(
        self,
        messages: list[dict],
        tenant_id: str = "default",
        session_id: Optional[str] = None,
        **kwargs: Any,
    ) -> dict:
        """LLM call với đầy đủ instrumentation: traces, metrics, cost tracking."""
        start_time = time.perf_counter()

        with self.tracer.start_as_current_span("llm.generate") as span:
            span.set_attribute("llm.model", self.model)
            span.set_attribute("llm.agent_id", self.agent_id)
            span.set_attribute("llm.session_id", session_id or "")
            span.set_attribute("llm.input_messages", len(messages))

            ACTIVE_SESSIONS.labels(agent_id=self.agent_id).inc()

            try:
                # Gọi LLM thực tế (thay bằng openai client thật)
                from openai import AsyncOpenAI
                client = AsyncOpenAI()
                response = await client.chat.completions.create(
                    model=self.model,
                    messages=messages,
                    **kwargs,
                )

                latency = time.perf_counter() - start_time
                usage = response.usage
                input_tokens = usage.prompt_tokens
                output_tokens = usage.completion_tokens
                cost = self.calculate_cost(input_tokens, output_tokens)

                # Prometheus metrics
                LLM_REQUEST_DURATION.labels(
                    agent_id=self.agent_id, model=self.model, operation="chat"
                ).observe(latency)

                LLM_TOKENS_TOTAL.labels(
                    agent_id=self.agent_id, model=self.model, token_type="input"
                ).inc(input_tokens)

                LLM_TOKENS_TOTAL.labels(
                    agent_id=self.agent_id, model=self.model, token_type="output"
                ).inc(output_tokens)

                LLM_COST_USD.labels(
                    agent_id=self.agent_id, model=self.model, tenant_id=tenant_id
                ).inc(cost)

                # OTel span attributes
                span.set_attribute("llm.input_tokens", input_tokens)
                span.set_attribute("llm.output_tokens", output_tokens)
                span.set_attribute("llm.cost_usd", cost)
                span.set_attribute("llm.latency_ms", int(latency * 1000))

                logger.info(
                    "llm_call_completed",
                    extra={
                        "agent_id": self.agent_id,
                        "model": self.model,
                        "input_tokens": input_tokens,
                        "output_tokens": output_tokens,
                        "latency_ms": int(latency * 1000),
                        "cost_usd": round(cost, 6),
                        "session_id": session_id,
                    },
                )

                return {"response": response, "cost_usd": cost, "latency_ms": int(latency * 1000)}

            except Exception as e:
                LLM_ERRORS_TOTAL.labels(
                    agent_id=self.agent_id, model=self.model, error_type=type(e).__name__
                ).inc()
                span.record_exception(e)
                span.set_status(trace.StatusCode.ERROR, str(e))
                logger.error("llm_call_failed", extra={"error": str(e), "agent_id": self.agent_id})
                raise
            finally:
                ACTIVE_SESSIONS.labels(agent_id=self.agent_id).dec()

6. Distributed Tracing Cho Multi-Agent Workflow

6.1. Khái Niệm Cơ Bản

Khái niệm	Mô tả	Ví dụ trong AI Agent
Trace	Toàn bộ lifecycle của 1 request	Từ lúc user gửi tin → nhận response
Span	1 đơn vị công việc trong trace	“llm.generate”, “rag.retrieve”, “tool.call”
Parent Span	Span chứa các span con	Orchestrator span chứa tất cả sub-agent spans
Context Propagation	Truyền trace context qua service boundaries	traceparent header qua HTTP/gRPC
Correlation ID	ID duy nhất kết nối logs + traces + metrics	request_id = trace_id

6.2. Python — OpenTelemetry + LangChain Callback Handler

import uuid
import time
import logging
from typing import Any, Optional, Union
from langchain.callbacks.base import BaseCallbackHandler
from langchain.schema import LLMResult, AgentAction, AgentFinish
from opentelemetry import trace, context, baggage
from opentelemetry.propagate import inject, extract
import structlog

logger = structlog.get_logger()
tracer = trace.get_tracer("langchain-agent")

class LangChainOTelCallbackHandler(BaseCallbackHandler):
    """
    LangChain callback handler tích hợp OpenTelemetry tracing.
    Tự động tạo spans cho mọi LLM call, tool call, chain execution.
    """

    def __init__(self, agent_id: str):
        self.agent_id = agent_id
        self._span_stack: dict[str, Any] = {}
        self._run_metadata: dict[str, dict] = {}

    def on_llm_start(self, serialized: dict, prompts: list[str], **kwargs: Any) -> None:
        run_id = str(kwargs.get("run_id", uuid.uuid4()))
        model = serialized.get("kwargs", {}).get("model_name", "unknown")

        span = tracer.start_span(
            "llm.generate",
            attributes={
                "llm.model": model,
                "llm.agent_id": self.agent_id,
                "llm.prompt_count": len(prompts),
                "llm.run_id": run_id,
            },
        )
        ctx = trace.use_span(span, end_on_exit=False)
        token = context.attach(ctx)

        self._span_stack[run_id] = {"span": span, "token": token, "start_time": time.perf_counter()}
        self._run_metadata[run_id] = {"model": model, "prompts": prompts}

        logger.info("llm_start", agent_id=self.agent_id, model=model, run_id=run_id)

    def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
        run_id = str(kwargs.get("run_id", ""))
        if run_id not in self._span_stack:
            return

        frame = self._span_stack.pop(run_id)
        span = frame["span"]
        latency_ms = int((time.perf_counter() - frame["start_time"]) * 1000)

        # Extract token usage từ LLMResult
        total_tokens = 0
        input_tokens = 0
        output_tokens = 0

        if response.llm_output:
            token_usage = response.llm_output.get("token_usage", {})
            input_tokens = token_usage.get("prompt_tokens", 0)
            output_tokens = token_usage.get("completion_tokens", 0)
            total_tokens = token_usage.get("total_tokens", 0)

        span.set_attribute("llm.input_tokens", input_tokens)
        span.set_attribute("llm.output_tokens", output_tokens)
        span.set_attribute("llm.total_tokens", total_tokens)
        span.set_attribute("llm.latency_ms", latency_ms)
        span.end()

        context.detach(frame["token"])

        logger.info(
            "llm_end",
            agent_id=self.agent_id,
            run_id=run_id,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            latency_ms=latency_ms,
        )

    def on_llm_error(self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any) -> None:
        run_id = str(kwargs.get("run_id", ""))
        if run_id not in self._span_stack:
            return

        frame = self._span_stack.pop(run_id)
        span = frame["span"]
        span.record_exception(error)
        span.set_status(trace.StatusCode.ERROR, str(error))
        span.end()
        context.detach(frame["token"])

        logger.error("llm_error", agent_id=self.agent_id, error=str(error), run_id=run_id)

    def on_tool_start(self, serialized: dict, input_str: str, **kwargs: Any) -> None:
        run_id = str(kwargs.get("run_id", uuid.uuid4()))
        tool_name = serialized.get("name", "unknown_tool")

        span = tracer.start_span(
            f"tool.{tool_name}",
            attributes={
                "tool.name": tool_name,
                "tool.input_length": len(input_str),
                "llm.agent_id": self.agent_id,
            },
        )
        ctx = trace.use_span(span, end_on_exit=False)
        token = context.attach(ctx)
        self._span_stack[run_id] = {"span": span, "token": token, "start_time": time.perf_counter()}

        logger.info("tool_start", tool=tool_name, agent_id=self.agent_id)

    def on_tool_end(self, output: str, **kwargs: Any) -> None:
        run_id = str(kwargs.get("run_id", ""))
        if run_id not in self._span_stack:
            return

        frame = self._span_stack.pop(run_id)
        span = frame["span"]
        span.set_attribute("tool.output_length", len(output))
        span.set_attribute("tool.latency_ms", int((time.perf_counter() - frame["start_time"]) * 1000))
        span.end()
        context.detach(frame["token"])

    def on_agent_action(self, action: AgentAction, **kwargs: Any) -> None:
        logger.info(
            "agent_action",
            agent_id=self.agent_id,
            tool=action.tool,
            tool_input=action.tool_input[:200],
        )

    def on_agent_finish(self, finish: AgentFinish, **kwargs: Any) -> None:
        logger.info("agent_finish", agent_id=self.agent_id, output_keys=list(finish.return_values.keys()))


# ─── Context Propagation qua HTTP ─────────────────────────────────────────────

def create_propagated_headers(current_span: Optional[Any] = None) -> dict:
    """Tạo HTTP headers với W3C traceparent để truyền context sang service khác."""
    headers: dict = {}
    inject(headers)  # OTel tự inject traceparent + tracestate
    return headers

def extract_trace_context(incoming_headers: dict) -> Any:
    """Extract trace context từ inbound HTTP request."""
    return extract(incoming_headers)

7. Structured Logging Cho AI Agent

7.1. Log Schema JSON Chuẩn

{
  "timestamp": "2026-05-14T10:23:45.123456Z",
  "level": "INFO",
  "service": "rag-agent-service",
  "version": "2.1.0",
  "environment": "production",

  "request_id": "req-4bf92f35-77b3-4da6",
  "session_id": "sess-a3ce929d-0e0e-4736",
  "correlation_id": "corr-00f067aa-0ba9-02b7",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",

  "agent_id": "rag_agent_v2",
  "tenant_id": "tenant-healthcare-001",
  "user_id": "user-hashed-789xyz",

  "model": "gpt-4o",
  "model_version": "2024-11-20",
  "operation": "chat_completion",

  "prompt_tokens": 1840,
  "completion_tokens": 420,
  "total_tokens": 2260,
  "cost_usd": 0.004530,

  "latency_ms": 2050,
  "ttft_ms": 380,
  "queue_wait_ms": 12,

  "guardrail_status": "passed",
  "guardrail_checks": {
    "prompt_injection": "clean",
    "pii_detection": "no_pii",
    "topic_filter": "in_scope",
    "toxicity": "clean"
  },

  "tool_calls": [
    {"name": "search_knowledge_base", "latency_ms": 145, "status": "success"},
    {"name": "get_product_info", "latency_ms": 89, "status": "success"}
  ],

  "rag_context": {
    "chunks_retrieved": 5,
    "top_similarity_score": 0.92,
    "retrieval_latency_ms": 145
  },

  "quality_scores": {
    "groundedness": 0.88,
    "hallucination_probability": 0.08,
    "relevance": 0.91
  },

  "error": null,
  "error_type": null,
  "retry_count": 0
}

7.2. Python Structlog Setup

import sys
import logging
import structlog
from opentelemetry import trace

def configure_structured_logging(
    service_name: str,
    environment: str = "production",
    log_level: str = "INFO",
) -> None:
    """Cấu hình structlog với OTel trace context injection."""

    # Processor chain: xử lý log record trước khi output
    structlog.configure(
        processors=[
            structlog.contextvars.merge_contextvars,         # Thread-local context
            structlog.stdlib.add_log_level,                   # level field
            structlog.stdlib.add_logger_name,                 # logger field
            structlog.processors.TimeStamper(fmt="iso"),      # ISO timestamp
            _inject_otel_context,                             # trace_id + span_id
            _add_service_metadata(service_name, environment), # service + env
            structlog.processors.StackInfoRenderer(),
            structlog.processors.format_exc_info,
            structlog.processors.JSONRenderer(),              # JSON output
        ],
        wrapper_class=structlog.make_filtering_bound_logger(
            getattr(logging, log_level.upper())
        ),
        context_class=dict,
        logger_factory=structlog.PrintLoggerFactory(sys.stdout),
    )

def _inject_otel_context(logger, method_name: str, event_dict: dict) -> dict:
    """Inject OTel trace_id và span_id vào mọi log record."""
    current_span = trace.get_current_span()
    if current_span and current_span.is_recording():
        ctx = current_span.get_span_context()
        event_dict["trace_id"] = format(ctx.trace_id, "032x")
        event_dict["span_id"] = format(ctx.span_id, "016x")
    return event_dict

def _add_service_metadata(service_name: str, environment: str):
    def processor(logger, method_name: str, event_dict: dict) -> dict:
        event_dict["service"] = service_name
        event_dict["environment"] = environment
        return event_dict
    return processor

# Sử dụng:
# configure_structured_logging("rag-agent-service", "production")
# log = structlog.get_logger()
# log.info("llm_call_completed", agent_id="rag_agent", latency_ms=2050, cost_usd=0.0045)

7.3. Elasticsearch Index Mapping

# elasticsearch-index-mapping.yaml
---
index_template:
  name: "ai-agent-logs"
  index_patterns:
    - "ai-agent-logs-*"
  
  settings:
    number_of_shards: 3
    number_of_replicas: 1
    refresh_interval: "5s"
    
    index:
      lifecycle:
        name: "ai-agent-logs-ilm-policy"
        rollover_alias: "ai-agent-logs"
    
    analysis:
      analyzer:
        custom_log_analyzer:
          type: standard
          stopwords: "_none_"

  mappings:
    dynamic: false
    properties:
      "@timestamp":        { type: date }
      timestamp:           { type: date }
      level:               { type: keyword }
      service:             { type: keyword }
      environment:         { type: keyword }
      version:             { type: keyword }

      request_id:          { type: keyword }
      session_id:          { type: keyword }
      trace_id:            { type: keyword }
      span_id:             { type: keyword }
      correlation_id:      { type: keyword }

      agent_id:            { type: keyword }
      tenant_id:           { type: keyword }
      user_id:             { type: keyword }
      model:               { type: keyword }
      operation:           { type: keyword }

      prompt_tokens:       { type: integer }
      completion_tokens:   { type: integer }
      total_tokens:        { type: integer }
      cost_usd:            { type: float }

      latency_ms:          { type: integer }
      ttft_ms:             { type: integer }
      queue_wait_ms:       { type: integer }

      guardrail_status:    { type: keyword }
      error:               { type: text, analyzer: custom_log_analyzer }
      error_type:          { type: keyword }
      retry_count:         { type: short }

      hallucination_probability: { type: float }
      groundedness:              { type: float }
      relevance:                 { type: float }

      tool_calls:
        type: nested
        properties:
          name:        { type: keyword }
          latency_ms:  { type: integer }
          status:      { type: keyword }

# ILM Policy
ilm_policy:
  name: "ai-agent-logs-ilm-policy"
  phases:
    hot:
      min_age: "0ms"
      actions:
        rollover:
          max_primary_shard_size: "50gb"
          max_age: "1d"
        set_priority:
          priority: 100
    warm:
      min_age: "7d"
      actions:
        shrink:
          number_of_shards: 1
        forcemerge:
          max_num_segments: 1
        set_priority:
          priority: 50
    cold:
      min_age: "30d"
      actions:
        freeze: {}
        set_priority:
          priority: 0
    delete:
      min_age: "90d"
      actions:
        delete: {}

7.4. Kibana/Elasticsearch Query Examples

// Query 1: Tìm slow requests (latency > 5s)
{
  "query": {
    "bool": {
      "must": [
        { "term": { "environment": "production" } },
        { "range": { "latency_ms": { "gte": 5000 } } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  },
  "sort": [{ "latency_ms": "desc" }],
  "size": 20
}

// Query 2: High-cost sessions hôm nay
{
  "query": {
    "bool": {
      "must": [
        { "range": { "@timestamp": { "gte": "now/d" } } },
        { "range": { "cost_usd": { "gte": 0.10 } } }
      ]
    }
  },
  "aggs": {
    "by_session": {
      "terms": { "field": "session_id", "size": 20 },
      "aggs": {
        "total_cost": { "sum": { "field": "cost_usd" } },
        "total_tokens": { "sum": { "field": "total_tokens" } }
      }
    }
  },
  "size": 0
}

// Query 3: Failed tool calls theo agent
{
  "query": {
    "bool": {
      "must": [
        { "range": { "@timestamp": { "gte": "now-6h" } } }
      ],
      "filter": [
        {
          "nested": {
            "path": "tool_calls",
            "query": {
              "term": { "tool_calls.status": "failed" }
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "by_agent": {
      "terms": { "field": "agent_id" },
      "aggs": {
        "failed_tools": {
          "nested": { "path": "tool_calls" },
          "aggs": {
            "failed_only": {
              "filter": { "term": { "tool_calls.status": "failed" } },
              "aggs": {
                "tool_names": { "terms": { "field": "tool_calls.name" } }
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}

8. Grafana Dashboard — 5 Panel Groups

8.1. Tổng Quan 5 Dashboard Panels

Panel	Mô tả	Metrics nguồn	Visualisation
Overview	RPS, Error Rate, Avg Latency	Prometheus	Stat + Time series
Token Economy	Cost/giờ, token distribution	Prometheus	Bar gauge + Heatmap
Quality	Hallucination rate, guardrail blocks	Prometheus	Time series + Alert
Agent Health	Per-agent latency heatmap	Prometheus	Heatmap
Business KPI	Task completion, escalation funnel	Prometheus + ES	Stat + Bar chart

8.2. Grafana Dashboard JSON Config (Partial)

{
  "title": "AI Agent — LLMOps Dashboard",
  "uid": "llmops-main-dashboard",
  "tags": ["ai-agent", "llmops", "production"],
  "refresh": "30s",
  "time": { "from": "now-3h", "to": "now" },

  "panels": [
    {
      "id": 1,
      "title": "🟢 Requests Per Second",
      "type": "stat",
      "gridPos": { "x": 0, "y": 0, "w": 6, "h": 4 },
      "targets": [
        {
          "datasource": "prometheus",
          "expr": "sum(rate(llm_request_duration_seconds_count[2m]))",
          "legendFormat": "RPS"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "thresholds" },
          "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 100 },
              { "color": "red", "value": 500 }
            ]
          },
          "unit": "reqps"
        }
      }
    },
    {
      "id": 2,
      "title": "🔴 Error Rate (%)",
      "type": "stat",
      "gridPos": { "x": 6, "y": 0, "w": 6, "h": 4 },
      "targets": [
        {
          "datasource": "prometheus",
          "expr": "100 * sum(rate(llm_errors_total[5m])) / sum(rate(llm_request_duration_seconds_count[5m]))",
          "legendFormat": "Error Rate %"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 1 },
              { "color": "red", "value": 5 }
            ]
          }
        }
      }
    },
    {
      "id": 3,
      "title": "⏱ Latency P95 (ms)",
      "type": "timeseries",
      "gridPos": { "x": 0, "y": 4, "w": 12, "h": 8 },
      "targets": [
        {
          "datasource": "prometheus",
          "expr": "histogram_quantile(0.95, sum by(le, agent_id) (rate(llm_request_duration_seconds_bucket[5m]))) * 1000",
          "legendFormat": "P95 - {{agent_id}}"
        },
        {
          "datasource": "prometheus",
          "expr": "histogram_quantile(0.50, sum by(le, agent_id) (rate(llm_request_duration_seconds_bucket[5m]))) * 1000",
          "legendFormat": "P50 - {{agent_id}}"
        }
      ],
      "fieldConfig": {
        "defaults": { "unit": "ms" }
      }
    },
    {
      "id": 4,
      "title": "💰 Cost Per Hour (USD)",
      "type": "timeseries",
      "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 },
      "targets": [
        {
          "datasource": "prometheus",
          "expr": "sum by(agent_id) (rate(llm_cost_usd_total[1h])) * 3600",
          "legendFormat": "Cost/hr - {{agent_id}}"
        }
      ],
      "fieldConfig": {
        "defaults": { "unit": "currencyUSD" }
      }
    },
    {
      "id": 5,
      "title": "🧠 Hallucination Rate (%)",
      "type": "timeseries",
      "gridPos": { "x": 0, "y": 12, "w": 12, "h": 8 },
      "targets": [
        {
          "datasource": "prometheus",
          "expr": "100 * histogram_quantile(0.90, rate(llm_hallucination_score_bucket[10m]))",
          "legendFormat": "Hallucination P90"
        }
      ],
      "alert": {
        "conditions": [
          {
            "type": "query",
            "query": { "params": ["A", "10m", "now"] },
            "reducer": { "type": "avg" },
            "evaluator": { "type": "gt", "params": [10] }
          }
        ],
        "name": "High Hallucination Rate Alert"
      }
    }
  ]
}

9. Alerting Strategy — 8 Alert Rules Quan Trọng

9.1. Prometheus AlertManager Config

# alertmanager-rules.yaml
---
groups:
  - name: llmops_critical
    rules:
      # Alert 1: Cost Spike — hàng ngày vượt 150% baseline
      - alert: LLMCostSpike
        expr: |
          (
            sum(increase(llm_cost_usd_total[24h]))
            /
            sum(increase(llm_cost_usd_total[24h] offset 7d))
          ) > 1.5
        for: 15m
        labels:
          severity: critical
          team: llmops
        annotations:
          summary: "💰 LLM Cost Spike Detected"
          description: "Daily cost is {{ humanize $value | printf \"%.0f%%\" }} of 7-day average. Current: ${{ $value }}"
          runbook: "https://wiki.company.com/runbooks/llm-cost-spike"

      # Alert 2: Latency P95 > 5s sustained 5 minutes
      - alert: LLMHighLatencyP95
        expr: |
          histogram_quantile(0.95,
            sum by(le, agent_id) (rate(llm_request_duration_seconds_bucket[5m]))
          ) > 5
        for: 5m
        labels:
          severity: warning
          team: llmops
        annotations:
          summary: "⏱ LLM P95 Latency High: {{ $labels.agent_id }}"
          description: "P95 latency is {{ $value | humanizeDuration }} for agent {{ $labels.agent_id }}"

      # Alert 3: Error Rate > 5% trong 10 phút
      - alert: LLMHighErrorRate
        expr: |
          (
            sum by(agent_id) (rate(llm_errors_total[10m]))
            /
            sum by(agent_id) (rate(llm_request_duration_seconds_count[10m]))
          ) * 100 > 5
        for: 10m
        labels:
          severity: critical
          team: llmops
        annotations:
          summary: "🔴 LLM Error Rate > 5%: {{ $labels.agent_id }}"
          description: "Error rate is {{ $value | printf \"%.1f%%\" }} for agent {{ $labels.agent_id }}"

      # Alert 4: Hallucination Rate > 10% (sampled evaluation)
      - alert: LLMHallucinationRateHigh
        expr: |
          histogram_quantile(0.90,
            sum by(le, agent_id) (rate(llm_hallucination_score_bucket[15m]))
          ) > 0.10
        for: 10m
        labels:
          severity: critical
          team: ai-quality
        annotations:
          summary: "🧠 Hallucination Rate Spike: {{ $labels.agent_id }}"
          description: "P90 hallucination score is {{ $value | printf \"%.2f\" }} — review recent prompts/model"

      # Alert 5: Guardrail Block Surge > 20% in 15 minutes
      - alert: LLMGuardrailBlockSurge
        expr: |
          (
            sum by(agent_id) (rate(llm_guardrail_decisions_total{decision="block"}[15m]))
            /
            sum by(agent_id) (rate(llm_request_duration_seconds_count[15m]))
          ) * 100 > 20
        for: 5m
        labels:
          severity: warning
          team: llmops
        annotations:
          summary: "🛡 Guardrail Block Surge: {{ $labels.agent_id }}"
          description: "{{ $value | printf \"%.1f%%\" }} of requests blocked — possible attack or prompt issue"

      # Alert 6: Token Quota Approaching 80% of Daily Limit
      - alert: LLMTokenQuotaWarning
        expr: |
          (
            sum by(tenant_id) (increase(llm_tokens_total[24h]))
            /
            on(tenant_id) llm_token_daily_quota
          ) * 100 > 80
        for: 0m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "📊 Token Quota Warning: {{ $labels.tenant_id }}"
          description: "Tenant {{ $labels.tenant_id }} has used {{ $value | printf \"%.0f%%\" }} of daily token quota"

      # Alert 7: Circuit Breaker OPEN
      - alert: LLMCircuitBreakerOpen
        expr: llm_circuit_breaker_state{state="open"} == 1
        for: 2m
        labels:
          severity: critical
          team: llmops
        annotations:
          summary: "⚡ Circuit Breaker OPEN: {{ $labels.agent_id }}"
          description: "LLM circuit breaker opened for {{ $labels.agent_id }} — service may be degraded"

      # Alert 8: Memory/Context Overflow Rate Spike
      - alert: LLMContextOverflowSpike
        expr: |
          (
            sum by(agent_id) (rate(llm_errors_total{error_type="context_length_exceeded"}[10m]))
            /
            sum by(agent_id) (rate(llm_request_duration_seconds_count[10m]))
          ) * 100 > 5
        for: 5m
        labels:
          severity: warning
          team: llmops
        annotations:
          summary: "💾 Context Overflow Spike: {{ $labels.agent_id }}"
          description: "{{ $value | printf \"%.1f%%\" }} requests hitting context limit — review chunking/truncation strategy"

# alertmanager.yaml — Routing + Slack Webhook
---
route:
  group_by: ['alertname', 'agent_id']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-llmops'

  routes:
    - match:
        severity: critical
      receiver: 'slack-critical-llmops'
      group_wait: 10s
      repeat_interval: 1h

    - match:
        team: ai-quality
      receiver: 'slack-ai-quality'

receivers:
  - name: 'slack-llmops'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#llmops-alerts'
        title: '{{ template "slack.title" . }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Details:* {{ .Annotations.description }}
          *Runbook:* {{ .Annotations.runbook }}
          {{ end }}
        send_resolved: true

  - name: 'slack-critical-llmops'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#llmops-critical'
        color: 'danger'
        title: '🚨 CRITICAL: {{ template "slack.title" . }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Details:* {{ .Annotations.description }}
          *Runbook:* {{ .Annotations.runbook }}
          {{ end }}
        send_resolved: true

  - name: 'slack-ai-quality'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#ai-quality-alerts'
        title: '{{ template "slack.title" . }}'
        send_resolved: true

10. A/B Testing Prompt & Model Routing

10.1. Kiến Trúc Traffic Splitting

                    INCOMING REQUESTS
                          │
                          ▼
              ┌───────────────────────┐
              │   FEATURE FLAG        │
              │   SERVICE             │
              │   (LaunchDarkly /     │
              │    self-hosted)       │
              └──────────┬────────────┘
                         │
           ┌─────────────┼──────────────┐
           │ 90%         │ 10%          │
           ▼             ▼              │
     ┌──────────┐  ┌──────────┐        │
     │ Prompt A │  │ Prompt B │   Shadow Mode
     │ (control)│  │(canary)  │        │
     └────┬─────┘  └────┬─────┘        │
          │             │          ┌───▼───────┐
          ▼             ▼          │ Duplicate │
     LLM Response  LLM Response   │ Request   │
                                   │ (no user  │
       Track:                      │  impact)  │
       - Latency                   └─────┬─────┘
       - Quality score                   │
       - Cost                            ▼
       - User satisfaction         Evaluation
                                   (offline)

10.2. Python — Model Router với Weighted Random Selection

import random
import time
import hashlib
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional, Callable
import structlog

logger = structlog.get_logger()

class RoutingStrategy(Enum):
    WEIGHTED_RANDOM = "weighted_random"
    TENANT_BASED = "tenant_based"
    TASK_COMPLEXITY = "task_complexity"
    CANARY = "canary"
    SHADOW = "shadow"

@dataclass
class ModelConfig:
    model: str
    weight: float            # 0.0 - 1.0, tổng các config phải = 1.0
    variant_name: str        # "control", "canary_v2", "shadow"
    max_tokens: int = 4096
    temperature: float = 0.7
    extra_params: dict = field(default_factory=dict)

@dataclass
class RoutingDecision:
    model_config: ModelConfig
    strategy_used: str
    routing_reason: str
    experiment_id: Optional[str] = None

class AIAgentModelRouter:
    """
    Model Router với nhiều chiến lược:
    - A/B test (weighted random)
    - Per-tenant routing
    - Task complexity routing
    - Shadow mode (duplicate traffic)
    """

    def __init__(self):
        # A/B test configurations
        self._ab_experiments: dict[str, list[ModelConfig]] = {}

        # Tenant-specific routing
        self._tenant_routing: dict[str, ModelConfig] = {}

        # Default routing by task type
        self._task_routing: dict[str, ModelConfig] = {
            "simple_faq": ModelConfig(
                model="gpt-4o-mini", weight=1.0, variant_name="control",
                max_tokens=1024, temperature=0.3,
            ),
            "complex_analysis": ModelConfig(
                model="gpt-4o", weight=1.0, variant_name="control",
                max_tokens=4096, temperature=0.7,
            ),
            "sensitive_medical": ModelConfig(
                model="ollama/llama3.1", weight=1.0, variant_name="on_premise",
                max_tokens=2048, temperature=0.1,
            ),
            "code_generation": ModelConfig(
                model="claude-3-5-sonnet", weight=1.0, variant_name="control",
                max_tokens=4096, temperature=0.2,
            ),
        }

    def register_ab_experiment(
        self,
        experiment_id: str,
        configs: list[ModelConfig],
    ) -> None:
        """Đăng ký A/B experiment với weighted configs."""
        total_weight = sum(c.weight for c in configs)
        if abs(total_weight - 1.0) > 0.001:
            raise ValueError(f"Weights must sum to 1.0, got {total_weight}")
        self._ab_experiments[experiment_id] = configs
        logger.info("ab_experiment_registered", experiment_id=experiment_id,
                    variants=[c.variant_name for c in configs])

    def route(
        self,
        task_type: str,
        tenant_id: str = "default",
        session_id: str = "",
        experiment_id: Optional[str] = None,
        force_strategy: Optional[RoutingStrategy] = None,
    ) -> RoutingDecision:
        """Chọn model config dựa trên chiến lược routing."""

        # 1. Tenant-specific override (highest priority)
        if tenant_id in self._tenant_routing and not experiment_id:
            config = self._tenant_routing[tenant_id]
            return RoutingDecision(
                model_config=config,
                strategy_used=RoutingStrategy.TENANT_BASED.value,
                routing_reason=f"Tenant {tenant_id} has dedicated model",
            )

        # 2. A/B Experiment (nếu có experiment_id)
        if experiment_id and experiment_id in self._ab_experiments:
            configs = self._ab_experiments[experiment_id]

            # Sticky routing: cùng session_id → cùng variant (consistent UX)
            if session_id:
                hash_val = int(hashlib.md5(session_id.encode()).hexdigest(), 16)
                bucket = (hash_val % 1000) / 1000.0
            else:
                bucket = random.random()

            cumulative = 0.0
            for config in configs:
                cumulative += config.weight
                if bucket <= cumulative:
                    logger.info(
                        "ab_routing",
                        experiment_id=experiment_id,
                        variant=config.variant_name,
                        model=config.model,
                        session_id=session_id,
                    )
                    return RoutingDecision(
                        model_config=config,
                        strategy_used=RoutingStrategy.WEIGHTED_RANDOM.value,
                        routing_reason=f"A/B bucket {bucket:.3f} → {config.variant_name}",
                        experiment_id=experiment_id,
                    )

        # 3. Task complexity routing (fallback)
        config = self._task_routing.get(
            task_type,
            ModelConfig(model="gpt-4o-mini", weight=1.0, variant_name="default")
        )

        return RoutingDecision(
            model_config=config,
            strategy_used=RoutingStrategy.TASK_COMPLEXITY.value,
            routing_reason=f"Task type '{task_type}' → {config.model}",
        )


# ─── Sample Usage ──────────────────────────────────────────────────────────────

router = AIAgentModelRouter()

# Đăng ký A/B experiment: 90% prompt A (gpt-4o-mini) vs 10% prompt B (gpt-4o)
router.register_ab_experiment(
    experiment_id="exp_prompt_v2_vs_v3",
    configs=[
        ModelConfig(model="gpt-4o-mini", weight=0.90, variant_name="prompt_v2_control"),
        ModelConfig(model="gpt-4o",      weight=0.10, variant_name="prompt_v3_canary"),
    ],
)

decision = router.route(
    task_type="simple_faq",
    tenant_id="tenant-abc",
    session_id="sess-xyz789",
    experiment_id="exp_prompt_v2_vs_v3",
)

print(f"Model: {decision.model_config.model}")
print(f"Variant: {decision.model_config.variant_name}")
print(f"Strategy: {decision.strategy_used}")

10.3. Bảng Kết Quả A/B Test Sample

Metric	Prompt A (control)	Prompt B (canary)	Δ	Kết luận
Latency P95 (ms)	1,820	2,340	+28.6%	❌ B chậm hơn
Quality Score (LLM Judge)	3.8/5	4.3/5	+13.2%	✅ B tốt hơn
Cost/request (USD)	$0.0021	$0.0047	+123.8%	❌ B đắt hơn
User satisfaction (CSAT)	76%	83%	+7%	✅ B tốt hơn
Task completion rate	88%	92%	+4%	✅ B tốt hơn
Hallucination rate	4.2%	1.8%	-57%	✅ B an toàn hơn
Guardrail block rate	1.8%	1.2%	-33%	✅ B sạch hơn

Kết luận: Prompt B (canary) cho quality tốt hơn đáng kể nhưng chi phí cao hơn 2x. Quyết định: roll out prompt B cho các tenant premium (happy to pay), giữ prompt A cho tier free.

11. Model Routing Theo Tác Vụ

11.1. Decision Matrix

Loại Tác Vụ	Độ phức tạp	Model đề xuất	Chi phí/1K token	Latency P95	Ghi chú
FAQ đơn giản	Thấp	GPT-4o-mini / Gemini Flash	$0.00015	< 500ms	80% traffic
Tóm tắt văn bản	Thấp-TB	GPT-4o-mini	$0.00015	< 800ms
Phân tích, so sánh	Trung bình	GPT-4o / Claude 3.5 Sonnet	$0.005	< 2s
Reasoning phức tạp	Cao	GPT-4o / Claude 3.5	$0.005	< 3s	15% traffic
Code generation	Cao	Claude 3.5 Sonnet	$0.003	< 3s
Dữ liệu y tế/nhạy cảm	Bất kỳ	Ollama on-premise	$0 (infra cost)	< 2s	Data không rời server
Real-time chat	Thấp	GPT-4o-mini (streaming)	$0.00015	TTFT < 200ms
Batch processing	Bất kỳ	GPT-4o Batch API	50% discount	Hours	Không realtime

11.2. Bảng Cost vs Quality Trade-off

Provider	Model	Input $/1M	Output $/1M	Quality Score	Latency	Data Privacy	Phù hợp
OpenAI	GPT-4o-mini	$0.15	$0.60	4.0/5	Fast	Cloud	General, cost-sensitive
OpenAI	GPT-4o	$5.00	$15.00	4.7/5	Medium	Cloud	Complex reasoning
Anthropic	Claude 3 Haiku	$0.25	$1.25	4.0/5	Fast	Cloud	Safe, structured output
Anthropic	Claude 3.5 Sonnet	$3.00	$15.00	4.8/5	Medium	Cloud	High quality, coding
Google	Gemini 1.5 Flash	$0.075	$0.30	3.9/5	Very Fast	Cloud	Ultra low cost
Azure OpenAI	GPT-4o	$5.00	$15.00	4.7/5	Medium	Cloud (VNet)	Enterprise compliance
Ollama	Llama 3.1 70B	$0 (GPU)	$0 (GPU)	4.0/5	Medium	On-premise	Healthcare, banking
Ollama	Qwen2.5 7B	$0 (GPU)	$0 (GPU)	3.6/5	Fast	On-premise	Cost-zero, low quality tasks

12. Sampling Strategy Cho Production

12.1. Vấn Đề

100% sampling trong production AI Agent:

10,000 requests/ngày × 5 spans/request = 50,000 spans/ngày
Lưu trữ: ~2KB/span × 50,000 = 100MB/ngày traces
3 tháng: ~9GB chỉ cho trace data
Chi phí Jaeger + object storage: ~$50-100/tháng

Giải pháp: Adaptive (tail-based) sampling.

12.2. Chiến Lược Sampling

Loại Request	Sampling Rate	Lý Do
Error requests	100%	Cần debug đầy đủ
Slow requests (P95+)	100%	Performance investigation
High-cost requests (>$0.10)	100%	Cost audit
Guardrail blocked	100%	Security audit
Normal successful requests	10%	Statistical representation
Health checks / internal	0%	Noise reduction

Chi phí storage ước tính (10,000 req/ngày):

Error rate 2% = 200 requests → 200 × 5 spans × 2KB = 2MB
Slow rate 5%  = 500 requests → 500 × 5 spans × 2KB = 5MB
Normal 10%    = 930 requests → 930 × 5 spans × 2KB = 9.3MB
Total/day ≈ 16.3MB  (vs 100MB với 100% sampling)
Savings: ~84%

12.3. Python OTel Adaptive Sampler

import random
from opentelemetry.sdk.trace.sampling import (
    Sampler,
    SamplingResult,
    Decision,
    ALWAYS_ON,
    ALWAYS_OFF,
)
from opentelemetry.trace import SpanKind
from opentelemetry.context import Context
from opentelemetry.util.types import Attributes

class AdaptiveLLMSampler(Sampler):
    """
    Tail-based adaptive sampler cho LLM workload.
    - Errors: 100%
    - Slow requests: 100%
    - Normal: configurable rate (default 10%)
    """

    def __init__(
        self,
        normal_sample_rate: float = 0.10,
        slow_threshold_ms: float = 3000.0,
        high_cost_threshold_usd: float = 0.10,
    ):
        self.normal_sample_rate = normal_sample_rate
        self.slow_threshold_ms = slow_threshold_ms
        self.high_cost_threshold_usd = high_cost_threshold_usd

    def should_sample(
        self,
        parent_context: Context,
        trace_id: int,
        name: str,
        kind: SpanKind = SpanKind.INTERNAL,
        attributes: Attributes = None,
        links: list = None,
        trace_state: object = None,
    ) -> SamplingResult:
        attrs = attributes or {}

        # Rule 1: Always sample errors
        if attrs.get("error", False) or attrs.get("http.status_code", 200) >= 500:
            return SamplingResult(Decision.RECORD_AND_SAMPLE, attributes=attrs)

        # Rule 2: Always sample slow requests
        latency_ms = attrs.get("llm.latency_ms", 0)
        if latency_ms > self.slow_threshold_ms:
            return SamplingResult(Decision.RECORD_AND_SAMPLE, attributes=attrs)

        # Rule 3: Always sample high-cost requests
        cost_usd = attrs.get("llm.cost_usd", 0)
        if cost_usd > self.high_cost_threshold_usd:
            return SamplingResult(Decision.RECORD_AND_SAMPLE, attributes=attrs)

        # Rule 4: Always sample guardrail blocks
        if attrs.get("guardrail.decision") == "block":
            return SamplingResult(Decision.RECORD_AND_SAMPLE, attributes=attrs)

        # Rule 5: Normal sampling (10%)
        if random.random() < self.normal_sample_rate:
            return SamplingResult(Decision.RECORD_AND_SAMPLE, attributes=attrs)

        return SamplingResult(Decision.DROP)

    def get_description(self) -> str:
        return f"AdaptiveLLMSampler(normal={self.normal_sample_rate})"

# Sử dụng trong TracerProvider:
# from opentelemetry.sdk.trace import TracerProvider
# provider = TracerProvider(sampler=AdaptiveLLMSampler(normal_sample_rate=0.10))

13. LLM Cost Management

13.1. Budgeting Per Tenant / Project

import time
import redis
from dataclasses import dataclass
from typing import Optional
import structlog

logger = structlog.get_logger()

@dataclass
class BudgetConfig:
    tenant_id: str
    daily_budget_usd: float
    monthly_budget_usd: float
    daily_token_limit: int
    alert_threshold_pct: float = 0.80  # Alert khi đạt 80%
    hard_stop: bool = True             # Dừng khi vượt budget

class LLMBudgetGuard:
    """
    Middleware kiểm tra budget trước mỗi LLM call.
    Sử dụng Redis để track real-time spending.
    """

    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client
        self._budgets: dict[str, BudgetConfig] = {}

    def register_budget(self, config: BudgetConfig) -> None:
        self._budgets[config.tenant_id] = config
        logger.info("budget_registered",
                    tenant_id=config.tenant_id,
                    daily_limit_usd=config.daily_budget_usd)

    def _get_today_key(self, tenant_id: str) -> str:
        today = time.strftime("%Y-%m-%d")
        return f"llm_budget:daily:{tenant_id}:{today}"

    def _get_month_key(self, tenant_id: str) -> str:
        month = time.strftime("%Y-%m")
        return f"llm_budget:monthly:{tenant_id}:{month}"

    def check_budget(self, tenant_id: str, estimated_cost_usd: float) -> dict:
        """
        Kiểm tra budget trước khi gọi LLM.
        Returns: {"allowed": bool, "reason": str, "remaining_usd": float}
        """
        config = self._budgets.get(tenant_id)
        if not config:
            return {"allowed": True, "reason": "no_budget_configured", "remaining_usd": float("inf")}

        daily_key = self._get_today_key(tenant_id)
        current_daily = float(self.redis.get(daily_key) or 0)
        projected_daily = current_daily + estimated_cost_usd

        # Hard stop check
        if config.hard_stop and projected_daily > config.daily_budget_usd:
            logger.warning(
                "budget_exceeded",
                tenant_id=tenant_id,
                current_cost=current_daily,
                daily_limit=config.daily_budget_usd,
            )
            return {
                "allowed": False,
                "reason": "daily_budget_exceeded",
                "remaining_usd": max(0, config.daily_budget_usd - current_daily),
            }

        # Alert threshold check
        if projected_daily > config.daily_budget_usd * config.alert_threshold_pct:
            logger.warning(
                "budget_threshold_warning",
                tenant_id=tenant_id,
                pct_used=projected_daily / config.daily_budget_usd,
            )

        return {
            "allowed": True,
            "reason": "within_budget",
            "remaining_usd": config.daily_budget_usd - current_daily,
        }

    def record_usage(self, tenant_id: str, actual_cost_usd: float) -> None:
        """Ghi nhận chi phí thực tế sau khi LLM call hoàn thành."""
        daily_key = self._get_today_key(tenant_id)
        month_key = self._get_month_key(tenant_id)

        pipe = self.redis.pipeline()
        pipe.incrbyfloat(daily_key, actual_cost_usd)
        pipe.expire(daily_key, 86400 * 2)          # 2 ngày TTL
        pipe.incrbyfloat(month_key, actual_cost_usd)
        pipe.expire(month_key, 86400 * 35)          # 35 ngày TTL
        pipe.execute()

13.2. Bảng Tier Pricing So Sánh

Tiêu chí	OpenAI GPT-4o	Anthropic Claude 3.5	Azure OpenAI	Ollama Self-hosted
Input price	$5/1M tokens	$3/1M tokens	$5/1M tokens	~$0.15/1M (GPU cost)
Output price	$15/1M tokens	$15/1M tokens	$15/1M tokens	~$0.15/1M (GPU cost)
Data privacy	OpenAI servers	Anthropic servers	Azure VNet	Hoàn toàn on-premise
Compliance	SOC2, GDPR (opt-out)	SOC2, HIPAA add-on	HIPAA, FedRAMP	Tự quản lý
Rate limits	10K RPM	5K RPM	Custom	Không giới hạn
SLA uptime	99.9%	99.9%	99.9%	Tự quản lý
Setup complexity	Thấp	Thấp	Trung bình	Cao (GPU infra)
Chi phí khởi đầu	$0	$0	Azure subscription	GPU server ~$2,000+
Cost 1M requests/ngày	~$3,500/ngày	~$2,100/ngày	~$3,500/ngày	~$50/ngày (amortized)
Phù hợp	General, startup	High quality	Enterprise	Healthcare, Banking

14. Incident Response Cho AI Agent

14.1. Runbook — Khi Hallucination Rate Tăng

INCIDENT: Hallucination Rate > 10%
════════════════════════════════════
T+0min:  Alert nhận được qua Slack #llmops-critical
T+2min:  On-call engineer acknowledge alert

INVESTIGATION STEPS:
1. Grafana → Quality Dashboard → Hallucination Timeline
   - Xác định: bắt đầu khi nào? Tất cả agents hay 1 agent cụ thể?
   - Xem top sessions có hallucination_score cao nhất

2. Elasticsearch query:
   GET ai-agent-logs-*/_search
   { "query": { "range": { "hallucination_probability": { "gte": 0.3 } } },
     "sort": [{"@timestamp": "desc"}], "size": 20 }

3. Kiểm tra: Có prompt version change gần đây không?
   git log --oneline prompts/ | head -20

4. Kiểm tra: Model provider có update model không?
   - OpenAI model version log
   - Pinned model version trong config

MITIGATION:
- Nếu do prompt change → rollback prompt version ngay
- Nếu do model update → pin model version cụ thể (gpt-4o-2024-11-20)
- Nếu nguyên nhân chưa rõ → kích hoạt HITL mode (escalate tất cả uncertain responses)
- Notify stakeholders qua #llmops-incidents

RESOLUTION CRITERIA:
- Hallucination rate < 5% sustained 15 minutes

POST-INCIDENT:
- Post-mortem trong 48h
- Update runbook nếu cần

14.2. Runbook — Khi Cost Spike

INCIDENT: Daily LLM Cost > 150% Baseline
══════════════════════════════════════════
T+0min:  Cost spike alert
T+2min:  Acknowledge, bắt đầu điều tra

INVESTIGATION:
1. Prometheus query: Tenant nào đang tiêu cost nhiều nhất?
   sum by(tenant_id) (rate(llm_cost_usd_total[1h])) * 3600

2. Elasticsearch: Session nào có cost cao bất thường?
   (Query 2 từ Section 7.4)

3. Kiểm tra: Token count bất thường?
   - Input tokens > 5,000 per request → likely context stuffing
   - Output tokens > 2,000 → likely verbose prompt

4. Kiểm tra: Retry loop?
   sum by(agent_id) (rate(llm_errors_total{error_type="RateLimitError"}[10m]))

MITIGATION (theo thứ tự):
1. Tắt tenant vi phạm nếu suspicious activity
2. Enable token quota hard limit ngay
3. Giảm max_tokens trong model config tạm thời
4. Scale down replicas nếu request flood

POST-INCIDENT: Review token quota per tenant, update budget config

14.3. Post-Mortem Template

# Post-Mortem: [Incident Name]

**Ngày**: YYYY-MM-DD
**Mức độ**: Critical / High / Medium
**Duration**: X giờ Y phút
**MTTR**: X giờ Y phút

## Impact
- Số users ảnh hưởng: XXX
- Doanh thu ảnh hưởng: $XXX
- Chi phí phát sinh: $XXX

## Timeline
| Thời gian | Sự kiện |
|-----------|---------|
| HH:MM | Alert triggered |
| HH:MM | On-call engineer acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Incident resolved |

## Root Cause
[Mô tả nguyên nhân gốc rễ]

## Contributing Factors
1. [Factor 1]
2. [Factor 2]

## What Went Well
- [...]

## What Could Be Improved
- [...]

## Action Items
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| [...] | [...] | [...] | High |

## Lessons Learned
[...]

14.4. MTTR Targets Cho AI Incidents

Mức độ	Ví dụ	Response Time	MTTR Target
P0 - Critical	Cost spike $1K+, mass data leak	5 phút	30 phút
P1 - High	Error rate > 10%, hallucination surge	15 phút	2 giờ
P2 - Medium	Latency degradation, quality drop	1 giờ	8 giờ
P3 - Low	Logging gap, minor metric anomaly	Next business day	3 ngày

15. Production Readiness Checklist — 3 Cấp Độ

🥉 Cấp MVP (Tối thiểu để Go-Live)

Monitoring cơ bản (10 items):

Prometheus endpoint /metrics được expose
LLM latency (p50, p95) được track
Error rate được track theo agent_id
Token count (input + output) được đếm
Cost tracking theo ngày
Basic Grafana dashboard với latency + errors
Alert cho error rate > 10%
Alert cho cost spike > 200% baseline
Structured JSON logging (request_id, session_id, latency, tokens)
Log được ship vào Elasticsearch / Loki

Reliability cơ bản (8 items):

Timeout configured (max 30s per LLM call)
Retry với exponential backoff (max 3 retries)
Rate limit handling (429 error → retry-after)
Circuit breaker configured cho LLM provider
Graceful degradation khi LLM unavailable
Health check endpoint /health trả về trạng thái LLM connectivity
Token limit guard (max_tokens configured)
Context length check trước khi gọi LLM

🥈 Cấp Production (Đầy đủ cho Enterprise)

Observability nâng cao (12 items):

OpenTelemetry SDK integrated đầy đủ (traces + metrics + logs)
Distributed tracing với context propagation qua tất cả agents
TTFT (Time To First Token) tracking cho streaming responses
Per-tenant cost breakdown dashboard
Hallucination rate monitoring (sampled evaluation pipeline)
Guardrail decision logging với reason codes
Tool call latency histogram per tool
Memory/context usage tracking
Session timeline reconstruction từ traces
Kibana/Grafana Explore for ad-hoc investigation
Automated daily cost report → email/Slack
ILM policy cho log retention (hot/warm/cold/delete)

Alerting đầy đủ (8 items):

Tất cả 8 alert rules từ Section 9 được configured
Alert routing theo team/severity
PagerDuty / on-call rotation integrated
Runbook link trong mọi alert annotation
Alert fatigue review (tune thresholds sau 2 tuần)
Dead man's switch (alert nếu metrics stop flowing)
Cost budget alerts per tenant
SLA breach prediction alert (leading indicator)

Reliability production (10 items):

Multi-region LLM provider failover
Budget guard middleware cho mọi tenant
Token quota enforcement per tenant per day
Adaptive sampling cho traces (không 100%)
A/B testing framework ready
Model versioning pinned (không dùng “latest”)
Prompt versioning với git + experiment tracking
Shadow mode testing cho model upgrades
Load testing với realistic token distribution
Chaos engineering: LLM provider outage drill

🥇 Cấp Enterprise (Đầy đủ nhất)

Advanced LLMOps (12 items):

Full MLflow / LangSmith experiment tracking integration
Automated evaluation pipeline chạy hourly trên sampled traffic
Model drift detection với statistical tests (KS test, Chi-square)
Prompt regression test suite chạy trên mỗi deployment
Multi-model cost optimization engine (auto-route based on task)
LLM request caching (semantic cache với Redis + vector similarity)
Streaming token profiling (tốc độ generation, jitter)
Custom SLOs: error budget tracking, burn rate alerts
Capacity planning dashboard (projected cost 30/60/90 days)
Fine-tuning pipeline với evaluation gate trước deploy
Cross-tenant benchmarking (ẩn danh)
Regulatory audit trail xuất report PDF/Excel on demand

Security & Compliance (10 items):

Mọi LLM interaction được log với immutable audit trail
PII detection và masking trong log pipeline
Data residency enforcement (EU data → EU LLM endpoint)
Penetration test cho prompt injection vectors
GDPR Article 22 compliance (explain AI decision)
SOC 2 Type II evidence collection automated
Monthly third-party security review của LLM configs
Incident response drill quarterly
Vendor lock-in mitigation plan (multi-provider routing)
Contractual SLA với LLM providers documented

16. KPI Vận Hành, Chi Phí Platform, ROI Analysis

16.1. KPI Vận Hành Theo Tháng

KPI	MVP Target	Production Target	Enterprise Target
System Uptime	99.0%	99.5%	99.9%
Avg Response Latency	< 5s	< 3s	< 2s
Error Rate	< 5%	< 1%	< 0.5%
Hallucination Rate	< 10%	< 3%	< 1%
Task Completion Rate	> 80%	> 90%	> 95%
Cost per Query	< $0.20	< $0.08	< $0.03
MTTR (P1 incident)	4h	2h	30min
User Satisfaction	> 70%	> 80%	> 90%

16.2. Chi Phí Platform Observability Stack

Thành phần	Self-hosted / Free tier	SaaS / Managed	Ghi chú
OpenTelemetry Collector	$0 (self-hosted)	$0 (open source)	K8s deployment
Prometheus	$0	$0	Thêm Thanos cho HA
Grafana	$0 (OSS)	$29-299/mo	OSS đủ dùng
Jaeger/Tempo	$0 + S3 storage	$50-200/mo	Tempo rẻ hơn Jaeger
Elasticsearch	$200-500/mo (3 nodes)	$95-500/mo (ES Cloud)	ES Cloud cho managed
Alertmanager	$0	$0	Bundled với Prometheus
Pyroscope	$0	$0	Grafana Phlare
Total (Self-hosted)	~$200-500/mo		10K req/ngày
Total (Full SaaS)		~$500-1,200/mo	Managed, ít ops effort

16.3. ROI Analysis

Scenario: 10,000 LLM queries/ngày, team 5 người

BEFORE (không có LLMOps):
- Incident detection lag: 4-6 giờ
- Mỗi incident: 3-4 giờ engineer time debug = ~$300 loss/incident
- 2 incidents/tháng = $600/month waste
- Overspending do không track cost: ~$400/month (estimated 20% waste)
- Hallucination → user churn: 5% users/month = $2,000 MRR loss
Total monthly loss without LLMOps: ~$3,000

AFTER (với LLMOps stack đầy đủ):
- Platform cost: $500/month
- Incident MTTR giảm từ 4h → 30min (P1): save $250/incident × 2 = $500/month
- Cost optimization (routing + quota): save 15-25% = ~$300-500/month
- Hallucination detection → user churn giảm 60%: save $1,200/month
Total monthly saving: ~$2,000 - $2,500

ROI = (2,000 - 500) / 500 × 100 = 300% ROI
Payback period: < 1 month

17. Ma Trận Rủi Ro Vận Hành

#	Rủi ro	Xác suất	Tác động	Mức độ	Biện pháp giảm thiểu
1	LLM provider outage (OpenAI, Anthropic)	Trung bình	Cao	🔴 High	Multi-provider failover; local Ollama fallback
2	Cost runaway (prompt loop, token exploit)	Thấp-TB	Rất cao	🔴 High	Budget guard; hard token quota; real-time cost alert
3	Silent model degradation (provider update model)	Trung bình	Cao	🔴 High	Pin model version; automated regression eval weekly
4	Log/trace data explosion (misconfigured sampler)	Thấp	Trung bình	🟠 Medium	Adaptive sampling; storage quota alert
5	Alert fatigue (too many false positives)	Cao	Trung bình	🟠 Medium	Tune thresholds sau 2 tuần; alert review cadence
6	PII leak via logs (unmasked user data in structured logs)	Thấp	Rất cao	🔴 High	Log scrubber middleware; PII regex masking pipeline
7	Dashboard blindspot (metric not instrumented)	Trung bình	Trung bình	🟠 Medium	Coverage checklist; quarterly observability audit
8	Observer effect (OTel overhead degrades performance)	Thấp	Thấp	🟡 Low	Benchmark OTel overhead (<1ms target); async exporters

18. Roadmap Triển Khai LLMOps — 3 Giai Đoạn

🚀 Giai Đoạn 1 — Foundation (Tuần 1-2)

Tuần 1:

Deploy OTel Collector + Prometheus + Grafana lên K8s (Helm charts)
Integrate OTel SDK vào tất cả agent services
Setup basic Grafana dashboard (latency, errors, cost)
Configure basic alerts (error rate, cost spike)
Ship structured logs vào Elasticsearch

Tuần 2:

Distributed tracing end-to-end (orchestrator → sub-agents)
Token + cost tracking per agent per tenant
Budget guard middleware deployed
ILM policy cho Elasticsearch
On-call rotation setup, runbooks viết xong

Deliverable: Hệ thống có thể detect P1 incident trong < 5 phút

⚙️ Giai Đoạn 2 — Quality & Cost (Tuần 3-6)

Tuần 3-4:

Hallucination evaluation pipeline (sampled, async)
Guardrail decision logging đầy đủ
Per-tenant cost dashboard + daily email report
A/B testing framework (canary deployment)
Model router theo task complexity

Tuần 5-6:

Adaptive sampling thay thế 100% sampling
Semantic cache (Redis + vector similarity)
Post-mortem process chính thức
Alert tuning (giảm false positives)
Quality SLO dashboard (error budget, burn rate)

Deliverable: Cost giảm 20-30%, hallucination rate visible và monitored

🏆 Giai Đoạn 3 — Enterprise Grade (Tuần 7-12)

Tuần 7-9:

Full MLflow / LangSmith integration
Model drift detection automated
Prompt regression test suite CI/CD
Multi-provider failover (OpenAI → Azure OpenAI → Anthropic)
Capacity planning dashboard

Tuần 10-12:

Compliance audit trail (immutable, exportable)
PII masking trong log pipeline
Chaos engineering drill (LLM outage simulation)
Security penetration test cho LLM attack vectors
Documentation, runbooks, training cho team

Deliverable: Full LLMOps maturity — incident MTTR < 30min, cost optimized, compliance-ready

19. Kết Luận

Trong bài này chúng ta đã xây dựng hoàn chỉnh hệ thống Monitoring & Observability cho AI Agent trong production — từ lý thuyết đến code thực tế:

Thành phần đã xây dựng	Giá trị
LLMOps vs DevOps — 10 chiều so sánh	Hiểu rõ tại sao cần observability riêng
Kiến trúc OTel 4 pillars	Framework đầy đủ cho mọi quy mô
25+ metrics với 5 nhóm	Biết chính xác cần đo gì
OTel instrumentation (Python)	Có thể implement ngay hôm nay
LangChain callback handler	Tracing tự động cho LangChain agents
Structured log schema + ES mapping	Log chuẩn, searchable, auditable
Grafana JSON config	Dashboard production-ready
8 alert rules + Prometheus YAML	Alert coverage đầy đủ
A/B testing framework	Cải tiến prompt/model dựa trên data
Model router (Python)	Cost optimization tự động
Adaptive sampler	Giảm 84% storage cost traces
Budget guard	Ngăn cost runaway
Incident runbooks	Response nhanh, MTTR thấp
50-item checklist 3 cấp	Roadmap rõ ràng từ MVP đến Enterprise
ROI 300%	Justify investment với stakeholders

Nguyên Tắc Vàng Cho LLMOps

“Bạn không thể quản lý những gì bạn không đo lường được. Trong thế giới LLM, điều này còn đúng hơn bất kỳ lĩnh vực nào khác — vì LLM có thể fail silently theo những cách mà không có metric nào trong DevOps truyền thống bắt được.”

📌 Bài Tiếp Theo

Bài 8: Use Case Thực Chiến — AI Agent trong Doanh nghiệp Việt Nam

Sau khi đã có đầy đủ nền tảng từ kiến trúc, memory, guardrails đến monitoring, bài tiếp theo sẽ đưa tất cả vào thực tế với 3 use case thực chiến tại doanh nghiệp Việt Nam:

Healthcare: AI Agent hỗ trợ bác sĩ tra cứu phác đồ điều trị, tích hợp HIS/EMR
Banking/Fintech: AI Agent tư vấn sản phẩm tài chính, KYC automation
Retail/E-commerce: AI Agent chăm sóc khách hàng đa kênh (Zalo, Web, App)

Mỗi use case đều bao gồm: kiến trúc chi tiết, tech stack, chi phí, timeline triển khai và bài học thực tế.

💡 Tip thực chiến: Bắt đầu với Giai Đoạn 1 (tuần 1-2) ngay khi có AI Agent đầu tiên lên production. Đừng chờ “có thời gian” — một cost runaway hay hallucination incident không báo trước sẽ khiến bạn phải xây observability trong tình trạng khủng hoảng, vừa tốn kém vừa stress. Ship observability cùng lúc với feature — đó là văn hóa LLMOps trưởng thành.

Last updated on May 14, 2026