Monitoring & Observability — Vận hành AI Agent trong Production

1. Tại sao Production AI Agent cần Observability riêng?

Ở bài trước, chúng ta đã xây dựng hệ thống Guardrails & Evaluation để đảm bảo AI Agent hoạt động an toàn. Nhưng khi hàng nghìn người dùng thật sự sử dụng agent mỗi ngày, một câu hỏi hoàn toàn mới nổi lên:

“Làm sao tôi biết agent đang hoạt động đúng, ổn định, đúng chi phítạo ra giá trị ngay lúc này — trong production, 24/7?”

Traditional monitoring (CPU, RAM, request/s) không đủ cho AI Agent. Agent có thể hoàn toàn “xanh” trên dashboard DevOps thông thường nhưng thực tế đang:

  • Trả lời sai (hallucination rate tăng âm thầm)
  • Tiêu token gấp 3 lần bình thường do prompt loop
  • Tốn thêm $800/ngày vì một model configuration sai
  • Stuck trong reasoning loop suốt 45 giây mà không timeout

Đây là lý do LLMOps — một nhánh riêng của MLOps — ra đời.


2. LLMOps vs DevOps Truyền Thống — 10 Điểm Khác Biệt Cốt Lõi

#Chiều so sánhDevOps truyền thốngLLMOps cho AI Agent
1Tính xác địnhDeterministic: cùng input → cùng outputNon-deterministic: cùng prompt → output khác nhau
2Đơn vị chi phíCPU giờ, bandwidth GBToken (input + output) + API call cost
3Metric chất lượngLatency, error rate, uptimeHallucination rate, groundedness, relevance score
4VersioningCode + config versioningCode + config + prompt versioning + model versioning
5DriftPerformance drift do hardware thay đổiModel drift: nhà cung cấp update model lặng lẽ
6DebuggingStack trace rõ ràngReasoning trace phức tạp, multi-hop, khó reproduce
7TestingUnit test, integration testEvaluation dataset, LLM-as-a-Judge, A/B testing
8RollbackRollback code/configRollback prompt version + model version + memory state
9ScalingHorizontal scaling đơn giảnPhải cân bằng token throughput, context window, cost
10ComplianceLog access, audit trailLog mọi LLM interaction cho compliance + audit

2.1. Non-Determinism — Thách Thức Lớn Nhất

DevOps:  f(x) = y          → luôn đúng, test 1 lần là đủ
LLMOps:  f(x) = y₁ | y₂ | y₃ | ...  → test phải sampling, eval phải statistical

Điều này có nghĩa: bạn không thể chỉ monitor có lỗi không — bạn phải monitor output có đúng không, liên tục, theo xác suất.

2.2. Token Economy — Chi phí vô hình

Tình huốngToken consumedChi phí ước tính
1 câu hỏi FAQ đơn giản~500 tokens~$0.001
1 phiên tư vấn phức tạp (RAG + history)~8,000 tokens~$0.016
1 agentic workflow 5 bước~25,000 tokens~$0.050
10,000 users/ngày × agentic workflow250M tokens~$500/ngày

Kết luận: Một bug nhỏ trong prompt (ví dụ: infinite retry loop) có thể tiêu tốn $2,000+ trước khi ai phát hiện nếu không có cost monitoring.


3. Kiến Trúc Observability Tổng Thể Cho AI Agent

┌─────────────────────────────────────────────────────────────────────────────┐
│              LLMOPS OBSERVABILITY ARCHITECTURE — AI AGENT CLUSTER           │
└─────────────────────────────────────────────────────────────────────────────┘

  ┌──────────────────────────────────────────────────────────────────────────┐
  │                          AI AGENT CLUSTER                                │
  │                                                                          │
  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌───────────────┐  │
  │  │Orchestrator │  │  RAG Agent  │  │  Tool Agent │  │ Memory Agent  │  │
  │  │   Agent     │  │             │  │             │  │               │  │
  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └───────┬───────┘  │
  │         │                │                │                  │          │
  │         └────────────────┴────────────────┴──────────────────┘          │
  │                                    │                                     │
  │                    OTel SDK (Python / .NET / Java)                       │
  │                    - Traces (spans + context propagation)                │
  │                    - Metrics (counters, histograms, gauges)              │
  │                    - Logs (structured JSON + trace_id correlation)       │
  └─────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
  ┌─────────────────────────────────────────────────────────────────────────┐
  │                       OPENTELEMETRY COLLECTOR                            │
  │                                                                          │
  │   Receivers: OTLP gRPC/HTTP, Prometheus scrape, Fluentd                  │
  │   Processors: Batch, Memory Limiter, Attribute Filter, Sampling          │
  │   Exporters: → Prometheus │ → Jaeger/Tempo │ → Elasticsearch            │
  └──────────────────┬──────────────────────────────────────────────────────┘
                     │
         ┌───────────┼──────────────┐
         ▼           ▼              ▼
  ┌────────────┐ ┌──────────┐ ┌──────────────────┐
  │ PROMETHEUS │ │  JAEGER  │ │  ELASTICSEARCH   │
  │            │ │  / TEMPO │ │  / OPENSEARCH    │
  │ Metrics    │ │          │ │                  │
  │ Storage    │ │ Distributed│ │ Log Storage     │
  │ & Query    │ │ Traces   │ │ Full-text Search │
  └─────┬──────┘ └────┬─────┘ └────────┬─────────┘
        │             │                │
        └─────────────┴────────────────┘
                      │
                      ▼
  ┌─────────────────────────────────────────────────────────────────────────┐
  │                        GRAFANA DASHBOARD                                 │
  │                                                                          │
  │  [Overview] [Token Economy] [Quality] [Agent Health] [Business KPI]     │
  └──────────────────────────────────┬──────────────────────────────────────┘
                                     │
                                     ▼
  ┌─────────────────────────────────────────────────────────────────────────┐
  │                         ALERTMANAGER                                     │
  │                                                                          │
  │   Rules: Cost Spike | Latency P95 | Error Rate | Hallucination Rate     │
  │   Routing: → Slack | PagerDuty | Email | Webhook                        │
  └─────────────────────────────────────────────────────────────────────────┘

3.1. Multi-Agent Distributed Tracing Flow

  USER REQUEST (request_id: req-abc123)
       │
       ▼ [Trace Start — Span: "user_request"]
  ┌────────────────────────────────────┐
  │     API GATEWAY / LB               │
  │     Inject: traceparent header     │
  └──────────────────┬─────────────────┘
                     │
                     ▼ [Span: "orchestrator.process"]
  ┌────────────────────────────────────┐
  │     ORCHESTRATOR AGENT             │  t=0ms
  │     - Parse intent                 │
  │     - Plan sub-tasks               │
  └──┬──────────────┬──────────────────┘
     │              │              │
     ▼              ▼              ▼
  [Span:         [Span:         [Span:
  "rag.retrieve"] "tool.call"]  "memory.fetch"]
  ┌──────────┐  ┌──────────┐  ┌──────────┐
  │RAG Agent │  │Tool Agent│  │Memory    │
  │t=5ms     │  │t=5ms     │  │Agent     │
  │          │  │          │  │t=5ms     │
  │  ┌─────┐ │  │  ┌─────┐ │  │  ┌─────┐│
  │  │Embed│ │  │  │API  │ │  │  │Redis││
  │  │Query│ │  │  │Call │ │  │  │Fetch││
  │  └──┬──┘ │  │  └──┬──┘ │  │  └──┬──┘│
  │     │    │  │     │    │  │     │   │
  │  ┌──▼──┐ │  │  ┌──▼──┐ │  │     │   │
  │  │Vecto│ │  │  │Tool │ │  │     │   │
  │  │rDB  │ │  │  │Resp │ │  │     │   │
  │  └─────┘ │  │  └─────┘ │  │     │   │
  └────┬─────┘  └────┬──────┘  └─────┬───┘
       │             │               │
       └──────────────┴───────────────┘
                      │
                      ▼ [Span: "llm.generate"] t=120ms
             ┌─────────────────┐
             │   LLM CALL      │
             │   GPT-4o / etc  │
             │   tokens: 2,340 │
             │   latency: 1.8s │
             └────────┬────────┘
                      │
                      ▼ [Span: "output.guard"] t=1920ms
             ┌─────────────────┐
             │ Output Guard    │
             │ Guardrails check│
             └────────┬────────┘
                      │
                      ▼ [Trace End] t=2050ms
             FINAL RESPONSE → User
             Total: 2,050ms | tokens: 2,340 | cost: $0.0047

4. Bốn Trụ Cột của LLM Observability

4.1. Pillar 1 — Metrics

Mô tả: Dữ liệu số, time-series, aggregatable — dùng để trending và alerting.

Tools phù hợp: Prometheus, Grafana, Datadog, New Relic

Sample data:

# HELP llm_request_duration_seconds LLM request latency
# TYPE llm_request_duration_seconds histogram
llm_request_duration_seconds_bucket{agent="rag_agent",model="gpt-4o",le="0.5"} 42
llm_request_duration_seconds_bucket{agent="rag_agent",model="gpt-4o",le="1.0"} 180
llm_request_duration_seconds_bucket{agent="rag_agent",model="gpt-4o",le="2.0"} 312
llm_request_duration_seconds_bucket{agent="rag_agent",model="gpt-4o",le="5.0"} 398
llm_request_duration_seconds_bucket{agent="rag_agent",model="gpt-4o",le="+Inf"} 402

# HELP llm_tokens_total Total tokens consumed
# TYPE llm_tokens_total counter
llm_tokens_total{agent="rag_agent",type="input",model="gpt-4o"} 1284930
llm_tokens_total{agent="rag_agent",type="output",model="gpt-4o"} 423810

# HELP llm_cost_usd_total Total cost in USD
# TYPE llm_cost_usd_total counter
llm_cost_usd_total{agent="rag_agent",model="gpt-4o"} 24.87

4.2. Pillar 2 — Logs

Mô tả: Structured event records — dùng để debug, audit và tìm root cause.

Tools phù hợp: Elasticsearch, OpenSearch, Loki, Splunk

Sample data (JSON structured log):

{
  "timestamp": "2026-05-14T10:23:45.123Z",
  "level": "INFO",
  "request_id": "req-abc123",
  "session_id": "sess-xyz789",
  "agent_id": "rag_agent",
  "model": "gpt-4o",
  "prompt_tokens": 1840,
  "completion_tokens": 420,
  "total_tokens": 2260,
  "latency_ms": 2050,
  "cost_usd": 0.0045,
  "guardrail_status": "passed",
  "tool_calls": ["search_knowledge_base", "get_product_info"],
  "hallucination_score": 0.12,
  "user_satisfaction": null,
  "error": null
}

4.3. Pillar 3 — Traces

Mô tả: Distributed tracing — timeline của request xuyên qua nhiều service/agent.

Tools phù hợp: Jaeger, Grafana Tempo, Zipkin, AWS X-Ray

Sample span data:

{
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spanId": "00f067aa0ba902b7",
  "parentSpanId": "b9c7c989f97918e1",
  "operationName": "llm.generate",
  "serviceName": "rag-agent",
  "startTime": 1715677425120,
  "duration": 1823000,
  "tags": {
    "llm.model": "gpt-4o",
    "llm.input_tokens": 1840,
    "llm.output_tokens": 420,
    "llm.cost_usd": 0.0045,
    "agent.id": "rag_agent",
    "guardrail.status": "passed"
  }
}

4.4. Pillar 4 — Profiles

Mô tả: CPU/memory profiling của inference engine và Python code — tìm bottleneck.

Tools phù hợt: Pyroscope, Grafana Phlare, py-spy, cProfile

Sample — phát hiện bottleneck thực tế:

Function                          │ CPU % │ Calls │ Avg ms
──────────────────────────────────┼───────┼───────┼───────
embed_documents()                 │ 34.2% │ 2,840 │ 12.1ms
vector_db.similarity_search()     │ 21.8% │ 2,840 │ 7.7ms
openai.chat.completions.create()  │ 18.6% │  890  │ 1,820ms
json.loads() [response parsing]   │  8.3% │ 2,840 │ 2.9ms
redis.get() [session cache]       │  5.1% │ 8,900 │ 0.57ms

5. Metrics Quan Trọng Cần Theo Dõi — 5 Nhóm

5.1. Nhóm 1 — Latency Metrics

MetricMô tảTarget (Production)Alert Threshold
TTFT p50Time To First Token, median< 500ms> 1s
TTFT p95Time To First Token, 95th percentile< 1.5s> 3s
TTFT p99Time To First Token, 99th percentile< 3s> 5s
Total Latency p95End-to-end response time< 3s> 5s
Queue Wait TimeThời gian chờ trong queue< 100ms> 500ms
Tool Call LatencyLatency của external API calls< 500ms/call> 2s

5.2. Nhóm 2 — Token & Cost Metrics

MetricMô tảTargetAlert Threshold
Input tokens/requestAvg input token per request< 2,000> 5,000
Output tokens/requestAvg output token per request< 500> 2,000
Cost/session USDChi phí trung bình mỗi phiên< $0.05> $0.20
Daily cost USDTổng chi phí theo ngàyBaseline ±20%> 150% baseline
Monthly cost trendXu hướng chi phí thángGrowth < 30%> 50% MoM
Token efficiency ratioOutput tokens / Input tokens> 0.3< 0.1

5.3. Nhóm 3 — Quality Metrics

MetricMô tảTargetAlert Threshold
Hallucination rate% responses có thông tin sai< 3%> 10%
Guardrail block rate% requests bị chặn bởi guardrail0.5-2%> 20% (surge)
Groundedness scoreRAG answer grounded in context> 0.85< 0.70
User satisfactionCSAT score / thumbs up %> 80%< 60%
Task completion rate% tasks completed successfully> 90%< 75%
Escalation rate% sessions escalated to human< 5%> 15%

5.4. Nhóm 4 — Reliability Metrics

MetricMô tảTargetAlert Threshold
Error rate% requests trả về lỗi< 1%> 5%
Timeout rate% requests timeout< 0.5%> 2%
Retry rate% requests phải retry< 2%> 10%
Circuit breaker stateTrạng thái circuit breakerCLOSEDOPEN > 5min
Memory overflow rate% context window overflow< 1%> 5%
Tool failure rate% tool calls thất bại< 2%> 10%

5.5. Nhóm 5 — Business Metrics

MetricMô tảTargetAlert Threshold
Active sessionsSố phiên đang hoạt độngCapacity planning> 80% capacity
Daily active usersSố user unique/ngàyGrowth targetSudden drop > 30%
Task completion rate% tác vụ hoàn thành> 90%< 75%
Avg conversation lengthSố turn trung bình/phiên3-8 turns> 15 turns
ROI per agentGiá trị tạo ra / chi phí vận hành> 3x< 1x
Cost per resolved queryChi phí để giải quyết 1 query< $0.10> $0.50

5.6. Python — Custom Prometheus Metrics + OpenTelemetry Instrumentation

import time
import logging
from typing import Optional, Any
from functools import wraps
from prometheus_client import Counter, Histogram, Gauge, Summary
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader

logger = logging.getLogger(__name__)

# ─── Prometheus Metrics ────────────────────────────────────────────────────────

# Latency histogram với p50/p95/p99 buckets
LLM_REQUEST_DURATION = Histogram(
    "llm_request_duration_seconds",
    "LLM request latency in seconds",
    ["agent_id", "model", "operation"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0],
)

TTFT_DURATION = Histogram(
    "llm_ttft_seconds",
    "Time To First Token in seconds",
    ["agent_id", "model"],
    buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0],
)

# Token counters
LLM_TOKENS_TOTAL = Counter(
    "llm_tokens_total",
    "Total tokens consumed",
    ["agent_id", "model", "token_type"],  # token_type: input | output
)

# Cost tracking
LLM_COST_USD = Counter(
    "llm_cost_usd_total",
    "Total LLM API cost in USD",
    ["agent_id", "model", "tenant_id"],
)

# Quality metrics
LLM_HALLUCINATION_SCORE = Histogram(
    "llm_hallucination_score",
    "Hallucination probability score (0.0-1.0)",
    ["agent_id"],
    buckets=[0.0, 0.1, 0.2, 0.3, 0.5, 0.7, 1.0],
)

GUARDRAIL_DECISIONS = Counter(
    "llm_guardrail_decisions_total",
    "Guardrail decisions",
    ["agent_id", "decision", "reason"],  # decision: allow|block|escalate
)

# Reliability
LLM_ERRORS_TOTAL = Counter(
    "llm_errors_total",
    "Total LLM errors",
    ["agent_id", "model", "error_type"],
)

# Active sessions gauge
ACTIVE_SESSIONS = Gauge(
    "llm_active_sessions",
    "Number of currently active sessions",
    ["agent_id"],
)

# ─── OpenTelemetry Setup ───────────────────────────────────────────────────────

def setup_otel(service_name: str, otel_endpoint: str = "http://otel-collector:4317"):
    """Configure OpenTelemetry Tracing + Metrics với OTLP exporter."""
    # Tracing
    tracer_provider = TracerProvider()
    otlp_span_exporter = OTLPSpanExporter(endpoint=otel_endpoint, insecure=True)
    tracer_provider.add_span_processor(BatchSpanProcessor(otlp_span_exporter))
    trace.set_tracer_provider(tracer_provider)

    # Metrics
    otlp_metric_exporter = OTLPMetricExporter(endpoint=otel_endpoint, insecure=True)
    metric_reader = PeriodicExportingMetricReader(otlp_metric_exporter, export_interval_millis=15000)
    meter_provider = MeterProvider(metric_readers=[metric_reader])
    metrics.set_meter_provider(meter_provider)

    return trace.get_tracer(service_name), metrics.get_meter(service_name)

# ─── Instrumented LLM Call Wrapper ────────────────────────────────────────────

COST_PER_1K_TOKENS = {
    "gpt-4o": {"input": 0.005, "output": 0.015},
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
    "claude-3-5-sonnet": {"input": 0.003, "output": 0.015},
    "claude-3-haiku": {"input": 0.00025, "output": 0.00125},
}

class InstrumentedLLMClient:
    def __init__(self, agent_id: str, tracer, model: str = "gpt-4o"):
        self.agent_id = agent_id
        self.model = model
        self.tracer = tracer

    def calculate_cost(self, input_tokens: int, output_tokens: int) -> float:
        rates = COST_PER_1K_TOKENS.get(self.model, {"input": 0.005, "output": 0.015})
        return (input_tokens / 1000 * rates["input"]) + (output_tokens / 1000 * rates["output"])

    async def chat_completion(
        self,
        messages: list[dict],
        tenant_id: str = "default",
        session_id: Optional[str] = None,
        **kwargs: Any,
    ) -> dict:
        """LLM call với đầy đủ instrumentation: traces, metrics, cost tracking."""
        start_time = time.perf_counter()

        with self.tracer.start_as_current_span("llm.generate") as span:
            span.set_attribute("llm.model", self.model)
            span.set_attribute("llm.agent_id", self.agent_id)
            span.set_attribute("llm.session_id", session_id or "")
            span.set_attribute("llm.input_messages", len(messages))

            ACTIVE_SESSIONS.labels(agent_id=self.agent_id).inc()

            try:
                # Gọi LLM thực tế (thay bằng openai client thật)
                from openai import AsyncOpenAI
                client = AsyncOpenAI()
                response = await client.chat.completions.create(
                    model=self.model,
                    messages=messages,
                    **kwargs,
                )

                latency = time.perf_counter() - start_time
                usage = response.usage
                input_tokens = usage.prompt_tokens
                output_tokens = usage.completion_tokens
                cost = self.calculate_cost(input_tokens, output_tokens)

                # Prometheus metrics
                LLM_REQUEST_DURATION.labels(
                    agent_id=self.agent_id, model=self.model, operation="chat"
                ).observe(latency)

                LLM_TOKENS_TOTAL.labels(
                    agent_id=self.agent_id, model=self.model, token_type="input"
                ).inc(input_tokens)

                LLM_TOKENS_TOTAL.labels(
                    agent_id=self.agent_id, model=self.model, token_type="output"
                ).inc(output_tokens)

                LLM_COST_USD.labels(
                    agent_id=self.agent_id, model=self.model, tenant_id=tenant_id
                ).inc(cost)

                # OTel span attributes
                span.set_attribute("llm.input_tokens", input_tokens)
                span.set_attribute("llm.output_tokens", output_tokens)
                span.set_attribute("llm.cost_usd", cost)
                span.set_attribute("llm.latency_ms", int(latency * 1000))

                logger.info(
                    "llm_call_completed",
                    extra={
                        "agent_id": self.agent_id,
                        "model": self.model,
                        "input_tokens": input_tokens,
                        "output_tokens": output_tokens,
                        "latency_ms": int(latency * 1000),
                        "cost_usd": round(cost, 6),
                        "session_id": session_id,
                    },
                )

                return {"response": response, "cost_usd": cost, "latency_ms": int(latency * 1000)}

            except Exception as e:
                LLM_ERRORS_TOTAL.labels(
                    agent_id=self.agent_id, model=self.model, error_type=type(e).__name__
                ).inc()
                span.record_exception(e)
                span.set_status(trace.StatusCode.ERROR, str(e))
                logger.error("llm_call_failed", extra={"error": str(e), "agent_id": self.agent_id})
                raise
            finally:
                ACTIVE_SESSIONS.labels(agent_id=self.agent_id).dec()

6. Distributed Tracing Cho Multi-Agent Workflow

6.1. Khái Niệm Cơ Bản

Khái niệmMô tảVí dụ trong AI Agent
TraceToàn bộ lifecycle của 1 requestTừ lúc user gửi tin → nhận response
Span1 đơn vị công việc trong trace“llm.generate”, “rag.retrieve”, “tool.call”
Parent SpanSpan chứa các span conOrchestrator span chứa tất cả sub-agent spans
Context PropagationTruyền trace context qua service boundariestraceparent header qua HTTP/gRPC
Correlation IDID duy nhất kết nối logs + traces + metricsrequest_id = trace_id

6.2. Python — OpenTelemetry + LangChain Callback Handler

import uuid
import time
import logging
from typing import Any, Optional, Union
from langchain.callbacks.base import BaseCallbackHandler
from langchain.schema import LLMResult, AgentAction, AgentFinish
from opentelemetry import trace, context, baggage
from opentelemetry.propagate import inject, extract
import structlog

logger = structlog.get_logger()
tracer = trace.get_tracer("langchain-agent")

class LangChainOTelCallbackHandler(BaseCallbackHandler):
    """
    LangChain callback handler tích hợp OpenTelemetry tracing.
    Tự động tạo spans cho mọi LLM call, tool call, chain execution.
    """

    def __init__(self, agent_id: str):
        self.agent_id = agent_id
        self._span_stack: dict[str, Any] = {}
        self._run_metadata: dict[str, dict] = {}

    def on_llm_start(self, serialized: dict, prompts: list[str], **kwargs: Any) -> None:
        run_id = str(kwargs.get("run_id", uuid.uuid4()))
        model = serialized.get("kwargs", {}).get("model_name", "unknown")

        span = tracer.start_span(
            "llm.generate",
            attributes={
                "llm.model": model,
                "llm.agent_id": self.agent_id,
                "llm.prompt_count": len(prompts),
                "llm.run_id": run_id,
            },
        )
        ctx = trace.use_span(span, end_on_exit=False)
        token = context.attach(ctx)

        self._span_stack[run_id] = {"span": span, "token": token, "start_time": time.perf_counter()}
        self._run_metadata[run_id] = {"model": model, "prompts": prompts}

        logger.info("llm_start", agent_id=self.agent_id, model=model, run_id=run_id)

    def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
        run_id = str(kwargs.get("run_id", ""))
        if run_id not in self._span_stack:
            return

        frame = self._span_stack.pop(run_id)
        span = frame["span"]
        latency_ms = int((time.perf_counter() - frame["start_time"]) * 1000)

        # Extract token usage từ LLMResult
        total_tokens = 0
        input_tokens = 0
        output_tokens = 0

        if response.llm_output:
            token_usage = response.llm_output.get("token_usage", {})
            input_tokens = token_usage.get("prompt_tokens", 0)
            output_tokens = token_usage.get("completion_tokens", 0)
            total_tokens = token_usage.get("total_tokens", 0)

        span.set_attribute("llm.input_tokens", input_tokens)
        span.set_attribute("llm.output_tokens", output_tokens)
        span.set_attribute("llm.total_tokens", total_tokens)
        span.set_attribute("llm.latency_ms", latency_ms)
        span.end()

        context.detach(frame["token"])

        logger.info(
            "llm_end",
            agent_id=self.agent_id,
            run_id=run_id,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            latency_ms=latency_ms,
        )

    def on_llm_error(self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any) -> None:
        run_id = str(kwargs.get("run_id", ""))
        if run_id not in self._span_stack:
            return

        frame = self._span_stack.pop(run_id)
        span = frame["span"]
        span.record_exception(error)
        span.set_status(trace.StatusCode.ERROR, str(error))
        span.end()
        context.detach(frame["token"])

        logger.error("llm_error", agent_id=self.agent_id, error=str(error), run_id=run_id)

    def on_tool_start(self, serialized: dict, input_str: str, **kwargs: Any) -> None:
        run_id = str(kwargs.get("run_id", uuid.uuid4()))
        tool_name = serialized.get("name", "unknown_tool")

        span = tracer.start_span(
            f"tool.{tool_name}",
            attributes={
                "tool.name": tool_name,
                "tool.input_length": len(input_str),
                "llm.agent_id": self.agent_id,
            },
        )
        ctx = trace.use_span(span, end_on_exit=False)
        token = context.attach(ctx)
        self._span_stack[run_id] = {"span": span, "token": token, "start_time": time.perf_counter()}

        logger.info("tool_start", tool=tool_name, agent_id=self.agent_id)

    def on_tool_end(self, output: str, **kwargs: Any) -> None:
        run_id = str(kwargs.get("run_id", ""))
        if run_id not in self._span_stack:
            return

        frame = self._span_stack.pop(run_id)
        span = frame["span"]
        span.set_attribute("tool.output_length", len(output))
        span.set_attribute("tool.latency_ms", int((time.perf_counter() - frame["start_time"]) * 1000))
        span.end()
        context.detach(frame["token"])

    def on_agent_action(self, action: AgentAction, **kwargs: Any) -> None:
        logger.info(
            "agent_action",
            agent_id=self.agent_id,
            tool=action.tool,
            tool_input=action.tool_input[:200],
        )

    def on_agent_finish(self, finish: AgentFinish, **kwargs: Any) -> None:
        logger.info("agent_finish", agent_id=self.agent_id, output_keys=list(finish.return_values.keys()))


# ─── Context Propagation qua HTTP ─────────────────────────────────────────────

def create_propagated_headers(current_span: Optional[Any] = None) -> dict:
    """Tạo HTTP headers với W3C traceparent để truyền context sang service khác."""
    headers: dict = {}
    inject(headers)  # OTel tự inject traceparent + tracestate
    return headers

def extract_trace_context(incoming_headers: dict) -> Any:
    """Extract trace context từ inbound HTTP request."""
    return extract(incoming_headers)

7. Structured Logging Cho AI Agent

7.1. Log Schema JSON Chuẩn

{
  "timestamp": "2026-05-14T10:23:45.123456Z",
  "level": "INFO",
  "service": "rag-agent-service",
  "version": "2.1.0",
  "environment": "production",

  "request_id": "req-4bf92f35-77b3-4da6",
  "session_id": "sess-a3ce929d-0e0e-4736",
  "correlation_id": "corr-00f067aa-0ba9-02b7",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",

  "agent_id": "rag_agent_v2",
  "tenant_id": "tenant-healthcare-001",
  "user_id": "user-hashed-789xyz",

  "model": "gpt-4o",
  "model_version": "2024-11-20",
  "operation": "chat_completion",

  "prompt_tokens": 1840,
  "completion_tokens": 420,
  "total_tokens": 2260,
  "cost_usd": 0.004530,

  "latency_ms": 2050,
  "ttft_ms": 380,
  "queue_wait_ms": 12,

  "guardrail_status": "passed",
  "guardrail_checks": {
    "prompt_injection": "clean",
    "pii_detection": "no_pii",
    "topic_filter": "in_scope",
    "toxicity": "clean"
  },

  "tool_calls": [
    {"name": "search_knowledge_base", "latency_ms": 145, "status": "success"},
    {"name": "get_product_info", "latency_ms": 89, "status": "success"}
  ],

  "rag_context": {
    "chunks_retrieved": 5,
    "top_similarity_score": 0.92,
    "retrieval_latency_ms": 145
  },

  "quality_scores": {
    "groundedness": 0.88,
    "hallucination_probability": 0.08,
    "relevance": 0.91
  },

  "error": null,
  "error_type": null,
  "retry_count": 0
}

7.2. Python Structlog Setup

import sys
import logging
import structlog
from opentelemetry import trace

def configure_structured_logging(
    service_name: str,
    environment: str = "production",
    log_level: str = "INFO",
) -> None:
    """Cấu hình structlog với OTel trace context injection."""

    # Processor chain: xử lý log record trước khi output
    structlog.configure(
        processors=[
            structlog.contextvars.merge_contextvars,         # Thread-local context
            structlog.stdlib.add_log_level,                   # level field
            structlog.stdlib.add_logger_name,                 # logger field
            structlog.processors.TimeStamper(fmt="iso"),      # ISO timestamp
            _inject_otel_context,                             # trace_id + span_id
            _add_service_metadata(service_name, environment), # service + env
            structlog.processors.StackInfoRenderer(),
            structlog.processors.format_exc_info,
            structlog.processors.JSONRenderer(),              # JSON output
        ],
        wrapper_class=structlog.make_filtering_bound_logger(
            getattr(logging, log_level.upper())
        ),
        context_class=dict,
        logger_factory=structlog.PrintLoggerFactory(sys.stdout),
    )

def _inject_otel_context(logger, method_name: str, event_dict: dict) -> dict:
    """Inject OTel trace_id và span_id vào mọi log record."""
    current_span = trace.get_current_span()
    if current_span and current_span.is_recording():
        ctx = current_span.get_span_context()
        event_dict["trace_id"] = format(ctx.trace_id, "032x")
        event_dict["span_id"] = format(ctx.span_id, "016x")
    return event_dict

def _add_service_metadata(service_name: str, environment: str):
    def processor(logger, method_name: str, event_dict: dict) -> dict:
        event_dict["service"] = service_name
        event_dict["environment"] = environment
        return event_dict
    return processor

# Sử dụng:
# configure_structured_logging("rag-agent-service", "production")
# log = structlog.get_logger()
# log.info("llm_call_completed", agent_id="rag_agent", latency_ms=2050, cost_usd=0.0045)

7.3. Elasticsearch Index Mapping

# elasticsearch-index-mapping.yaml
---
index_template:
  name: "ai-agent-logs"
  index_patterns:
    - "ai-agent-logs-*"
  
  settings:
    number_of_shards: 3
    number_of_replicas: 1
    refresh_interval: "5s"
    
    index:
      lifecycle:
        name: "ai-agent-logs-ilm-policy"
        rollover_alias: "ai-agent-logs"
    
    analysis:
      analyzer:
        custom_log_analyzer:
          type: standard
          stopwords: "_none_"

  mappings:
    dynamic: false
    properties:
      "@timestamp":        { type: date }
      timestamp:           { type: date }
      level:               { type: keyword }
      service:             { type: keyword }
      environment:         { type: keyword }
      version:             { type: keyword }

      request_id:          { type: keyword }
      session_id:          { type: keyword }
      trace_id:            { type: keyword }
      span_id:             { type: keyword }
      correlation_id:      { type: keyword }

      agent_id:            { type: keyword }
      tenant_id:           { type: keyword }
      user_id:             { type: keyword }
      model:               { type: keyword }
      operation:           { type: keyword }

      prompt_tokens:       { type: integer }
      completion_tokens:   { type: integer }
      total_tokens:        { type: integer }
      cost_usd:            { type: float }

      latency_ms:          { type: integer }
      ttft_ms:             { type: integer }
      queue_wait_ms:       { type: integer }

      guardrail_status:    { type: keyword }
      error:               { type: text, analyzer: custom_log_analyzer }
      error_type:          { type: keyword }
      retry_count:         { type: short }

      hallucination_probability: { type: float }
      groundedness:              { type: float }
      relevance:                 { type: float }

      tool_calls:
        type: nested
        properties:
          name:        { type: keyword }
          latency_ms:  { type: integer }
          status:      { type: keyword }

# ILM Policy
ilm_policy:
  name: "ai-agent-logs-ilm-policy"
  phases:
    hot:
      min_age: "0ms"
      actions:
        rollover:
          max_primary_shard_size: "50gb"
          max_age: "1d"
        set_priority:
          priority: 100
    warm:
      min_age: "7d"
      actions:
        shrink:
          number_of_shards: 1
        forcemerge:
          max_num_segments: 1
        set_priority:
          priority: 50
    cold:
      min_age: "30d"
      actions:
        freeze: {}
        set_priority:
          priority: 0
    delete:
      min_age: "90d"
      actions:
        delete: {}

7.4. Kibana/Elasticsearch Query Examples

// Query 1: Tìm slow requests (latency > 5s)
{
  "query": {
    "bool": {
      "must": [
        { "term": { "environment": "production" } },
        { "range": { "latency_ms": { "gte": 5000 } } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  },
  "sort": [{ "latency_ms": "desc" }],
  "size": 20
}

// Query 2: High-cost sessions hôm nay
{
  "query": {
    "bool": {
      "must": [
        { "range": { "@timestamp": { "gte": "now/d" } } },
        { "range": { "cost_usd": { "gte": 0.10 } } }
      ]
    }
  },
  "aggs": {
    "by_session": {
      "terms": { "field": "session_id", "size": 20 },
      "aggs": {
        "total_cost": { "sum": { "field": "cost_usd" } },
        "total_tokens": { "sum": { "field": "total_tokens" } }
      }
    }
  },
  "size": 0
}

// Query 3: Failed tool calls theo agent
{
  "query": {
    "bool": {
      "must": [
        { "range": { "@timestamp": { "gte": "now-6h" } } }
      ],
      "filter": [
        {
          "nested": {
            "path": "tool_calls",
            "query": {
              "term": { "tool_calls.status": "failed" }
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "by_agent": {
      "terms": { "field": "agent_id" },
      "aggs": {
        "failed_tools": {
          "nested": { "path": "tool_calls" },
          "aggs": {
            "failed_only": {
              "filter": { "term": { "tool_calls.status": "failed" } },
              "aggs": {
                "tool_names": { "terms": { "field": "tool_calls.name" } }
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}

8. Grafana Dashboard — 5 Panel Groups

8.1. Tổng Quan 5 Dashboard Panels

PanelMô tảMetrics nguồnVisualisation
OverviewRPS, Error Rate, Avg LatencyPrometheusStat + Time series
Token EconomyCost/giờ, token distributionPrometheusBar gauge + Heatmap
QualityHallucination rate, guardrail blocksPrometheusTime series + Alert
Agent HealthPer-agent latency heatmapPrometheusHeatmap
Business KPITask completion, escalation funnelPrometheus + ESStat + Bar chart

8.2. Grafana Dashboard JSON Config (Partial)

{
  "title": "AI Agent — LLMOps Dashboard",
  "uid": "llmops-main-dashboard",
  "tags": ["ai-agent", "llmops", "production"],
  "refresh": "30s",
  "time": { "from": "now-3h", "to": "now" },

  "panels": [
    {
      "id": 1,
      "title": "🟢 Requests Per Second",
      "type": "stat",
      "gridPos": { "x": 0, "y": 0, "w": 6, "h": 4 },
      "targets": [
        {
          "datasource": "prometheus",
          "expr": "sum(rate(llm_request_duration_seconds_count[2m]))",
          "legendFormat": "RPS"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "color": { "mode": "thresholds" },
          "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 100 },
              { "color": "red", "value": 500 }
            ]
          },
          "unit": "reqps"
        }
      }
    },
    {
      "id": 2,
      "title": "🔴 Error Rate (%)",
      "type": "stat",
      "gridPos": { "x": 6, "y": 0, "w": 6, "h": 4 },
      "targets": [
        {
          "datasource": "prometheus",
          "expr": "100 * sum(rate(llm_errors_total[5m])) / sum(rate(llm_request_duration_seconds_count[5m]))",
          "legendFormat": "Error Rate %"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 1 },
              { "color": "red", "value": 5 }
            ]
          }
        }
      }
    },
    {
      "id": 3,
      "title": "⏱ Latency P95 (ms)",
      "type": "timeseries",
      "gridPos": { "x": 0, "y": 4, "w": 12, "h": 8 },
      "targets": [
        {
          "datasource": "prometheus",
          "expr": "histogram_quantile(0.95, sum by(le, agent_id) (rate(llm_request_duration_seconds_bucket[5m]))) * 1000",
          "legendFormat": "P95 - {{agent_id}}"
        },
        {
          "datasource": "prometheus",
          "expr": "histogram_quantile(0.50, sum by(le, agent_id) (rate(llm_request_duration_seconds_bucket[5m]))) * 1000",
          "legendFormat": "P50 - {{agent_id}}"
        }
      ],
      "fieldConfig": {
        "defaults": { "unit": "ms" }
      }
    },
    {
      "id": 4,
      "title": "💰 Cost Per Hour (USD)",
      "type": "timeseries",
      "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 },
      "targets": [
        {
          "datasource": "prometheus",
          "expr": "sum by(agent_id) (rate(llm_cost_usd_total[1h])) * 3600",
          "legendFormat": "Cost/hr - {{agent_id}}"
        }
      ],
      "fieldConfig": {
        "defaults": { "unit": "currencyUSD" }
      }
    },
    {
      "id": 5,
      "title": "🧠 Hallucination Rate (%)",
      "type": "timeseries",
      "gridPos": { "x": 0, "y": 12, "w": 12, "h": 8 },
      "targets": [
        {
          "datasource": "prometheus",
          "expr": "100 * histogram_quantile(0.90, rate(llm_hallucination_score_bucket[10m]))",
          "legendFormat": "Hallucination P90"
        }
      ],
      "alert": {
        "conditions": [
          {
            "type": "query",
            "query": { "params": ["A", "10m", "now"] },
            "reducer": { "type": "avg" },
            "evaluator": { "type": "gt", "params": [10] }
          }
        ],
        "name": "High Hallucination Rate Alert"
      }
    }
  ]
}

9. Alerting Strategy — 8 Alert Rules Quan Trọng

9.1. Prometheus AlertManager Config

# alertmanager-rules.yaml
---
groups:
  - name: llmops_critical
    rules:
      # Alert 1: Cost Spike — hàng ngày vượt 150% baseline
      - alert: LLMCostSpike
        expr: |
          (
            sum(increase(llm_cost_usd_total[24h]))
            /
            sum(increase(llm_cost_usd_total[24h] offset 7d))
          ) > 1.5
        for: 15m
        labels:
          severity: critical
          team: llmops
        annotations:
          summary: "💰 LLM Cost Spike Detected"
          description: "Daily cost is {{ humanize $value | printf \"%.0f%%\" }} of 7-day average. Current: ${{ $value }}"
          runbook: "https://wiki.company.com/runbooks/llm-cost-spike"

      # Alert 2: Latency P95 > 5s sustained 5 minutes
      - alert: LLMHighLatencyP95
        expr: |
          histogram_quantile(0.95,
            sum by(le, agent_id) (rate(llm_request_duration_seconds_bucket[5m]))
          ) > 5
        for: 5m
        labels:
          severity: warning
          team: llmops
        annotations:
          summary: "⏱ LLM P95 Latency High: {{ $labels.agent_id }}"
          description: "P95 latency is {{ $value | humanizeDuration }} for agent {{ $labels.agent_id }}"

      # Alert 3: Error Rate > 5% trong 10 phút
      - alert: LLMHighErrorRate
        expr: |
          (
            sum by(agent_id) (rate(llm_errors_total[10m]))
            /
            sum by(agent_id) (rate(llm_request_duration_seconds_count[10m]))
          ) * 100 > 5
        for: 10m
        labels:
          severity: critical
          team: llmops
        annotations:
          summary: "🔴 LLM Error Rate > 5%: {{ $labels.agent_id }}"
          description: "Error rate is {{ $value | printf \"%.1f%%\" }} for agent {{ $labels.agent_id }}"

      # Alert 4: Hallucination Rate > 10% (sampled evaluation)
      - alert: LLMHallucinationRateHigh
        expr: |
          histogram_quantile(0.90,
            sum by(le, agent_id) (rate(llm_hallucination_score_bucket[15m]))
          ) > 0.10
        for: 10m
        labels:
          severity: critical
          team: ai-quality
        annotations:
          summary: "🧠 Hallucination Rate Spike: {{ $labels.agent_id }}"
          description: "P90 hallucination score is {{ $value | printf \"%.2f\" }} — review recent prompts/model"

      # Alert 5: Guardrail Block Surge > 20% in 15 minutes
      - alert: LLMGuardrailBlockSurge
        expr: |
          (
            sum by(agent_id) (rate(llm_guardrail_decisions_total{decision="block"}[15m]))
            /
            sum by(agent_id) (rate(llm_request_duration_seconds_count[15m]))
          ) * 100 > 20
        for: 5m
        labels:
          severity: warning
          team: llmops
        annotations:
          summary: "🛡 Guardrail Block Surge: {{ $labels.agent_id }}"
          description: "{{ $value | printf \"%.1f%%\" }} of requests blocked — possible attack or prompt issue"

      # Alert 6: Token Quota Approaching 80% of Daily Limit
      - alert: LLMTokenQuotaWarning
        expr: |
          (
            sum by(tenant_id) (increase(llm_tokens_total[24h]))
            /
            on(tenant_id) llm_token_daily_quota
          ) * 100 > 80
        for: 0m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "📊 Token Quota Warning: {{ $labels.tenant_id }}"
          description: "Tenant {{ $labels.tenant_id }} has used {{ $value | printf \"%.0f%%\" }} of daily token quota"

      # Alert 7: Circuit Breaker OPEN
      - alert: LLMCircuitBreakerOpen
        expr: llm_circuit_breaker_state{state="open"} == 1
        for: 2m
        labels:
          severity: critical
          team: llmops
        annotations:
          summary: "⚡ Circuit Breaker OPEN: {{ $labels.agent_id }}"
          description: "LLM circuit breaker opened for {{ $labels.agent_id }} — service may be degraded"

      # Alert 8: Memory/Context Overflow Rate Spike
      - alert: LLMContextOverflowSpike
        expr: |
          (
            sum by(agent_id) (rate(llm_errors_total{error_type="context_length_exceeded"}[10m]))
            /
            sum by(agent_id) (rate(llm_request_duration_seconds_count[10m]))
          ) * 100 > 5
        for: 5m
        labels:
          severity: warning
          team: llmops
        annotations:
          summary: "💾 Context Overflow Spike: {{ $labels.agent_id }}"
          description: "{{ $value | printf \"%.1f%%\" }} requests hitting context limit — review chunking/truncation strategy"

# alertmanager.yaml — Routing + Slack Webhook
---
route:
  group_by: ['alertname', 'agent_id']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-llmops'

  routes:
    - match:
        severity: critical
      receiver: 'slack-critical-llmops'
      group_wait: 10s
      repeat_interval: 1h

    - match:
        team: ai-quality
      receiver: 'slack-ai-quality'

receivers:
  - name: 'slack-llmops'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#llmops-alerts'
        title: '{{ template "slack.title" . }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Details:* {{ .Annotations.description }}
          *Runbook:* {{ .Annotations.runbook }}
          {{ end }}
        send_resolved: true

  - name: 'slack-critical-llmops'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#llmops-critical'
        color: 'danger'
        title: '🚨 CRITICAL: {{ template "slack.title" . }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Details:* {{ .Annotations.description }}
          *Runbook:* {{ .Annotations.runbook }}
          {{ end }}
        send_resolved: true

  - name: 'slack-ai-quality'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#ai-quality-alerts'
        title: '{{ template "slack.title" . }}'
        send_resolved: true

10. A/B Testing Prompt & Model Routing

10.1. Kiến Trúc Traffic Splitting

                    INCOMING REQUESTS
                          │
                          ▼
              ┌───────────────────────┐
              │   FEATURE FLAG        │
              │   SERVICE             │
              │   (LaunchDarkly /     │
              │    self-hosted)       │
              └──────────┬────────────┘
                         │
           ┌─────────────┼──────────────┐
           │ 90%         │ 10%          │
           ▼             ▼              │
     ┌──────────┐  ┌──────────┐        │
     │ Prompt A │  │ Prompt B │   Shadow Mode
     │ (control)│  │(canary)  │        │
     └────┬─────┘  └────┬─────┘        │
          │             │          ┌───▼───────┐
          ▼             ▼          │ Duplicate │
     LLM Response  LLM Response   │ Request   │
                                   │ (no user  │
       Track:                      │  impact)  │
       - Latency                   └─────┬─────┘
       - Quality score                   │
       - Cost                            ▼
       - User satisfaction         Evaluation
                                   (offline)

10.2. Python — Model Router với Weighted Random Selection

import random
import time
import hashlib
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional, Callable
import structlog

logger = structlog.get_logger()

class RoutingStrategy(Enum):
    WEIGHTED_RANDOM = "weighted_random"
    TENANT_BASED = "tenant_based"
    TASK_COMPLEXITY = "task_complexity"
    CANARY = "canary"
    SHADOW = "shadow"

@dataclass
class ModelConfig:
    model: str
    weight: float            # 0.0 - 1.0, tổng các config phải = 1.0
    variant_name: str        # "control", "canary_v2", "shadow"
    max_tokens: int = 4096
    temperature: float = 0.7
    extra_params: dict = field(default_factory=dict)

@dataclass
class RoutingDecision:
    model_config: ModelConfig
    strategy_used: str
    routing_reason: str
    experiment_id: Optional[str] = None

class AIAgentModelRouter:
    """
    Model Router với nhiều chiến lược:
    - A/B test (weighted random)
    - Per-tenant routing
    - Task complexity routing
    - Shadow mode (duplicate traffic)
    """

    def __init__(self):
        # A/B test configurations
        self._ab_experiments: dict[str, list[ModelConfig]] = {}

        # Tenant-specific routing
        self._tenant_routing: dict[str, ModelConfig] = {}

        # Default routing by task type
        self._task_routing: dict[str, ModelConfig] = {
            "simple_faq": ModelConfig(
                model="gpt-4o-mini", weight=1.0, variant_name="control",
                max_tokens=1024, temperature=0.3,
            ),
            "complex_analysis": ModelConfig(
                model="gpt-4o", weight=1.0, variant_name="control",
                max_tokens=4096, temperature=0.7,
            ),
            "sensitive_medical": ModelConfig(
                model="ollama/llama3.1", weight=1.0, variant_name="on_premise",
                max_tokens=2048, temperature=0.1,
            ),
            "code_generation": ModelConfig(
                model="claude-3-5-sonnet", weight=1.0, variant_name="control",
                max_tokens=4096, temperature=0.2,
            ),
        }

    def register_ab_experiment(
        self,
        experiment_id: str,
        configs: list[ModelConfig],
    ) -> None:
        """Đăng ký A/B experiment với weighted configs."""
        total_weight = sum(c.weight for c in configs)
        if abs(total_weight - 1.0) > 0.001:
            raise ValueError(f"Weights must sum to 1.0, got {total_weight}")
        self._ab_experiments[experiment_id] = configs
        logger.info("ab_experiment_registered", experiment_id=experiment_id,
                    variants=[c.variant_name for c in configs])

    def route(
        self,
        task_type: str,
        tenant_id: str = "default",
        session_id: str = "",
        experiment_id: Optional[str] = None,
        force_strategy: Optional[RoutingStrategy] = None,
    ) -> RoutingDecision:
        """Chọn model config dựa trên chiến lược routing."""

        # 1. Tenant-specific override (highest priority)
        if tenant_id in self._tenant_routing and not experiment_id:
            config = self._tenant_routing[tenant_id]
            return RoutingDecision(
                model_config=config,
                strategy_used=RoutingStrategy.TENANT_BASED.value,
                routing_reason=f"Tenant {tenant_id} has dedicated model",
            )

        # 2. A/B Experiment (nếu có experiment_id)
        if experiment_id and experiment_id in self._ab_experiments:
            configs = self._ab_experiments[experiment_id]

            # Sticky routing: cùng session_id → cùng variant (consistent UX)
            if session_id:
                hash_val = int(hashlib.md5(session_id.encode()).hexdigest(), 16)
                bucket = (hash_val % 1000) / 1000.0
            else:
                bucket = random.random()

            cumulative = 0.0
            for config in configs:
                cumulative += config.weight
                if bucket <= cumulative:
                    logger.info(
                        "ab_routing",
                        experiment_id=experiment_id,
                        variant=config.variant_name,
                        model=config.model,
                        session_id=session_id,
                    )
                    return RoutingDecision(
                        model_config=config,
                        strategy_used=RoutingStrategy.WEIGHTED_RANDOM.value,
                        routing_reason=f"A/B bucket {bucket:.3f} → {config.variant_name}",
                        experiment_id=experiment_id,
                    )

        # 3. Task complexity routing (fallback)
        config = self._task_routing.get(
            task_type,
            ModelConfig(model="gpt-4o-mini", weight=1.0, variant_name="default")
        )

        return RoutingDecision(
            model_config=config,
            strategy_used=RoutingStrategy.TASK_COMPLEXITY.value,
            routing_reason=f"Task type '{task_type}' → {config.model}",
        )


# ─── Sample Usage ──────────────────────────────────────────────────────────────

router = AIAgentModelRouter()

# Đăng ký A/B experiment: 90% prompt A (gpt-4o-mini) vs 10% prompt B (gpt-4o)
router.register_ab_experiment(
    experiment_id="exp_prompt_v2_vs_v3",
    configs=[
        ModelConfig(model="gpt-4o-mini", weight=0.90, variant_name="prompt_v2_control"),
        ModelConfig(model="gpt-4o",      weight=0.10, variant_name="prompt_v3_canary"),
    ],
)

decision = router.route(
    task_type="simple_faq",
    tenant_id="tenant-abc",
    session_id="sess-xyz789",
    experiment_id="exp_prompt_v2_vs_v3",
)

print(f"Model: {decision.model_config.model}")
print(f"Variant: {decision.model_config.variant_name}")
print(f"Strategy: {decision.strategy_used}")

10.3. Bảng Kết Quả A/B Test Sample

MetricPrompt A (control)Prompt B (canary)ΔKết luận
Latency P95 (ms)1,8202,340+28.6%❌ B chậm hơn
Quality Score (LLM Judge)3.8/54.3/5+13.2%✅ B tốt hơn
Cost/request (USD)$0.0021$0.0047+123.8%❌ B đắt hơn
User satisfaction (CSAT)76%83%+7%✅ B tốt hơn
Task completion rate88%92%+4%✅ B tốt hơn
Hallucination rate4.2%1.8%-57%✅ B an toàn hơn
Guardrail block rate1.8%1.2%-33%✅ B sạch hơn

Kết luận: Prompt B (canary) cho quality tốt hơn đáng kể nhưng chi phí cao hơn 2x. Quyết định: roll out prompt B cho các tenant premium (happy to pay), giữ prompt A cho tier free.


11. Model Routing Theo Tác Vụ

11.1. Decision Matrix

Loại Tác VụĐộ phức tạpModel đề xuấtChi phí/1K tokenLatency P95Ghi chú
FAQ đơn giảnThấpGPT-4o-mini / Gemini Flash$0.00015< 500ms80% traffic
Tóm tắt văn bảnThấp-TBGPT-4o-mini$0.00015< 800ms
Phân tích, so sánhTrung bìnhGPT-4o / Claude 3.5 Sonnet$0.005< 2s
Reasoning phức tạpCaoGPT-4o / Claude 3.5$0.005< 3s15% traffic
Code generationCaoClaude 3.5 Sonnet$0.003< 3s
Dữ liệu y tế/nhạy cảmBất kỳOllama on-premise$0 (infra cost)< 2sData không rời server
Real-time chatThấpGPT-4o-mini (streaming)$0.00015TTFT < 200ms
Batch processingBất kỳGPT-4o Batch API50% discountHoursKhông realtime

11.2. Bảng Cost vs Quality Trade-off

ProviderModelInput $/1MOutput $/1MQuality ScoreLatencyData PrivacyPhù hợp
OpenAIGPT-4o-mini$0.15$0.604.0/5FastCloudGeneral, cost-sensitive
OpenAIGPT-4o$5.00$15.004.7/5MediumCloudComplex reasoning
AnthropicClaude 3 Haiku$0.25$1.254.0/5FastCloudSafe, structured output
AnthropicClaude 3.5 Sonnet$3.00$15.004.8/5MediumCloudHigh quality, coding
GoogleGemini 1.5 Flash$0.075$0.303.9/5Very FastCloudUltra low cost
Azure OpenAIGPT-4o$5.00$15.004.7/5MediumCloud (VNet)Enterprise compliance
OllamaLlama 3.1 70B$0 (GPU)$0 (GPU)4.0/5MediumOn-premiseHealthcare, banking
OllamaQwen2.5 7B$0 (GPU)$0 (GPU)3.6/5FastOn-premiseCost-zero, low quality tasks

12. Sampling Strategy Cho Production

12.1. Vấn Đề

100% sampling trong production AI Agent:

  • 10,000 requests/ngày × 5 spans/request = 50,000 spans/ngày
  • Lưu trữ: ~2KB/span × 50,000 = 100MB/ngày traces
  • 3 tháng: ~9GB chỉ cho trace data
  • Chi phí Jaeger + object storage: ~$50-100/tháng

Giải pháp: Adaptive (tail-based) sampling.

12.2. Chiến Lược Sampling

Loại RequestSampling RateLý Do
Error requests100%Cần debug đầy đủ
Slow requests (P95+)100%Performance investigation
High-cost requests (>$0.10)100%Cost audit
Guardrail blocked100%Security audit
Normal successful requests10%Statistical representation
Health checks / internal0%Noise reduction

Chi phí storage ước tính (10,000 req/ngày):

Error rate 2% = 200 requests → 200 × 5 spans × 2KB = 2MB
Slow rate 5%  = 500 requests → 500 × 5 spans × 2KB = 5MB
Normal 10%    = 930 requests → 930 × 5 spans × 2KB = 9.3MB
Total/day ≈ 16.3MB  (vs 100MB với 100% sampling)
Savings: ~84%

12.3. Python OTel Adaptive Sampler

import random
from opentelemetry.sdk.trace.sampling import (
    Sampler,
    SamplingResult,
    Decision,
    ALWAYS_ON,
    ALWAYS_OFF,
)
from opentelemetry.trace import SpanKind
from opentelemetry.context import Context
from opentelemetry.util.types import Attributes

class AdaptiveLLMSampler(Sampler):
    """
    Tail-based adaptive sampler cho LLM workload.
    - Errors: 100%
    - Slow requests: 100%
    - Normal: configurable rate (default 10%)
    """

    def __init__(
        self,
        normal_sample_rate: float = 0.10,
        slow_threshold_ms: float = 3000.0,
        high_cost_threshold_usd: float = 0.10,
    ):
        self.normal_sample_rate = normal_sample_rate
        self.slow_threshold_ms = slow_threshold_ms
        self.high_cost_threshold_usd = high_cost_threshold_usd

    def should_sample(
        self,
        parent_context: Context,
        trace_id: int,
        name: str,
        kind: SpanKind = SpanKind.INTERNAL,
        attributes: Attributes = None,
        links: list = None,
        trace_state: object = None,
    ) -> SamplingResult:
        attrs = attributes or {}

        # Rule 1: Always sample errors
        if attrs.get("error", False) or attrs.get("http.status_code", 200) >= 500:
            return SamplingResult(Decision.RECORD_AND_SAMPLE, attributes=attrs)

        # Rule 2: Always sample slow requests
        latency_ms = attrs.get("llm.latency_ms", 0)
        if latency_ms > self.slow_threshold_ms:
            return SamplingResult(Decision.RECORD_AND_SAMPLE, attributes=attrs)

        # Rule 3: Always sample high-cost requests
        cost_usd = attrs.get("llm.cost_usd", 0)
        if cost_usd > self.high_cost_threshold_usd:
            return SamplingResult(Decision.RECORD_AND_SAMPLE, attributes=attrs)

        # Rule 4: Always sample guardrail blocks
        if attrs.get("guardrail.decision") == "block":
            return SamplingResult(Decision.RECORD_AND_SAMPLE, attributes=attrs)

        # Rule 5: Normal sampling (10%)
        if random.random() < self.normal_sample_rate:
            return SamplingResult(Decision.RECORD_AND_SAMPLE, attributes=attrs)

        return SamplingResult(Decision.DROP)

    def get_description(self) -> str:
        return f"AdaptiveLLMSampler(normal={self.normal_sample_rate})"

# Sử dụng trong TracerProvider:
# from opentelemetry.sdk.trace import TracerProvider
# provider = TracerProvider(sampler=AdaptiveLLMSampler(normal_sample_rate=0.10))

13. LLM Cost Management

13.1. Budgeting Per Tenant / Project

import time
import redis
from dataclasses import dataclass
from typing import Optional
import structlog

logger = structlog.get_logger()

@dataclass
class BudgetConfig:
    tenant_id: str
    daily_budget_usd: float
    monthly_budget_usd: float
    daily_token_limit: int
    alert_threshold_pct: float = 0.80  # Alert khi đạt 80%
    hard_stop: bool = True             # Dừng khi vượt budget

class LLMBudgetGuard:
    """
    Middleware kiểm tra budget trước mỗi LLM call.
    Sử dụng Redis để track real-time spending.
    """

    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client
        self._budgets: dict[str, BudgetConfig] = {}

    def register_budget(self, config: BudgetConfig) -> None:
        self._budgets[config.tenant_id] = config
        logger.info("budget_registered",
                    tenant_id=config.tenant_id,
                    daily_limit_usd=config.daily_budget_usd)

    def _get_today_key(self, tenant_id: str) -> str:
        today = time.strftime("%Y-%m-%d")
        return f"llm_budget:daily:{tenant_id}:{today}"

    def _get_month_key(self, tenant_id: str) -> str:
        month = time.strftime("%Y-%m")
        return f"llm_budget:monthly:{tenant_id}:{month}"

    def check_budget(self, tenant_id: str, estimated_cost_usd: float) -> dict:
        """
        Kiểm tra budget trước khi gọi LLM.
        Returns: {"allowed": bool, "reason": str, "remaining_usd": float}
        """
        config = self._budgets.get(tenant_id)
        if not config:
            return {"allowed": True, "reason": "no_budget_configured", "remaining_usd": float("inf")}

        daily_key = self._get_today_key(tenant_id)
        current_daily = float(self.redis.get(daily_key) or 0)
        projected_daily = current_daily + estimated_cost_usd

        # Hard stop check
        if config.hard_stop and projected_daily > config.daily_budget_usd:
            logger.warning(
                "budget_exceeded",
                tenant_id=tenant_id,
                current_cost=current_daily,
                daily_limit=config.daily_budget_usd,
            )
            return {
                "allowed": False,
                "reason": "daily_budget_exceeded",
                "remaining_usd": max(0, config.daily_budget_usd - current_daily),
            }

        # Alert threshold check
        if projected_daily > config.daily_budget_usd * config.alert_threshold_pct:
            logger.warning(
                "budget_threshold_warning",
                tenant_id=tenant_id,
                pct_used=projected_daily / config.daily_budget_usd,
            )

        return {
            "allowed": True,
            "reason": "within_budget",
            "remaining_usd": config.daily_budget_usd - current_daily,
        }

    def record_usage(self, tenant_id: str, actual_cost_usd: float) -> None:
        """Ghi nhận chi phí thực tế sau khi LLM call hoàn thành."""
        daily_key = self._get_today_key(tenant_id)
        month_key = self._get_month_key(tenant_id)

        pipe = self.redis.pipeline()
        pipe.incrbyfloat(daily_key, actual_cost_usd)
        pipe.expire(daily_key, 86400 * 2)          # 2 ngày TTL
        pipe.incrbyfloat(month_key, actual_cost_usd)
        pipe.expire(month_key, 86400 * 35)          # 35 ngày TTL
        pipe.execute()

13.2. Bảng Tier Pricing So Sánh

Tiêu chíOpenAI GPT-4oAnthropic Claude 3.5Azure OpenAIOllama Self-hosted
Input price$5/1M tokens$3/1M tokens$5/1M tokens~$0.15/1M (GPU cost)
Output price$15/1M tokens$15/1M tokens$15/1M tokens~$0.15/1M (GPU cost)
Data privacyOpenAI serversAnthropic serversAzure VNetHoàn toàn on-premise
ComplianceSOC2, GDPR (opt-out)SOC2, HIPAA add-onHIPAA, FedRAMPTự quản lý
Rate limits10K RPM5K RPMCustomKhông giới hạn
SLA uptime99.9%99.9%99.9%Tự quản lý
Setup complexityThấpThấpTrung bìnhCao (GPU infra)
Chi phí khởi đầu$0$0Azure subscriptionGPU server ~$2,000+
Cost 1M requests/ngày~$3,500/ngày~$2,100/ngày~$3,500/ngày~$50/ngày (amortized)
Phù hợpGeneral, startupHigh qualityEnterpriseHealthcare, Banking

14. Incident Response Cho AI Agent

14.1. Runbook — Khi Hallucination Rate Tăng

INCIDENT: Hallucination Rate > 10%
════════════════════════════════════
T+0min:  Alert nhận được qua Slack #llmops-critical
T+2min:  On-call engineer acknowledge alert

INVESTIGATION STEPS:
1. Grafana → Quality Dashboard → Hallucination Timeline
   - Xác định: bắt đầu khi nào? Tất cả agents hay 1 agent cụ thể?
   - Xem top sessions có hallucination_score cao nhất

2. Elasticsearch query:
   GET ai-agent-logs-*/_search
   { "query": { "range": { "hallucination_probability": { "gte": 0.3 } } },
     "sort": [{"@timestamp": "desc"}], "size": 20 }

3. Kiểm tra: Có prompt version change gần đây không?
   git log --oneline prompts/ | head -20

4. Kiểm tra: Model provider có update model không?
   - OpenAI model version log
   - Pinned model version trong config

MITIGATION:
- Nếu do prompt change → rollback prompt version ngay
- Nếu do model update → pin model version cụ thể (gpt-4o-2024-11-20)
- Nếu nguyên nhân chưa rõ → kích hoạt HITL mode (escalate tất cả uncertain responses)
- Notify stakeholders qua #llmops-incidents

RESOLUTION CRITERIA:
- Hallucination rate < 5% sustained 15 minutes

POST-INCIDENT:
- Post-mortem trong 48h
- Update runbook nếu cần

14.2. Runbook — Khi Cost Spike

INCIDENT: Daily LLM Cost > 150% Baseline
══════════════════════════════════════════
T+0min:  Cost spike alert
T+2min:  Acknowledge, bắt đầu điều tra

INVESTIGATION:
1. Prometheus query: Tenant nào đang tiêu cost nhiều nhất?
   sum by(tenant_id) (rate(llm_cost_usd_total[1h])) * 3600

2. Elasticsearch: Session nào có cost cao bất thường?
   (Query 2 từ Section 7.4)

3. Kiểm tra: Token count bất thường?
   - Input tokens > 5,000 per request → likely context stuffing
   - Output tokens > 2,000 → likely verbose prompt

4. Kiểm tra: Retry loop?
   sum by(agent_id) (rate(llm_errors_total{error_type="RateLimitError"}[10m]))

MITIGATION (theo thứ tự):
1. Tắt tenant vi phạm nếu suspicious activity
2. Enable token quota hard limit ngay
3. Giảm max_tokens trong model config tạm thời
4. Scale down replicas nếu request flood

POST-INCIDENT: Review token quota per tenant, update budget config

14.3. Post-Mortem Template

# Post-Mortem: [Incident Name]

**Ngày**: YYYY-MM-DD
**Mức độ**: Critical / High / Medium
**Duration**: X giờ Y phút
**MTTR**: X giờ Y phút

## Impact
- Số users ảnh hưởng: XXX
- Doanh thu ảnh hưởng: $XXX
- Chi phí phát sinh: $XXX

## Timeline
| Thời gian | Sự kiện |
|-----------|---------|
| HH:MM | Alert triggered |
| HH:MM | On-call engineer acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Incident resolved |

## Root Cause
[Mô tả nguyên nhân gốc rễ]

## Contributing Factors
1. [Factor 1]
2. [Factor 2]

## What Went Well
- [...]

## What Could Be Improved
- [...]

## Action Items
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| [...] | [...] | [...] | High |

## Lessons Learned
[...]

14.4. MTTR Targets Cho AI Incidents

Mức độVí dụResponse TimeMTTR Target
P0 - CriticalCost spike $1K+, mass data leak5 phút30 phút
P1 - HighError rate > 10%, hallucination surge15 phút2 giờ
P2 - MediumLatency degradation, quality drop1 giờ8 giờ
P3 - LowLogging gap, minor metric anomalyNext business day3 ngày

15. Production Readiness Checklist — 3 Cấp Độ

🥉 Cấp MVP (Tối thiểu để Go-Live)

Monitoring cơ bản (10 items):

  • Prometheus endpoint /metrics được expose
  • LLM latency (p50, p95) được track
  • Error rate được track theo agent_id
  • Token count (input + output) được đếm
  • Cost tracking theo ngày
  • Basic Grafana dashboard với latency + errors
  • Alert cho error rate > 10%
  • Alert cho cost spike > 200% baseline
  • Structured JSON logging (request_id, session_id, latency, tokens)
  • Log được ship vào Elasticsearch / Loki

Reliability cơ bản (8 items):

  • Timeout configured (max 30s per LLM call)
  • Retry với exponential backoff (max 3 retries)
  • Rate limit handling (429 error → retry-after)
  • Circuit breaker configured cho LLM provider
  • Graceful degradation khi LLM unavailable
  • Health check endpoint /health trả về trạng thái LLM connectivity
  • Token limit guard (max_tokens configured)
  • Context length check trước khi gọi LLM

🥈 Cấp Production (Đầy đủ cho Enterprise)

Observability nâng cao (12 items):

  • OpenTelemetry SDK integrated đầy đủ (traces + metrics + logs)
  • Distributed tracing với context propagation qua tất cả agents
  • TTFT (Time To First Token) tracking cho streaming responses
  • Per-tenant cost breakdown dashboard
  • Hallucination rate monitoring (sampled evaluation pipeline)
  • Guardrail decision logging với reason codes
  • Tool call latency histogram per tool
  • Memory/context usage tracking
  • Session timeline reconstruction từ traces
  • Kibana/Grafana Explore for ad-hoc investigation
  • Automated daily cost report → email/Slack
  • ILM policy cho log retention (hot/warm/cold/delete)

Alerting đầy đủ (8 items):

  • Tất cả 8 alert rules từ Section 9 được configured
  • Alert routing theo team/severity
  • PagerDuty / on-call rotation integrated
  • Runbook link trong mọi alert annotation
  • Alert fatigue review (tune thresholds sau 2 tuần)
  • Dead man's switch (alert nếu metrics stop flowing)
  • Cost budget alerts per tenant
  • SLA breach prediction alert (leading indicator)

Reliability production (10 items):

  • Multi-region LLM provider failover
  • Budget guard middleware cho mọi tenant
  • Token quota enforcement per tenant per day
  • Adaptive sampling cho traces (không 100%)
  • A/B testing framework ready
  • Model versioning pinned (không dùng “latest”)
  • Prompt versioning với git + experiment tracking
  • Shadow mode testing cho model upgrades
  • Load testing với realistic token distribution
  • Chaos engineering: LLM provider outage drill

🥇 Cấp Enterprise (Đầy đủ nhất)

Advanced LLMOps (12 items):

  • Full MLflow / LangSmith experiment tracking integration
  • Automated evaluation pipeline chạy hourly trên sampled traffic
  • Model drift detection với statistical tests (KS test, Chi-square)
  • Prompt regression test suite chạy trên mỗi deployment
  • Multi-model cost optimization engine (auto-route based on task)
  • LLM request caching (semantic cache với Redis + vector similarity)
  • Streaming token profiling (tốc độ generation, jitter)
  • Custom SLOs: error budget tracking, burn rate alerts
  • Capacity planning dashboard (projected cost 30/60/90 days)
  • Fine-tuning pipeline với evaluation gate trước deploy
  • Cross-tenant benchmarking (ẩn danh)
  • Regulatory audit trail xuất report PDF/Excel on demand

Security & Compliance (10 items):

  • Mọi LLM interaction được log với immutable audit trail
  • PII detection và masking trong log pipeline
  • Data residency enforcement (EU data → EU LLM endpoint)
  • Penetration test cho prompt injection vectors
  • GDPR Article 22 compliance (explain AI decision)
  • SOC 2 Type II evidence collection automated
  • Monthly third-party security review của LLM configs
  • Incident response drill quarterly
  • Vendor lock-in mitigation plan (multi-provider routing)
  • Contractual SLA với LLM providers documented

16. KPI Vận Hành, Chi Phí Platform, ROI Analysis

16.1. KPI Vận Hành Theo Tháng

KPIMVP TargetProduction TargetEnterprise Target
System Uptime99.0%99.5%99.9%
Avg Response Latency< 5s< 3s< 2s
Error Rate< 5%< 1%< 0.5%
Hallucination Rate< 10%< 3%< 1%
Task Completion Rate> 80%> 90%> 95%
Cost per Query< $0.20< $0.08< $0.03
MTTR (P1 incident)4h2h30min
User Satisfaction> 70%> 80%> 90%

16.2. Chi Phí Platform Observability Stack

Thành phầnSelf-hosted / Free tierSaaS / ManagedGhi chú
OpenTelemetry Collector$0 (self-hosted)$0 (open source)K8s deployment
Prometheus$0$0Thêm Thanos cho HA
Grafana$0 (OSS)$29-299/moOSS đủ dùng
Jaeger/Tempo$0 + S3 storage$50-200/moTempo rẻ hơn Jaeger
Elasticsearch$200-500/mo (3 nodes)$95-500/mo (ES Cloud)ES Cloud cho managed
Alertmanager$0$0Bundled với Prometheus
Pyroscope$0$0Grafana Phlare
Total (Self-hosted)~$200-500/mo10K req/ngày
Total (Full SaaS)~$500-1,200/moManaged, ít ops effort

16.3. ROI Analysis

Scenario: 10,000 LLM queries/ngày, team 5 người

BEFORE (không có LLMOps):
- Incident detection lag: 4-6 giờ
- Mỗi incident: 3-4 giờ engineer time debug = ~$300 loss/incident
- 2 incidents/tháng = $600/month waste
- Overspending do không track cost: ~$400/month (estimated 20% waste)
- Hallucination → user churn: 5% users/month = $2,000 MRR loss
Total monthly loss without LLMOps: ~$3,000

AFTER (với LLMOps stack đầy đủ):
- Platform cost: $500/month
- Incident MTTR giảm từ 4h → 30min (P1): save $250/incident × 2 = $500/month
- Cost optimization (routing + quota): save 15-25% = ~$300-500/month
- Hallucination detection → user churn giảm 60%: save $1,200/month
Total monthly saving: ~$2,000 - $2,500

ROI = (2,000 - 500) / 500 × 100 = 300% ROI
Payback period: < 1 month

17. Ma Trận Rủi Ro Vận Hành

#Rủi roXác suấtTác độngMức độBiện pháp giảm thiểu
1LLM provider outage (OpenAI, Anthropic)Trung bìnhCao🔴 HighMulti-provider failover; local Ollama fallback
2Cost runaway (prompt loop, token exploit)Thấp-TBRất cao🔴 HighBudget guard; hard token quota; real-time cost alert
3Silent model degradation (provider update model)Trung bìnhCao🔴 HighPin model version; automated regression eval weekly
4Log/trace data explosion (misconfigured sampler)ThấpTrung bình🟠 MediumAdaptive sampling; storage quota alert
5Alert fatigue (too many false positives)CaoTrung bình🟠 MediumTune thresholds sau 2 tuần; alert review cadence
6PII leak via logs (unmasked user data in structured logs)ThấpRất cao🔴 HighLog scrubber middleware; PII regex masking pipeline
7Dashboard blindspot (metric not instrumented)Trung bìnhTrung bình🟠 MediumCoverage checklist; quarterly observability audit
8Observer effect (OTel overhead degrades performance)ThấpThấp🟡 LowBenchmark OTel overhead (<1ms target); async exporters

18. Roadmap Triển Khai LLMOps — 3 Giai Đoạn

🚀 Giai Đoạn 1 — Foundation (Tuần 1-2)

Tuần 1:

  • Deploy OTel Collector + Prometheus + Grafana lên K8s (Helm charts)
  • Integrate OTel SDK vào tất cả agent services
  • Setup basic Grafana dashboard (latency, errors, cost)
  • Configure basic alerts (error rate, cost spike)
  • Ship structured logs vào Elasticsearch

Tuần 2:

  • Distributed tracing end-to-end (orchestrator → sub-agents)
  • Token + cost tracking per agent per tenant
  • Budget guard middleware deployed
  • ILM policy cho Elasticsearch
  • On-call rotation setup, runbooks viết xong

Deliverable: Hệ thống có thể detect P1 incident trong < 5 phút


⚙️ Giai Đoạn 2 — Quality & Cost (Tuần 3-6)

Tuần 3-4:

  • Hallucination evaluation pipeline (sampled, async)
  • Guardrail decision logging đầy đủ
  • Per-tenant cost dashboard + daily email report
  • A/B testing framework (canary deployment)
  • Model router theo task complexity

Tuần 5-6:

  • Adaptive sampling thay thế 100% sampling
  • Semantic cache (Redis + vector similarity)
  • Post-mortem process chính thức
  • Alert tuning (giảm false positives)
  • Quality SLO dashboard (error budget, burn rate)

Deliverable: Cost giảm 20-30%, hallucination rate visible và monitored


🏆 Giai Đoạn 3 — Enterprise Grade (Tuần 7-12)

Tuần 7-9:

  • Full MLflow / LangSmith integration
  • Model drift detection automated
  • Prompt regression test suite CI/CD
  • Multi-provider failover (OpenAI → Azure OpenAI → Anthropic)
  • Capacity planning dashboard

Tuần 10-12:

  • Compliance audit trail (immutable, exportable)
  • PII masking trong log pipeline
  • Chaos engineering drill (LLM outage simulation)
  • Security penetration test cho LLM attack vectors
  • Documentation, runbooks, training cho team

Deliverable: Full LLMOps maturity — incident MTTR < 30min, cost optimized, compliance-ready


19. Kết Luận

Trong bài này chúng ta đã xây dựng hoàn chỉnh hệ thống Monitoring & Observability cho AI Agent trong production — từ lý thuyết đến code thực tế:

Thành phần đã xây dựngGiá trị
LLMOps vs DevOps — 10 chiều so sánhHiểu rõ tại sao cần observability riêng
Kiến trúc OTel 4 pillarsFramework đầy đủ cho mọi quy mô
25+ metrics với 5 nhómBiết chính xác cần đo gì
OTel instrumentation (Python)Có thể implement ngay hôm nay
LangChain callback handlerTracing tự động cho LangChain agents
Structured log schema + ES mappingLog chuẩn, searchable, auditable
Grafana JSON configDashboard production-ready
8 alert rules + Prometheus YAMLAlert coverage đầy đủ
A/B testing frameworkCải tiến prompt/model dựa trên data
Model router (Python)Cost optimization tự động
Adaptive samplerGiảm 84% storage cost traces
Budget guardNgăn cost runaway
Incident runbooksResponse nhanh, MTTR thấp
50-item checklist 3 cấpRoadmap rõ ràng từ MVP đến Enterprise
ROI 300%Justify investment với stakeholders

Nguyên Tắc Vàng Cho LLMOps

“Bạn không thể quản lý những gì bạn không đo lường được. Trong thế giới LLM, điều này còn đúng hơn bất kỳ lĩnh vực nào khác — vì LLM có thể fail silently theo những cách mà không có metric nào trong DevOps truyền thống bắt được.”


📌 Bài Tiếp Theo

Bài 8: Use Case Thực Chiến — AI Agent trong Doanh nghiệp Việt Nam

Sau khi đã có đầy đủ nền tảng từ kiến trúc, memory, guardrails đến monitoring, bài tiếp theo sẽ đưa tất cả vào thực tế với 3 use case thực chiến tại doanh nghiệp Việt Nam:

  • Healthcare: AI Agent hỗ trợ bác sĩ tra cứu phác đồ điều trị, tích hợp HIS/EMR
  • Banking/Fintech: AI Agent tư vấn sản phẩm tài chính, KYC automation
  • Retail/E-commerce: AI Agent chăm sóc khách hàng đa kênh (Zalo, Web, App)

Mỗi use case đều bao gồm: kiến trúc chi tiết, tech stack, chi phí, timeline triển khai và bài học thực tế.


💡 Tip thực chiến: Bắt đầu với Giai Đoạn 1 (tuần 1-2) ngay khi có AI Agent đầu tiên lên production. Đừng chờ “có thời gian” — một cost runaway hay hallucination incident không báo trước sẽ khiến bạn phải xây observability trong tình trạng khủng hoảng, vừa tốn kém vừa stress. Ship observability cùng lúc với feature — đó là văn hóa LLMOps trưởng thành.

Previous
Next