1. Tại sao Production AI Agent cần Observability riêng?
Ở bài trước, chúng ta đã xây dựng hệ thống Guardrails & Evaluation để đảm bảo AI Agent hoạt động an toàn. Nhưng khi hàng nghìn người dùng thật sự sử dụng agent mỗi ngày, một câu hỏi hoàn toàn mới nổi lên:
“Làm sao tôi biết agent đang hoạt động đúng, ổn định, đúng chi phí và tạo ra giá trị ngay lúc này — trong production, 24/7?”
Traditional monitoring (CPU, RAM, request/s) không đủ cho AI Agent. Agent có thể hoàn toàn “xanh” trên dashboard DevOps thông thường nhưng thực tế đang:
- Trả lời sai (hallucination rate tăng âm thầm)
- Tiêu token gấp 3 lần bình thường do prompt loop
- Tốn thêm $800/ngày vì một model configuration sai
- Stuck trong reasoning loop suốt 45 giây mà không timeout
Đây là lý do LLMOps — một nhánh riêng của MLOps — ra đời.
2. LLMOps vs DevOps Truyền Thống — 10 Điểm Khác Biệt Cốt Lõi
| # | Chiều so sánh | DevOps truyền thống | LLMOps cho AI Agent |
|---|---|---|---|
| 1 | Tính xác định | Deterministic: cùng input → cùng output | Non-deterministic: cùng prompt → output khác nhau |
| 2 | Đơn vị chi phí | CPU giờ, bandwidth GB | Token (input + output) + API call cost |
| 3 | Metric chất lượng | Latency, error rate, uptime | Hallucination rate, groundedness, relevance score |
| 4 | Versioning | Code + config versioning | Code + config + prompt versioning + model versioning |
| 5 | Drift | Performance drift do hardware thay đổi | Model drift: nhà cung cấp update model lặng lẽ |
| 6 | Debugging | Stack trace rõ ràng | Reasoning trace phức tạp, multi-hop, khó reproduce |
| 7 | Testing | Unit test, integration test | Evaluation dataset, LLM-as-a-Judge, A/B testing |
| 8 | Rollback | Rollback code/config | Rollback prompt version + model version + memory state |
| 9 | Scaling | Horizontal scaling đơn giản | Phải cân bằng token throughput, context window, cost |
| 10 | Compliance | Log access, audit trail | Log mọi LLM interaction cho compliance + audit |
2.1. Non-Determinism — Thách Thức Lớn Nhất
DevOps: f(x) = y → luôn đúng, test 1 lần là đủ
LLMOps: f(x) = y₁ | y₂ | y₃ | ... → test phải sampling, eval phải statistical
Điều này có nghĩa: bạn không thể chỉ monitor có lỗi không — bạn phải monitor output có đúng không, liên tục, theo xác suất.
2.2. Token Economy — Chi phí vô hình
| Tình huống | Token consumed | Chi phí ước tính |
|---|---|---|
| 1 câu hỏi FAQ đơn giản | ~500 tokens | ~$0.001 |
| 1 phiên tư vấn phức tạp (RAG + history) | ~8,000 tokens | ~$0.016 |
| 1 agentic workflow 5 bước | ~25,000 tokens | ~$0.050 |
| 10,000 users/ngày × agentic workflow | 250M tokens | ~$500/ngày |
Kết luận: Một bug nhỏ trong prompt (ví dụ: infinite retry loop) có thể tiêu tốn $2,000+ trước khi ai phát hiện nếu không có cost monitoring.
3. Kiến Trúc Observability Tổng Thể Cho AI Agent
┌─────────────────────────────────────────────────────────────────────────────┐
│ LLMOPS OBSERVABILITY ARCHITECTURE — AI AGENT CLUSTER │
└─────────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────────┐
│ AI AGENT CLUSTER │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ │
│ │Orchestrator │ │ RAG Agent │ │ Tool Agent │ │ Memory Agent │ │
│ │ Agent │ │ │ │ │ │ │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └───────┬───────┘ │
│ │ │ │ │ │
│ └────────────────┴────────────────┴──────────────────┘ │
│ │ │
│ OTel SDK (Python / .NET / Java) │
│ - Traces (spans + context propagation) │
│ - Metrics (counters, histograms, gauges) │
│ - Logs (structured JSON + trace_id correlation) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ OPENTELEMETRY COLLECTOR │
│ │
│ Receivers: OTLP gRPC/HTTP, Prometheus scrape, Fluentd │
│ Processors: Batch, Memory Limiter, Attribute Filter, Sampling │
│ Exporters: → Prometheus │ → Jaeger/Tempo │ → Elasticsearch │
└──────────────────┬──────────────────────────────────────────────────────┘
│
┌───────────┼──────────────┐
▼ ▼ ▼
┌────────────┐ ┌──────────┐ ┌──────────────────┐
│ PROMETHEUS │ │ JAEGER │ │ ELASTICSEARCH │
│ │ │ / TEMPO │ │ / OPENSEARCH │
│ Metrics │ │ │ │ │
│ Storage │ │ Distributed│ │ Log Storage │
│ & Query │ │ Traces │ │ Full-text Search │
└─────┬──────┘ └────┬─────┘ └────────┬─────────┘
│ │ │
└─────────────┴────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ GRAFANA DASHBOARD │
│ │
│ [Overview] [Token Economy] [Quality] [Agent Health] [Business KPI] │
└──────────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ ALERTMANAGER │
│ │
│ Rules: Cost Spike | Latency P95 | Error Rate | Hallucination Rate │
│ Routing: → Slack | PagerDuty | Email | Webhook │
└─────────────────────────────────────────────────────────────────────────┘
3.1. Multi-Agent Distributed Tracing Flow
USER REQUEST (request_id: req-abc123)
│
▼ [Trace Start — Span: "user_request"]
┌────────────────────────────────────┐
│ API GATEWAY / LB │
│ Inject: traceparent header │
└──────────────────┬─────────────────┘
│
▼ [Span: "orchestrator.process"]
┌────────────────────────────────────┐
│ ORCHESTRATOR AGENT │ t=0ms
│ - Parse intent │
│ - Plan sub-tasks │
└──┬──────────────┬──────────────────┘
│ │ │
▼ ▼ ▼
[Span: [Span: [Span:
"rag.retrieve"] "tool.call"] "memory.fetch"]
┌──────────┐ ┌──────────┐ ┌──────────┐
│RAG Agent │ │Tool Agent│ │Memory │
│t=5ms │ │t=5ms │ │Agent │
│ │ │ │ │t=5ms │
│ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐│
│ │Embed│ │ │ │API │ │ │ │Redis││
│ │Query│ │ │ │Call │ │ │ │Fetch││
│ └──┬──┘ │ │ └──┬──┘ │ │ └──┬──┘│
│ │ │ │ │ │ │ │ │
│ ┌──▼──┐ │ │ ┌──▼──┐ │ │ │ │
│ │Vecto│ │ │ │Tool │ │ │ │ │
│ │rDB │ │ │ │Resp │ │ │ │ │
│ └─────┘ │ │ └─────┘ │ │ │ │
└────┬─────┘ └────┬──────┘ └─────┬───┘
│ │ │
└──────────────┴───────────────┘
│
▼ [Span: "llm.generate"] t=120ms
┌─────────────────┐
│ LLM CALL │
│ GPT-4o / etc │
│ tokens: 2,340 │
│ latency: 1.8s │
└────────┬────────┘
│
▼ [Span: "output.guard"] t=1920ms
┌─────────────────┐
│ Output Guard │
│ Guardrails check│
└────────┬────────┘
│
▼ [Trace End] t=2050ms
FINAL RESPONSE → User
Total: 2,050ms | tokens: 2,340 | cost: $0.0047
4. Bốn Trụ Cột của LLM Observability
4.1. Pillar 1 — Metrics
Mô tả: Dữ liệu số, time-series, aggregatable — dùng để trending và alerting.
Tools phù hợp: Prometheus, Grafana, Datadog, New Relic
Sample data:
# HELP llm_request_duration_seconds LLM request latency
# TYPE llm_request_duration_seconds histogram
llm_request_duration_seconds_bucket{agent="rag_agent",model="gpt-4o",le="0.5"} 42
llm_request_duration_seconds_bucket{agent="rag_agent",model="gpt-4o",le="1.0"} 180
llm_request_duration_seconds_bucket{agent="rag_agent",model="gpt-4o",le="2.0"} 312
llm_request_duration_seconds_bucket{agent="rag_agent",model="gpt-4o",le="5.0"} 398
llm_request_duration_seconds_bucket{agent="rag_agent",model="gpt-4o",le="+Inf"} 402
# HELP llm_tokens_total Total tokens consumed
# TYPE llm_tokens_total counter
llm_tokens_total{agent="rag_agent",type="input",model="gpt-4o"} 1284930
llm_tokens_total{agent="rag_agent",type="output",model="gpt-4o"} 423810
# HELP llm_cost_usd_total Total cost in USD
# TYPE llm_cost_usd_total counter
llm_cost_usd_total{agent="rag_agent",model="gpt-4o"} 24.87
4.2. Pillar 2 — Logs
Mô tả: Structured event records — dùng để debug, audit và tìm root cause.
Tools phù hợp: Elasticsearch, OpenSearch, Loki, Splunk
Sample data (JSON structured log):
{
"timestamp": "2026-05-14T10:23:45.123Z",
"level": "INFO",
"request_id": "req-abc123",
"session_id": "sess-xyz789",
"agent_id": "rag_agent",
"model": "gpt-4o",
"prompt_tokens": 1840,
"completion_tokens": 420,
"total_tokens": 2260,
"latency_ms": 2050,
"cost_usd": 0.0045,
"guardrail_status": "passed",
"tool_calls": ["search_knowledge_base", "get_product_info"],
"hallucination_score": 0.12,
"user_satisfaction": null,
"error": null
}
4.3. Pillar 3 — Traces
Mô tả: Distributed tracing — timeline của request xuyên qua nhiều service/agent.
Tools phù hợp: Jaeger, Grafana Tempo, Zipkin, AWS X-Ray
Sample span data:
{
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"spanId": "00f067aa0ba902b7",
"parentSpanId": "b9c7c989f97918e1",
"operationName": "llm.generate",
"serviceName": "rag-agent",
"startTime": 1715677425120,
"duration": 1823000,
"tags": {
"llm.model": "gpt-4o",
"llm.input_tokens": 1840,
"llm.output_tokens": 420,
"llm.cost_usd": 0.0045,
"agent.id": "rag_agent",
"guardrail.status": "passed"
}
}
4.4. Pillar 4 — Profiles
Mô tả: CPU/memory profiling của inference engine và Python code — tìm bottleneck.
Tools phù hợt: Pyroscope, Grafana Phlare, py-spy, cProfile
Sample — phát hiện bottleneck thực tế:
Function │ CPU % │ Calls │ Avg ms
──────────────────────────────────┼───────┼───────┼───────
embed_documents() │ 34.2% │ 2,840 │ 12.1ms
vector_db.similarity_search() │ 21.8% │ 2,840 │ 7.7ms
openai.chat.completions.create() │ 18.6% │ 890 │ 1,820ms
json.loads() [response parsing] │ 8.3% │ 2,840 │ 2.9ms
redis.get() [session cache] │ 5.1% │ 8,900 │ 0.57ms
5. Metrics Quan Trọng Cần Theo Dõi — 5 Nhóm
5.1. Nhóm 1 — Latency Metrics
| Metric | Mô tả | Target (Production) | Alert Threshold |
|---|---|---|---|
| TTFT p50 | Time To First Token, median | < 500ms | > 1s |
| TTFT p95 | Time To First Token, 95th percentile | < 1.5s | > 3s |
| TTFT p99 | Time To First Token, 99th percentile | < 3s | > 5s |
| Total Latency p95 | End-to-end response time | < 3s | > 5s |
| Queue Wait Time | Thời gian chờ trong queue | < 100ms | > 500ms |
| Tool Call Latency | Latency của external API calls | < 500ms/call | > 2s |
5.2. Nhóm 2 — Token & Cost Metrics
| Metric | Mô tả | Target | Alert Threshold |
|---|---|---|---|
| Input tokens/request | Avg input token per request | < 2,000 | > 5,000 |
| Output tokens/request | Avg output token per request | < 500 | > 2,000 |
| Cost/session USD | Chi phí trung bình mỗi phiên | < $0.05 | > $0.20 |
| Daily cost USD | Tổng chi phí theo ngày | Baseline ±20% | > 150% baseline |
| Monthly cost trend | Xu hướng chi phí tháng | Growth < 30% | > 50% MoM |
| Token efficiency ratio | Output tokens / Input tokens | > 0.3 | < 0.1 |
5.3. Nhóm 3 — Quality Metrics
| Metric | Mô tả | Target | Alert Threshold |
|---|---|---|---|
| Hallucination rate | % responses có thông tin sai | < 3% | > 10% |
| Guardrail block rate | % requests bị chặn bởi guardrail | 0.5-2% | > 20% (surge) |
| Groundedness score | RAG answer grounded in context | > 0.85 | < 0.70 |
| User satisfaction | CSAT score / thumbs up % | > 80% | < 60% |
| Task completion rate | % tasks completed successfully | > 90% | < 75% |
| Escalation rate | % sessions escalated to human | < 5% | > 15% |
5.4. Nhóm 4 — Reliability Metrics
| Metric | Mô tả | Target | Alert Threshold |
|---|---|---|---|
| Error rate | % requests trả về lỗi | < 1% | > 5% |
| Timeout rate | % requests timeout | < 0.5% | > 2% |
| Retry rate | % requests phải retry | < 2% | > 10% |
| Circuit breaker state | Trạng thái circuit breaker | CLOSED | OPEN > 5min |
| Memory overflow rate | % context window overflow | < 1% | > 5% |
| Tool failure rate | % tool calls thất bại | < 2% | > 10% |
5.5. Nhóm 5 — Business Metrics
| Metric | Mô tả | Target | Alert Threshold |
|---|---|---|---|
| Active sessions | Số phiên đang hoạt động | Capacity planning | > 80% capacity |
| Daily active users | Số user unique/ngày | Growth target | Sudden drop > 30% |
| Task completion rate | % tác vụ hoàn thành | > 90% | < 75% |
| Avg conversation length | Số turn trung bình/phiên | 3-8 turns | > 15 turns |
| ROI per agent | Giá trị tạo ra / chi phí vận hành | > 3x | < 1x |
| Cost per resolved query | Chi phí để giải quyết 1 query | < $0.10 | > $0.50 |
5.6. Python — Custom Prometheus Metrics + OpenTelemetry Instrumentation
import time
import logging
from typing import Optional, Any
from functools import wraps
from prometheus_client import Counter, Histogram, Gauge, Summary
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
logger = logging.getLogger(__name__)
# ─── Prometheus Metrics ────────────────────────────────────────────────────────
# Latency histogram với p50/p95/p99 buckets
LLM_REQUEST_DURATION = Histogram(
"llm_request_duration_seconds",
"LLM request latency in seconds",
["agent_id", "model", "operation"],
buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0],
)
TTFT_DURATION = Histogram(
"llm_ttft_seconds",
"Time To First Token in seconds",
["agent_id", "model"],
buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0],
)
# Token counters
LLM_TOKENS_TOTAL = Counter(
"llm_tokens_total",
"Total tokens consumed",
["agent_id", "model", "token_type"], # token_type: input | output
)
# Cost tracking
LLM_COST_USD = Counter(
"llm_cost_usd_total",
"Total LLM API cost in USD",
["agent_id", "model", "tenant_id"],
)
# Quality metrics
LLM_HALLUCINATION_SCORE = Histogram(
"llm_hallucination_score",
"Hallucination probability score (0.0-1.0)",
["agent_id"],
buckets=[0.0, 0.1, 0.2, 0.3, 0.5, 0.7, 1.0],
)
GUARDRAIL_DECISIONS = Counter(
"llm_guardrail_decisions_total",
"Guardrail decisions",
["agent_id", "decision", "reason"], # decision: allow|block|escalate
)
# Reliability
LLM_ERRORS_TOTAL = Counter(
"llm_errors_total",
"Total LLM errors",
["agent_id", "model", "error_type"],
)
# Active sessions gauge
ACTIVE_SESSIONS = Gauge(
"llm_active_sessions",
"Number of currently active sessions",
["agent_id"],
)
# ─── OpenTelemetry Setup ───────────────────────────────────────────────────────
def setup_otel(service_name: str, otel_endpoint: str = "http://otel-collector:4317"):
"""Configure OpenTelemetry Tracing + Metrics với OTLP exporter."""
# Tracing
tracer_provider = TracerProvider()
otlp_span_exporter = OTLPSpanExporter(endpoint=otel_endpoint, insecure=True)
tracer_provider.add_span_processor(BatchSpanProcessor(otlp_span_exporter))
trace.set_tracer_provider(tracer_provider)
# Metrics
otlp_metric_exporter = OTLPMetricExporter(endpoint=otel_endpoint, insecure=True)
metric_reader = PeriodicExportingMetricReader(otlp_metric_exporter, export_interval_millis=15000)
meter_provider = MeterProvider(metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)
return trace.get_tracer(service_name), metrics.get_meter(service_name)
# ─── Instrumented LLM Call Wrapper ────────────────────────────────────────────
COST_PER_1K_TOKENS = {
"gpt-4o": {"input": 0.005, "output": 0.015},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"claude-3-5-sonnet": {"input": 0.003, "output": 0.015},
"claude-3-haiku": {"input": 0.00025, "output": 0.00125},
}
class InstrumentedLLMClient:
def __init__(self, agent_id: str, tracer, model: str = "gpt-4o"):
self.agent_id = agent_id
self.model = model
self.tracer = tracer
def calculate_cost(self, input_tokens: int, output_tokens: int) -> float:
rates = COST_PER_1K_TOKENS.get(self.model, {"input": 0.005, "output": 0.015})
return (input_tokens / 1000 * rates["input"]) + (output_tokens / 1000 * rates["output"])
async def chat_completion(
self,
messages: list[dict],
tenant_id: str = "default",
session_id: Optional[str] = None,
**kwargs: Any,
) -> dict:
"""LLM call với đầy đủ instrumentation: traces, metrics, cost tracking."""
start_time = time.perf_counter()
with self.tracer.start_as_current_span("llm.generate") as span:
span.set_attribute("llm.model", self.model)
span.set_attribute("llm.agent_id", self.agent_id)
span.set_attribute("llm.session_id", session_id or "")
span.set_attribute("llm.input_messages", len(messages))
ACTIVE_SESSIONS.labels(agent_id=self.agent_id).inc()
try:
# Gọi LLM thực tế (thay bằng openai client thật)
from openai import AsyncOpenAI
client = AsyncOpenAI()
response = await client.chat.completions.create(
model=self.model,
messages=messages,
**kwargs,
)
latency = time.perf_counter() - start_time
usage = response.usage
input_tokens = usage.prompt_tokens
output_tokens = usage.completion_tokens
cost = self.calculate_cost(input_tokens, output_tokens)
# Prometheus metrics
LLM_REQUEST_DURATION.labels(
agent_id=self.agent_id, model=self.model, operation="chat"
).observe(latency)
LLM_TOKENS_TOTAL.labels(
agent_id=self.agent_id, model=self.model, token_type="input"
).inc(input_tokens)
LLM_TOKENS_TOTAL.labels(
agent_id=self.agent_id, model=self.model, token_type="output"
).inc(output_tokens)
LLM_COST_USD.labels(
agent_id=self.agent_id, model=self.model, tenant_id=tenant_id
).inc(cost)
# OTel span attributes
span.set_attribute("llm.input_tokens", input_tokens)
span.set_attribute("llm.output_tokens", output_tokens)
span.set_attribute("llm.cost_usd", cost)
span.set_attribute("llm.latency_ms", int(latency * 1000))
logger.info(
"llm_call_completed",
extra={
"agent_id": self.agent_id,
"model": self.model,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"latency_ms": int(latency * 1000),
"cost_usd": round(cost, 6),
"session_id": session_id,
},
)
return {"response": response, "cost_usd": cost, "latency_ms": int(latency * 1000)}
except Exception as e:
LLM_ERRORS_TOTAL.labels(
agent_id=self.agent_id, model=self.model, error_type=type(e).__name__
).inc()
span.record_exception(e)
span.set_status(trace.StatusCode.ERROR, str(e))
logger.error("llm_call_failed", extra={"error": str(e), "agent_id": self.agent_id})
raise
finally:
ACTIVE_SESSIONS.labels(agent_id=self.agent_id).dec()
6. Distributed Tracing Cho Multi-Agent Workflow
6.1. Khái Niệm Cơ Bản
| Khái niệm | Mô tả | Ví dụ trong AI Agent |
|---|---|---|
| Trace | Toàn bộ lifecycle của 1 request | Từ lúc user gửi tin → nhận response |
| Span | 1 đơn vị công việc trong trace | “llm.generate”, “rag.retrieve”, “tool.call” |
| Parent Span | Span chứa các span con | Orchestrator span chứa tất cả sub-agent spans |
| Context Propagation | Truyền trace context qua service boundaries | traceparent header qua HTTP/gRPC |
| Correlation ID | ID duy nhất kết nối logs + traces + metrics | request_id = trace_id |
6.2. Python — OpenTelemetry + LangChain Callback Handler
import uuid
import time
import logging
from typing import Any, Optional, Union
from langchain.callbacks.base import BaseCallbackHandler
from langchain.schema import LLMResult, AgentAction, AgentFinish
from opentelemetry import trace, context, baggage
from opentelemetry.propagate import inject, extract
import structlog
logger = structlog.get_logger()
tracer = trace.get_tracer("langchain-agent")
class LangChainOTelCallbackHandler(BaseCallbackHandler):
"""
LangChain callback handler tích hợp OpenTelemetry tracing.
Tự động tạo spans cho mọi LLM call, tool call, chain execution.
"""
def __init__(self, agent_id: str):
self.agent_id = agent_id
self._span_stack: dict[str, Any] = {}
self._run_metadata: dict[str, dict] = {}
def on_llm_start(self, serialized: dict, prompts: list[str], **kwargs: Any) -> None:
run_id = str(kwargs.get("run_id", uuid.uuid4()))
model = serialized.get("kwargs", {}).get("model_name", "unknown")
span = tracer.start_span(
"llm.generate",
attributes={
"llm.model": model,
"llm.agent_id": self.agent_id,
"llm.prompt_count": len(prompts),
"llm.run_id": run_id,
},
)
ctx = trace.use_span(span, end_on_exit=False)
token = context.attach(ctx)
self._span_stack[run_id] = {"span": span, "token": token, "start_time": time.perf_counter()}
self._run_metadata[run_id] = {"model": model, "prompts": prompts}
logger.info("llm_start", agent_id=self.agent_id, model=model, run_id=run_id)
def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
run_id = str(kwargs.get("run_id", ""))
if run_id not in self._span_stack:
return
frame = self._span_stack.pop(run_id)
span = frame["span"]
latency_ms = int((time.perf_counter() - frame["start_time"]) * 1000)
# Extract token usage từ LLMResult
total_tokens = 0
input_tokens = 0
output_tokens = 0
if response.llm_output:
token_usage = response.llm_output.get("token_usage", {})
input_tokens = token_usage.get("prompt_tokens", 0)
output_tokens = token_usage.get("completion_tokens", 0)
total_tokens = token_usage.get("total_tokens", 0)
span.set_attribute("llm.input_tokens", input_tokens)
span.set_attribute("llm.output_tokens", output_tokens)
span.set_attribute("llm.total_tokens", total_tokens)
span.set_attribute("llm.latency_ms", latency_ms)
span.end()
context.detach(frame["token"])
logger.info(
"llm_end",
agent_id=self.agent_id,
run_id=run_id,
input_tokens=input_tokens,
output_tokens=output_tokens,
latency_ms=latency_ms,
)
def on_llm_error(self, error: Union[Exception, KeyboardInterrupt], **kwargs: Any) -> None:
run_id = str(kwargs.get("run_id", ""))
if run_id not in self._span_stack:
return
frame = self._span_stack.pop(run_id)
span = frame["span"]
span.record_exception(error)
span.set_status(trace.StatusCode.ERROR, str(error))
span.end()
context.detach(frame["token"])
logger.error("llm_error", agent_id=self.agent_id, error=str(error), run_id=run_id)
def on_tool_start(self, serialized: dict, input_str: str, **kwargs: Any) -> None:
run_id = str(kwargs.get("run_id", uuid.uuid4()))
tool_name = serialized.get("name", "unknown_tool")
span = tracer.start_span(
f"tool.{tool_name}",
attributes={
"tool.name": tool_name,
"tool.input_length": len(input_str),
"llm.agent_id": self.agent_id,
},
)
ctx = trace.use_span(span, end_on_exit=False)
token = context.attach(ctx)
self._span_stack[run_id] = {"span": span, "token": token, "start_time": time.perf_counter()}
logger.info("tool_start", tool=tool_name, agent_id=self.agent_id)
def on_tool_end(self, output: str, **kwargs: Any) -> None:
run_id = str(kwargs.get("run_id", ""))
if run_id not in self._span_stack:
return
frame = self._span_stack.pop(run_id)
span = frame["span"]
span.set_attribute("tool.output_length", len(output))
span.set_attribute("tool.latency_ms", int((time.perf_counter() - frame["start_time"]) * 1000))
span.end()
context.detach(frame["token"])
def on_agent_action(self, action: AgentAction, **kwargs: Any) -> None:
logger.info(
"agent_action",
agent_id=self.agent_id,
tool=action.tool,
tool_input=action.tool_input[:200],
)
def on_agent_finish(self, finish: AgentFinish, **kwargs: Any) -> None:
logger.info("agent_finish", agent_id=self.agent_id, output_keys=list(finish.return_values.keys()))
# ─── Context Propagation qua HTTP ─────────────────────────────────────────────
def create_propagated_headers(current_span: Optional[Any] = None) -> dict:
"""Tạo HTTP headers với W3C traceparent để truyền context sang service khác."""
headers: dict = {}
inject(headers) # OTel tự inject traceparent + tracestate
return headers
def extract_trace_context(incoming_headers: dict) -> Any:
"""Extract trace context từ inbound HTTP request."""
return extract(incoming_headers)
7. Structured Logging Cho AI Agent
7.1. Log Schema JSON Chuẩn
{
"timestamp": "2026-05-14T10:23:45.123456Z",
"level": "INFO",
"service": "rag-agent-service",
"version": "2.1.0",
"environment": "production",
"request_id": "req-4bf92f35-77b3-4da6",
"session_id": "sess-a3ce929d-0e0e-4736",
"correlation_id": "corr-00f067aa-0ba9-02b7",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"agent_id": "rag_agent_v2",
"tenant_id": "tenant-healthcare-001",
"user_id": "user-hashed-789xyz",
"model": "gpt-4o",
"model_version": "2024-11-20",
"operation": "chat_completion",
"prompt_tokens": 1840,
"completion_tokens": 420,
"total_tokens": 2260,
"cost_usd": 0.004530,
"latency_ms": 2050,
"ttft_ms": 380,
"queue_wait_ms": 12,
"guardrail_status": "passed",
"guardrail_checks": {
"prompt_injection": "clean",
"pii_detection": "no_pii",
"topic_filter": "in_scope",
"toxicity": "clean"
},
"tool_calls": [
{"name": "search_knowledge_base", "latency_ms": 145, "status": "success"},
{"name": "get_product_info", "latency_ms": 89, "status": "success"}
],
"rag_context": {
"chunks_retrieved": 5,
"top_similarity_score": 0.92,
"retrieval_latency_ms": 145
},
"quality_scores": {
"groundedness": 0.88,
"hallucination_probability": 0.08,
"relevance": 0.91
},
"error": null,
"error_type": null,
"retry_count": 0
}
7.2. Python Structlog Setup
import sys
import logging
import structlog
from opentelemetry import trace
def configure_structured_logging(
service_name: str,
environment: str = "production",
log_level: str = "INFO",
) -> None:
"""Cấu hình structlog với OTel trace context injection."""
# Processor chain: xử lý log record trước khi output
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars, # Thread-local context
structlog.stdlib.add_log_level, # level field
structlog.stdlib.add_logger_name, # logger field
structlog.processors.TimeStamper(fmt="iso"), # ISO timestamp
_inject_otel_context, # trace_id + span_id
_add_service_metadata(service_name, environment), # service + env
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.JSONRenderer(), # JSON output
],
wrapper_class=structlog.make_filtering_bound_logger(
getattr(logging, log_level.upper())
),
context_class=dict,
logger_factory=structlog.PrintLoggerFactory(sys.stdout),
)
def _inject_otel_context(logger, method_name: str, event_dict: dict) -> dict:
"""Inject OTel trace_id và span_id vào mọi log record."""
current_span = trace.get_current_span()
if current_span and current_span.is_recording():
ctx = current_span.get_span_context()
event_dict["trace_id"] = format(ctx.trace_id, "032x")
event_dict["span_id"] = format(ctx.span_id, "016x")
return event_dict
def _add_service_metadata(service_name: str, environment: str):
def processor(logger, method_name: str, event_dict: dict) -> dict:
event_dict["service"] = service_name
event_dict["environment"] = environment
return event_dict
return processor
# Sử dụng:
# configure_structured_logging("rag-agent-service", "production")
# log = structlog.get_logger()
# log.info("llm_call_completed", agent_id="rag_agent", latency_ms=2050, cost_usd=0.0045)
7.3. Elasticsearch Index Mapping
# elasticsearch-index-mapping.yaml
---
index_template:
name: "ai-agent-logs"
index_patterns:
- "ai-agent-logs-*"
settings:
number_of_shards: 3
number_of_replicas: 1
refresh_interval: "5s"
index:
lifecycle:
name: "ai-agent-logs-ilm-policy"
rollover_alias: "ai-agent-logs"
analysis:
analyzer:
custom_log_analyzer:
type: standard
stopwords: "_none_"
mappings:
dynamic: false
properties:
"@timestamp": { type: date }
timestamp: { type: date }
level: { type: keyword }
service: { type: keyword }
environment: { type: keyword }
version: { type: keyword }
request_id: { type: keyword }
session_id: { type: keyword }
trace_id: { type: keyword }
span_id: { type: keyword }
correlation_id: { type: keyword }
agent_id: { type: keyword }
tenant_id: { type: keyword }
user_id: { type: keyword }
model: { type: keyword }
operation: { type: keyword }
prompt_tokens: { type: integer }
completion_tokens: { type: integer }
total_tokens: { type: integer }
cost_usd: { type: float }
latency_ms: { type: integer }
ttft_ms: { type: integer }
queue_wait_ms: { type: integer }
guardrail_status: { type: keyword }
error: { type: text, analyzer: custom_log_analyzer }
error_type: { type: keyword }
retry_count: { type: short }
hallucination_probability: { type: float }
groundedness: { type: float }
relevance: { type: float }
tool_calls:
type: nested
properties:
name: { type: keyword }
latency_ms: { type: integer }
status: { type: keyword }
# ILM Policy
ilm_policy:
name: "ai-agent-logs-ilm-policy"
phases:
hot:
min_age: "0ms"
actions:
rollover:
max_primary_shard_size: "50gb"
max_age: "1d"
set_priority:
priority: 100
warm:
min_age: "7d"
actions:
shrink:
number_of_shards: 1
forcemerge:
max_num_segments: 1
set_priority:
priority: 50
cold:
min_age: "30d"
actions:
freeze: {}
set_priority:
priority: 0
delete:
min_age: "90d"
actions:
delete: {}
7.4. Kibana/Elasticsearch Query Examples
// Query 1: Tìm slow requests (latency > 5s)
{
"query": {
"bool": {
"must": [
{ "term": { "environment": "production" } },
{ "range": { "latency_ms": { "gte": 5000 } } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
]
}
},
"sort": [{ "latency_ms": "desc" }],
"size": 20
}
// Query 2: High-cost sessions hôm nay
{
"query": {
"bool": {
"must": [
{ "range": { "@timestamp": { "gte": "now/d" } } },
{ "range": { "cost_usd": { "gte": 0.10 } } }
]
}
},
"aggs": {
"by_session": {
"terms": { "field": "session_id", "size": 20 },
"aggs": {
"total_cost": { "sum": { "field": "cost_usd" } },
"total_tokens": { "sum": { "field": "total_tokens" } }
}
}
},
"size": 0
}
// Query 3: Failed tool calls theo agent
{
"query": {
"bool": {
"must": [
{ "range": { "@timestamp": { "gte": "now-6h" } } }
],
"filter": [
{
"nested": {
"path": "tool_calls",
"query": {
"term": { "tool_calls.status": "failed" }
}
}
}
]
}
},
"aggs": {
"by_agent": {
"terms": { "field": "agent_id" },
"aggs": {
"failed_tools": {
"nested": { "path": "tool_calls" },
"aggs": {
"failed_only": {
"filter": { "term": { "tool_calls.status": "failed" } },
"aggs": {
"tool_names": { "terms": { "field": "tool_calls.name" } }
}
}
}
}
}
}
},
"size": 0
}
8. Grafana Dashboard — 5 Panel Groups
8.1. Tổng Quan 5 Dashboard Panels
| Panel | Mô tả | Metrics nguồn | Visualisation |
|---|---|---|---|
| Overview | RPS, Error Rate, Avg Latency | Prometheus | Stat + Time series |
| Token Economy | Cost/giờ, token distribution | Prometheus | Bar gauge + Heatmap |
| Quality | Hallucination rate, guardrail blocks | Prometheus | Time series + Alert |
| Agent Health | Per-agent latency heatmap | Prometheus | Heatmap |
| Business KPI | Task completion, escalation funnel | Prometheus + ES | Stat + Bar chart |
8.2. Grafana Dashboard JSON Config (Partial)
{
"title": "AI Agent — LLMOps Dashboard",
"uid": "llmops-main-dashboard",
"tags": ["ai-agent", "llmops", "production"],
"refresh": "30s",
"time": { "from": "now-3h", "to": "now" },
"panels": [
{
"id": 1,
"title": "🟢 Requests Per Second",
"type": "stat",
"gridPos": { "x": 0, "y": 0, "w": 6, "h": 4 },
"targets": [
{
"datasource": "prometheus",
"expr": "sum(rate(llm_request_duration_seconds_count[2m]))",
"legendFormat": "RPS"
}
],
"fieldConfig": {
"defaults": {
"color": { "mode": "thresholds" },
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 100 },
{ "color": "red", "value": 500 }
]
},
"unit": "reqps"
}
}
},
{
"id": 2,
"title": "🔴 Error Rate (%)",
"type": "stat",
"gridPos": { "x": 6, "y": 0, "w": 6, "h": 4 },
"targets": [
{
"datasource": "prometheus",
"expr": "100 * sum(rate(llm_errors_total[5m])) / sum(rate(llm_request_duration_seconds_count[5m]))",
"legendFormat": "Error Rate %"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]
}
}
}
},
{
"id": 3,
"title": "⏱ Latency P95 (ms)",
"type": "timeseries",
"gridPos": { "x": 0, "y": 4, "w": 12, "h": 8 },
"targets": [
{
"datasource": "prometheus",
"expr": "histogram_quantile(0.95, sum by(le, agent_id) (rate(llm_request_duration_seconds_bucket[5m]))) * 1000",
"legendFormat": "P95 - {{agent_id}}"
},
{
"datasource": "prometheus",
"expr": "histogram_quantile(0.50, sum by(le, agent_id) (rate(llm_request_duration_seconds_bucket[5m]))) * 1000",
"legendFormat": "P50 - {{agent_id}}"
}
],
"fieldConfig": {
"defaults": { "unit": "ms" }
}
},
{
"id": 4,
"title": "💰 Cost Per Hour (USD)",
"type": "timeseries",
"gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 },
"targets": [
{
"datasource": "prometheus",
"expr": "sum by(agent_id) (rate(llm_cost_usd_total[1h])) * 3600",
"legendFormat": "Cost/hr - {{agent_id}}"
}
],
"fieldConfig": {
"defaults": { "unit": "currencyUSD" }
}
},
{
"id": 5,
"title": "🧠 Hallucination Rate (%)",
"type": "timeseries",
"gridPos": { "x": 0, "y": 12, "w": 12, "h": 8 },
"targets": [
{
"datasource": "prometheus",
"expr": "100 * histogram_quantile(0.90, rate(llm_hallucination_score_bucket[10m]))",
"legendFormat": "Hallucination P90"
}
],
"alert": {
"conditions": [
{
"type": "query",
"query": { "params": ["A", "10m", "now"] },
"reducer": { "type": "avg" },
"evaluator": { "type": "gt", "params": [10] }
}
],
"name": "High Hallucination Rate Alert"
}
}
]
}
9. Alerting Strategy — 8 Alert Rules Quan Trọng
9.1. Prometheus AlertManager Config
# alertmanager-rules.yaml
---
groups:
- name: llmops_critical
rules:
# Alert 1: Cost Spike — hàng ngày vượt 150% baseline
- alert: LLMCostSpike
expr: |
(
sum(increase(llm_cost_usd_total[24h]))
/
sum(increase(llm_cost_usd_total[24h] offset 7d))
) > 1.5
for: 15m
labels:
severity: critical
team: llmops
annotations:
summary: "💰 LLM Cost Spike Detected"
description: "Daily cost is {{ humanize $value | printf \"%.0f%%\" }} of 7-day average. Current: ${{ $value }}"
runbook: "https://wiki.company.com/runbooks/llm-cost-spike"
# Alert 2: Latency P95 > 5s sustained 5 minutes
- alert: LLMHighLatencyP95
expr: |
histogram_quantile(0.95,
sum by(le, agent_id) (rate(llm_request_duration_seconds_bucket[5m]))
) > 5
for: 5m
labels:
severity: warning
team: llmops
annotations:
summary: "⏱ LLM P95 Latency High: {{ $labels.agent_id }}"
description: "P95 latency is {{ $value | humanizeDuration }} for agent {{ $labels.agent_id }}"
# Alert 3: Error Rate > 5% trong 10 phút
- alert: LLMHighErrorRate
expr: |
(
sum by(agent_id) (rate(llm_errors_total[10m]))
/
sum by(agent_id) (rate(llm_request_duration_seconds_count[10m]))
) * 100 > 5
for: 10m
labels:
severity: critical
team: llmops
annotations:
summary: "🔴 LLM Error Rate > 5%: {{ $labels.agent_id }}"
description: "Error rate is {{ $value | printf \"%.1f%%\" }} for agent {{ $labels.agent_id }}"
# Alert 4: Hallucination Rate > 10% (sampled evaluation)
- alert: LLMHallucinationRateHigh
expr: |
histogram_quantile(0.90,
sum by(le, agent_id) (rate(llm_hallucination_score_bucket[15m]))
) > 0.10
for: 10m
labels:
severity: critical
team: ai-quality
annotations:
summary: "🧠 Hallucination Rate Spike: {{ $labels.agent_id }}"
description: "P90 hallucination score is {{ $value | printf \"%.2f\" }} — review recent prompts/model"
# Alert 5: Guardrail Block Surge > 20% in 15 minutes
- alert: LLMGuardrailBlockSurge
expr: |
(
sum by(agent_id) (rate(llm_guardrail_decisions_total{decision="block"}[15m]))
/
sum by(agent_id) (rate(llm_request_duration_seconds_count[15m]))
) * 100 > 20
for: 5m
labels:
severity: warning
team: llmops
annotations:
summary: "🛡 Guardrail Block Surge: {{ $labels.agent_id }}"
description: "{{ $value | printf \"%.1f%%\" }} of requests blocked — possible attack or prompt issue"
# Alert 6: Token Quota Approaching 80% of Daily Limit
- alert: LLMTokenQuotaWarning
expr: |
(
sum by(tenant_id) (increase(llm_tokens_total[24h]))
/
on(tenant_id) llm_token_daily_quota
) * 100 > 80
for: 0m
labels:
severity: warning
team: platform
annotations:
summary: "📊 Token Quota Warning: {{ $labels.tenant_id }}"
description: "Tenant {{ $labels.tenant_id }} has used {{ $value | printf \"%.0f%%\" }} of daily token quota"
# Alert 7: Circuit Breaker OPEN
- alert: LLMCircuitBreakerOpen
expr: llm_circuit_breaker_state{state="open"} == 1
for: 2m
labels:
severity: critical
team: llmops
annotations:
summary: "⚡ Circuit Breaker OPEN: {{ $labels.agent_id }}"
description: "LLM circuit breaker opened for {{ $labels.agent_id }} — service may be degraded"
# Alert 8: Memory/Context Overflow Rate Spike
- alert: LLMContextOverflowSpike
expr: |
(
sum by(agent_id) (rate(llm_errors_total{error_type="context_length_exceeded"}[10m]))
/
sum by(agent_id) (rate(llm_request_duration_seconds_count[10m]))
) * 100 > 5
for: 5m
labels:
severity: warning
team: llmops
annotations:
summary: "💾 Context Overflow Spike: {{ $labels.agent_id }}"
description: "{{ $value | printf \"%.1f%%\" }} requests hitting context limit — review chunking/truncation strategy"
# alertmanager.yaml — Routing + Slack Webhook
---
route:
group_by: ['alertname', 'agent_id']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-llmops'
routes:
- match:
severity: critical
receiver: 'slack-critical-llmops'
group_wait: 10s
repeat_interval: 1h
- match:
team: ai-quality
receiver: 'slack-ai-quality'
receivers:
- name: 'slack-llmops'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#llmops-alerts'
title: '{{ template "slack.title" . }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Details:* {{ .Annotations.description }}
*Runbook:* {{ .Annotations.runbook }}
{{ end }}
send_resolved: true
- name: 'slack-critical-llmops'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#llmops-critical'
color: 'danger'
title: '🚨 CRITICAL: {{ template "slack.title" . }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Details:* {{ .Annotations.description }}
*Runbook:* {{ .Annotations.runbook }}
{{ end }}
send_resolved: true
- name: 'slack-ai-quality'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#ai-quality-alerts'
title: '{{ template "slack.title" . }}'
send_resolved: true
10. A/B Testing Prompt & Model Routing
10.1. Kiến Trúc Traffic Splitting
INCOMING REQUESTS
│
▼
┌───────────────────────┐
│ FEATURE FLAG │
│ SERVICE │
│ (LaunchDarkly / │
│ self-hosted) │
└──────────┬────────────┘
│
┌─────────────┼──────────────┐
│ 90% │ 10% │
▼ ▼ │
┌──────────┐ ┌──────────┐ │
│ Prompt A │ │ Prompt B │ Shadow Mode
│ (control)│ │(canary) │ │
└────┬─────┘ └────┬─────┘ │
│ │ ┌───▼───────┐
▼ ▼ │ Duplicate │
LLM Response LLM Response │ Request │
│ (no user │
Track: │ impact) │
- Latency └─────┬─────┘
- Quality score │
- Cost ▼
- User satisfaction Evaluation
(offline)
10.2. Python — Model Router với Weighted Random Selection
import random
import time
import hashlib
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional, Callable
import structlog
logger = structlog.get_logger()
class RoutingStrategy(Enum):
WEIGHTED_RANDOM = "weighted_random"
TENANT_BASED = "tenant_based"
TASK_COMPLEXITY = "task_complexity"
CANARY = "canary"
SHADOW = "shadow"
@dataclass
class ModelConfig:
model: str
weight: float # 0.0 - 1.0, tổng các config phải = 1.0
variant_name: str # "control", "canary_v2", "shadow"
max_tokens: int = 4096
temperature: float = 0.7
extra_params: dict = field(default_factory=dict)
@dataclass
class RoutingDecision:
model_config: ModelConfig
strategy_used: str
routing_reason: str
experiment_id: Optional[str] = None
class AIAgentModelRouter:
"""
Model Router với nhiều chiến lược:
- A/B test (weighted random)
- Per-tenant routing
- Task complexity routing
- Shadow mode (duplicate traffic)
"""
def __init__(self):
# A/B test configurations
self._ab_experiments: dict[str, list[ModelConfig]] = {}
# Tenant-specific routing
self._tenant_routing: dict[str, ModelConfig] = {}
# Default routing by task type
self._task_routing: dict[str, ModelConfig] = {
"simple_faq": ModelConfig(
model="gpt-4o-mini", weight=1.0, variant_name="control",
max_tokens=1024, temperature=0.3,
),
"complex_analysis": ModelConfig(
model="gpt-4o", weight=1.0, variant_name="control",
max_tokens=4096, temperature=0.7,
),
"sensitive_medical": ModelConfig(
model="ollama/llama3.1", weight=1.0, variant_name="on_premise",
max_tokens=2048, temperature=0.1,
),
"code_generation": ModelConfig(
model="claude-3-5-sonnet", weight=1.0, variant_name="control",
max_tokens=4096, temperature=0.2,
),
}
def register_ab_experiment(
self,
experiment_id: str,
configs: list[ModelConfig],
) -> None:
"""Đăng ký A/B experiment với weighted configs."""
total_weight = sum(c.weight for c in configs)
if abs(total_weight - 1.0) > 0.001:
raise ValueError(f"Weights must sum to 1.0, got {total_weight}")
self._ab_experiments[experiment_id] = configs
logger.info("ab_experiment_registered", experiment_id=experiment_id,
variants=[c.variant_name for c in configs])
def route(
self,
task_type: str,
tenant_id: str = "default",
session_id: str = "",
experiment_id: Optional[str] = None,
force_strategy: Optional[RoutingStrategy] = None,
) -> RoutingDecision:
"""Chọn model config dựa trên chiến lược routing."""
# 1. Tenant-specific override (highest priority)
if tenant_id in self._tenant_routing and not experiment_id:
config = self._tenant_routing[tenant_id]
return RoutingDecision(
model_config=config,
strategy_used=RoutingStrategy.TENANT_BASED.value,
routing_reason=f"Tenant {tenant_id} has dedicated model",
)
# 2. A/B Experiment (nếu có experiment_id)
if experiment_id and experiment_id in self._ab_experiments:
configs = self._ab_experiments[experiment_id]
# Sticky routing: cùng session_id → cùng variant (consistent UX)
if session_id:
hash_val = int(hashlib.md5(session_id.encode()).hexdigest(), 16)
bucket = (hash_val % 1000) / 1000.0
else:
bucket = random.random()
cumulative = 0.0
for config in configs:
cumulative += config.weight
if bucket <= cumulative:
logger.info(
"ab_routing",
experiment_id=experiment_id,
variant=config.variant_name,
model=config.model,
session_id=session_id,
)
return RoutingDecision(
model_config=config,
strategy_used=RoutingStrategy.WEIGHTED_RANDOM.value,
routing_reason=f"A/B bucket {bucket:.3f} → {config.variant_name}",
experiment_id=experiment_id,
)
# 3. Task complexity routing (fallback)
config = self._task_routing.get(
task_type,
ModelConfig(model="gpt-4o-mini", weight=1.0, variant_name="default")
)
return RoutingDecision(
model_config=config,
strategy_used=RoutingStrategy.TASK_COMPLEXITY.value,
routing_reason=f"Task type '{task_type}' → {config.model}",
)
# ─── Sample Usage ──────────────────────────────────────────────────────────────
router = AIAgentModelRouter()
# Đăng ký A/B experiment: 90% prompt A (gpt-4o-mini) vs 10% prompt B (gpt-4o)
router.register_ab_experiment(
experiment_id="exp_prompt_v2_vs_v3",
configs=[
ModelConfig(model="gpt-4o-mini", weight=0.90, variant_name="prompt_v2_control"),
ModelConfig(model="gpt-4o", weight=0.10, variant_name="prompt_v3_canary"),
],
)
decision = router.route(
task_type="simple_faq",
tenant_id="tenant-abc",
session_id="sess-xyz789",
experiment_id="exp_prompt_v2_vs_v3",
)
print(f"Model: {decision.model_config.model}")
print(f"Variant: {decision.model_config.variant_name}")
print(f"Strategy: {decision.strategy_used}")
10.3. Bảng Kết Quả A/B Test Sample
| Metric | Prompt A (control) | Prompt B (canary) | Δ | Kết luận |
|---|---|---|---|---|
| Latency P95 (ms) | 1,820 | 2,340 | +28.6% | ❌ B chậm hơn |
| Quality Score (LLM Judge) | 3.8/5 | 4.3/5 | +13.2% | ✅ B tốt hơn |
| Cost/request (USD) | $0.0021 | $0.0047 | +123.8% | ❌ B đắt hơn |
| User satisfaction (CSAT) | 76% | 83% | +7% | ✅ B tốt hơn |
| Task completion rate | 88% | 92% | +4% | ✅ B tốt hơn |
| Hallucination rate | 4.2% | 1.8% | -57% | ✅ B an toàn hơn |
| Guardrail block rate | 1.8% | 1.2% | -33% | ✅ B sạch hơn |
Kết luận: Prompt B (canary) cho quality tốt hơn đáng kể nhưng chi phí cao hơn 2x. Quyết định: roll out prompt B cho các tenant premium (happy to pay), giữ prompt A cho tier free.
11. Model Routing Theo Tác Vụ
11.1. Decision Matrix
| Loại Tác Vụ | Độ phức tạp | Model đề xuất | Chi phí/1K token | Latency P95 | Ghi chú |
|---|---|---|---|---|---|
| FAQ đơn giản | Thấp | GPT-4o-mini / Gemini Flash | $0.00015 | < 500ms | 80% traffic |
| Tóm tắt văn bản | Thấp-TB | GPT-4o-mini | $0.00015 | < 800ms | |
| Phân tích, so sánh | Trung bình | GPT-4o / Claude 3.5 Sonnet | $0.005 | < 2s | |
| Reasoning phức tạp | Cao | GPT-4o / Claude 3.5 | $0.005 | < 3s | 15% traffic |
| Code generation | Cao | Claude 3.5 Sonnet | $0.003 | < 3s | |
| Dữ liệu y tế/nhạy cảm | Bất kỳ | Ollama on-premise | $0 (infra cost) | < 2s | Data không rời server |
| Real-time chat | Thấp | GPT-4o-mini (streaming) | $0.00015 | TTFT < 200ms | |
| Batch processing | Bất kỳ | GPT-4o Batch API | 50% discount | Hours | Không realtime |
11.2. Bảng Cost vs Quality Trade-off
| Provider | Model | Input $/1M | Output $/1M | Quality Score | Latency | Data Privacy | Phù hợp |
|---|---|---|---|---|---|---|---|
| OpenAI | GPT-4o-mini | $0.15 | $0.60 | 4.0/5 | Fast | Cloud | General, cost-sensitive |
| OpenAI | GPT-4o | $5.00 | $15.00 | 4.7/5 | Medium | Cloud | Complex reasoning |
| Anthropic | Claude 3 Haiku | $0.25 | $1.25 | 4.0/5 | Fast | Cloud | Safe, structured output |
| Anthropic | Claude 3.5 Sonnet | $3.00 | $15.00 | 4.8/5 | Medium | Cloud | High quality, coding |
| Gemini 1.5 Flash | $0.075 | $0.30 | 3.9/5 | Very Fast | Cloud | Ultra low cost | |
| Azure OpenAI | GPT-4o | $5.00 | $15.00 | 4.7/5 | Medium | Cloud (VNet) | Enterprise compliance |
| Ollama | Llama 3.1 70B | $0 (GPU) | $0 (GPU) | 4.0/5 | Medium | On-premise | Healthcare, banking |
| Ollama | Qwen2.5 7B | $0 (GPU) | $0 (GPU) | 3.6/5 | Fast | On-premise | Cost-zero, low quality tasks |
12. Sampling Strategy Cho Production
12.1. Vấn Đề
100% sampling trong production AI Agent:
- 10,000 requests/ngày × 5 spans/request = 50,000 spans/ngày
- Lưu trữ: ~2KB/span × 50,000 = 100MB/ngày traces
- 3 tháng: ~9GB chỉ cho trace data
- Chi phí Jaeger + object storage: ~$50-100/tháng
Giải pháp: Adaptive (tail-based) sampling.
12.2. Chiến Lược Sampling
| Loại Request | Sampling Rate | Lý Do |
|---|---|---|
| Error requests | 100% | Cần debug đầy đủ |
| Slow requests (P95+) | 100% | Performance investigation |
| High-cost requests (>$0.10) | 100% | Cost audit |
| Guardrail blocked | 100% | Security audit |
| Normal successful requests | 10% | Statistical representation |
| Health checks / internal | 0% | Noise reduction |
Chi phí storage ước tính (10,000 req/ngày):
Error rate 2% = 200 requests → 200 × 5 spans × 2KB = 2MB
Slow rate 5% = 500 requests → 500 × 5 spans × 2KB = 5MB
Normal 10% = 930 requests → 930 × 5 spans × 2KB = 9.3MB
Total/day ≈ 16.3MB (vs 100MB với 100% sampling)
Savings: ~84%
12.3. Python OTel Adaptive Sampler
import random
from opentelemetry.sdk.trace.sampling import (
Sampler,
SamplingResult,
Decision,
ALWAYS_ON,
ALWAYS_OFF,
)
from opentelemetry.trace import SpanKind
from opentelemetry.context import Context
from opentelemetry.util.types import Attributes
class AdaptiveLLMSampler(Sampler):
"""
Tail-based adaptive sampler cho LLM workload.
- Errors: 100%
- Slow requests: 100%
- Normal: configurable rate (default 10%)
"""
def __init__(
self,
normal_sample_rate: float = 0.10,
slow_threshold_ms: float = 3000.0,
high_cost_threshold_usd: float = 0.10,
):
self.normal_sample_rate = normal_sample_rate
self.slow_threshold_ms = slow_threshold_ms
self.high_cost_threshold_usd = high_cost_threshold_usd
def should_sample(
self,
parent_context: Context,
trace_id: int,
name: str,
kind: SpanKind = SpanKind.INTERNAL,
attributes: Attributes = None,
links: list = None,
trace_state: object = None,
) -> SamplingResult:
attrs = attributes or {}
# Rule 1: Always sample errors
if attrs.get("error", False) or attrs.get("http.status_code", 200) >= 500:
return SamplingResult(Decision.RECORD_AND_SAMPLE, attributes=attrs)
# Rule 2: Always sample slow requests
latency_ms = attrs.get("llm.latency_ms", 0)
if latency_ms > self.slow_threshold_ms:
return SamplingResult(Decision.RECORD_AND_SAMPLE, attributes=attrs)
# Rule 3: Always sample high-cost requests
cost_usd = attrs.get("llm.cost_usd", 0)
if cost_usd > self.high_cost_threshold_usd:
return SamplingResult(Decision.RECORD_AND_SAMPLE, attributes=attrs)
# Rule 4: Always sample guardrail blocks
if attrs.get("guardrail.decision") == "block":
return SamplingResult(Decision.RECORD_AND_SAMPLE, attributes=attrs)
# Rule 5: Normal sampling (10%)
if random.random() < self.normal_sample_rate:
return SamplingResult(Decision.RECORD_AND_SAMPLE, attributes=attrs)
return SamplingResult(Decision.DROP)
def get_description(self) -> str:
return f"AdaptiveLLMSampler(normal={self.normal_sample_rate})"
# Sử dụng trong TracerProvider:
# from opentelemetry.sdk.trace import TracerProvider
# provider = TracerProvider(sampler=AdaptiveLLMSampler(normal_sample_rate=0.10))
13. LLM Cost Management
13.1. Budgeting Per Tenant / Project
import time
import redis
from dataclasses import dataclass
from typing import Optional
import structlog
logger = structlog.get_logger()
@dataclass
class BudgetConfig:
tenant_id: str
daily_budget_usd: float
monthly_budget_usd: float
daily_token_limit: int
alert_threshold_pct: float = 0.80 # Alert khi đạt 80%
hard_stop: bool = True # Dừng khi vượt budget
class LLMBudgetGuard:
"""
Middleware kiểm tra budget trước mỗi LLM call.
Sử dụng Redis để track real-time spending.
"""
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
self._budgets: dict[str, BudgetConfig] = {}
def register_budget(self, config: BudgetConfig) -> None:
self._budgets[config.tenant_id] = config
logger.info("budget_registered",
tenant_id=config.tenant_id,
daily_limit_usd=config.daily_budget_usd)
def _get_today_key(self, tenant_id: str) -> str:
today = time.strftime("%Y-%m-%d")
return f"llm_budget:daily:{tenant_id}:{today}"
def _get_month_key(self, tenant_id: str) -> str:
month = time.strftime("%Y-%m")
return f"llm_budget:monthly:{tenant_id}:{month}"
def check_budget(self, tenant_id: str, estimated_cost_usd: float) -> dict:
"""
Kiểm tra budget trước khi gọi LLM.
Returns: {"allowed": bool, "reason": str, "remaining_usd": float}
"""
config = self._budgets.get(tenant_id)
if not config:
return {"allowed": True, "reason": "no_budget_configured", "remaining_usd": float("inf")}
daily_key = self._get_today_key(tenant_id)
current_daily = float(self.redis.get(daily_key) or 0)
projected_daily = current_daily + estimated_cost_usd
# Hard stop check
if config.hard_stop and projected_daily > config.daily_budget_usd:
logger.warning(
"budget_exceeded",
tenant_id=tenant_id,
current_cost=current_daily,
daily_limit=config.daily_budget_usd,
)
return {
"allowed": False,
"reason": "daily_budget_exceeded",
"remaining_usd": max(0, config.daily_budget_usd - current_daily),
}
# Alert threshold check
if projected_daily > config.daily_budget_usd * config.alert_threshold_pct:
logger.warning(
"budget_threshold_warning",
tenant_id=tenant_id,
pct_used=projected_daily / config.daily_budget_usd,
)
return {
"allowed": True,
"reason": "within_budget",
"remaining_usd": config.daily_budget_usd - current_daily,
}
def record_usage(self, tenant_id: str, actual_cost_usd: float) -> None:
"""Ghi nhận chi phí thực tế sau khi LLM call hoàn thành."""
daily_key = self._get_today_key(tenant_id)
month_key = self._get_month_key(tenant_id)
pipe = self.redis.pipeline()
pipe.incrbyfloat(daily_key, actual_cost_usd)
pipe.expire(daily_key, 86400 * 2) # 2 ngày TTL
pipe.incrbyfloat(month_key, actual_cost_usd)
pipe.expire(month_key, 86400 * 35) # 35 ngày TTL
pipe.execute()
13.2. Bảng Tier Pricing So Sánh
| Tiêu chí | OpenAI GPT-4o | Anthropic Claude 3.5 | Azure OpenAI | Ollama Self-hosted |
|---|---|---|---|---|
| Input price | $5/1M tokens | $3/1M tokens | $5/1M tokens | ~$0.15/1M (GPU cost) |
| Output price | $15/1M tokens | $15/1M tokens | $15/1M tokens | ~$0.15/1M (GPU cost) |
| Data privacy | OpenAI servers | Anthropic servers | Azure VNet | Hoàn toàn on-premise |
| Compliance | SOC2, GDPR (opt-out) | SOC2, HIPAA add-on | HIPAA, FedRAMP | Tự quản lý |
| Rate limits | 10K RPM | 5K RPM | Custom | Không giới hạn |
| SLA uptime | 99.9% | 99.9% | 99.9% | Tự quản lý |
| Setup complexity | Thấp | Thấp | Trung bình | Cao (GPU infra) |
| Chi phí khởi đầu | $0 | $0 | Azure subscription | GPU server ~$2,000+ |
| Cost 1M requests/ngày | ~$3,500/ngày | ~$2,100/ngày | ~$3,500/ngày | ~$50/ngày (amortized) |
| Phù hợp | General, startup | High quality | Enterprise | Healthcare, Banking |
14. Incident Response Cho AI Agent
14.1. Runbook — Khi Hallucination Rate Tăng
INCIDENT: Hallucination Rate > 10%
════════════════════════════════════
T+0min: Alert nhận được qua Slack #llmops-critical
T+2min: On-call engineer acknowledge alert
INVESTIGATION STEPS:
1. Grafana → Quality Dashboard → Hallucination Timeline
- Xác định: bắt đầu khi nào? Tất cả agents hay 1 agent cụ thể?
- Xem top sessions có hallucination_score cao nhất
2. Elasticsearch query:
GET ai-agent-logs-*/_search
{ "query": { "range": { "hallucination_probability": { "gte": 0.3 } } },
"sort": [{"@timestamp": "desc"}], "size": 20 }
3. Kiểm tra: Có prompt version change gần đây không?
git log --oneline prompts/ | head -20
4. Kiểm tra: Model provider có update model không?
- OpenAI model version log
- Pinned model version trong config
MITIGATION:
- Nếu do prompt change → rollback prompt version ngay
- Nếu do model update → pin model version cụ thể (gpt-4o-2024-11-20)
- Nếu nguyên nhân chưa rõ → kích hoạt HITL mode (escalate tất cả uncertain responses)
- Notify stakeholders qua #llmops-incidents
RESOLUTION CRITERIA:
- Hallucination rate < 5% sustained 15 minutes
POST-INCIDENT:
- Post-mortem trong 48h
- Update runbook nếu cần
14.2. Runbook — Khi Cost Spike
INCIDENT: Daily LLM Cost > 150% Baseline
══════════════════════════════════════════
T+0min: Cost spike alert
T+2min: Acknowledge, bắt đầu điều tra
INVESTIGATION:
1. Prometheus query: Tenant nào đang tiêu cost nhiều nhất?
sum by(tenant_id) (rate(llm_cost_usd_total[1h])) * 3600
2. Elasticsearch: Session nào có cost cao bất thường?
(Query 2 từ Section 7.4)
3. Kiểm tra: Token count bất thường?
- Input tokens > 5,000 per request → likely context stuffing
- Output tokens > 2,000 → likely verbose prompt
4. Kiểm tra: Retry loop?
sum by(agent_id) (rate(llm_errors_total{error_type="RateLimitError"}[10m]))
MITIGATION (theo thứ tự):
1. Tắt tenant vi phạm nếu suspicious activity
2. Enable token quota hard limit ngay
3. Giảm max_tokens trong model config tạm thời
4. Scale down replicas nếu request flood
POST-INCIDENT: Review token quota per tenant, update budget config
14.3. Post-Mortem Template
# Post-Mortem: [Incident Name]
**Ngày**: YYYY-MM-DD
**Mức độ**: Critical / High / Medium
**Duration**: X giờ Y phút
**MTTR**: X giờ Y phút
## Impact
- Số users ảnh hưởng: XXX
- Doanh thu ảnh hưởng: $XXX
- Chi phí phát sinh: $XXX
## Timeline
| Thời gian | Sự kiện |
|-----------|---------|
| HH:MM | Alert triggered |
| HH:MM | On-call engineer acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Incident resolved |
## Root Cause
[Mô tả nguyên nhân gốc rễ]
## Contributing Factors
1. [Factor 1]
2. [Factor 2]
## What Went Well
- [...]
## What Could Be Improved
- [...]
## Action Items
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| [...] | [...] | [...] | High |
## Lessons Learned
[...]
14.4. MTTR Targets Cho AI Incidents
| Mức độ | Ví dụ | Response Time | MTTR Target |
|---|---|---|---|
| P0 - Critical | Cost spike $1K+, mass data leak | 5 phút | 30 phút |
| P1 - High | Error rate > 10%, hallucination surge | 15 phút | 2 giờ |
| P2 - Medium | Latency degradation, quality drop | 1 giờ | 8 giờ |
| P3 - Low | Logging gap, minor metric anomaly | Next business day | 3 ngày |
15. Production Readiness Checklist — 3 Cấp Độ
🥉 Cấp MVP (Tối thiểu để Go-Live)
Monitoring cơ bản (10 items):
- Prometheus endpoint
/metricsđược expose - LLM latency (p50, p95) được track
- Error rate được track theo agent_id
- Token count (input + output) được đếm
- Cost tracking theo ngày
- Basic Grafana dashboard với latency + errors
- Alert cho error rate > 10%
- Alert cho cost spike > 200% baseline
- Structured JSON logging (request_id, session_id, latency, tokens)
- Log được ship vào Elasticsearch / Loki
Reliability cơ bản (8 items):
- Timeout configured (max 30s per LLM call)
- Retry với exponential backoff (max 3 retries)
- Rate limit handling (429 error → retry-after)
- Circuit breaker configured cho LLM provider
- Graceful degradation khi LLM unavailable
- Health check endpoint
/healthtrả về trạng thái LLM connectivity - Token limit guard (max_tokens configured)
- Context length check trước khi gọi LLM
🥈 Cấp Production (Đầy đủ cho Enterprise)
Observability nâng cao (12 items):
- OpenTelemetry SDK integrated đầy đủ (traces + metrics + logs)
- Distributed tracing với context propagation qua tất cả agents
- TTFT (Time To First Token) tracking cho streaming responses
- Per-tenant cost breakdown dashboard
- Hallucination rate monitoring (sampled evaluation pipeline)
- Guardrail decision logging với reason codes
- Tool call latency histogram per tool
- Memory/context usage tracking
- Session timeline reconstruction từ traces
- Kibana/Grafana Explore for ad-hoc investigation
- Automated daily cost report → email/Slack
- ILM policy cho log retention (hot/warm/cold/delete)
Alerting đầy đủ (8 items):
- Tất cả 8 alert rules từ Section 9 được configured
- Alert routing theo team/severity
- PagerDuty / on-call rotation integrated
- Runbook link trong mọi alert annotation
- Alert fatigue review (tune thresholds sau 2 tuần)
- Dead man's switch (alert nếu metrics stop flowing)
- Cost budget alerts per tenant
- SLA breach prediction alert (leading indicator)
Reliability production (10 items):
- Multi-region LLM provider failover
- Budget guard middleware cho mọi tenant
- Token quota enforcement per tenant per day
- Adaptive sampling cho traces (không 100%)
- A/B testing framework ready
- Model versioning pinned (không dùng “latest”)
- Prompt versioning với git + experiment tracking
- Shadow mode testing cho model upgrades
- Load testing với realistic token distribution
- Chaos engineering: LLM provider outage drill
🥇 Cấp Enterprise (Đầy đủ nhất)
Advanced LLMOps (12 items):
- Full MLflow / LangSmith experiment tracking integration
- Automated evaluation pipeline chạy hourly trên sampled traffic
- Model drift detection với statistical tests (KS test, Chi-square)
- Prompt regression test suite chạy trên mỗi deployment
- Multi-model cost optimization engine (auto-route based on task)
- LLM request caching (semantic cache với Redis + vector similarity)
- Streaming token profiling (tốc độ generation, jitter)
- Custom SLOs: error budget tracking, burn rate alerts
- Capacity planning dashboard (projected cost 30/60/90 days)
- Fine-tuning pipeline với evaluation gate trước deploy
- Cross-tenant benchmarking (ẩn danh)
- Regulatory audit trail xuất report PDF/Excel on demand
Security & Compliance (10 items):
- Mọi LLM interaction được log với immutable audit trail
- PII detection và masking trong log pipeline
- Data residency enforcement (EU data → EU LLM endpoint)
- Penetration test cho prompt injection vectors
- GDPR Article 22 compliance (explain AI decision)
- SOC 2 Type II evidence collection automated
- Monthly third-party security review của LLM configs
- Incident response drill quarterly
- Vendor lock-in mitigation plan (multi-provider routing)
- Contractual SLA với LLM providers documented
16. KPI Vận Hành, Chi Phí Platform, ROI Analysis
16.1. KPI Vận Hành Theo Tháng
| KPI | MVP Target | Production Target | Enterprise Target |
|---|---|---|---|
| System Uptime | 99.0% | 99.5% | 99.9% |
| Avg Response Latency | < 5s | < 3s | < 2s |
| Error Rate | < 5% | < 1% | < 0.5% |
| Hallucination Rate | < 10% | < 3% | < 1% |
| Task Completion Rate | > 80% | > 90% | > 95% |
| Cost per Query | < $0.20 | < $0.08 | < $0.03 |
| MTTR (P1 incident) | 4h | 2h | 30min |
| User Satisfaction | > 70% | > 80% | > 90% |
16.2. Chi Phí Platform Observability Stack
| Thành phần | Self-hosted / Free tier | SaaS / Managed | Ghi chú |
|---|---|---|---|
| OpenTelemetry Collector | $0 (self-hosted) | $0 (open source) | K8s deployment |
| Prometheus | $0 | $0 | Thêm Thanos cho HA |
| Grafana | $0 (OSS) | $29-299/mo | OSS đủ dùng |
| Jaeger/Tempo | $0 + S3 storage | $50-200/mo | Tempo rẻ hơn Jaeger |
| Elasticsearch | $200-500/mo (3 nodes) | $95-500/mo (ES Cloud) | ES Cloud cho managed |
| Alertmanager | $0 | $0 | Bundled với Prometheus |
| Pyroscope | $0 | $0 | Grafana Phlare |
| Total (Self-hosted) | ~$200-500/mo | 10K req/ngày | |
| Total (Full SaaS) | ~$500-1,200/mo | Managed, ít ops effort |
16.3. ROI Analysis
Scenario: 10,000 LLM queries/ngày, team 5 người
BEFORE (không có LLMOps):
- Incident detection lag: 4-6 giờ
- Mỗi incident: 3-4 giờ engineer time debug = ~$300 loss/incident
- 2 incidents/tháng = $600/month waste
- Overspending do không track cost: ~$400/month (estimated 20% waste)
- Hallucination → user churn: 5% users/month = $2,000 MRR loss
Total monthly loss without LLMOps: ~$3,000
AFTER (với LLMOps stack đầy đủ):
- Platform cost: $500/month
- Incident MTTR giảm từ 4h → 30min (P1): save $250/incident × 2 = $500/month
- Cost optimization (routing + quota): save 15-25% = ~$300-500/month
- Hallucination detection → user churn giảm 60%: save $1,200/month
Total monthly saving: ~$2,000 - $2,500
ROI = (2,000 - 500) / 500 × 100 = 300% ROI
Payback period: < 1 month
17. Ma Trận Rủi Ro Vận Hành
| # | Rủi ro | Xác suất | Tác động | Mức độ | Biện pháp giảm thiểu |
|---|---|---|---|---|---|
| 1 | LLM provider outage (OpenAI, Anthropic) | Trung bình | Cao | 🔴 High | Multi-provider failover; local Ollama fallback |
| 2 | Cost runaway (prompt loop, token exploit) | Thấp-TB | Rất cao | 🔴 High | Budget guard; hard token quota; real-time cost alert |
| 3 | Silent model degradation (provider update model) | Trung bình | Cao | 🔴 High | Pin model version; automated regression eval weekly |
| 4 | Log/trace data explosion (misconfigured sampler) | Thấp | Trung bình | 🟠 Medium | Adaptive sampling; storage quota alert |
| 5 | Alert fatigue (too many false positives) | Cao | Trung bình | 🟠 Medium | Tune thresholds sau 2 tuần; alert review cadence |
| 6 | PII leak via logs (unmasked user data in structured logs) | Thấp | Rất cao | 🔴 High | Log scrubber middleware; PII regex masking pipeline |
| 7 | Dashboard blindspot (metric not instrumented) | Trung bình | Trung bình | 🟠 Medium | Coverage checklist; quarterly observability audit |
| 8 | Observer effect (OTel overhead degrades performance) | Thấp | Thấp | 🟡 Low | Benchmark OTel overhead (<1ms target); async exporters |
18. Roadmap Triển Khai LLMOps — 3 Giai Đoạn
🚀 Giai Đoạn 1 — Foundation (Tuần 1-2)
Tuần 1:
- Deploy OTel Collector + Prometheus + Grafana lên K8s (Helm charts)
- Integrate OTel SDK vào tất cả agent services
- Setup basic Grafana dashboard (latency, errors, cost)
- Configure basic alerts (error rate, cost spike)
- Ship structured logs vào Elasticsearch
Tuần 2:
- Distributed tracing end-to-end (orchestrator → sub-agents)
- Token + cost tracking per agent per tenant
- Budget guard middleware deployed
- ILM policy cho Elasticsearch
- On-call rotation setup, runbooks viết xong
Deliverable: Hệ thống có thể detect P1 incident trong < 5 phút
⚙️ Giai Đoạn 2 — Quality & Cost (Tuần 3-6)
Tuần 3-4:
- Hallucination evaluation pipeline (sampled, async)
- Guardrail decision logging đầy đủ
- Per-tenant cost dashboard + daily email report
- A/B testing framework (canary deployment)
- Model router theo task complexity
Tuần 5-6:
- Adaptive sampling thay thế 100% sampling
- Semantic cache (Redis + vector similarity)
- Post-mortem process chính thức
- Alert tuning (giảm false positives)
- Quality SLO dashboard (error budget, burn rate)
Deliverable: Cost giảm 20-30%, hallucination rate visible và monitored
🏆 Giai Đoạn 3 — Enterprise Grade (Tuần 7-12)
Tuần 7-9:
- Full MLflow / LangSmith integration
- Model drift detection automated
- Prompt regression test suite CI/CD
- Multi-provider failover (OpenAI → Azure OpenAI → Anthropic)
- Capacity planning dashboard
Tuần 10-12:
- Compliance audit trail (immutable, exportable)
- PII masking trong log pipeline
- Chaos engineering drill (LLM outage simulation)
- Security penetration test cho LLM attack vectors
- Documentation, runbooks, training cho team
Deliverable: Full LLMOps maturity — incident MTTR < 30min, cost optimized, compliance-ready
19. Kết Luận
Trong bài này chúng ta đã xây dựng hoàn chỉnh hệ thống Monitoring & Observability cho AI Agent trong production — từ lý thuyết đến code thực tế:
| Thành phần đã xây dựng | Giá trị |
|---|---|
| LLMOps vs DevOps — 10 chiều so sánh | Hiểu rõ tại sao cần observability riêng |
| Kiến trúc OTel 4 pillars | Framework đầy đủ cho mọi quy mô |
| 25+ metrics với 5 nhóm | Biết chính xác cần đo gì |
| OTel instrumentation (Python) | Có thể implement ngay hôm nay |
| LangChain callback handler | Tracing tự động cho LangChain agents |
| Structured log schema + ES mapping | Log chuẩn, searchable, auditable |
| Grafana JSON config | Dashboard production-ready |
| 8 alert rules + Prometheus YAML | Alert coverage đầy đủ |
| A/B testing framework | Cải tiến prompt/model dựa trên data |
| Model router (Python) | Cost optimization tự động |
| Adaptive sampler | Giảm 84% storage cost traces |
| Budget guard | Ngăn cost runaway |
| Incident runbooks | Response nhanh, MTTR thấp |
| 50-item checklist 3 cấp | Roadmap rõ ràng từ MVP đến Enterprise |
| ROI 300% | Justify investment với stakeholders |
Nguyên Tắc Vàng Cho LLMOps
“Bạn không thể quản lý những gì bạn không đo lường được. Trong thế giới LLM, điều này còn đúng hơn bất kỳ lĩnh vực nào khác — vì LLM có thể fail silently theo những cách mà không có metric nào trong DevOps truyền thống bắt được.”
📌 Bài Tiếp Theo
Bài 8: Use Case Thực Chiến — AI Agent trong Doanh nghiệp Việt Nam
Sau khi đã có đầy đủ nền tảng từ kiến trúc, memory, guardrails đến monitoring, bài tiếp theo sẽ đưa tất cả vào thực tế với 3 use case thực chiến tại doanh nghiệp Việt Nam:
- Healthcare: AI Agent hỗ trợ bác sĩ tra cứu phác đồ điều trị, tích hợp HIS/EMR
- Banking/Fintech: AI Agent tư vấn sản phẩm tài chính, KYC automation
- Retail/E-commerce: AI Agent chăm sóc khách hàng đa kênh (Zalo, Web, App)
Mỗi use case đều bao gồm: kiến trúc chi tiết, tech stack, chi phí, timeline triển khai và bài học thực tế.
💡 Tip thực chiến: Bắt đầu với Giai Đoạn 1 (tuần 1-2) ngay khi có AI Agent đầu tiên lên production. Đừng chờ “có thời gian” — một cost runaway hay hallucination incident không báo trước sẽ khiến bạn phải xây observability trong tình trạng khủng hoảng, vừa tốn kém vừa stress. Ship observability cùng lúc với feature — đó là văn hóa LLMOps trưởng thành.