<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Elasticsearch on &lt;Vunb /></title><link>https://vunb.github.io/tags/elasticsearch/</link><description>Recent content in Elasticsearch on &lt;Vunb /></description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><copyright>Vunb &amp;copy; {year}</copyright><lastBuildDate>Thu, 14 May 2026 00:00:00 +0700</lastBuildDate><atom:link href="https://vunb.github.io/tags/elasticsearch/index.xml" rel="self" type="application/rss+xml"/><item><title>Monitoring &amp; Observability — Vận hành AI Agent trong Production</title><link>https://vunb.github.io/tutorials/ai-agent/monitoring-va-observability-van-hanh-ai-agent-trong-production/</link><pubDate>Thu, 14 May 2026 00:00:00 +0700</pubDate><guid>https://vunb.github.io/tutorials/ai-agent/monitoring-va-observability-van-hanh-ai-agent-trong-production/</guid><description>&lt;h2 id="1-ti-sao-production-ai-agent-cn-observability-ring">1. Tại sao Production AI Agent cần Observability riêng?&lt;/h2>
&lt;p>Ở bài trước, chúng ta đã xây dựng hệ thống Guardrails &amp;amp; Evaluation để đảm bảo AI Agent hoạt động an toàn. Nhưng khi hàng nghìn người dùng thật sự sử dụng agent mỗi ngày, một câu hỏi hoàn toàn mới nổi lên:&lt;/p>
&lt;blockquote>
&lt;p>&amp;ldquo;Làm sao tôi biết agent đang hoạt động &lt;strong>đúng&lt;/strong>, &lt;strong>ổn định&lt;/strong>, &lt;strong>đúng chi phí&lt;/strong> và &lt;strong>tạo ra giá trị&lt;/strong> ngay lúc này — trong production, 24/7?&amp;rdquo;&lt;/p>
&lt;/blockquote>
&lt;p>Traditional monitoring (CPU, RAM, request/s) &lt;strong>không đủ&lt;/strong> cho AI Agent. Agent có thể hoàn toàn &amp;ldquo;xanh&amp;rdquo; trên dashboard DevOps thông thường nhưng thực tế đang:&lt;/p>
&lt;ul>
&lt;li>Trả lời sai (hallucination rate tăng âm thầm)&lt;/li>
&lt;li>Tiêu token gấp 3 lần bình thường do prompt loop&lt;/li>
&lt;li>Tốn thêm $800/ngày vì một model configuration sai&lt;/li>
&lt;li>Stuck trong reasoning loop suốt 45 giây mà không timeout&lt;/li>
&lt;/ul>
&lt;p>Đây là lý do &lt;strong>LLMOps&lt;/strong> — một nhánh riêng của MLOps — ra đời.&lt;/p>
&lt;hr>
&lt;h2 id="2-llmops-vs-devops-truyn-thng--10-im-khc-bit-ct-li">2. LLMOps vs DevOps Truyền Thống — 10 Điểm Khác Biệt Cốt Lõi&lt;/h2>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>#&lt;/th>
&lt;th>Chiều so sánh&lt;/th>
&lt;th>DevOps truyền thống&lt;/th>
&lt;th>LLMOps cho AI Agent&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>1&lt;/td>
&lt;td>&lt;strong>Tính xác định&lt;/strong>&lt;/td>
&lt;td>Deterministic: cùng input → cùng output&lt;/td>
&lt;td>Non-deterministic: cùng prompt → output khác nhau&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2&lt;/td>
&lt;td>&lt;strong>Đơn vị chi phí&lt;/strong>&lt;/td>
&lt;td>CPU giờ, bandwidth GB&lt;/td>
&lt;td>Token (input + output) + API call cost&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3&lt;/td>
&lt;td>&lt;strong>Metric chất lượng&lt;/strong>&lt;/td>
&lt;td>Latency, error rate, uptime&lt;/td>
&lt;td>Hallucination rate, groundedness, relevance score&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>4&lt;/td>
&lt;td>&lt;strong>Versioning&lt;/strong>&lt;/td>
&lt;td>Code + config versioning&lt;/td>
&lt;td>Code + config + &lt;strong>prompt versioning&lt;/strong> + &lt;strong>model versioning&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>5&lt;/td>
&lt;td>&lt;strong>Drift&lt;/strong>&lt;/td>
&lt;td>Performance drift do hardware thay đổi&lt;/td>
&lt;td>&lt;strong>Model drift&lt;/strong>: nhà cung cấp update model lặng lẽ&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>6&lt;/td>
&lt;td>&lt;strong>Debugging&lt;/strong>&lt;/td>
&lt;td>Stack trace rõ ràng&lt;/td>
&lt;td>Reasoning trace phức tạp, multi-hop, khó reproduce&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>7&lt;/td>
&lt;td>&lt;strong>Testing&lt;/strong>&lt;/td>
&lt;td>Unit test, integration test&lt;/td>
&lt;td>Evaluation dataset, LLM-as-a-Judge, A/B testing&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>8&lt;/td>
&lt;td>&lt;strong>Rollback&lt;/strong>&lt;/td>
&lt;td>Rollback code/config&lt;/td>
&lt;td>Rollback prompt version + model version + memory state&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>9&lt;/td>
&lt;td>&lt;strong>Scaling&lt;/strong>&lt;/td>
&lt;td>Horizontal scaling đơn giản&lt;/td>
&lt;td>Phải cân bằng token throughput, context window, cost&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>10&lt;/td>
&lt;td>&lt;strong>Compliance&lt;/strong>&lt;/td>
&lt;td>Log access, audit trail&lt;/td>
&lt;td>Log &lt;strong>mọi LLM interaction&lt;/strong> cho compliance + audit&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="21-non-determinism--thch-thc-ln-nht">2.1. Non-Determinism — Thách Thức Lớn Nhất&lt;/h3>
&lt;pre>&lt;code>DevOps: f(x) = y → luôn đúng, test 1 lần là đủ
LLMOps: f(x) = y₁ | y₂ | y₃ | ... → test phải sampling, eval phải statistical
&lt;/code>&lt;/pre>&lt;p>Điều này có nghĩa: bạn không thể chỉ monitor &lt;strong>có lỗi không&lt;/strong> — bạn phải monitor &lt;strong>output có đúng không&lt;/strong>, liên tục, theo xác suất.&lt;/p>
&lt;h3 id="22-token-economy--chi-ph-v-hnh">2.2. Token Economy — Chi phí vô hình&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Tình huống&lt;/th>
&lt;th>Token consumed&lt;/th>
&lt;th>Chi phí ước tính&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>1 câu hỏi FAQ đơn giản&lt;/td>
&lt;td>~500 tokens&lt;/td>
&lt;td>~$0.001&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>1 phiên tư vấn phức tạp (RAG + history)&lt;/td>
&lt;td>~8,000 tokens&lt;/td>
&lt;td>~$0.016&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>1 agentic workflow 5 bước&lt;/td>
&lt;td>~25,000 tokens&lt;/td>
&lt;td>~$0.050&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>10,000 users/ngày × agentic workflow&lt;/td>
&lt;td>250M tokens&lt;/td>
&lt;td>~$500/ngày&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>Kết luận&lt;/strong>: Một bug nhỏ trong prompt (ví dụ: infinite retry loop) có thể tiêu tốn $2,000+ trước khi ai phát hiện nếu không có cost monitoring.&lt;/p>
&lt;hr>
&lt;h2 id="3-kin-trc-observability-tng-th-cho-ai-agent">3. Kiến Trúc Observability Tổng Thể Cho AI Agent&lt;/h2>
&lt;pre>&lt;code>┌─────────────────────────────────────────────────────────────────────────────┐
│ LLMOPS OBSERVABILITY ARCHITECTURE — AI AGENT CLUSTER │
└─────────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────────┐
│ AI AGENT CLUSTER │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ │
│ │Orchestrator │ │ RAG Agent │ │ Tool Agent │ │ Memory Agent │ │
│ │ Agent │ │ │ │ │ │ │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └───────┬───────┘ │
│ │ │ │ │ │
│ └────────────────┴────────────────┴──────────────────┘ │
│ │ │
│ OTel SDK (Python / .NET / Java) │
│ - Traces (spans + context propagation) │
│ - Metrics (counters, histograms, gauges) │
│ - Logs (structured JSON + trace_id correlation) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ OPENTELEMETRY COLLECTOR │
│ │
│ Receivers: OTLP gRPC/HTTP, Prometheus scrape, Fluentd │
│ Processors: Batch, Memory Limiter, Attribute Filter, Sampling │
│ Exporters: → Prometheus │ → Jaeger/Tempo │ → Elasticsearch │
└──────────────────┬──────────────────────────────────────────────────────┘
│
┌───────────┼──────────────┐
▼ ▼ ▼
┌────────────┐ ┌──────────┐ ┌──────────────────┐
│ PROMETHEUS │ │ JAEGER │ │ ELASTICSEARCH │
│ │ │ / TEMPO │ │ / OPENSEARCH │
│ Metrics │ │ │ │ │
│ Storage │ │ Distributed│ │ Log Storage │
│ &amp;amp; Query │ │ Traces │ │ Full-text Search │
└─────┬──────┘ └────┬─────┘ └────────┬─────────┘
│ │ │
└─────────────┴────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ GRAFANA DASHBOARD │
│ │
│ [Overview] [Token Economy] [Quality] [Agent Health] [Business KPI] │
└──────────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ ALERTMANAGER │
│ │
│ Rules: Cost Spike | Latency P95 | Error Rate | Hallucination Rate │
│ Routing: → Slack | PagerDuty | Email | Webhook │
└─────────────────────────────────────────────────────────────────────────┘
&lt;/code>&lt;/pre>&lt;h3 id="31-multi-agent-distributed-tracing-flow">3.1. Multi-Agent Distributed Tracing Flow&lt;/h3>
&lt;pre>&lt;code> USER REQUEST (request_id: req-abc123)
│
▼ [Trace Start — Span: &amp;quot;user_request&amp;quot;]
┌────────────────────────────────────┐
│ API GATEWAY / LB │
│ Inject: traceparent header │
└──────────────────┬─────────────────┘
│
▼ [Span: &amp;quot;orchestrator.process&amp;quot;]
┌────────────────────────────────────┐
│ ORCHESTRATOR AGENT │ t=0ms
│ - Parse intent │
│ - Plan sub-tasks │
└──┬──────────────┬──────────────────┘
│ │ │
▼ ▼ ▼
[Span: [Span: [Span:
&amp;quot;rag.retrieve&amp;quot;] &amp;quot;tool.call&amp;quot;] &amp;quot;memory.fetch&amp;quot;]
┌──────────┐ ┌──────────┐ ┌──────────┐
│RAG Agent │ │Tool Agent│ │Memory │
│t=5ms │ │t=5ms │ │Agent │
│ │ │ │ │t=5ms │
│ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐│
│ │Embed│ │ │ │API │ │ │ │Redis││
│ │Query│ │ │ │Call │ │ │ │Fetch││
│ └──┬──┘ │ │ └──┬──┘ │ │ └──┬──┘│
│ │ │ │ │ │ │ │ │
│ ┌──▼──┐ │ │ ┌──▼──┐ │ │ │ │
│ │Vecto│ │ │ │Tool │ │ │ │ │
│ │rDB │ │ │ │Resp │ │ │ │ │
│ └─────┘ │ │ └─────┘ │ │ │ │
└────┬─────┘ └────┬──────┘ └─────┬───┘
│ │ │
└──────────────┴───────────────┘
│
▼ [Span: &amp;quot;llm.generate&amp;quot;] t=120ms
┌─────────────────┐
│ LLM CALL │
│ GPT-4o / etc │
│ tokens: 2,340 │
│ latency: 1.8s │
└────────┬────────┘
│
▼ [Span: &amp;quot;output.guard&amp;quot;] t=1920ms
┌─────────────────┐
│ Output Guard │
│ Guardrails check│
└────────┬────────┘
│
▼ [Trace End] t=2050ms
FINAL RESPONSE → User
Total: 2,050ms | tokens: 2,340 | cost: $0.0047
&lt;/code>&lt;/pre>&lt;hr>
&lt;h2 id="4-bn-tr-ct-ca-llm-observability">4. Bốn Trụ Cột của LLM Observability&lt;/h2>
&lt;h3 id="41-pillar-1--metrics">4.1. Pillar 1 — Metrics&lt;/h3>
&lt;p>&lt;strong>Mô tả&lt;/strong>: Dữ liệu số, time-series, aggregatable — dùng để trending và alerting.&lt;/p>
&lt;p>&lt;strong>Tools phù hợp&lt;/strong>: Prometheus, Grafana, Datadog, New Relic&lt;/p>
&lt;p>&lt;strong>Sample data:&lt;/strong>&lt;/p>
&lt;pre>&lt;code># HELP llm_request_duration_seconds LLM request latency
# TYPE llm_request_duration_seconds histogram
llm_request_duration_seconds_bucket{agent=&amp;quot;rag_agent&amp;quot;,model=&amp;quot;gpt-4o&amp;quot;,le=&amp;quot;0.5&amp;quot;} 42
llm_request_duration_seconds_bucket{agent=&amp;quot;rag_agent&amp;quot;,model=&amp;quot;gpt-4o&amp;quot;,le=&amp;quot;1.0&amp;quot;} 180
llm_request_duration_seconds_bucket{agent=&amp;quot;rag_agent&amp;quot;,model=&amp;quot;gpt-4o&amp;quot;,le=&amp;quot;2.0&amp;quot;} 312
llm_request_duration_seconds_bucket{agent=&amp;quot;rag_agent&amp;quot;,model=&amp;quot;gpt-4o&amp;quot;,le=&amp;quot;5.0&amp;quot;} 398
llm_request_duration_seconds_bucket{agent=&amp;quot;rag_agent&amp;quot;,model=&amp;quot;gpt-4o&amp;quot;,le=&amp;quot;+Inf&amp;quot;} 402
# HELP llm_tokens_total Total tokens consumed
# TYPE llm_tokens_total counter
llm_tokens_total{agent=&amp;quot;rag_agent&amp;quot;,type=&amp;quot;input&amp;quot;,model=&amp;quot;gpt-4o&amp;quot;} 1284930
llm_tokens_total{agent=&amp;quot;rag_agent&amp;quot;,type=&amp;quot;output&amp;quot;,model=&amp;quot;gpt-4o&amp;quot;} 423810
# HELP llm_cost_usd_total Total cost in USD
# TYPE llm_cost_usd_total counter
llm_cost_usd_total{agent=&amp;quot;rag_agent&amp;quot;,model=&amp;quot;gpt-4o&amp;quot;} 24.87
&lt;/code>&lt;/pre>&lt;h3 id="42-pillar-2--logs">4.2. Pillar 2 — Logs&lt;/h3>
&lt;p>&lt;strong>Mô tả&lt;/strong>: Structured event records — dùng để debug, audit và tìm root cause.&lt;/p>
&lt;p>&lt;strong>Tools phù hợp&lt;/strong>: Elasticsearch, OpenSearch, Loki, Splunk&lt;/p>
&lt;p>&lt;strong>Sample data (JSON structured log):&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-json" data-lang="json">{
&lt;span style="color:#f92672">&amp;#34;timestamp&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;2026-05-14T10:23:45.123Z&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;level&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;INFO&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;request_id&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;req-abc123&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;session_id&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;sess-xyz789&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;agent_id&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;rag_agent&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;model&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;gpt-4o&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;prompt_tokens&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">1840&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;completion_tokens&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">420&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;total_tokens&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">2260&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;latency_ms&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">2050&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;cost_usd&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.0045&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;guardrail_status&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;passed&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;tool_calls&amp;#34;&lt;/span>: [&lt;span style="color:#e6db74">&amp;#34;search_knowledge_base&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;get_product_info&amp;#34;&lt;/span>],
&lt;span style="color:#f92672">&amp;#34;hallucination_score&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.12&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;user_satisfaction&amp;#34;&lt;/span>: &lt;span style="color:#66d9ef">null&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;error&amp;#34;&lt;/span>: &lt;span style="color:#66d9ef">null&lt;/span>
}
&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="43-pillar-3--traces">4.3. Pillar 3 — Traces&lt;/h3>
&lt;p>&lt;strong>Mô tả&lt;/strong>: Distributed tracing — timeline của request xuyên qua nhiều service/agent.&lt;/p>
&lt;p>&lt;strong>Tools phù hợp&lt;/strong>: Jaeger, Grafana Tempo, Zipkin, AWS X-Ray&lt;/p>
&lt;p>&lt;strong>Sample span data:&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-json" data-lang="json">{
&lt;span style="color:#f92672">&amp;#34;traceId&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;4bf92f3577b34da6a3ce929d0e0e4736&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;spanId&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;00f067aa0ba902b7&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;parentSpanId&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;b9c7c989f97918e1&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;operationName&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;llm.generate&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;serviceName&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;rag-agent&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;startTime&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">1715677425120&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;duration&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">1823000&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;tags&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;llm.model&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;gpt-4o&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;llm.input_tokens&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">1840&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;llm.output_tokens&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">420&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;llm.cost_usd&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.0045&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;agent.id&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;rag_agent&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;guardrail.status&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;passed&amp;#34;&lt;/span>
}
}
&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="44-pillar-4--profiles">4.4. Pillar 4 — Profiles&lt;/h3>
&lt;p>&lt;strong>Mô tả&lt;/strong>: CPU/memory profiling của inference engine và Python code — tìm bottleneck.&lt;/p>
&lt;p>&lt;strong>Tools phù hợt&lt;/strong>: Pyroscope, Grafana Phlare, py-spy, cProfile&lt;/p>
&lt;p>&lt;strong>Sample — phát hiện bottleneck thực tế:&lt;/strong>&lt;/p>
&lt;pre>&lt;code>Function │ CPU % │ Calls │ Avg ms
──────────────────────────────────┼───────┼───────┼───────
embed_documents() │ 34.2% │ 2,840 │ 12.1ms
vector_db.similarity_search() │ 21.8% │ 2,840 │ 7.7ms
openai.chat.completions.create() │ 18.6% │ 890 │ 1,820ms
json.loads() [response parsing] │ 8.3% │ 2,840 │ 2.9ms
redis.get() [session cache] │ 5.1% │ 8,900 │ 0.57ms
&lt;/code>&lt;/pre>&lt;hr>
&lt;h2 id="5-metrics-quan-trng-cn-theo-di--5-nhm">5. Metrics Quan Trọng Cần Theo Dõi — 5 Nhóm&lt;/h2>
&lt;h3 id="51-nhm-1--latency-metrics">5.1. Nhóm 1 — Latency Metrics&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Metric&lt;/th>
&lt;th>Mô tả&lt;/th>
&lt;th>Target (Production)&lt;/th>
&lt;th>Alert Threshold&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>TTFT p50&lt;/strong>&lt;/td>
&lt;td>Time To First Token, median&lt;/td>
&lt;td>&amp;lt; 500ms&lt;/td>
&lt;td>&amp;gt; 1s&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>TTFT p95&lt;/strong>&lt;/td>
&lt;td>Time To First Token, 95th percentile&lt;/td>
&lt;td>&amp;lt; 1.5s&lt;/td>
&lt;td>&amp;gt; 3s&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>TTFT p99&lt;/strong>&lt;/td>
&lt;td>Time To First Token, 99th percentile&lt;/td>
&lt;td>&amp;lt; 3s&lt;/td>
&lt;td>&amp;gt; 5s&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Total Latency p95&lt;/strong>&lt;/td>
&lt;td>End-to-end response time&lt;/td>
&lt;td>&amp;lt; 3s&lt;/td>
&lt;td>&amp;gt; 5s&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Queue Wait Time&lt;/strong>&lt;/td>
&lt;td>Thời gian chờ trong queue&lt;/td>
&lt;td>&amp;lt; 100ms&lt;/td>
&lt;td>&amp;gt; 500ms&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Tool Call Latency&lt;/strong>&lt;/td>
&lt;td>Latency của external API calls&lt;/td>
&lt;td>&amp;lt; 500ms/call&lt;/td>
&lt;td>&amp;gt; 2s&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="52-nhm-2--token--cost-metrics">5.2. Nhóm 2 — Token &amp;amp; Cost Metrics&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Metric&lt;/th>
&lt;th>Mô tả&lt;/th>
&lt;th>Target&lt;/th>
&lt;th>Alert Threshold&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Input tokens/request&lt;/strong>&lt;/td>
&lt;td>Avg input token per request&lt;/td>
&lt;td>&amp;lt; 2,000&lt;/td>
&lt;td>&amp;gt; 5,000&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Output tokens/request&lt;/strong>&lt;/td>
&lt;td>Avg output token per request&lt;/td>
&lt;td>&amp;lt; 500&lt;/td>
&lt;td>&amp;gt; 2,000&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Cost/session USD&lt;/strong>&lt;/td>
&lt;td>Chi phí trung bình mỗi phiên&lt;/td>
&lt;td>&amp;lt; $0.05&lt;/td>
&lt;td>&amp;gt; $0.20&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Daily cost USD&lt;/strong>&lt;/td>
&lt;td>Tổng chi phí theo ngày&lt;/td>
&lt;td>Baseline ±20%&lt;/td>
&lt;td>&amp;gt; 150% baseline&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Monthly cost trend&lt;/strong>&lt;/td>
&lt;td>Xu hướng chi phí tháng&lt;/td>
&lt;td>Growth &amp;lt; 30%&lt;/td>
&lt;td>&amp;gt; 50% MoM&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Token efficiency ratio&lt;/strong>&lt;/td>
&lt;td>Output tokens / Input tokens&lt;/td>
&lt;td>&amp;gt; 0.3&lt;/td>
&lt;td>&amp;lt; 0.1&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="53-nhm-3--quality-metrics">5.3. Nhóm 3 — Quality Metrics&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Metric&lt;/th>
&lt;th>Mô tả&lt;/th>
&lt;th>Target&lt;/th>
&lt;th>Alert Threshold&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Hallucination rate&lt;/strong>&lt;/td>
&lt;td>% responses có thông tin sai&lt;/td>
&lt;td>&amp;lt; 3%&lt;/td>
&lt;td>&amp;gt; 10%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Guardrail block rate&lt;/strong>&lt;/td>
&lt;td>% requests bị chặn bởi guardrail&lt;/td>
&lt;td>0.5-2%&lt;/td>
&lt;td>&amp;gt; 20% (surge)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Groundedness score&lt;/strong>&lt;/td>
&lt;td>RAG answer grounded in context&lt;/td>
&lt;td>&amp;gt; 0.85&lt;/td>
&lt;td>&amp;lt; 0.70&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>User satisfaction&lt;/strong>&lt;/td>
&lt;td>CSAT score / thumbs up %&lt;/td>
&lt;td>&amp;gt; 80%&lt;/td>
&lt;td>&amp;lt; 60%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Task completion rate&lt;/strong>&lt;/td>
&lt;td>% tasks completed successfully&lt;/td>
&lt;td>&amp;gt; 90%&lt;/td>
&lt;td>&amp;lt; 75%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Escalation rate&lt;/strong>&lt;/td>
&lt;td>% sessions escalated to human&lt;/td>
&lt;td>&amp;lt; 5%&lt;/td>
&lt;td>&amp;gt; 15%&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="54-nhm-4--reliability-metrics">5.4. Nhóm 4 — Reliability Metrics&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Metric&lt;/th>
&lt;th>Mô tả&lt;/th>
&lt;th>Target&lt;/th>
&lt;th>Alert Threshold&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Error rate&lt;/strong>&lt;/td>
&lt;td>% requests trả về lỗi&lt;/td>
&lt;td>&amp;lt; 1%&lt;/td>
&lt;td>&amp;gt; 5%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Timeout rate&lt;/strong>&lt;/td>
&lt;td>% requests timeout&lt;/td>
&lt;td>&amp;lt; 0.5%&lt;/td>
&lt;td>&amp;gt; 2%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Retry rate&lt;/strong>&lt;/td>
&lt;td>% requests phải retry&lt;/td>
&lt;td>&amp;lt; 2%&lt;/td>
&lt;td>&amp;gt; 10%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Circuit breaker state&lt;/strong>&lt;/td>
&lt;td>Trạng thái circuit breaker&lt;/td>
&lt;td>CLOSED&lt;/td>
&lt;td>OPEN &amp;gt; 5min&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Memory overflow rate&lt;/strong>&lt;/td>
&lt;td>% context window overflow&lt;/td>
&lt;td>&amp;lt; 1%&lt;/td>
&lt;td>&amp;gt; 5%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Tool failure rate&lt;/strong>&lt;/td>
&lt;td>% tool calls thất bại&lt;/td>
&lt;td>&amp;lt; 2%&lt;/td>
&lt;td>&amp;gt; 10%&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="55-nhm-5--business-metrics">5.5. Nhóm 5 — Business Metrics&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Metric&lt;/th>
&lt;th>Mô tả&lt;/th>
&lt;th>Target&lt;/th>
&lt;th>Alert Threshold&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Active sessions&lt;/strong>&lt;/td>
&lt;td>Số phiên đang hoạt động&lt;/td>
&lt;td>Capacity planning&lt;/td>
&lt;td>&amp;gt; 80% capacity&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Daily active users&lt;/strong>&lt;/td>
&lt;td>Số user unique/ngày&lt;/td>
&lt;td>Growth target&lt;/td>
&lt;td>Sudden drop &amp;gt; 30%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Task completion rate&lt;/strong>&lt;/td>
&lt;td>% tác vụ hoàn thành&lt;/td>
&lt;td>&amp;gt; 90%&lt;/td>
&lt;td>&amp;lt; 75%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Avg conversation length&lt;/strong>&lt;/td>
&lt;td>Số turn trung bình/phiên&lt;/td>
&lt;td>3-8 turns&lt;/td>
&lt;td>&amp;gt; 15 turns&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>ROI per agent&lt;/strong>&lt;/td>
&lt;td>Giá trị tạo ra / chi phí vận hành&lt;/td>
&lt;td>&amp;gt; 3x&lt;/td>
&lt;td>&amp;lt; 1x&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Cost per resolved query&lt;/strong>&lt;/td>
&lt;td>Chi phí để giải quyết 1 query&lt;/td>
&lt;td>&amp;lt; $0.10&lt;/td>
&lt;td>&amp;gt; $0.50&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="56-python--custom-prometheus-metrics--opentelemetry-instrumentation">5.6. Python — Custom Prometheus Metrics + OpenTelemetry Instrumentation&lt;/h3>
&lt;div class="highlight">&lt;pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-python" data-lang="python">&lt;span style="color:#f92672">import&lt;/span> time
&lt;span style="color:#f92672">import&lt;/span> logging
&lt;span style="color:#f92672">from&lt;/span> typing &lt;span style="color:#f92672">import&lt;/span> Optional, Any
&lt;span style="color:#f92672">from&lt;/span> functools &lt;span style="color:#f92672">import&lt;/span> wraps
&lt;span style="color:#f92672">from&lt;/span> prometheus_client &lt;span style="color:#f92672">import&lt;/span> Counter, Histogram, Gauge, Summary
&lt;span style="color:#f92672">from&lt;/span> opentelemetry &lt;span style="color:#f92672">import&lt;/span> trace, metrics
&lt;span style="color:#f92672">from&lt;/span> opentelemetry.sdk.trace &lt;span style="color:#f92672">import&lt;/span> TracerProvider
&lt;span style="color:#f92672">from&lt;/span> opentelemetry.sdk.metrics &lt;span style="color:#f92672">import&lt;/span> MeterProvider
&lt;span style="color:#f92672">from&lt;/span> opentelemetry.exporter.otlp.proto.grpc.trace_exporter &lt;span style="color:#f92672">import&lt;/span> OTLPSpanExporter
&lt;span style="color:#f92672">from&lt;/span> opentelemetry.exporter.otlp.proto.grpc.metric_exporter &lt;span style="color:#f92672">import&lt;/span> OTLPMetricExporter
&lt;span style="color:#f92672">from&lt;/span> opentelemetry.sdk.trace.export &lt;span style="color:#f92672">import&lt;/span> BatchSpanProcessor
&lt;span style="color:#f92672">from&lt;/span> opentelemetry.sdk.metrics.export &lt;span style="color:#f92672">import&lt;/span> PeriodicExportingMetricReader
logger &lt;span style="color:#f92672">=&lt;/span> logging&lt;span style="color:#f92672">.&lt;/span>getLogger(__name__)
&lt;span style="color:#75715e"># ─── Prometheus Metrics ────────────────────────────────────────────────────────&lt;/span>
&lt;span style="color:#75715e"># Latency histogram với p50/p95/p99 buckets&lt;/span>
LLM_REQUEST_DURATION &lt;span style="color:#f92672">=&lt;/span> Histogram(
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm_request_duration_seconds&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">LLM request latency in seconds&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">agent_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">model&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">operation&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>],
buckets&lt;span style="color:#f92672">=&lt;/span>[&lt;span style="color:#ae81ff">0.1&lt;/span>, &lt;span style="color:#ae81ff">0.25&lt;/span>, &lt;span style="color:#ae81ff">0.5&lt;/span>, &lt;span style="color:#ae81ff">1.0&lt;/span>, &lt;span style="color:#ae81ff">2.0&lt;/span>, &lt;span style="color:#ae81ff">5.0&lt;/span>, &lt;span style="color:#ae81ff">10.0&lt;/span>, &lt;span style="color:#ae81ff">30.0&lt;/span>],
)
TTFT_DURATION &lt;span style="color:#f92672">=&lt;/span> Histogram(
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm_ttft_seconds&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Time To First Token in seconds&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">agent_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">model&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>],
buckets&lt;span style="color:#f92672">=&lt;/span>[&lt;span style="color:#ae81ff">0.05&lt;/span>, &lt;span style="color:#ae81ff">0.1&lt;/span>, &lt;span style="color:#ae81ff">0.25&lt;/span>, &lt;span style="color:#ae81ff">0.5&lt;/span>, &lt;span style="color:#ae81ff">1.0&lt;/span>, &lt;span style="color:#ae81ff">2.0&lt;/span>, &lt;span style="color:#ae81ff">5.0&lt;/span>],
)
&lt;span style="color:#75715e"># Token counters&lt;/span>
LLM_TOKENS_TOTAL &lt;span style="color:#f92672">=&lt;/span> Counter(
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm_tokens_total&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Total tokens consumed&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">agent_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">model&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">token_type&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>], &lt;span style="color:#75715e"># token_type: input | output&lt;/span>
)
&lt;span style="color:#75715e"># Cost tracking&lt;/span>
LLM_COST_USD &lt;span style="color:#f92672">=&lt;/span> Counter(
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm_cost_usd_total&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Total LLM API cost in USD&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">agent_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">model&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">tenant_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>],
)
&lt;span style="color:#75715e"># Quality metrics&lt;/span>
LLM_HALLUCINATION_SCORE &lt;span style="color:#f92672">=&lt;/span> Histogram(
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm_hallucination_score&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Hallucination probability score (0.0-1.0)&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">agent_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>],
buckets&lt;span style="color:#f92672">=&lt;/span>[&lt;span style="color:#ae81ff">0.0&lt;/span>, &lt;span style="color:#ae81ff">0.1&lt;/span>, &lt;span style="color:#ae81ff">0.2&lt;/span>, &lt;span style="color:#ae81ff">0.3&lt;/span>, &lt;span style="color:#ae81ff">0.5&lt;/span>, &lt;span style="color:#ae81ff">0.7&lt;/span>, &lt;span style="color:#ae81ff">1.0&lt;/span>],
)
GUARDRAIL_DECISIONS &lt;span style="color:#f92672">=&lt;/span> Counter(
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm_guardrail_decisions_total&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Guardrail decisions&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">agent_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">decision&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">reason&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>], &lt;span style="color:#75715e"># decision: allow|block|escalate&lt;/span>
)
&lt;span style="color:#75715e"># Reliability&lt;/span>
LLM_ERRORS_TOTAL &lt;span style="color:#f92672">=&lt;/span> Counter(
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm_errors_total&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Total LLM errors&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">agent_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">model&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">error_type&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>],
)
&lt;span style="color:#75715e"># Active sessions gauge&lt;/span>
ACTIVE_SESSIONS &lt;span style="color:#f92672">=&lt;/span> Gauge(
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm_active_sessions&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Number of currently active sessions&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">agent_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>],
)
&lt;span style="color:#75715e"># ─── OpenTelemetry Setup ───────────────────────────────────────────────────────&lt;/span>
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">setup_otel&lt;/span>(service_name: str, otel_endpoint: str &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">http://otel-collector:4317&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>):
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Configure OpenTelemetry Tracing + Metrics với OTLP exporter.&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;span style="color:#75715e"># Tracing&lt;/span>
tracer_provider &lt;span style="color:#f92672">=&lt;/span> TracerProvider()
otlp_span_exporter &lt;span style="color:#f92672">=&lt;/span> OTLPSpanExporter(endpoint&lt;span style="color:#f92672">=&lt;/span>otel_endpoint, insecure&lt;span style="color:#f92672">=&lt;/span>True)
tracer_provider&lt;span style="color:#f92672">.&lt;/span>add_span_processor(BatchSpanProcessor(otlp_span_exporter))
trace&lt;span style="color:#f92672">.&lt;/span>set_tracer_provider(tracer_provider)
&lt;span style="color:#75715e"># Metrics&lt;/span>
otlp_metric_exporter &lt;span style="color:#f92672">=&lt;/span> OTLPMetricExporter(endpoint&lt;span style="color:#f92672">=&lt;/span>otel_endpoint, insecure&lt;span style="color:#f92672">=&lt;/span>True)
metric_reader &lt;span style="color:#f92672">=&lt;/span> PeriodicExportingMetricReader(otlp_metric_exporter, export_interval_millis&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">15000&lt;/span>)
meter_provider &lt;span style="color:#f92672">=&lt;/span> MeterProvider(metric_readers&lt;span style="color:#f92672">=&lt;/span>[metric_reader])
metrics&lt;span style="color:#f92672">.&lt;/span>set_meter_provider(meter_provider)
&lt;span style="color:#66d9ef">return&lt;/span> trace&lt;span style="color:#f92672">.&lt;/span>get_tracer(service_name), metrics&lt;span style="color:#f92672">.&lt;/span>get_meter(service_name)
&lt;span style="color:#75715e"># ─── Instrumented LLM Call Wrapper ────────────────────────────────────────────&lt;/span>
COST_PER_1K_TOKENS &lt;span style="color:#f92672">=&lt;/span> {
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">gpt-4o&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: {&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">input&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.005&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">output&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.015&lt;/span>},
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">gpt-4o-mini&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: {&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">input&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.00015&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">output&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.0006&lt;/span>},
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">claude-3-5-sonnet&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: {&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">input&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.003&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">output&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.015&lt;/span>},
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">claude-3-haiku&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: {&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">input&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.00025&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">output&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.00125&lt;/span>},
}
&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">InstrumentedLLMClient&lt;/span>:
&lt;span style="color:#66d9ef">def&lt;/span> __init__(self, agent_id: str, tracer, model: str &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">gpt-4o&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>):
self&lt;span style="color:#f92672">.&lt;/span>agent_id &lt;span style="color:#f92672">=&lt;/span> agent_id
self&lt;span style="color:#f92672">.&lt;/span>model &lt;span style="color:#f92672">=&lt;/span> model
self&lt;span style="color:#f92672">.&lt;/span>tracer &lt;span style="color:#f92672">=&lt;/span> tracer
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">calculate_cost&lt;/span>(self, input_tokens: int, output_tokens: int) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> float:
rates &lt;span style="color:#f92672">=&lt;/span> COST_PER_1K_TOKENS&lt;span style="color:#f92672">.&lt;/span>get(self&lt;span style="color:#f92672">.&lt;/span>model, {&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">input&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.005&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">output&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.015&lt;/span>})
&lt;span style="color:#66d9ef">return&lt;/span> (input_tokens &lt;span style="color:#f92672">/&lt;/span> &lt;span style="color:#ae81ff">1000&lt;/span> &lt;span style="color:#f92672">*&lt;/span> rates[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">input&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>]) &lt;span style="color:#f92672">+&lt;/span> (output_tokens &lt;span style="color:#f92672">/&lt;/span> &lt;span style="color:#ae81ff">1000&lt;/span> &lt;span style="color:#f92672">*&lt;/span> rates[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">output&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>])
async &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">chat_completion&lt;/span>(
self,
messages: list[dict],
tenant_id: str &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">default&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
session_id: Optional[str] &lt;span style="color:#f92672">=&lt;/span> None,
&lt;span style="color:#f92672">*&lt;/span>&lt;span style="color:#f92672">*&lt;/span>kwargs: Any,
) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> dict:
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>&lt;span style="color:#e6db74">LLM call với đầy đủ instrumentation: traces, metrics, cost tracking.&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>
start_time &lt;span style="color:#f92672">=&lt;/span> time&lt;span style="color:#f92672">.&lt;/span>perf_counter()
&lt;span style="color:#66d9ef">with&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>tracer&lt;span style="color:#f92672">.&lt;/span>start_as_current_span(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.generate&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>) &lt;span style="color:#66d9ef">as&lt;/span> span:
span&lt;span style="color:#f92672">.&lt;/span>set_attribute(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.model&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, self&lt;span style="color:#f92672">.&lt;/span>model)
span&lt;span style="color:#f92672">.&lt;/span>set_attribute(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.agent_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, self&lt;span style="color:#f92672">.&lt;/span>agent_id)
span&lt;span style="color:#f92672">.&lt;/span>set_attribute(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.session_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, session_id &lt;span style="color:#f92672">or&lt;/span> &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
span&lt;span style="color:#f92672">.&lt;/span>set_attribute(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.input_messages&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, len(messages))
ACTIVE_SESSIONS&lt;span style="color:#f92672">.&lt;/span>labels(agent_id&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>agent_id)&lt;span style="color:#f92672">.&lt;/span>inc()
&lt;span style="color:#66d9ef">try&lt;/span>:
&lt;span style="color:#75715e"># Gọi LLM thực tế (thay bằng openai client thật)&lt;/span>
&lt;span style="color:#f92672">from&lt;/span> openai &lt;span style="color:#f92672">import&lt;/span> AsyncOpenAI
client &lt;span style="color:#f92672">=&lt;/span> AsyncOpenAI()
response &lt;span style="color:#f92672">=&lt;/span> await client&lt;span style="color:#f92672">.&lt;/span>chat&lt;span style="color:#f92672">.&lt;/span>completions&lt;span style="color:#f92672">.&lt;/span>create(
model&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>model,
messages&lt;span style="color:#f92672">=&lt;/span>messages,
&lt;span style="color:#f92672">*&lt;/span>&lt;span style="color:#f92672">*&lt;/span>kwargs,
)
latency &lt;span style="color:#f92672">=&lt;/span> time&lt;span style="color:#f92672">.&lt;/span>perf_counter() &lt;span style="color:#f92672">-&lt;/span> start_time
usage &lt;span style="color:#f92672">=&lt;/span> response&lt;span style="color:#f92672">.&lt;/span>usage
input_tokens &lt;span style="color:#f92672">=&lt;/span> usage&lt;span style="color:#f92672">.&lt;/span>prompt_tokens
output_tokens &lt;span style="color:#f92672">=&lt;/span> usage&lt;span style="color:#f92672">.&lt;/span>completion_tokens
cost &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>calculate_cost(input_tokens, output_tokens)
&lt;span style="color:#75715e"># Prometheus metrics&lt;/span>
LLM_REQUEST_DURATION&lt;span style="color:#f92672">.&lt;/span>labels(
agent_id&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>agent_id, model&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>model, operation&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">chat&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>
)&lt;span style="color:#f92672">.&lt;/span>observe(latency)
LLM_TOKENS_TOTAL&lt;span style="color:#f92672">.&lt;/span>labels(
agent_id&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>agent_id, model&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>model, token_type&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">input&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>
)&lt;span style="color:#f92672">.&lt;/span>inc(input_tokens)
LLM_TOKENS_TOTAL&lt;span style="color:#f92672">.&lt;/span>labels(
agent_id&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>agent_id, model&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>model, token_type&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">output&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>
)&lt;span style="color:#f92672">.&lt;/span>inc(output_tokens)
LLM_COST_USD&lt;span style="color:#f92672">.&lt;/span>labels(
agent_id&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>agent_id, model&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>model, tenant_id&lt;span style="color:#f92672">=&lt;/span>tenant_id
)&lt;span style="color:#f92672">.&lt;/span>inc(cost)
&lt;span style="color:#75715e"># OTel span attributes&lt;/span>
span&lt;span style="color:#f92672">.&lt;/span>set_attribute(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.input_tokens&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, input_tokens)
span&lt;span style="color:#f92672">.&lt;/span>set_attribute(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.output_tokens&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, output_tokens)
span&lt;span style="color:#f92672">.&lt;/span>set_attribute(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.cost_usd&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, cost)
span&lt;span style="color:#f92672">.&lt;/span>set_attribute(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.latency_ms&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, int(latency &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#ae81ff">1000&lt;/span>))
logger&lt;span style="color:#f92672">.&lt;/span>info(
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm_call_completed&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
extra&lt;span style="color:#f92672">=&lt;/span>{
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">agent_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: self&lt;span style="color:#f92672">.&lt;/span>agent_id,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">model&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: self&lt;span style="color:#f92672">.&lt;/span>model,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">input_tokens&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: input_tokens,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">output_tokens&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: output_tokens,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">latency_ms&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: int(latency &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#ae81ff">1000&lt;/span>),
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">cost_usd&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: round(cost, &lt;span style="color:#ae81ff">6&lt;/span>),
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">session_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: session_id,
},
)
&lt;span style="color:#66d9ef">return&lt;/span> {&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">response&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: response, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">cost_usd&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: cost, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">latency_ms&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: int(latency &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#ae81ff">1000&lt;/span>)}
&lt;span style="color:#66d9ef">except&lt;/span> &lt;span style="color:#a6e22e">Exception&lt;/span> &lt;span style="color:#66d9ef">as&lt;/span> e:
LLM_ERRORS_TOTAL&lt;span style="color:#f92672">.&lt;/span>labels(
agent_id&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>agent_id, model&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>model, error_type&lt;span style="color:#f92672">=&lt;/span>type(e)&lt;span style="color:#f92672">.&lt;/span>__name__
)&lt;span style="color:#f92672">.&lt;/span>inc()
span&lt;span style="color:#f92672">.&lt;/span>record_exception(e)
span&lt;span style="color:#f92672">.&lt;/span>set_status(trace&lt;span style="color:#f92672">.&lt;/span>StatusCode&lt;span style="color:#f92672">.&lt;/span>ERROR, str(e))
logger&lt;span style="color:#f92672">.&lt;/span>error(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm_call_failed&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, extra&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">error&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: str(e), &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">agent_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: self&lt;span style="color:#f92672">.&lt;/span>agent_id})
&lt;span style="color:#66d9ef">raise&lt;/span>
&lt;span style="color:#66d9ef">finally&lt;/span>:
ACTIVE_SESSIONS&lt;span style="color:#f92672">.&lt;/span>labels(agent_id&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>agent_id)&lt;span style="color:#f92672">.&lt;/span>dec()
&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="6-distributed-tracing-cho-multi-agent-workflow">6. Distributed Tracing Cho Multi-Agent Workflow&lt;/h2>
&lt;h3 id="61-khi-nim-c-bn">6.1. Khái Niệm Cơ Bản&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Khái niệm&lt;/th>
&lt;th>Mô tả&lt;/th>
&lt;th>Ví dụ trong AI Agent&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Trace&lt;/strong>&lt;/td>
&lt;td>Toàn bộ lifecycle của 1 request&lt;/td>
&lt;td>Từ lúc user gửi tin → nhận response&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Span&lt;/strong>&lt;/td>
&lt;td>1 đơn vị công việc trong trace&lt;/td>
&lt;td>&amp;ldquo;llm.generate&amp;rdquo;, &amp;ldquo;rag.retrieve&amp;rdquo;, &amp;ldquo;tool.call&amp;rdquo;&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Parent Span&lt;/strong>&lt;/td>
&lt;td>Span chứa các span con&lt;/td>
&lt;td>Orchestrator span chứa tất cả sub-agent spans&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Context Propagation&lt;/strong>&lt;/td>
&lt;td>Truyền trace context qua service boundaries&lt;/td>
&lt;td>traceparent header qua HTTP/gRPC&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Correlation ID&lt;/strong>&lt;/td>
&lt;td>ID duy nhất kết nối logs + traces + metrics&lt;/td>
&lt;td>request_id = trace_id&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="62-python--opentelemetry--langchain-callback-handler">6.2. Python — OpenTelemetry + LangChain Callback Handler&lt;/h3>
&lt;div class="highlight">&lt;pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-python" data-lang="python">&lt;span style="color:#f92672">import&lt;/span> uuid
&lt;span style="color:#f92672">import&lt;/span> time
&lt;span style="color:#f92672">import&lt;/span> logging
&lt;span style="color:#f92672">from&lt;/span> typing &lt;span style="color:#f92672">import&lt;/span> Any, Optional, Union
&lt;span style="color:#f92672">from&lt;/span> langchain.callbacks.base &lt;span style="color:#f92672">import&lt;/span> BaseCallbackHandler
&lt;span style="color:#f92672">from&lt;/span> langchain.schema &lt;span style="color:#f92672">import&lt;/span> LLMResult, AgentAction, AgentFinish
&lt;span style="color:#f92672">from&lt;/span> opentelemetry &lt;span style="color:#f92672">import&lt;/span> trace, context, baggage
&lt;span style="color:#f92672">from&lt;/span> opentelemetry.propagate &lt;span style="color:#f92672">import&lt;/span> inject, extract
&lt;span style="color:#f92672">import&lt;/span> structlog
logger &lt;span style="color:#f92672">=&lt;/span> structlog&lt;span style="color:#f92672">.&lt;/span>get_logger()
tracer &lt;span style="color:#f92672">=&lt;/span> trace&lt;span style="color:#f92672">.&lt;/span>get_tracer(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">langchain-agent&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">LangChainOTelCallbackHandler&lt;/span>(BaseCallbackHandler):
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> LangChain callback handler tích hợp OpenTelemetry tracing.&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> Tự động tạo spans cho mọi LLM call, tool call, chain execution.&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;span style="color:#66d9ef">def&lt;/span> __init__(self, agent_id: str):
self&lt;span style="color:#f92672">.&lt;/span>agent_id &lt;span style="color:#f92672">=&lt;/span> agent_id
self&lt;span style="color:#f92672">.&lt;/span>_span_stack: dict[str, Any] &lt;span style="color:#f92672">=&lt;/span> {}
self&lt;span style="color:#f92672">.&lt;/span>_run_metadata: dict[str, dict] &lt;span style="color:#f92672">=&lt;/span> {}
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">on_llm_start&lt;/span>(self, serialized: dict, prompts: list[str], &lt;span style="color:#f92672">*&lt;/span>&lt;span style="color:#f92672">*&lt;/span>kwargs: Any) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> None:
run_id &lt;span style="color:#f92672">=&lt;/span> str(kwargs&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">run_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, uuid&lt;span style="color:#f92672">.&lt;/span>uuid4()))
model &lt;span style="color:#f92672">=&lt;/span> serialized&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">kwargs&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, {})&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">model_name&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">unknown&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
span &lt;span style="color:#f92672">=&lt;/span> tracer&lt;span style="color:#f92672">.&lt;/span>start_span(
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.generate&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
attributes&lt;span style="color:#f92672">=&lt;/span>{
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.model&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: model,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.agent_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: self&lt;span style="color:#f92672">.&lt;/span>agent_id,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.prompt_count&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: len(prompts),
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.run_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: run_id,
},
)
ctx &lt;span style="color:#f92672">=&lt;/span> trace&lt;span style="color:#f92672">.&lt;/span>use_span(span, end_on_exit&lt;span style="color:#f92672">=&lt;/span>False)
token &lt;span style="color:#f92672">=&lt;/span> context&lt;span style="color:#f92672">.&lt;/span>attach(ctx)
self&lt;span style="color:#f92672">.&lt;/span>_span_stack[run_id] &lt;span style="color:#f92672">=&lt;/span> {&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">span&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: span, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">token&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: token, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">start_time&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: time&lt;span style="color:#f92672">.&lt;/span>perf_counter()}
self&lt;span style="color:#f92672">.&lt;/span>_run_metadata[run_id] &lt;span style="color:#f92672">=&lt;/span> {&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">model&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: model, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">prompts&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: prompts}
logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm_start&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, agent_id&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>agent_id, model&lt;span style="color:#f92672">=&lt;/span>model, run_id&lt;span style="color:#f92672">=&lt;/span>run_id)
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">on_llm_end&lt;/span>(self, response: LLMResult, &lt;span style="color:#f92672">*&lt;/span>&lt;span style="color:#f92672">*&lt;/span>kwargs: Any) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> None:
run_id &lt;span style="color:#f92672">=&lt;/span> str(kwargs&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">run_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>))
&lt;span style="color:#66d9ef">if&lt;/span> run_id &lt;span style="color:#f92672">not&lt;/span> &lt;span style="color:#f92672">in&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>_span_stack:
&lt;span style="color:#66d9ef">return&lt;/span>
frame &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>_span_stack&lt;span style="color:#f92672">.&lt;/span>pop(run_id)
span &lt;span style="color:#f92672">=&lt;/span> frame[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">span&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>]
latency_ms &lt;span style="color:#f92672">=&lt;/span> int((time&lt;span style="color:#f92672">.&lt;/span>perf_counter() &lt;span style="color:#f92672">-&lt;/span> frame[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">start_time&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>]) &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#ae81ff">1000&lt;/span>)
&lt;span style="color:#75715e"># Extract token usage từ LLMResult&lt;/span>
total_tokens &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>
input_tokens &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>
output_tokens &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>
&lt;span style="color:#66d9ef">if&lt;/span> response&lt;span style="color:#f92672">.&lt;/span>llm_output:
token_usage &lt;span style="color:#f92672">=&lt;/span> response&lt;span style="color:#f92672">.&lt;/span>llm_output&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">token_usage&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, {})
input_tokens &lt;span style="color:#f92672">=&lt;/span> token_usage&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">prompt_tokens&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#ae81ff">0&lt;/span>)
output_tokens &lt;span style="color:#f92672">=&lt;/span> token_usage&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">completion_tokens&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#ae81ff">0&lt;/span>)
total_tokens &lt;span style="color:#f92672">=&lt;/span> token_usage&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">total_tokens&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#ae81ff">0&lt;/span>)
span&lt;span style="color:#f92672">.&lt;/span>set_attribute(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.input_tokens&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, input_tokens)
span&lt;span style="color:#f92672">.&lt;/span>set_attribute(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.output_tokens&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, output_tokens)
span&lt;span style="color:#f92672">.&lt;/span>set_attribute(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.total_tokens&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, total_tokens)
span&lt;span style="color:#f92672">.&lt;/span>set_attribute(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.latency_ms&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, latency_ms)
span&lt;span style="color:#f92672">.&lt;/span>end()
context&lt;span style="color:#f92672">.&lt;/span>detach(frame[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">token&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>])
logger&lt;span style="color:#f92672">.&lt;/span>info(
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm_end&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
agent_id&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>agent_id,
run_id&lt;span style="color:#f92672">=&lt;/span>run_id,
input_tokens&lt;span style="color:#f92672">=&lt;/span>input_tokens,
output_tokens&lt;span style="color:#f92672">=&lt;/span>output_tokens,
latency_ms&lt;span style="color:#f92672">=&lt;/span>latency_ms,
)
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">on_llm_error&lt;/span>(self, error: Union[&lt;span style="color:#a6e22e">Exception&lt;/span>, &lt;span style="color:#a6e22e">KeyboardInterrupt&lt;/span>], &lt;span style="color:#f92672">*&lt;/span>&lt;span style="color:#f92672">*&lt;/span>kwargs: Any) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> None:
run_id &lt;span style="color:#f92672">=&lt;/span> str(kwargs&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">run_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>))
&lt;span style="color:#66d9ef">if&lt;/span> run_id &lt;span style="color:#f92672">not&lt;/span> &lt;span style="color:#f92672">in&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>_span_stack:
&lt;span style="color:#66d9ef">return&lt;/span>
frame &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>_span_stack&lt;span style="color:#f92672">.&lt;/span>pop(run_id)
span &lt;span style="color:#f92672">=&lt;/span> frame[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">span&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>]
span&lt;span style="color:#f92672">.&lt;/span>record_exception(error)
span&lt;span style="color:#f92672">.&lt;/span>set_status(trace&lt;span style="color:#f92672">.&lt;/span>StatusCode&lt;span style="color:#f92672">.&lt;/span>ERROR, str(error))
span&lt;span style="color:#f92672">.&lt;/span>end()
context&lt;span style="color:#f92672">.&lt;/span>detach(frame[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">token&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>])
logger&lt;span style="color:#f92672">.&lt;/span>error(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm_error&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, agent_id&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>agent_id, error&lt;span style="color:#f92672">=&lt;/span>str(error), run_id&lt;span style="color:#f92672">=&lt;/span>run_id)
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">on_tool_start&lt;/span>(self, serialized: dict, input_str: str, &lt;span style="color:#f92672">*&lt;/span>&lt;span style="color:#f92672">*&lt;/span>kwargs: Any) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> None:
run_id &lt;span style="color:#f92672">=&lt;/span> str(kwargs&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">run_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, uuid&lt;span style="color:#f92672">.&lt;/span>uuid4()))
tool_name &lt;span style="color:#f92672">=&lt;/span> serialized&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">name&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">unknown_tool&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
span &lt;span style="color:#f92672">=&lt;/span> tracer&lt;span style="color:#f92672">.&lt;/span>start_span(
f&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">tool.{tool_name}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
attributes&lt;span style="color:#f92672">=&lt;/span>{
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">tool.name&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: tool_name,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">tool.input_length&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: len(input_str),
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.agent_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: self&lt;span style="color:#f92672">.&lt;/span>agent_id,
},
)
ctx &lt;span style="color:#f92672">=&lt;/span> trace&lt;span style="color:#f92672">.&lt;/span>use_span(span, end_on_exit&lt;span style="color:#f92672">=&lt;/span>False)
token &lt;span style="color:#f92672">=&lt;/span> context&lt;span style="color:#f92672">.&lt;/span>attach(ctx)
self&lt;span style="color:#f92672">.&lt;/span>_span_stack[run_id] &lt;span style="color:#f92672">=&lt;/span> {&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">span&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: span, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">token&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: token, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">start_time&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: time&lt;span style="color:#f92672">.&lt;/span>perf_counter()}
logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">tool_start&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, tool&lt;span style="color:#f92672">=&lt;/span>tool_name, agent_id&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>agent_id)
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">on_tool_end&lt;/span>(self, output: str, &lt;span style="color:#f92672">*&lt;/span>&lt;span style="color:#f92672">*&lt;/span>kwargs: Any) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> None:
run_id &lt;span style="color:#f92672">=&lt;/span> str(kwargs&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">run_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>))
&lt;span style="color:#66d9ef">if&lt;/span> run_id &lt;span style="color:#f92672">not&lt;/span> &lt;span style="color:#f92672">in&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>_span_stack:
&lt;span style="color:#66d9ef">return&lt;/span>
frame &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>_span_stack&lt;span style="color:#f92672">.&lt;/span>pop(run_id)
span &lt;span style="color:#f92672">=&lt;/span> frame[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">span&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>]
span&lt;span style="color:#f92672">.&lt;/span>set_attribute(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">tool.output_length&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, len(output))
span&lt;span style="color:#f92672">.&lt;/span>set_attribute(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">tool.latency_ms&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, int((time&lt;span style="color:#f92672">.&lt;/span>perf_counter() &lt;span style="color:#f92672">-&lt;/span> frame[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">start_time&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>]) &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#ae81ff">1000&lt;/span>))
span&lt;span style="color:#f92672">.&lt;/span>end()
context&lt;span style="color:#f92672">.&lt;/span>detach(frame[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">token&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>])
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">on_agent_action&lt;/span>(self, action: AgentAction, &lt;span style="color:#f92672">*&lt;/span>&lt;span style="color:#f92672">*&lt;/span>kwargs: Any) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> None:
logger&lt;span style="color:#f92672">.&lt;/span>info(
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">agent_action&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
agent_id&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>agent_id,
tool&lt;span style="color:#f92672">=&lt;/span>action&lt;span style="color:#f92672">.&lt;/span>tool,
tool_input&lt;span style="color:#f92672">=&lt;/span>action&lt;span style="color:#f92672">.&lt;/span>tool_input[:&lt;span style="color:#ae81ff">200&lt;/span>],
)
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">on_agent_finish&lt;/span>(self, finish: AgentFinish, &lt;span style="color:#f92672">*&lt;/span>&lt;span style="color:#f92672">*&lt;/span>kwargs: Any) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> None:
logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">agent_finish&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, agent_id&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>agent_id, output_keys&lt;span style="color:#f92672">=&lt;/span>list(finish&lt;span style="color:#f92672">.&lt;/span>return_values&lt;span style="color:#f92672">.&lt;/span>keys()))
&lt;span style="color:#75715e"># ─── Context Propagation qua HTTP ─────────────────────────────────────────────&lt;/span>
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">create_propagated_headers&lt;/span>(current_span: Optional[Any] &lt;span style="color:#f92672">=&lt;/span> None) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> dict:
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Tạo HTTP headers với W3C traceparent để truyền context sang service khác.&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>
headers: dict &lt;span style="color:#f92672">=&lt;/span> {}
inject(headers) &lt;span style="color:#75715e"># OTel tự inject traceparent + tracestate&lt;/span>
&lt;span style="color:#66d9ef">return&lt;/span> headers
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">extract_trace_context&lt;/span>(incoming_headers: dict) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> Any:
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Extract trace context từ inbound HTTP request.&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;span style="color:#66d9ef">return&lt;/span> extract(incoming_headers)
&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="7-structured-logging-cho-ai-agent">7. Structured Logging Cho AI Agent&lt;/h2>
&lt;h3 id="71-log-schema-json-chun">7.1. Log Schema JSON Chuẩn&lt;/h3>
&lt;div class="highlight">&lt;pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-json" data-lang="json">{
&lt;span style="color:#f92672">&amp;#34;timestamp&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;2026-05-14T10:23:45.123456Z&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;level&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;INFO&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;service&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;rag-agent-service&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;version&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;2.1.0&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;environment&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;production&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;request_id&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;req-4bf92f35-77b3-4da6&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;session_id&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;sess-a3ce929d-0e0e-4736&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;correlation_id&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;corr-00f067aa-0ba9-02b7&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;trace_id&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;4bf92f3577b34da6a3ce929d0e0e4736&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;span_id&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;00f067aa0ba902b7&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;agent_id&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;rag_agent_v2&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;tenant_id&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;tenant-healthcare-001&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;user_id&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;user-hashed-789xyz&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;model&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;gpt-4o&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;model_version&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;2024-11-20&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;operation&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;chat_completion&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;prompt_tokens&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">1840&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;completion_tokens&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">420&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;total_tokens&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">2260&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;cost_usd&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.004530&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;latency_ms&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">2050&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;ttft_ms&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">380&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;queue_wait_ms&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">12&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;guardrail_status&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;passed&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;guardrail_checks&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;prompt_injection&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;clean&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;pii_detection&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;no_pii&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;topic_filter&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;in_scope&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;toxicity&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;clean&amp;#34;&lt;/span>
},
&lt;span style="color:#f92672">&amp;#34;tool_calls&amp;#34;&lt;/span>: [
{&lt;span style="color:#f92672">&amp;#34;name&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;search_knowledge_base&amp;#34;&lt;/span>, &lt;span style="color:#f92672">&amp;#34;latency_ms&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">145&lt;/span>, &lt;span style="color:#f92672">&amp;#34;status&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;success&amp;#34;&lt;/span>},
{&lt;span style="color:#f92672">&amp;#34;name&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;get_product_info&amp;#34;&lt;/span>, &lt;span style="color:#f92672">&amp;#34;latency_ms&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">89&lt;/span>, &lt;span style="color:#f92672">&amp;#34;status&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;success&amp;#34;&lt;/span>}
],
&lt;span style="color:#f92672">&amp;#34;rag_context&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;chunks_retrieved&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">5&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;top_similarity_score&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.92&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;retrieval_latency_ms&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">145&lt;/span>
},
&lt;span style="color:#f92672">&amp;#34;quality_scores&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;groundedness&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.88&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;hallucination_probability&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.08&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;relevance&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.91&lt;/span>
},
&lt;span style="color:#f92672">&amp;#34;error&amp;#34;&lt;/span>: &lt;span style="color:#66d9ef">null&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;error_type&amp;#34;&lt;/span>: &lt;span style="color:#66d9ef">null&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;retry_count&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0&lt;/span>
}
&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="72-python-structlog-setup">7.2. Python Structlog Setup&lt;/h3>
&lt;div class="highlight">&lt;pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-python" data-lang="python">&lt;span style="color:#f92672">import&lt;/span> sys
&lt;span style="color:#f92672">import&lt;/span> logging
&lt;span style="color:#f92672">import&lt;/span> structlog
&lt;span style="color:#f92672">from&lt;/span> opentelemetry &lt;span style="color:#f92672">import&lt;/span> trace
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">configure_structured_logging&lt;/span>(
service_name: str,
environment: str &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">production&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
log_level: str &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">INFO&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> None:
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Cấu hình structlog với OTel trace context injection.&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;span style="color:#75715e"># Processor chain: xử lý log record trước khi output&lt;/span>
structlog&lt;span style="color:#f92672">.&lt;/span>configure(
processors&lt;span style="color:#f92672">=&lt;/span>[
structlog&lt;span style="color:#f92672">.&lt;/span>contextvars&lt;span style="color:#f92672">.&lt;/span>merge_contextvars, &lt;span style="color:#75715e"># Thread-local context&lt;/span>
structlog&lt;span style="color:#f92672">.&lt;/span>stdlib&lt;span style="color:#f92672">.&lt;/span>add_log_level, &lt;span style="color:#75715e"># level field&lt;/span>
structlog&lt;span style="color:#f92672">.&lt;/span>stdlib&lt;span style="color:#f92672">.&lt;/span>add_logger_name, &lt;span style="color:#75715e"># logger field&lt;/span>
structlog&lt;span style="color:#f92672">.&lt;/span>processors&lt;span style="color:#f92672">.&lt;/span>TimeStamper(fmt&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">iso&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>), &lt;span style="color:#75715e"># ISO timestamp&lt;/span>
_inject_otel_context, &lt;span style="color:#75715e"># trace_id + span_id&lt;/span>
_add_service_metadata(service_name, environment), &lt;span style="color:#75715e"># service + env&lt;/span>
structlog&lt;span style="color:#f92672">.&lt;/span>processors&lt;span style="color:#f92672">.&lt;/span>StackInfoRenderer(),
structlog&lt;span style="color:#f92672">.&lt;/span>processors&lt;span style="color:#f92672">.&lt;/span>format_exc_info,
structlog&lt;span style="color:#f92672">.&lt;/span>processors&lt;span style="color:#f92672">.&lt;/span>JSONRenderer(), &lt;span style="color:#75715e"># JSON output&lt;/span>
],
wrapper_class&lt;span style="color:#f92672">=&lt;/span>structlog&lt;span style="color:#f92672">.&lt;/span>make_filtering_bound_logger(
getattr(logging, log_level&lt;span style="color:#f92672">.&lt;/span>upper())
),
context_class&lt;span style="color:#f92672">=&lt;/span>dict,
logger_factory&lt;span style="color:#f92672">=&lt;/span>structlog&lt;span style="color:#f92672">.&lt;/span>PrintLoggerFactory(sys&lt;span style="color:#f92672">.&lt;/span>stdout),
)
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">_inject_otel_context&lt;/span>(logger, method_name: str, event_dict: dict) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> dict:
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Inject OTel trace_id và span_id vào mọi log record.&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>
current_span &lt;span style="color:#f92672">=&lt;/span> trace&lt;span style="color:#f92672">.&lt;/span>get_current_span()
&lt;span style="color:#66d9ef">if&lt;/span> current_span &lt;span style="color:#f92672">and&lt;/span> current_span&lt;span style="color:#f92672">.&lt;/span>is_recording():
ctx &lt;span style="color:#f92672">=&lt;/span> current_span&lt;span style="color:#f92672">.&lt;/span>get_span_context()
event_dict[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">trace_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> format(ctx&lt;span style="color:#f92672">.&lt;/span>trace_id, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">032x&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
event_dict[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">span_id&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> format(ctx&lt;span style="color:#f92672">.&lt;/span>span_id, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">016x&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;span style="color:#66d9ef">return&lt;/span> event_dict
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">_add_service_metadata&lt;/span>(service_name: str, environment: str):
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">processor&lt;/span>(logger, method_name: str, event_dict: dict) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> dict:
event_dict[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">service&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> service_name
event_dict[&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">environment&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> environment
&lt;span style="color:#66d9ef">return&lt;/span> event_dict
&lt;span style="color:#66d9ef">return&lt;/span> processor
&lt;span style="color:#75715e"># Sử dụng:&lt;/span>
&lt;span style="color:#75715e"># configure_structured_logging(&amp;#34;rag-agent-service&amp;#34;, &amp;#34;production&amp;#34;)&lt;/span>
&lt;span style="color:#75715e"># log = structlog.get_logger()&lt;/span>
&lt;span style="color:#75715e"># log.info(&amp;#34;llm_call_completed&amp;#34;, agent_id=&amp;#34;rag_agent&amp;#34;, latency_ms=2050, cost_usd=0.0045)&lt;/span>
&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="73-elasticsearch-index-mapping">7.3. Elasticsearch Index Mapping&lt;/h3>
&lt;div class="highlight">&lt;pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="color:#75715e"># elasticsearch-index-mapping.yaml&lt;/span>
---
index_template:
name: &lt;span style="color:#e6db74">&amp;#34;ai-agent-logs&amp;#34;&lt;/span>
index_patterns:
- &lt;span style="color:#e6db74">&amp;#34;ai-agent-logs-*&amp;#34;&lt;/span>
settings:
number_of_shards: &lt;span style="color:#ae81ff">3&lt;/span>
number_of_replicas: &lt;span style="color:#ae81ff">1&lt;/span>
refresh_interval: &lt;span style="color:#e6db74">&amp;#34;5s&amp;#34;&lt;/span>
index:
lifecycle:
name: &lt;span style="color:#e6db74">&amp;#34;ai-agent-logs-ilm-policy&amp;#34;&lt;/span>
rollover_alias: &lt;span style="color:#e6db74">&amp;#34;ai-agent-logs&amp;#34;&lt;/span>
analysis:
analyzer:
custom_log_analyzer:
type: standard
stopwords: &lt;span style="color:#e6db74">&amp;#34;_none_&amp;#34;&lt;/span>
mappings:
dynamic: &lt;span style="color:#66d9ef">false&lt;/span>
properties:
&lt;span style="color:#e6db74">&amp;#34;@timestamp&amp;#34;&lt;/span>: { type: date }
timestamp: { type: date }
level: { type: keyword }
service: { type: keyword }
environment: { type: keyword }
version: { type: keyword }
request_id: { type: keyword }
session_id: { type: keyword }
trace_id: { type: keyword }
span_id: { type: keyword }
correlation_id: { type: keyword }
agent_id: { type: keyword }
tenant_id: { type: keyword }
user_id: { type: keyword }
model: { type: keyword }
operation: { type: keyword }
prompt_tokens: { type: integer }
completion_tokens: { type: integer }
total_tokens: { type: integer }
cost_usd: { type: float }
latency_ms: { type: integer }
ttft_ms: { type: integer }
queue_wait_ms: { type: integer }
guardrail_status: { type: keyword }
error: { type: text, analyzer: custom_log_analyzer }
error_type: { type: keyword }
retry_count: { type: short }
hallucination_probability: { type: float }
groundedness: { type: float }
relevance: { type: float }
tool_calls:
type: nested
properties:
name: { type: keyword }
latency_ms: { type: integer }
status: { type: keyword }
&lt;span style="color:#75715e"># ILM Policy&lt;/span>
ilm_policy:
name: &lt;span style="color:#e6db74">&amp;#34;ai-agent-logs-ilm-policy&amp;#34;&lt;/span>
phases:
hot:
min_age: &lt;span style="color:#e6db74">&amp;#34;0ms&amp;#34;&lt;/span>
actions:
rollover:
max_primary_shard_size: &lt;span style="color:#e6db74">&amp;#34;50gb&amp;#34;&lt;/span>
max_age: &lt;span style="color:#e6db74">&amp;#34;1d&amp;#34;&lt;/span>
set_priority:
priority: &lt;span style="color:#ae81ff">100&lt;/span>
warm:
min_age: &lt;span style="color:#e6db74">&amp;#34;7d&amp;#34;&lt;/span>
actions:
shrink:
number_of_shards: &lt;span style="color:#ae81ff">1&lt;/span>
forcemerge:
max_num_segments: &lt;span style="color:#ae81ff">1&lt;/span>
set_priority:
priority: &lt;span style="color:#ae81ff">50&lt;/span>
cold:
min_age: &lt;span style="color:#e6db74">&amp;#34;30d&amp;#34;&lt;/span>
actions:
freeze: {}
set_priority:
priority: &lt;span style="color:#ae81ff">0&lt;/span>
delete:
min_age: &lt;span style="color:#e6db74">&amp;#34;90d&amp;#34;&lt;/span>
actions:
delete: {}
&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="74-kibanaelasticsearch-query-examples">7.4. Kibana/Elasticsearch Query Examples&lt;/h3>
&lt;div class="highlight">&lt;pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-json" data-lang="json">&lt;span style="color:#960050;background-color:#1e0010">/&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">/&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">Q&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">u&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">e&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">r&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">y&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">:&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">T&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">ì&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">m&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">s&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">l&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">o&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">w&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">r&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">e&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">q&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">u&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">e&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">s&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">t&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">s&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">(&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">l&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">a&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">t&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">e&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">n&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">c&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">y&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">&amp;gt;&lt;/span> &lt;span style="color:#ae81ff">5&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">s&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">)&lt;/span>
{
&lt;span style="color:#f92672">&amp;#34;query&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;bool&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;must&amp;#34;&lt;/span>: [
{ &lt;span style="color:#f92672">&amp;#34;term&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;environment&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;production&amp;#34;&lt;/span> } },
{ &lt;span style="color:#f92672">&amp;#34;range&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;latency_ms&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;gte&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">5000&lt;/span> } } },
{ &lt;span style="color:#f92672">&amp;#34;range&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;@timestamp&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;gte&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;now-1h&amp;#34;&lt;/span> } } }
]
}
},
&lt;span style="color:#f92672">&amp;#34;sort&amp;#34;&lt;/span>: [{ &lt;span style="color:#f92672">&amp;#34;latency_ms&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;desc&amp;#34;&lt;/span> }],
&lt;span style="color:#f92672">&amp;#34;size&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">20&lt;/span>
}
&lt;span style="color:#960050;background-color:#1e0010">/&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">/&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">Q&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">u&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">e&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">r&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">y&lt;/span> &lt;span style="color:#ae81ff">2&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">:&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">H&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">i&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">g&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">h&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">-&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">c&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">o&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">s&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">t&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">s&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">e&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">s&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">s&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">i&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">o&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">n&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">s&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">h&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">ô&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">m&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">n&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">a&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">y&lt;/span>
{
&lt;span style="color:#f92672">&amp;#34;query&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;bool&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;must&amp;#34;&lt;/span>: [
{ &lt;span style="color:#f92672">&amp;#34;range&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;@timestamp&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;gte&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;now/d&amp;#34;&lt;/span> } } },
{ &lt;span style="color:#f92672">&amp;#34;range&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;cost_usd&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;gte&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0.10&lt;/span> } } }
]
}
},
&lt;span style="color:#f92672">&amp;#34;aggs&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;by_session&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;terms&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;field&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;session_id&amp;#34;&lt;/span>, &lt;span style="color:#f92672">&amp;#34;size&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">20&lt;/span> },
&lt;span style="color:#f92672">&amp;#34;aggs&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;total_cost&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;sum&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;field&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;cost_usd&amp;#34;&lt;/span> } },
&lt;span style="color:#f92672">&amp;#34;total_tokens&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;sum&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;field&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;total_tokens&amp;#34;&lt;/span> } }
}
}
},
&lt;span style="color:#f92672">&amp;#34;size&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0&lt;/span>
}
&lt;span style="color:#960050;background-color:#1e0010">/&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">/&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">Q&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">u&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">e&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">r&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">y&lt;/span> &lt;span style="color:#ae81ff">3&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">:&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">F&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">a&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">i&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">l&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">e&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">d&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">t&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">o&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">o&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">l&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">c&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">a&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">l&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">l&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">s&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">t&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">h&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">e&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">o&lt;/span> &lt;span style="color:#960050;background-color:#1e0010">a&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">g&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">e&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">n&lt;/span>&lt;span style="color:#960050;background-color:#1e0010">t&lt;/span>
{
&lt;span style="color:#f92672">&amp;#34;query&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;bool&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;must&amp;#34;&lt;/span>: [
{ &lt;span style="color:#f92672">&amp;#34;range&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;@timestamp&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;gte&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;now-6h&amp;#34;&lt;/span> } } }
],
&lt;span style="color:#f92672">&amp;#34;filter&amp;#34;&lt;/span>: [
{
&lt;span style="color:#f92672">&amp;#34;nested&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;path&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;tool_calls&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;query&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;term&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;tool_calls.status&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;failed&amp;#34;&lt;/span> }
}
}
}
]
}
},
&lt;span style="color:#f92672">&amp;#34;aggs&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;by_agent&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;terms&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;field&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;agent_id&amp;#34;&lt;/span> },
&lt;span style="color:#f92672">&amp;#34;aggs&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;failed_tools&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;nested&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;path&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;tool_calls&amp;#34;&lt;/span> },
&lt;span style="color:#f92672">&amp;#34;aggs&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;failed_only&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;filter&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;term&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;tool_calls.status&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;failed&amp;#34;&lt;/span> } },
&lt;span style="color:#f92672">&amp;#34;aggs&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;tool_names&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;terms&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;field&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;tool_calls.name&amp;#34;&lt;/span> } }
}
}
}
}
}
}
},
&lt;span style="color:#f92672">&amp;#34;size&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0&lt;/span>
}
&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="8-grafana-dashboard--5-panel-groups">8. Grafana Dashboard — 5 Panel Groups&lt;/h2>
&lt;h3 id="81-tng-quan-5-dashboard-panels">8.1. Tổng Quan 5 Dashboard Panels&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Panel&lt;/th>
&lt;th>Mô tả&lt;/th>
&lt;th>Metrics nguồn&lt;/th>
&lt;th>Visualisation&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Overview&lt;/strong>&lt;/td>
&lt;td>RPS, Error Rate, Avg Latency&lt;/td>
&lt;td>Prometheus&lt;/td>
&lt;td>Stat + Time series&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Token Economy&lt;/strong>&lt;/td>
&lt;td>Cost/giờ, token distribution&lt;/td>
&lt;td>Prometheus&lt;/td>
&lt;td>Bar gauge + Heatmap&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Quality&lt;/strong>&lt;/td>
&lt;td>Hallucination rate, guardrail blocks&lt;/td>
&lt;td>Prometheus&lt;/td>
&lt;td>Time series + Alert&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Agent Health&lt;/strong>&lt;/td>
&lt;td>Per-agent latency heatmap&lt;/td>
&lt;td>Prometheus&lt;/td>
&lt;td>Heatmap&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Business KPI&lt;/strong>&lt;/td>
&lt;td>Task completion, escalation funnel&lt;/td>
&lt;td>Prometheus + ES&lt;/td>
&lt;td>Stat + Bar chart&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="82-grafana-dashboard-json-config-partial">8.2. Grafana Dashboard JSON Config (Partial)&lt;/h3>
&lt;div class="highlight">&lt;pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-json" data-lang="json">{
&lt;span style="color:#f92672">&amp;#34;title&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;AI Agent — LLMOps Dashboard&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;uid&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;llmops-main-dashboard&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;tags&amp;#34;&lt;/span>: [&lt;span style="color:#e6db74">&amp;#34;ai-agent&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;llmops&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;production&amp;#34;&lt;/span>],
&lt;span style="color:#f92672">&amp;#34;refresh&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;30s&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;time&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;from&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;now-3h&amp;#34;&lt;/span>, &lt;span style="color:#f92672">&amp;#34;to&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;now&amp;#34;&lt;/span> },
&lt;span style="color:#f92672">&amp;#34;panels&amp;#34;&lt;/span>: [
{
&lt;span style="color:#f92672">&amp;#34;id&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">1&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;title&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;🟢 Requests Per Second&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;stat&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;gridPos&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;x&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0&lt;/span>, &lt;span style="color:#f92672">&amp;#34;y&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0&lt;/span>, &lt;span style="color:#f92672">&amp;#34;w&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">6&lt;/span>, &lt;span style="color:#f92672">&amp;#34;h&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">4&lt;/span> },
&lt;span style="color:#f92672">&amp;#34;targets&amp;#34;&lt;/span>: [
{
&lt;span style="color:#f92672">&amp;#34;datasource&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;prometheus&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;expr&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;sum(rate(llm_request_duration_seconds_count[2m]))&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;legendFormat&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;RPS&amp;#34;&lt;/span>
}
],
&lt;span style="color:#f92672">&amp;#34;fieldConfig&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;defaults&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;color&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;mode&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;thresholds&amp;#34;&lt;/span> },
&lt;span style="color:#f92672">&amp;#34;thresholds&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;steps&amp;#34;&lt;/span>: [
{ &lt;span style="color:#f92672">&amp;#34;color&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;green&amp;#34;&lt;/span>, &lt;span style="color:#f92672">&amp;#34;value&amp;#34;&lt;/span>: &lt;span style="color:#66d9ef">null&lt;/span> },
{ &lt;span style="color:#f92672">&amp;#34;color&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;yellow&amp;#34;&lt;/span>, &lt;span style="color:#f92672">&amp;#34;value&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">100&lt;/span> },
{ &lt;span style="color:#f92672">&amp;#34;color&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;red&amp;#34;&lt;/span>, &lt;span style="color:#f92672">&amp;#34;value&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">500&lt;/span> }
]
},
&lt;span style="color:#f92672">&amp;#34;unit&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;reqps&amp;#34;&lt;/span>
}
}
},
{
&lt;span style="color:#f92672">&amp;#34;id&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">2&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;title&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;🔴 Error Rate (%)&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;stat&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;gridPos&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;x&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">6&lt;/span>, &lt;span style="color:#f92672">&amp;#34;y&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0&lt;/span>, &lt;span style="color:#f92672">&amp;#34;w&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">6&lt;/span>, &lt;span style="color:#f92672">&amp;#34;h&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">4&lt;/span> },
&lt;span style="color:#f92672">&amp;#34;targets&amp;#34;&lt;/span>: [
{
&lt;span style="color:#f92672">&amp;#34;datasource&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;prometheus&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;expr&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;100 * sum(rate(llm_errors_total[5m])) / sum(rate(llm_request_duration_seconds_count[5m]))&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;legendFormat&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;Error Rate %&amp;#34;&lt;/span>
}
],
&lt;span style="color:#f92672">&amp;#34;fieldConfig&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;defaults&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;unit&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;percent&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;thresholds&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;steps&amp;#34;&lt;/span>: [
{ &lt;span style="color:#f92672">&amp;#34;color&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;green&amp;#34;&lt;/span>, &lt;span style="color:#f92672">&amp;#34;value&amp;#34;&lt;/span>: &lt;span style="color:#66d9ef">null&lt;/span> },
{ &lt;span style="color:#f92672">&amp;#34;color&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;yellow&amp;#34;&lt;/span>, &lt;span style="color:#f92672">&amp;#34;value&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">1&lt;/span> },
{ &lt;span style="color:#f92672">&amp;#34;color&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;red&amp;#34;&lt;/span>, &lt;span style="color:#f92672">&amp;#34;value&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">5&lt;/span> }
]
}
}
}
},
{
&lt;span style="color:#f92672">&amp;#34;id&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">3&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;title&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;⏱ Latency P95 (ms)&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;timeseries&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;gridPos&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;x&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0&lt;/span>, &lt;span style="color:#f92672">&amp;#34;y&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">4&lt;/span>, &lt;span style="color:#f92672">&amp;#34;w&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">12&lt;/span>, &lt;span style="color:#f92672">&amp;#34;h&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">8&lt;/span> },
&lt;span style="color:#f92672">&amp;#34;targets&amp;#34;&lt;/span>: [
{
&lt;span style="color:#f92672">&amp;#34;datasource&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;prometheus&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;expr&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;histogram_quantile(0.95, sum by(le, agent_id) (rate(llm_request_duration_seconds_bucket[5m]))) * 1000&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;legendFormat&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;P95 - {{agent_id}}&amp;#34;&lt;/span>
},
{
&lt;span style="color:#f92672">&amp;#34;datasource&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;prometheus&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;expr&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;histogram_quantile(0.50, sum by(le, agent_id) (rate(llm_request_duration_seconds_bucket[5m]))) * 1000&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;legendFormat&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;P50 - {{agent_id}}&amp;#34;&lt;/span>
}
],
&lt;span style="color:#f92672">&amp;#34;fieldConfig&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;defaults&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;unit&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;ms&amp;#34;&lt;/span> }
}
},
{
&lt;span style="color:#f92672">&amp;#34;id&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">4&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;title&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;💰 Cost Per Hour (USD)&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;timeseries&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;gridPos&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;x&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">12&lt;/span>, &lt;span style="color:#f92672">&amp;#34;y&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0&lt;/span>, &lt;span style="color:#f92672">&amp;#34;w&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">12&lt;/span>, &lt;span style="color:#f92672">&amp;#34;h&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">8&lt;/span> },
&lt;span style="color:#f92672">&amp;#34;targets&amp;#34;&lt;/span>: [
{
&lt;span style="color:#f92672">&amp;#34;datasource&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;prometheus&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;expr&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;sum by(agent_id) (rate(llm_cost_usd_total[1h])) * 3600&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;legendFormat&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;Cost/hr - {{agent_id}}&amp;#34;&lt;/span>
}
],
&lt;span style="color:#f92672">&amp;#34;fieldConfig&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;defaults&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;unit&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;currencyUSD&amp;#34;&lt;/span> }
}
},
{
&lt;span style="color:#f92672">&amp;#34;id&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">5&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;title&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;🧠 Hallucination Rate (%)&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;timeseries&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;gridPos&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;x&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">0&lt;/span>, &lt;span style="color:#f92672">&amp;#34;y&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">12&lt;/span>, &lt;span style="color:#f92672">&amp;#34;w&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">12&lt;/span>, &lt;span style="color:#f92672">&amp;#34;h&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">8&lt;/span> },
&lt;span style="color:#f92672">&amp;#34;targets&amp;#34;&lt;/span>: [
{
&lt;span style="color:#f92672">&amp;#34;datasource&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;prometheus&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;expr&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;100 * histogram_quantile(0.90, rate(llm_hallucination_score_bucket[10m]))&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;legendFormat&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;Hallucination P90&amp;#34;&lt;/span>
}
],
&lt;span style="color:#f92672">&amp;#34;alert&amp;#34;&lt;/span>: {
&lt;span style="color:#f92672">&amp;#34;conditions&amp;#34;&lt;/span>: [
{
&lt;span style="color:#f92672">&amp;#34;type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;query&amp;#34;&lt;/span>,
&lt;span style="color:#f92672">&amp;#34;query&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;params&amp;#34;&lt;/span>: [&lt;span style="color:#e6db74">&amp;#34;A&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;10m&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;now&amp;#34;&lt;/span>] },
&lt;span style="color:#f92672">&amp;#34;reducer&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;avg&amp;#34;&lt;/span> },
&lt;span style="color:#f92672">&amp;#34;evaluator&amp;#34;&lt;/span>: { &lt;span style="color:#f92672">&amp;#34;type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;gt&amp;#34;&lt;/span>, &lt;span style="color:#f92672">&amp;#34;params&amp;#34;&lt;/span>: [&lt;span style="color:#ae81ff">10&lt;/span>] }
}
],
&lt;span style="color:#f92672">&amp;#34;name&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;High Hallucination Rate Alert&amp;#34;&lt;/span>
}
}
]
}
&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="9-alerting-strategy--8-alert-rules-quan-trng">9. Alerting Strategy — 8 Alert Rules Quan Trọng&lt;/h2>
&lt;h3 id="91-prometheus-alertmanager-config">9.1. Prometheus AlertManager Config&lt;/h3>
&lt;div class="highlight">&lt;pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-yaml" data-lang="yaml">&lt;span style="color:#75715e"># alertmanager-rules.yaml&lt;/span>
---
groups:
- name: llmops_critical
rules:
&lt;span style="color:#75715e"># Alert 1: Cost Spike — hàng ngày vượt 150% baseline&lt;/span>
- alert: LLMCostSpike
expr: &lt;span style="color:#e6db74">|
&lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74">(&lt;/span>
sum(increase(llm_cost_usd_total[24h]))
/
sum(increase(llm_cost_usd_total[24h] offset 7d))
) &amp;gt; &lt;span style="color:#ae81ff">1.5&lt;/span>
for: 15m
labels:
severity: critical
team: llmops
annotations:
summary: &lt;span style="color:#e6db74">&amp;#34;💰 LLM Cost Spike Detected&amp;#34;&lt;/span>
description: &lt;span style="color:#e6db74">&amp;#34;Daily cost is {{ humanize $value | printf \&amp;#34;%.0f%%\&amp;#34; }} of 7-day average. Current: ${{ $value }}&amp;#34;&lt;/span>
runbook: &lt;span style="color:#e6db74">&amp;#34;https://wiki.company.com/runbooks/llm-cost-spike&amp;#34;&lt;/span>
&lt;span style="color:#75715e"># Alert 2: Latency P95 &amp;gt; 5s sustained 5 minutes&lt;/span>
- alert: LLMHighLatencyP95
expr: &lt;span style="color:#e6db74">|
&lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74">histogram_quantile(0.95,&lt;/span>
sum by(le, agent_id) (rate(llm_request_duration_seconds_bucket[5m]))
) &amp;gt; &lt;span style="color:#ae81ff">5&lt;/span>
for: 5m
labels:
severity: warning
team: llmops
annotations:
summary: &lt;span style="color:#e6db74">&amp;#34;⏱ LLM P95 Latency High: {{ $labels.agent_id }}&amp;#34;&lt;/span>
description: &lt;span style="color:#e6db74">&amp;#34;P95 latency is {{ $value | humanizeDuration }} for agent {{ $labels.agent_id }}&amp;#34;&lt;/span>
&lt;span style="color:#75715e"># Alert 3: Error Rate &amp;gt; 5% trong 10 phút&lt;/span>
- alert: LLMHighErrorRate
expr: &lt;span style="color:#e6db74">|
&lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74">(&lt;/span>
sum by(agent_id) (rate(llm_errors_total[10m]))
/
sum by(agent_id) (rate(llm_request_duration_seconds_count[10m]))
) * &lt;span style="color:#ae81ff">100&lt;/span> &amp;gt; &lt;span style="color:#ae81ff">5&lt;/span>
for: 10m
labels:
severity: critical
team: llmops
annotations:
summary: &lt;span style="color:#e6db74">&amp;#34;🔴 LLM Error Rate &amp;gt; 5%: {{ $labels.agent_id }}&amp;#34;&lt;/span>
description: &lt;span style="color:#e6db74">&amp;#34;Error rate is {{ $value | printf \&amp;#34;%.1f%%\&amp;#34; }} for agent {{ $labels.agent_id }}&amp;#34;&lt;/span>
&lt;span style="color:#75715e"># Alert 4: Hallucination Rate &amp;gt; 10% (sampled evaluation)&lt;/span>
- alert: LLMHallucinationRateHigh
expr: &lt;span style="color:#e6db74">|
&lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74">histogram_quantile(0.90,&lt;/span>
sum by(le, agent_id) (rate(llm_hallucination_score_bucket[15m]))
) &amp;gt; &lt;span style="color:#ae81ff">0.10&lt;/span>
for: 10m
labels:
severity: critical
team: ai-quality
annotations:
summary: &lt;span style="color:#e6db74">&amp;#34;🧠 Hallucination Rate Spike: {{ $labels.agent_id }}&amp;#34;&lt;/span>
description: &lt;span style="color:#e6db74">&amp;#34;P90 hallucination score is {{ $value | printf \&amp;#34;%.2f\&amp;#34; }} — review recent prompts/model&amp;#34;&lt;/span>
&lt;span style="color:#75715e"># Alert 5: Guardrail Block Surge &amp;gt; 20% in 15 minutes&lt;/span>
- alert: LLMGuardrailBlockSurge
expr: &lt;span style="color:#e6db74">|
&lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74">(&lt;/span>
sum by(agent_id) (rate(llm_guardrail_decisions_total{decision=&lt;span style="color:#e6db74">&amp;#34;block&amp;#34;&lt;/span>}[15m]))
/
sum by(agent_id) (rate(llm_request_duration_seconds_count[15m]))
) * &lt;span style="color:#ae81ff">100&lt;/span> &amp;gt; &lt;span style="color:#ae81ff">20&lt;/span>
for: 5m
labels:
severity: warning
team: llmops
annotations:
summary: &lt;span style="color:#e6db74">&amp;#34;🛡 Guardrail Block Surge: {{ $labels.agent_id }}&amp;#34;&lt;/span>
description: &lt;span style="color:#e6db74">&amp;#34;{{ $value | printf \&amp;#34;%.1f%%\&amp;#34; }} of requests blocked — possible attack or prompt issue&amp;#34;&lt;/span>
&lt;span style="color:#75715e"># Alert 6: Token Quota Approaching 80% of Daily Limit&lt;/span>
- alert: LLMTokenQuotaWarning
expr: &lt;span style="color:#e6db74">|
&lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74">(&lt;/span>
sum by(tenant_id) (increase(llm_tokens_total[24h]))
/
on(tenant_id) llm_token_daily_quota
) * &lt;span style="color:#ae81ff">100&lt;/span> &amp;gt; &lt;span style="color:#ae81ff">80&lt;/span>
for: 0m
labels:
severity: warning
team: platform
annotations:
summary: &lt;span style="color:#e6db74">&amp;#34;📊 Token Quota Warning: {{ $labels.tenant_id }}&amp;#34;&lt;/span>
description: &lt;span style="color:#e6db74">&amp;#34;Tenant {{ $labels.tenant_id }} has used {{ $value | printf \&amp;#34;%.0f%%\&amp;#34; }} of daily token quota&amp;#34;&lt;/span>
&lt;span style="color:#75715e"># Alert 7: Circuit Breaker OPEN&lt;/span>
- alert: LLMCircuitBreakerOpen
expr: llm_circuit_breaker_state{state=&lt;span style="color:#e6db74">&amp;#34;open&amp;#34;&lt;/span>} == &lt;span style="color:#ae81ff">1&lt;/span>
for: 2m
labels:
severity: critical
team: llmops
annotations:
summary: &lt;span style="color:#e6db74">&amp;#34;⚡ Circuit Breaker OPEN: {{ $labels.agent_id }}&amp;#34;&lt;/span>
description: &lt;span style="color:#e6db74">&amp;#34;LLM circuit breaker opened for {{ $labels.agent_id }} — service may be degraded&amp;#34;&lt;/span>
&lt;span style="color:#75715e"># Alert 8: Memory/Context Overflow Rate Spike&lt;/span>
- alert: LLMContextOverflowSpike
expr: &lt;span style="color:#e6db74">|
&lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74">(&lt;/span>
sum by(agent_id) (rate(llm_errors_total{error_type=&lt;span style="color:#e6db74">&amp;#34;context_length_exceeded&amp;#34;&lt;/span>}[10m]))
/
sum by(agent_id) (rate(llm_request_duration_seconds_count[10m]))
) * &lt;span style="color:#ae81ff">100&lt;/span> &amp;gt; &lt;span style="color:#ae81ff">5&lt;/span>
for: 5m
labels:
severity: warning
team: llmops
annotations:
summary: &lt;span style="color:#e6db74">&amp;#34;💾 Context Overflow Spike: {{ $labels.agent_id }}&amp;#34;&lt;/span>
description: &lt;span style="color:#e6db74">&amp;#34;{{ $value | printf \&amp;#34;%.1f%%\&amp;#34; }} requests hitting context limit — review chunking/truncation strategy&amp;#34;&lt;/span>
&lt;span style="color:#75715e"># alertmanager.yaml — Routing + Slack Webhook&lt;/span>
---
route:
group_by: [&lt;span style="color:#e6db74">&amp;#39;alertname&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;agent_id&amp;#39;&lt;/span>]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: &lt;span style="color:#e6db74">&amp;#39;slack-llmops&amp;#39;&lt;/span>
routes:
- match:
severity: critical
receiver: &lt;span style="color:#e6db74">&amp;#39;slack-critical-llmops&amp;#39;&lt;/span>
group_wait: 10s
repeat_interval: 1h
- match:
team: ai-quality
receiver: &lt;span style="color:#e6db74">&amp;#39;slack-ai-quality&amp;#39;&lt;/span>
receivers:
- name: &lt;span style="color:#e6db74">&amp;#39;slack-llmops&amp;#39;&lt;/span>
slack_configs:
- api_url: &lt;span style="color:#e6db74">&amp;#39;https://hooks.slack.com/services/YOUR/WEBHOOK/URL&amp;#39;&lt;/span>
channel: &lt;span style="color:#e6db74">&amp;#39;#llmops-alerts&amp;#39;&lt;/span>
title: &lt;span style="color:#e6db74">&amp;#39;{{ template &amp;#34;slack.title&amp;#34; . }}&amp;#39;&lt;/span>
text: &lt;span style="color:#e6db74">|
&lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74">{{ range .Alerts }}&lt;/span>
&lt;span style="color:#75715e">*Alert:*&lt;/span> {{ .Annotations.summary }}
&lt;span style="color:#75715e">*Details:*&lt;/span> {{ .Annotations.description }}
&lt;span style="color:#75715e">*Runbook:*&lt;/span> {{ .Annotations.runbook }}
{{ end }}
send_resolved: &lt;span style="color:#66d9ef">true&lt;/span>
- name: &lt;span style="color:#e6db74">&amp;#39;slack-critical-llmops&amp;#39;&lt;/span>
slack_configs:
- api_url: &lt;span style="color:#e6db74">&amp;#39;https://hooks.slack.com/services/YOUR/WEBHOOK/URL&amp;#39;&lt;/span>
channel: &lt;span style="color:#e6db74">&amp;#39;#llmops-critical&amp;#39;&lt;/span>
color: &lt;span style="color:#e6db74">&amp;#39;danger&amp;#39;&lt;/span>
title: &lt;span style="color:#e6db74">&amp;#39;🚨 CRITICAL: {{ template &amp;#34;slack.title&amp;#34; . }}&amp;#39;&lt;/span>
text: &lt;span style="color:#e6db74">|
&lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74">{{ range .Alerts }}&lt;/span>
&lt;span style="color:#75715e">*Alert:*&lt;/span> {{ .Annotations.summary }}
&lt;span style="color:#75715e">*Details:*&lt;/span> {{ .Annotations.description }}
&lt;span style="color:#75715e">*Runbook:*&lt;/span> {{ .Annotations.runbook }}
{{ end }}
send_resolved: &lt;span style="color:#66d9ef">true&lt;/span>
- name: &lt;span style="color:#e6db74">&amp;#39;slack-ai-quality&amp;#39;&lt;/span>
slack_configs:
- api_url: &lt;span style="color:#e6db74">&amp;#39;https://hooks.slack.com/services/YOUR/WEBHOOK/URL&amp;#39;&lt;/span>
channel: &lt;span style="color:#e6db74">&amp;#39;#ai-quality-alerts&amp;#39;&lt;/span>
title: &lt;span style="color:#e6db74">&amp;#39;{{ template &amp;#34;slack.title&amp;#34; . }}&amp;#39;&lt;/span>
send_resolved: &lt;span style="color:#66d9ef">true&lt;/span>
&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="10-ab-testing-prompt--model-routing">10. A/B Testing Prompt &amp;amp; Model Routing&lt;/h2>
&lt;h3 id="101-kin-trc-traffic-splitting">10.1. Kiến Trúc Traffic Splitting&lt;/h3>
&lt;pre>&lt;code> INCOMING REQUESTS
│
▼
┌───────────────────────┐
│ FEATURE FLAG │
│ SERVICE │
│ (LaunchDarkly / │
│ self-hosted) │
└──────────┬────────────┘
│
┌─────────────┼──────────────┐
│ 90% │ 10% │
▼ ▼ │
┌──────────┐ ┌──────────┐ │
│ Prompt A │ │ Prompt B │ Shadow Mode
│ (control)│ │(canary) │ │
└────┬─────┘ └────┬─────┘ │
│ │ ┌───▼───────┐
▼ ▼ │ Duplicate │
LLM Response LLM Response │ Request │
│ (no user │
Track: │ impact) │
- Latency └─────┬─────┘
- Quality score │
- Cost ▼
- User satisfaction Evaluation
(offline)
&lt;/code>&lt;/pre>&lt;h3 id="102-python--model-router-vi-weighted-random-selection">10.2. Python — Model Router với Weighted Random Selection&lt;/h3>
&lt;div class="highlight">&lt;pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-python" data-lang="python">&lt;span style="color:#f92672">import&lt;/span> random
&lt;span style="color:#f92672">import&lt;/span> time
&lt;span style="color:#f92672">import&lt;/span> hashlib
&lt;span style="color:#f92672">from&lt;/span> dataclasses &lt;span style="color:#f92672">import&lt;/span> dataclass, field
&lt;span style="color:#f92672">from&lt;/span> enum &lt;span style="color:#f92672">import&lt;/span> Enum
&lt;span style="color:#f92672">from&lt;/span> typing &lt;span style="color:#f92672">import&lt;/span> Optional, Callable
&lt;span style="color:#f92672">import&lt;/span> structlog
logger &lt;span style="color:#f92672">=&lt;/span> structlog&lt;span style="color:#f92672">.&lt;/span>get_logger()
&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">RoutingStrategy&lt;/span>(Enum):
WEIGHTED_RANDOM &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">weighted_random&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>
TENANT_BASED &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">tenant_based&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>
TASK_COMPLEXITY &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">task_complexity&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>
CANARY &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">canary&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>
SHADOW &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">shadow&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>
&lt;span style="color:#a6e22e">@dataclass&lt;/span>
&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">ModelConfig&lt;/span>:
model: str
weight: float &lt;span style="color:#75715e"># 0.0 - 1.0, tổng các config phải = 1.0&lt;/span>
variant_name: str &lt;span style="color:#75715e"># &amp;#34;control&amp;#34;, &amp;#34;canary_v2&amp;#34;, &amp;#34;shadow&amp;#34;&lt;/span>
max_tokens: int &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">4096&lt;/span>
temperature: float &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0.7&lt;/span>
extra_params: dict &lt;span style="color:#f92672">=&lt;/span> field(default_factory&lt;span style="color:#f92672">=&lt;/span>dict)
&lt;span style="color:#a6e22e">@dataclass&lt;/span>
&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">RoutingDecision&lt;/span>:
model_config: ModelConfig
strategy_used: str
routing_reason: str
experiment_id: Optional[str] &lt;span style="color:#f92672">=&lt;/span> None
&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">AIAgentModelRouter&lt;/span>:
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> Model Router với nhiều chiến lược:&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> - A/B test (weighted random)&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> - Per-tenant routing&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> - Task complexity routing&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> - Shadow mode (duplicate traffic)&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;span style="color:#66d9ef">def&lt;/span> __init__(self):
&lt;span style="color:#75715e"># A/B test configurations&lt;/span>
self&lt;span style="color:#f92672">.&lt;/span>_ab_experiments: dict[str, list[ModelConfig]] &lt;span style="color:#f92672">=&lt;/span> {}
&lt;span style="color:#75715e"># Tenant-specific routing&lt;/span>
self&lt;span style="color:#f92672">.&lt;/span>_tenant_routing: dict[str, ModelConfig] &lt;span style="color:#f92672">=&lt;/span> {}
&lt;span style="color:#75715e"># Default routing by task type&lt;/span>
self&lt;span style="color:#f92672">.&lt;/span>_task_routing: dict[str, ModelConfig] &lt;span style="color:#f92672">=&lt;/span> {
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">simple_faq&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: ModelConfig(
model&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">gpt-4o-mini&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, weight&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">1.0&lt;/span>, variant_name&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">control&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
max_tokens&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">1024&lt;/span>, temperature&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">0.3&lt;/span>,
),
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">complex_analysis&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: ModelConfig(
model&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">gpt-4o&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, weight&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">1.0&lt;/span>, variant_name&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">control&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
max_tokens&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">4096&lt;/span>, temperature&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">0.7&lt;/span>,
),
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">sensitive_medical&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: ModelConfig(
model&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">ollama/llama3.1&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, weight&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">1.0&lt;/span>, variant_name&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">on_premise&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
max_tokens&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">2048&lt;/span>, temperature&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">0.1&lt;/span>,
),
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">code_generation&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: ModelConfig(
model&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">claude-3-5-sonnet&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, weight&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">1.0&lt;/span>, variant_name&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">control&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
max_tokens&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">4096&lt;/span>, temperature&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">0.2&lt;/span>,
),
}
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">register_ab_experiment&lt;/span>(
self,
experiment_id: str,
configs: list[ModelConfig],
) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> None:
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Đăng ký A/B experiment với weighted configs.&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>
total_weight &lt;span style="color:#f92672">=&lt;/span> sum(c&lt;span style="color:#f92672">.&lt;/span>weight &lt;span style="color:#66d9ef">for&lt;/span> c &lt;span style="color:#f92672">in&lt;/span> configs)
&lt;span style="color:#66d9ef">if&lt;/span> abs(total_weight &lt;span style="color:#f92672">-&lt;/span> &lt;span style="color:#ae81ff">1.0&lt;/span>) &lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#ae81ff">0.001&lt;/span>:
&lt;span style="color:#66d9ef">raise&lt;/span> &lt;span style="color:#a6e22e">ValueError&lt;/span>(f&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Weights must sum to 1.0, got {total_weight}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
self&lt;span style="color:#f92672">.&lt;/span>_ab_experiments[experiment_id] &lt;span style="color:#f92672">=&lt;/span> configs
logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">ab_experiment_registered&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, experiment_id&lt;span style="color:#f92672">=&lt;/span>experiment_id,
variants&lt;span style="color:#f92672">=&lt;/span>[c&lt;span style="color:#f92672">.&lt;/span>variant_name &lt;span style="color:#66d9ef">for&lt;/span> c &lt;span style="color:#f92672">in&lt;/span> configs])
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">route&lt;/span>(
self,
task_type: str,
tenant_id: str &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">default&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
session_id: str &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
experiment_id: Optional[str] &lt;span style="color:#f92672">=&lt;/span> None,
force_strategy: Optional[RoutingStrategy] &lt;span style="color:#f92672">=&lt;/span> None,
) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> RoutingDecision:
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Chọn model config dựa trên chiến lược routing.&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;span style="color:#75715e"># 1. Tenant-specific override (highest priority)&lt;/span>
&lt;span style="color:#66d9ef">if&lt;/span> tenant_id &lt;span style="color:#f92672">in&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>_tenant_routing &lt;span style="color:#f92672">and&lt;/span> &lt;span style="color:#f92672">not&lt;/span> experiment_id:
config &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>_tenant_routing[tenant_id]
&lt;span style="color:#66d9ef">return&lt;/span> RoutingDecision(
model_config&lt;span style="color:#f92672">=&lt;/span>config,
strategy_used&lt;span style="color:#f92672">=&lt;/span>RoutingStrategy&lt;span style="color:#f92672">.&lt;/span>TENANT_BASED&lt;span style="color:#f92672">.&lt;/span>value,
routing_reason&lt;span style="color:#f92672">=&lt;/span>f&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Tenant {tenant_id} has dedicated model&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
)
&lt;span style="color:#75715e"># 2. A/B Experiment (nếu có experiment_id)&lt;/span>
&lt;span style="color:#66d9ef">if&lt;/span> experiment_id &lt;span style="color:#f92672">and&lt;/span> experiment_id &lt;span style="color:#f92672">in&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>_ab_experiments:
configs &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>_ab_experiments[experiment_id]
&lt;span style="color:#75715e"># Sticky routing: cùng session_id → cùng variant (consistent UX)&lt;/span>
&lt;span style="color:#66d9ef">if&lt;/span> session_id:
hash_val &lt;span style="color:#f92672">=&lt;/span> int(hashlib&lt;span style="color:#f92672">.&lt;/span>md5(session_id&lt;span style="color:#f92672">.&lt;/span>encode())&lt;span style="color:#f92672">.&lt;/span>hexdigest(), &lt;span style="color:#ae81ff">16&lt;/span>)
bucket &lt;span style="color:#f92672">=&lt;/span> (hash_val &lt;span style="color:#f92672">%&lt;/span> &lt;span style="color:#ae81ff">1000&lt;/span>) &lt;span style="color:#f92672">/&lt;/span> &lt;span style="color:#ae81ff">1000.0&lt;/span>
&lt;span style="color:#66d9ef">else&lt;/span>:
bucket &lt;span style="color:#f92672">=&lt;/span> random&lt;span style="color:#f92672">.&lt;/span>random()
cumulative &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0.0&lt;/span>
&lt;span style="color:#66d9ef">for&lt;/span> config &lt;span style="color:#f92672">in&lt;/span> configs:
cumulative &lt;span style="color:#f92672">+&lt;/span>&lt;span style="color:#f92672">=&lt;/span> config&lt;span style="color:#f92672">.&lt;/span>weight
&lt;span style="color:#66d9ef">if&lt;/span> bucket &lt;span style="color:#f92672">&amp;lt;&lt;/span>&lt;span style="color:#f92672">=&lt;/span> cumulative:
logger&lt;span style="color:#f92672">.&lt;/span>info(
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">ab_routing&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
experiment_id&lt;span style="color:#f92672">=&lt;/span>experiment_id,
variant&lt;span style="color:#f92672">=&lt;/span>config&lt;span style="color:#f92672">.&lt;/span>variant_name,
model&lt;span style="color:#f92672">=&lt;/span>config&lt;span style="color:#f92672">.&lt;/span>model,
session_id&lt;span style="color:#f92672">=&lt;/span>session_id,
)
&lt;span style="color:#66d9ef">return&lt;/span> RoutingDecision(
model_config&lt;span style="color:#f92672">=&lt;/span>config,
strategy_used&lt;span style="color:#f92672">=&lt;/span>RoutingStrategy&lt;span style="color:#f92672">.&lt;/span>WEIGHTED_RANDOM&lt;span style="color:#f92672">.&lt;/span>value,
routing_reason&lt;span style="color:#f92672">=&lt;/span>f&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">A/B bucket {bucket:.3f} → {config.variant_name}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
experiment_id&lt;span style="color:#f92672">=&lt;/span>experiment_id,
)
&lt;span style="color:#75715e"># 3. Task complexity routing (fallback)&lt;/span>
config &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>_task_routing&lt;span style="color:#f92672">.&lt;/span>get(
task_type,
ModelConfig(model&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">gpt-4o-mini&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, weight&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">1.0&lt;/span>, variant_name&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">default&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
)
&lt;span style="color:#66d9ef">return&lt;/span> RoutingDecision(
model_config&lt;span style="color:#f92672">=&lt;/span>config,
strategy_used&lt;span style="color:#f92672">=&lt;/span>RoutingStrategy&lt;span style="color:#f92672">.&lt;/span>TASK_COMPLEXITY&lt;span style="color:#f92672">.&lt;/span>value,
routing_reason&lt;span style="color:#f92672">=&lt;/span>f&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Task type &lt;/span>&lt;span style="color:#e6db74">&amp;#39;&lt;/span>&lt;span style="color:#e6db74">{task_type}&lt;/span>&lt;span style="color:#e6db74">&amp;#39;&lt;/span>&lt;span style="color:#e6db74"> → {config.model}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
)
&lt;span style="color:#75715e"># ─── Sample Usage ──────────────────────────────────────────────────────────────&lt;/span>
router &lt;span style="color:#f92672">=&lt;/span> AIAgentModelRouter()
&lt;span style="color:#75715e"># Đăng ký A/B experiment: 90% prompt A (gpt-4o-mini) vs 10% prompt B (gpt-4o)&lt;/span>
router&lt;span style="color:#f92672">.&lt;/span>register_ab_experiment(
experiment_id&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">exp_prompt_v2_vs_v3&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
configs&lt;span style="color:#f92672">=&lt;/span>[
ModelConfig(model&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">gpt-4o-mini&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, weight&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">0.90&lt;/span>, variant_name&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">prompt_v2_control&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>),
ModelConfig(model&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">gpt-4o&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, weight&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">0.10&lt;/span>, variant_name&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">prompt_v3_canary&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>),
],
)
decision &lt;span style="color:#f92672">=&lt;/span> router&lt;span style="color:#f92672">.&lt;/span>route(
task_type&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">simple_faq&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
tenant_id&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">tenant-abc&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
session_id&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">sess-xyz789&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
experiment_id&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">exp_prompt_v2_vs_v3&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
)
&lt;span style="color:#66d9ef">print&lt;/span>(f&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Model: {decision.model_config.model}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;span style="color:#66d9ef">print&lt;/span>(f&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Variant: {decision.model_config.variant_name}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;span style="color:#66d9ef">print&lt;/span>(f&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Strategy: {decision.strategy_used}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="103-bng-kt-qu-ab-test-sample">10.3. Bảng Kết Quả A/B Test Sample&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Metric&lt;/th>
&lt;th>Prompt A (control)&lt;/th>
&lt;th>Prompt B (canary)&lt;/th>
&lt;th>Δ&lt;/th>
&lt;th>Kết luận&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Latency P95 (ms)&lt;/strong>&lt;/td>
&lt;td>1,820&lt;/td>
&lt;td>2,340&lt;/td>
&lt;td>+28.6%&lt;/td>
&lt;td>❌ B chậm hơn&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Quality Score (LLM Judge)&lt;/strong>&lt;/td>
&lt;td>3.8/5&lt;/td>
&lt;td>4.3/5&lt;/td>
&lt;td>+13.2%&lt;/td>
&lt;td>✅ B tốt hơn&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Cost/request (USD)&lt;/strong>&lt;/td>
&lt;td>$0.0021&lt;/td>
&lt;td>$0.0047&lt;/td>
&lt;td>+123.8%&lt;/td>
&lt;td>❌ B đắt hơn&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>User satisfaction (CSAT)&lt;/strong>&lt;/td>
&lt;td>76%&lt;/td>
&lt;td>83%&lt;/td>
&lt;td>+7%&lt;/td>
&lt;td>✅ B tốt hơn&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Task completion rate&lt;/strong>&lt;/td>
&lt;td>88%&lt;/td>
&lt;td>92%&lt;/td>
&lt;td>+4%&lt;/td>
&lt;td>✅ B tốt hơn&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Hallucination rate&lt;/strong>&lt;/td>
&lt;td>4.2%&lt;/td>
&lt;td>1.8%&lt;/td>
&lt;td>-57%&lt;/td>
&lt;td>✅ B an toàn hơn&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Guardrail block rate&lt;/strong>&lt;/td>
&lt;td>1.8%&lt;/td>
&lt;td>1.2%&lt;/td>
&lt;td>-33%&lt;/td>
&lt;td>✅ B sạch hơn&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>Kết luận&lt;/strong>: Prompt B (canary) cho quality tốt hơn đáng kể nhưng chi phí cao hơn 2x. Quyết định: roll out prompt B cho các tenant premium (happy to pay), giữ prompt A cho tier free.&lt;/p>
&lt;hr>
&lt;h2 id="11-model-routing-theo-tc-v">11. Model Routing Theo Tác Vụ&lt;/h2>
&lt;h3 id="111-decision-matrix">11.1. Decision Matrix&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Loại Tác Vụ&lt;/th>
&lt;th>Độ phức tạp&lt;/th>
&lt;th>Model đề xuất&lt;/th>
&lt;th>Chi phí/1K token&lt;/th>
&lt;th>Latency P95&lt;/th>
&lt;th>Ghi chú&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>FAQ đơn giản&lt;/td>
&lt;td>Thấp&lt;/td>
&lt;td>GPT-4o-mini / Gemini Flash&lt;/td>
&lt;td>$0.00015&lt;/td>
&lt;td>&amp;lt; 500ms&lt;/td>
&lt;td>80% traffic&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Tóm tắt văn bản&lt;/td>
&lt;td>Thấp-TB&lt;/td>
&lt;td>GPT-4o-mini&lt;/td>
&lt;td>$0.00015&lt;/td>
&lt;td>&amp;lt; 800ms&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Phân tích, so sánh&lt;/td>
&lt;td>Trung bình&lt;/td>
&lt;td>GPT-4o / Claude 3.5 Sonnet&lt;/td>
&lt;td>$0.005&lt;/td>
&lt;td>&amp;lt; 2s&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Reasoning phức tạp&lt;/td>
&lt;td>Cao&lt;/td>
&lt;td>GPT-4o / Claude 3.5&lt;/td>
&lt;td>$0.005&lt;/td>
&lt;td>&amp;lt; 3s&lt;/td>
&lt;td>15% traffic&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Code generation&lt;/td>
&lt;td>Cao&lt;/td>
&lt;td>Claude 3.5 Sonnet&lt;/td>
&lt;td>$0.003&lt;/td>
&lt;td>&amp;lt; 3s&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Dữ liệu y tế/nhạy cảm&lt;/td>
&lt;td>Bất kỳ&lt;/td>
&lt;td>Ollama on-premise&lt;/td>
&lt;td>$0 (infra cost)&lt;/td>
&lt;td>&amp;lt; 2s&lt;/td>
&lt;td>Data không rời server&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Real-time chat&lt;/td>
&lt;td>Thấp&lt;/td>
&lt;td>GPT-4o-mini (streaming)&lt;/td>
&lt;td>$0.00015&lt;/td>
&lt;td>TTFT &amp;lt; 200ms&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Batch processing&lt;/td>
&lt;td>Bất kỳ&lt;/td>
&lt;td>GPT-4o Batch API&lt;/td>
&lt;td>50% discount&lt;/td>
&lt;td>Hours&lt;/td>
&lt;td>Không realtime&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="112-bng-cost-vs-quality-trade-off">11.2. Bảng Cost vs Quality Trade-off&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Provider&lt;/th>
&lt;th>Model&lt;/th>
&lt;th>Input $/1M&lt;/th>
&lt;th>Output $/1M&lt;/th>
&lt;th>Quality Score&lt;/th>
&lt;th>Latency&lt;/th>
&lt;th>Data Privacy&lt;/th>
&lt;th>Phù hợp&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>OpenAI&lt;/td>
&lt;td>GPT-4o-mini&lt;/td>
&lt;td>$0.15&lt;/td>
&lt;td>$0.60&lt;/td>
&lt;td>4.0/5&lt;/td>
&lt;td>Fast&lt;/td>
&lt;td>Cloud&lt;/td>
&lt;td>General, cost-sensitive&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>OpenAI&lt;/td>
&lt;td>GPT-4o&lt;/td>
&lt;td>$5.00&lt;/td>
&lt;td>$15.00&lt;/td>
&lt;td>4.7/5&lt;/td>
&lt;td>Medium&lt;/td>
&lt;td>Cloud&lt;/td>
&lt;td>Complex reasoning&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Anthropic&lt;/td>
&lt;td>Claude 3 Haiku&lt;/td>
&lt;td>$0.25&lt;/td>
&lt;td>$1.25&lt;/td>
&lt;td>4.0/5&lt;/td>
&lt;td>Fast&lt;/td>
&lt;td>Cloud&lt;/td>
&lt;td>Safe, structured output&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Anthropic&lt;/td>
&lt;td>Claude 3.5 Sonnet&lt;/td>
&lt;td>$3.00&lt;/td>
&lt;td>$15.00&lt;/td>
&lt;td>4.8/5&lt;/td>
&lt;td>Medium&lt;/td>
&lt;td>Cloud&lt;/td>
&lt;td>High quality, coding&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Google&lt;/td>
&lt;td>Gemini 1.5 Flash&lt;/td>
&lt;td>$0.075&lt;/td>
&lt;td>$0.30&lt;/td>
&lt;td>3.9/5&lt;/td>
&lt;td>Very Fast&lt;/td>
&lt;td>Cloud&lt;/td>
&lt;td>Ultra low cost&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Azure OpenAI&lt;/td>
&lt;td>GPT-4o&lt;/td>
&lt;td>$5.00&lt;/td>
&lt;td>$15.00&lt;/td>
&lt;td>4.7/5&lt;/td>
&lt;td>Medium&lt;/td>
&lt;td>Cloud (VNet)&lt;/td>
&lt;td>Enterprise compliance&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Ollama&lt;/td>
&lt;td>Llama 3.1 70B&lt;/td>
&lt;td>$0 (GPU)&lt;/td>
&lt;td>$0 (GPU)&lt;/td>
&lt;td>4.0/5&lt;/td>
&lt;td>Medium&lt;/td>
&lt;td>On-premise&lt;/td>
&lt;td>Healthcare, banking&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Ollama&lt;/td>
&lt;td>Qwen2.5 7B&lt;/td>
&lt;td>$0 (GPU)&lt;/td>
&lt;td>$0 (GPU)&lt;/td>
&lt;td>3.6/5&lt;/td>
&lt;td>Fast&lt;/td>
&lt;td>On-premise&lt;/td>
&lt;td>Cost-zero, low quality tasks&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;hr>
&lt;h2 id="12-sampling-strategy-cho-production">12. Sampling Strategy Cho Production&lt;/h2>
&lt;h3 id="121-vn-">12.1. Vấn Đề&lt;/h3>
&lt;p>100% sampling trong production AI Agent:&lt;/p>
&lt;ul>
&lt;li>10,000 requests/ngày × 5 spans/request = 50,000 spans/ngày&lt;/li>
&lt;li>Lưu trữ: ~2KB/span × 50,000 = 100MB/ngày traces&lt;/li>
&lt;li>3 tháng: ~9GB chỉ cho trace data&lt;/li>
&lt;li>Chi phí Jaeger + object storage: ~$50-100/tháng&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Giải pháp&lt;/strong>: Adaptive (tail-based) sampling.&lt;/p>
&lt;h3 id="122-chin-lc-sampling">12.2. Chiến Lược Sampling&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Loại Request&lt;/th>
&lt;th>Sampling Rate&lt;/th>
&lt;th>Lý Do&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Error requests&lt;/td>
&lt;td>&lt;strong>100%&lt;/strong>&lt;/td>
&lt;td>Cần debug đầy đủ&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Slow requests (P95+)&lt;/td>
&lt;td>&lt;strong>100%&lt;/strong>&lt;/td>
&lt;td>Performance investigation&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>High-cost requests (&amp;gt;$0.10)&lt;/td>
&lt;td>&lt;strong>100%&lt;/strong>&lt;/td>
&lt;td>Cost audit&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Guardrail blocked&lt;/td>
&lt;td>&lt;strong>100%&lt;/strong>&lt;/td>
&lt;td>Security audit&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Normal successful requests&lt;/td>
&lt;td>&lt;strong>10%&lt;/strong>&lt;/td>
&lt;td>Statistical representation&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Health checks / internal&lt;/td>
&lt;td>&lt;strong>0%&lt;/strong>&lt;/td>
&lt;td>Noise reduction&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>Chi phí storage ước tính (10,000 req/ngày):&lt;/strong>&lt;/p>
&lt;pre>&lt;code>Error rate 2% = 200 requests → 200 × 5 spans × 2KB = 2MB
Slow rate 5% = 500 requests → 500 × 5 spans × 2KB = 5MB
Normal 10% = 930 requests → 930 × 5 spans × 2KB = 9.3MB
Total/day ≈ 16.3MB (vs 100MB với 100% sampling)
Savings: ~84%
&lt;/code>&lt;/pre>&lt;h3 id="123-python-otel-adaptive-sampler">12.3. Python OTel Adaptive Sampler&lt;/h3>
&lt;div class="highlight">&lt;pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-python" data-lang="python">&lt;span style="color:#f92672">import&lt;/span> random
&lt;span style="color:#f92672">from&lt;/span> opentelemetry.sdk.trace.sampling &lt;span style="color:#f92672">import&lt;/span> (
Sampler,
SamplingResult,
Decision,
ALWAYS_ON,
ALWAYS_OFF,
)
&lt;span style="color:#f92672">from&lt;/span> opentelemetry.trace &lt;span style="color:#f92672">import&lt;/span> SpanKind
&lt;span style="color:#f92672">from&lt;/span> opentelemetry.context &lt;span style="color:#f92672">import&lt;/span> Context
&lt;span style="color:#f92672">from&lt;/span> opentelemetry.util.types &lt;span style="color:#f92672">import&lt;/span> Attributes
&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">AdaptiveLLMSampler&lt;/span>(Sampler):
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> Tail-based adaptive sampler cho LLM workload.&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> - Errors: 100&lt;/span>&lt;span style="color:#e6db74">%&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> - Slow requests: 100&lt;/span>&lt;span style="color:#e6db74">%&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> - Normal: configurable rate (default 10&lt;/span>&lt;span style="color:#e6db74">%&lt;/span>&lt;span style="color:#e6db74">)&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;span style="color:#66d9ef">def&lt;/span> __init__(
self,
normal_sample_rate: float &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0.10&lt;/span>,
slow_threshold_ms: float &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">3000.0&lt;/span>,
high_cost_threshold_usd: float &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0.10&lt;/span>,
):
self&lt;span style="color:#f92672">.&lt;/span>normal_sample_rate &lt;span style="color:#f92672">=&lt;/span> normal_sample_rate
self&lt;span style="color:#f92672">.&lt;/span>slow_threshold_ms &lt;span style="color:#f92672">=&lt;/span> slow_threshold_ms
self&lt;span style="color:#f92672">.&lt;/span>high_cost_threshold_usd &lt;span style="color:#f92672">=&lt;/span> high_cost_threshold_usd
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">should_sample&lt;/span>(
self,
parent_context: Context,
trace_id: int,
name: str,
kind: SpanKind &lt;span style="color:#f92672">=&lt;/span> SpanKind&lt;span style="color:#f92672">.&lt;/span>INTERNAL,
attributes: Attributes &lt;span style="color:#f92672">=&lt;/span> None,
links: list &lt;span style="color:#f92672">=&lt;/span> None,
trace_state: object &lt;span style="color:#f92672">=&lt;/span> None,
) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> SamplingResult:
attrs &lt;span style="color:#f92672">=&lt;/span> attributes &lt;span style="color:#f92672">or&lt;/span> {}
&lt;span style="color:#75715e"># Rule 1: Always sample errors&lt;/span>
&lt;span style="color:#66d9ef">if&lt;/span> attrs&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">error&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, False) &lt;span style="color:#f92672">or&lt;/span> attrs&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">http.status_code&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#ae81ff">200&lt;/span>) &lt;span style="color:#f92672">&amp;gt;&lt;/span>&lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">500&lt;/span>:
&lt;span style="color:#66d9ef">return&lt;/span> SamplingResult(Decision&lt;span style="color:#f92672">.&lt;/span>RECORD_AND_SAMPLE, attributes&lt;span style="color:#f92672">=&lt;/span>attrs)
&lt;span style="color:#75715e"># Rule 2: Always sample slow requests&lt;/span>
latency_ms &lt;span style="color:#f92672">=&lt;/span> attrs&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.latency_ms&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#ae81ff">0&lt;/span>)
&lt;span style="color:#66d9ef">if&lt;/span> latency_ms &lt;span style="color:#f92672">&amp;gt;&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>slow_threshold_ms:
&lt;span style="color:#66d9ef">return&lt;/span> SamplingResult(Decision&lt;span style="color:#f92672">.&lt;/span>RECORD_AND_SAMPLE, attributes&lt;span style="color:#f92672">=&lt;/span>attrs)
&lt;span style="color:#75715e"># Rule 3: Always sample high-cost requests&lt;/span>
cost_usd &lt;span style="color:#f92672">=&lt;/span> attrs&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm.cost_usd&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#ae81ff">0&lt;/span>)
&lt;span style="color:#66d9ef">if&lt;/span> cost_usd &lt;span style="color:#f92672">&amp;gt;&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>high_cost_threshold_usd:
&lt;span style="color:#66d9ef">return&lt;/span> SamplingResult(Decision&lt;span style="color:#f92672">.&lt;/span>RECORD_AND_SAMPLE, attributes&lt;span style="color:#f92672">=&lt;/span>attrs)
&lt;span style="color:#75715e"># Rule 4: Always sample guardrail blocks&lt;/span>
&lt;span style="color:#66d9ef">if&lt;/span> attrs&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">guardrail.decision&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>) &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">block&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>:
&lt;span style="color:#66d9ef">return&lt;/span> SamplingResult(Decision&lt;span style="color:#f92672">.&lt;/span>RECORD_AND_SAMPLE, attributes&lt;span style="color:#f92672">=&lt;/span>attrs)
&lt;span style="color:#75715e"># Rule 5: Normal sampling (10%)&lt;/span>
&lt;span style="color:#66d9ef">if&lt;/span> random&lt;span style="color:#f92672">.&lt;/span>random() &lt;span style="color:#f92672">&amp;lt;&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>normal_sample_rate:
&lt;span style="color:#66d9ef">return&lt;/span> SamplingResult(Decision&lt;span style="color:#f92672">.&lt;/span>RECORD_AND_SAMPLE, attributes&lt;span style="color:#f92672">=&lt;/span>attrs)
&lt;span style="color:#66d9ef">return&lt;/span> SamplingResult(Decision&lt;span style="color:#f92672">.&lt;/span>DROP)
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">get_description&lt;/span>(self) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> str:
&lt;span style="color:#66d9ef">return&lt;/span> f&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">AdaptiveLLMSampler(normal={self.normal_sample_rate})&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>
&lt;span style="color:#75715e"># Sử dụng trong TracerProvider:&lt;/span>
&lt;span style="color:#75715e"># from opentelemetry.sdk.trace import TracerProvider&lt;/span>
&lt;span style="color:#75715e"># provider = TracerProvider(sampler=AdaptiveLLMSampler(normal_sample_rate=0.10))&lt;/span>
&lt;/code>&lt;/pre>&lt;/div>&lt;hr>
&lt;h2 id="13-llm-cost-management">13. LLM Cost Management&lt;/h2>
&lt;h3 id="131-budgeting-per-tenant--project">13.1. Budgeting Per Tenant / Project&lt;/h3>
&lt;div class="highlight">&lt;pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-python" data-lang="python">&lt;span style="color:#f92672">import&lt;/span> time
&lt;span style="color:#f92672">import&lt;/span> redis
&lt;span style="color:#f92672">from&lt;/span> dataclasses &lt;span style="color:#f92672">import&lt;/span> dataclass
&lt;span style="color:#f92672">from&lt;/span> typing &lt;span style="color:#f92672">import&lt;/span> Optional
&lt;span style="color:#f92672">import&lt;/span> structlog
logger &lt;span style="color:#f92672">=&lt;/span> structlog&lt;span style="color:#f92672">.&lt;/span>get_logger()
&lt;span style="color:#a6e22e">@dataclass&lt;/span>
&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">BudgetConfig&lt;/span>:
tenant_id: str
daily_budget_usd: float
monthly_budget_usd: float
daily_token_limit: int
alert_threshold_pct: float &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0.80&lt;/span> &lt;span style="color:#75715e"># Alert khi đạt 80%&lt;/span>
hard_stop: bool &lt;span style="color:#f92672">=&lt;/span> True &lt;span style="color:#75715e"># Dừng khi vượt budget&lt;/span>
&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">LLMBudgetGuard&lt;/span>:
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> Middleware kiểm tra budget trước mỗi LLM call.&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> Sử dụng Redis để track real-time spending.&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;span style="color:#66d9ef">def&lt;/span> __init__(self, redis_client: redis&lt;span style="color:#f92672">.&lt;/span>Redis):
self&lt;span style="color:#f92672">.&lt;/span>redis &lt;span style="color:#f92672">=&lt;/span> redis_client
self&lt;span style="color:#f92672">.&lt;/span>_budgets: dict[str, BudgetConfig] &lt;span style="color:#f92672">=&lt;/span> {}
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">register_budget&lt;/span>(self, config: BudgetConfig) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> None:
self&lt;span style="color:#f92672">.&lt;/span>_budgets[config&lt;span style="color:#f92672">.&lt;/span>tenant_id] &lt;span style="color:#f92672">=&lt;/span> config
logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">budget_registered&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
tenant_id&lt;span style="color:#f92672">=&lt;/span>config&lt;span style="color:#f92672">.&lt;/span>tenant_id,
daily_limit_usd&lt;span style="color:#f92672">=&lt;/span>config&lt;span style="color:#f92672">.&lt;/span>daily_budget_usd)
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">_get_today_key&lt;/span>(self, tenant_id: str) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> str:
today &lt;span style="color:#f92672">=&lt;/span> time&lt;span style="color:#f92672">.&lt;/span>strftime(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">%&lt;/span>&lt;span style="color:#e6db74">Y-&lt;/span>&lt;span style="color:#e6db74">%&lt;/span>&lt;span style="color:#e6db74">m-&lt;/span>&lt;span style="color:#e6db74">%d&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;span style="color:#66d9ef">return&lt;/span> f&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm_budget:daily:{tenant_id}:{today}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">_get_month_key&lt;/span>(self, tenant_id: str) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> str:
month &lt;span style="color:#f92672">=&lt;/span> time&lt;span style="color:#f92672">.&lt;/span>strftime(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">%&lt;/span>&lt;span style="color:#e6db74">Y-&lt;/span>&lt;span style="color:#e6db74">%&lt;/span>&lt;span style="color:#e6db74">m&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;span style="color:#66d9ef">return&lt;/span> f&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">llm_budget:monthly:{tenant_id}:{month}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">check_budget&lt;/span>(self, tenant_id: str, estimated_cost_usd: float) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> dict:
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> Kiểm tra budget trước khi gọi LLM.&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> Returns: {&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">allowed&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">: bool, &lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">reason&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">: str, &lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">remaining_usd&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">: float}&lt;/span>&lt;span style="color:#e6db74">
&lt;/span>&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74"> &lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>
config &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>_budgets&lt;span style="color:#f92672">.&lt;/span>get(tenant_id)
&lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#f92672">not&lt;/span> config:
&lt;span style="color:#66d9ef">return&lt;/span> {&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">allowed&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: True, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">reason&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">no_budget_configured&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">remaining_usd&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: float(&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">inf&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)}
daily_key &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>_get_today_key(tenant_id)
current_daily &lt;span style="color:#f92672">=&lt;/span> float(self&lt;span style="color:#f92672">.&lt;/span>redis&lt;span style="color:#f92672">.&lt;/span>get(daily_key) &lt;span style="color:#f92672">or&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>)
projected_daily &lt;span style="color:#f92672">=&lt;/span> current_daily &lt;span style="color:#f92672">+&lt;/span> estimated_cost_usd
&lt;span style="color:#75715e"># Hard stop check&lt;/span>
&lt;span style="color:#66d9ef">if&lt;/span> config&lt;span style="color:#f92672">.&lt;/span>hard_stop &lt;span style="color:#f92672">and&lt;/span> projected_daily &lt;span style="color:#f92672">&amp;gt;&lt;/span> config&lt;span style="color:#f92672">.&lt;/span>daily_budget_usd:
logger&lt;span style="color:#f92672">.&lt;/span>warning(
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">budget_exceeded&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
tenant_id&lt;span style="color:#f92672">=&lt;/span>tenant_id,
current_cost&lt;span style="color:#f92672">=&lt;/span>current_daily,
daily_limit&lt;span style="color:#f92672">=&lt;/span>config&lt;span style="color:#f92672">.&lt;/span>daily_budget_usd,
)
&lt;span style="color:#66d9ef">return&lt;/span> {
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">allowed&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: False,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">reason&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">daily_budget_exceeded&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">remaining_usd&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: max(&lt;span style="color:#ae81ff">0&lt;/span>, config&lt;span style="color:#f92672">.&lt;/span>daily_budget_usd &lt;span style="color:#f92672">-&lt;/span> current_daily),
}
&lt;span style="color:#75715e"># Alert threshold check&lt;/span>
&lt;span style="color:#66d9ef">if&lt;/span> projected_daily &lt;span style="color:#f92672">&amp;gt;&lt;/span> config&lt;span style="color:#f92672">.&lt;/span>daily_budget_usd &lt;span style="color:#f92672">*&lt;/span> config&lt;span style="color:#f92672">.&lt;/span>alert_threshold_pct:
logger&lt;span style="color:#f92672">.&lt;/span>warning(
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">budget_threshold_warning&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
tenant_id&lt;span style="color:#f92672">=&lt;/span>tenant_id,
pct_used&lt;span style="color:#f92672">=&lt;/span>projected_daily &lt;span style="color:#f92672">/&lt;/span> config&lt;span style="color:#f92672">.&lt;/span>daily_budget_usd,
)
&lt;span style="color:#66d9ef">return&lt;/span> {
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">allowed&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: True,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">reason&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">within_budget&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>,
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">remaining_usd&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>: config&lt;span style="color:#f92672">.&lt;/span>daily_budget_usd &lt;span style="color:#f92672">-&lt;/span> current_daily,
}
&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">record_usage&lt;/span>(self, tenant_id: str, actual_cost_usd: float) &lt;span style="color:#f92672">-&lt;/span>&lt;span style="color:#f92672">&amp;gt;&lt;/span> None:
&lt;span style="color:#e6db74">&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>&lt;span style="color:#e6db74">Ghi nhận chi phí thực tế sau khi LLM call hoàn thành.&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;&lt;/span>
daily_key &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>_get_today_key(tenant_id)
month_key &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>_get_month_key(tenant_id)
pipe &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>redis&lt;span style="color:#f92672">.&lt;/span>pipeline()
pipe&lt;span style="color:#f92672">.&lt;/span>incrbyfloat(daily_key, actual_cost_usd)
pipe&lt;span style="color:#f92672">.&lt;/span>expire(daily_key, &lt;span style="color:#ae81ff">86400&lt;/span> &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#ae81ff">2&lt;/span>) &lt;span style="color:#75715e"># 2 ngày TTL&lt;/span>
pipe&lt;span style="color:#f92672">.&lt;/span>incrbyfloat(month_key, actual_cost_usd)
pipe&lt;span style="color:#f92672">.&lt;/span>expire(month_key, &lt;span style="color:#ae81ff">86400&lt;/span> &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#ae81ff">35&lt;/span>) &lt;span style="color:#75715e"># 35 ngày TTL&lt;/span>
pipe&lt;span style="color:#f92672">.&lt;/span>execute()
&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="132-bng-tier-pricing-so-snh">13.2. Bảng Tier Pricing So Sánh&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Tiêu chí&lt;/th>
&lt;th>OpenAI GPT-4o&lt;/th>
&lt;th>Anthropic Claude 3.5&lt;/th>
&lt;th>Azure OpenAI&lt;/th>
&lt;th>Ollama Self-hosted&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>Input price&lt;/strong>&lt;/td>
&lt;td>$5/1M tokens&lt;/td>
&lt;td>$3/1M tokens&lt;/td>
&lt;td>$5/1M tokens&lt;/td>
&lt;td>~$0.15/1M (GPU cost)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Output price&lt;/strong>&lt;/td>
&lt;td>$15/1M tokens&lt;/td>
&lt;td>$15/1M tokens&lt;/td>
&lt;td>$15/1M tokens&lt;/td>
&lt;td>~$0.15/1M (GPU cost)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Data privacy&lt;/strong>&lt;/td>
&lt;td>OpenAI servers&lt;/td>
&lt;td>Anthropic servers&lt;/td>
&lt;td>Azure VNet&lt;/td>
&lt;td>Hoàn toàn on-premise&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Compliance&lt;/strong>&lt;/td>
&lt;td>SOC2, GDPR (opt-out)&lt;/td>
&lt;td>SOC2, HIPAA add-on&lt;/td>
&lt;td>HIPAA, FedRAMP&lt;/td>
&lt;td>Tự quản lý&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Rate limits&lt;/strong>&lt;/td>
&lt;td>10K RPM&lt;/td>
&lt;td>5K RPM&lt;/td>
&lt;td>Custom&lt;/td>
&lt;td>Không giới hạn&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>SLA uptime&lt;/strong>&lt;/td>
&lt;td>99.9%&lt;/td>
&lt;td>99.9%&lt;/td>
&lt;td>99.9%&lt;/td>
&lt;td>Tự quản lý&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Setup complexity&lt;/strong>&lt;/td>
&lt;td>Thấp&lt;/td>
&lt;td>Thấp&lt;/td>
&lt;td>Trung bình&lt;/td>
&lt;td>Cao (GPU infra)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Chi phí khởi đầu&lt;/strong>&lt;/td>
&lt;td>$0&lt;/td>
&lt;td>$0&lt;/td>
&lt;td>Azure subscription&lt;/td>
&lt;td>GPU server ~$2,000+&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Cost 1M requests/ngày&lt;/strong>&lt;/td>
&lt;td>~$3,500/ngày&lt;/td>
&lt;td>~$2,100/ngày&lt;/td>
&lt;td>~$3,500/ngày&lt;/td>
&lt;td>~$50/ngày (amortized)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Phù hợp&lt;/strong>&lt;/td>
&lt;td>General, startup&lt;/td>
&lt;td>High quality&lt;/td>
&lt;td>Enterprise&lt;/td>
&lt;td>Healthcare, Banking&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;hr>
&lt;h2 id="14-incident-response-cho-ai-agent">14. Incident Response Cho AI Agent&lt;/h2>
&lt;h3 id="141-runbook--khi-hallucination-rate-tng">14.1. Runbook — Khi Hallucination Rate Tăng&lt;/h3>
&lt;pre>&lt;code>INCIDENT: Hallucination Rate &amp;gt; 10%
════════════════════════════════════
T+0min: Alert nhận được qua Slack #llmops-critical
T+2min: On-call engineer acknowledge alert
INVESTIGATION STEPS:
1. Grafana → Quality Dashboard → Hallucination Timeline
- Xác định: bắt đầu khi nào? Tất cả agents hay 1 agent cụ thể?
- Xem top sessions có hallucination_score cao nhất
2. Elasticsearch query:
GET ai-agent-logs-*/_search
{ &amp;quot;query&amp;quot;: { &amp;quot;range&amp;quot;: { &amp;quot;hallucination_probability&amp;quot;: { &amp;quot;gte&amp;quot;: 0.3 } } },
&amp;quot;sort&amp;quot;: [{&amp;quot;@timestamp&amp;quot;: &amp;quot;desc&amp;quot;}], &amp;quot;size&amp;quot;: 20 }
3. Kiểm tra: Có prompt version change gần đây không?
git log --oneline prompts/ | head -20
4. Kiểm tra: Model provider có update model không?
- OpenAI model version log
- Pinned model version trong config
MITIGATION:
- Nếu do prompt change → rollback prompt version ngay
- Nếu do model update → pin model version cụ thể (gpt-4o-2024-11-20)
- Nếu nguyên nhân chưa rõ → kích hoạt HITL mode (escalate tất cả uncertain responses)
- Notify stakeholders qua #llmops-incidents
RESOLUTION CRITERIA:
- Hallucination rate &amp;lt; 5% sustained 15 minutes
POST-INCIDENT:
- Post-mortem trong 48h
- Update runbook nếu cần
&lt;/code>&lt;/pre>&lt;h3 id="142-runbook--khi-cost-spike">14.2. Runbook — Khi Cost Spike&lt;/h3>
&lt;pre>&lt;code>INCIDENT: Daily LLM Cost &amp;gt; 150% Baseline
══════════════════════════════════════════
T+0min: Cost spike alert
T+2min: Acknowledge, bắt đầu điều tra
INVESTIGATION:
1. Prometheus query: Tenant nào đang tiêu cost nhiều nhất?
sum by(tenant_id) (rate(llm_cost_usd_total[1h])) * 3600
2. Elasticsearch: Session nào có cost cao bất thường?
(Query 2 từ Section 7.4)
3. Kiểm tra: Token count bất thường?
- Input tokens &amp;gt; 5,000 per request → likely context stuffing
- Output tokens &amp;gt; 2,000 → likely verbose prompt
4. Kiểm tra: Retry loop?
sum by(agent_id) (rate(llm_errors_total{error_type=&amp;quot;RateLimitError&amp;quot;}[10m]))
MITIGATION (theo thứ tự):
1. Tắt tenant vi phạm nếu suspicious activity
2. Enable token quota hard limit ngay
3. Giảm max_tokens trong model config tạm thời
4. Scale down replicas nếu request flood
POST-INCIDENT: Review token quota per tenant, update budget config
&lt;/code>&lt;/pre>&lt;h3 id="143-post-mortem-template">14.3. Post-Mortem Template&lt;/h3>
&lt;div class="highlight">&lt;pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-markdown" data-lang="markdown"># Post-Mortem: [Incident Name]
&lt;span style="font-weight:bold">**Ngày&lt;/span>&lt;span style="font-weight:bold">**&lt;/span>: YYYY-MM-DD
&lt;span style="font-weight:bold">**Mức độ&lt;/span>&lt;span style="font-weight:bold">**&lt;/span>: Critical / High / Medium
&lt;span style="font-weight:bold">**Duration&lt;/span>&lt;span style="font-weight:bold">**&lt;/span>: X giờ Y phút
&lt;span style="font-weight:bold">**MTTR&lt;/span>&lt;span style="font-weight:bold">**&lt;/span>: X giờ Y phút
&lt;span style="color:#75715e">## Impact
&lt;/span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">-&lt;/span> Số users ảnh hưởng: XXX
&lt;span style="color:#66d9ef">-&lt;/span> Doanh thu ảnh hưởng: $XXX
&lt;span style="color:#66d9ef">-&lt;/span> Chi phí phát sinh: $XXX
&lt;span style="color:#75715e">## Timeline
&lt;/span>&lt;span style="color:#75715e">&lt;/span>| Thời gian | Sự kiện |
|-----------|---------|
| HH:MM | Alert triggered |
| HH:MM | On-call engineer acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Incident resolved |
&lt;span style="color:#75715e">## Root Cause
&lt;/span>&lt;span style="color:#75715e">&lt;/span>[Mô tả nguyên nhân gốc rễ]
&lt;span style="color:#75715e">## Contributing Factors
&lt;/span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">1.&lt;/span> [Factor 1]
&lt;span style="color:#66d9ef">2.&lt;/span> [Factor 2]
&lt;span style="color:#75715e">## What Went Well
&lt;/span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">-&lt;/span> [...]
&lt;span style="color:#75715e">## What Could Be Improved
&lt;/span>&lt;span style="color:#75715e">&lt;/span>&lt;span style="color:#66d9ef">-&lt;/span> [...]
&lt;span style="color:#75715e">## Action Items
&lt;/span>&lt;span style="color:#75715e">&lt;/span>| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| [...] | [...] | [...] | High |
&lt;span style="color:#75715e">## Lessons Learned
&lt;/span>&lt;span style="color:#75715e">&lt;/span>[...]
&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="144-mttr-targets-cho-ai-incidents">14.4. MTTR Targets Cho AI Incidents&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Mức độ&lt;/th>
&lt;th>Ví dụ&lt;/th>
&lt;th>Response Time&lt;/th>
&lt;th>MTTR Target&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>P0 - Critical&lt;/strong>&lt;/td>
&lt;td>Cost spike $1K+, mass data leak&lt;/td>
&lt;td>5 phút&lt;/td>
&lt;td>30 phút&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>P1 - High&lt;/strong>&lt;/td>
&lt;td>Error rate &amp;gt; 10%, hallucination surge&lt;/td>
&lt;td>15 phút&lt;/td>
&lt;td>2 giờ&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>P2 - Medium&lt;/strong>&lt;/td>
&lt;td>Latency degradation, quality drop&lt;/td>
&lt;td>1 giờ&lt;/td>
&lt;td>8 giờ&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>P3 - Low&lt;/strong>&lt;/td>
&lt;td>Logging gap, minor metric anomaly&lt;/td>
&lt;td>Next business day&lt;/td>
&lt;td>3 ngày&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;hr>
&lt;h2 id="15-production-readiness-checklist--3-cp-">15. Production Readiness Checklist — 3 Cấp Độ&lt;/h2>
&lt;h3 id="-cp-mvp-ti-thiu--go-live">🥉 Cấp MVP (Tối thiểu để Go-Live)&lt;/h3>
&lt;p>&lt;strong>Monitoring cơ bản (10 items):&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox">Prometheus endpoint &lt;code>/metrics&lt;/code> được expose&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">LLM latency (p50, p95) được track&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Error rate được track theo agent_id&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Token count (input + output) được đếm&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Cost tracking theo ngày&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Basic Grafana dashboard với latency + errors&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Alert cho error rate &amp;gt; 10%&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Alert cho cost spike &amp;gt; 200% baseline&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Structured JSON logging (request_id, session_id, latency, tokens)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Log được ship vào Elasticsearch / Loki&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Reliability cơ bản (8 items):&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox">Timeout configured (max 30s per LLM call)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Retry với exponential backoff (max 3 retries)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Rate limit handling (429 error → retry-after)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Circuit breaker configured cho LLM provider&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Graceful degradation khi LLM unavailable&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Health check endpoint &lt;code>/health&lt;/code> trả về trạng thái LLM connectivity&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Token limit guard (max_tokens configured)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Context length check trước khi gọi LLM&lt;/li>
&lt;/ul>
&lt;h3 id="-cp-production-y--cho-enterprise">🥈 Cấp Production (Đầy đủ cho Enterprise)&lt;/h3>
&lt;p>&lt;strong>Observability nâng cao (12 items):&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox">OpenTelemetry SDK integrated đầy đủ (traces + metrics + logs)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Distributed tracing với context propagation qua tất cả agents&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">TTFT (Time To First Token) tracking cho streaming responses&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Per-tenant cost breakdown dashboard&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Hallucination rate monitoring (sampled evaluation pipeline)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Guardrail decision logging với reason codes&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Tool call latency histogram per tool&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Memory/context usage tracking&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Session timeline reconstruction từ traces&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Kibana/Grafana Explore for ad-hoc investigation&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Automated daily cost report → email/Slack&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">ILM policy cho log retention (hot/warm/cold/delete)&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Alerting đầy đủ (8 items):&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox">Tất cả 8 alert rules từ Section 9 được configured&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Alert routing theo team/severity&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">PagerDuty / on-call rotation integrated&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Runbook link trong mọi alert annotation&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Alert fatigue review (tune thresholds sau 2 tuần)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Dead man's switch (alert nếu metrics stop flowing)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Cost budget alerts per tenant&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">SLA breach prediction alert (leading indicator)&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Reliability production (10 items):&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox">Multi-region LLM provider failover&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Budget guard middleware cho mọi tenant&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Token quota enforcement per tenant per day&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Adaptive sampling cho traces (không 100%)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">A/B testing framework ready&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Model versioning pinned (không dùng &amp;ldquo;latest&amp;rdquo;)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Prompt versioning với git + experiment tracking&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Shadow mode testing cho model upgrades&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Load testing với realistic token distribution&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Chaos engineering: LLM provider outage drill&lt;/li>
&lt;/ul>
&lt;h3 id="-cp-enterprise-y--nht">🥇 Cấp Enterprise (Đầy đủ nhất)&lt;/h3>
&lt;p>&lt;strong>Advanced LLMOps (12 items):&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox">Full MLflow / LangSmith experiment tracking integration&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Automated evaluation pipeline chạy hourly trên sampled traffic&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Model drift detection với statistical tests (KS test, Chi-square)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Prompt regression test suite chạy trên mỗi deployment&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Multi-model cost optimization engine (auto-route based on task)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">LLM request caching (semantic cache với Redis + vector similarity)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Streaming token profiling (tốc độ generation, jitter)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Custom SLOs: error budget tracking, burn rate alerts&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Capacity planning dashboard (projected cost 30/60/90 days)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Fine-tuning pipeline với evaluation gate trước deploy&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Cross-tenant benchmarking (ẩn danh)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Regulatory audit trail xuất report PDF/Excel on demand&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Security &amp;amp; Compliance (10 items):&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox">Mọi LLM interaction được log với immutable audit trail&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">PII detection và masking trong log pipeline&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Data residency enforcement (EU data → EU LLM endpoint)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Penetration test cho prompt injection vectors&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">GDPR Article 22 compliance (explain AI decision)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">SOC 2 Type II evidence collection automated&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Monthly third-party security review của LLM configs&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Incident response drill quarterly&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Vendor lock-in mitigation plan (multi-provider routing)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Contractual SLA với LLM providers documented&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="16-kpi-vn-hnh-chi-ph-platform-roi-analysis">16. KPI Vận Hành, Chi Phí Platform, ROI Analysis&lt;/h2>
&lt;h3 id="161-kpi-vn-hnh-theo-thng">16.1. KPI Vận Hành Theo Tháng&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>KPI&lt;/th>
&lt;th>MVP Target&lt;/th>
&lt;th>Production Target&lt;/th>
&lt;th>Enterprise Target&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>System Uptime&lt;/strong>&lt;/td>
&lt;td>99.0%&lt;/td>
&lt;td>99.5%&lt;/td>
&lt;td>99.9%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Avg Response Latency&lt;/strong>&lt;/td>
&lt;td>&amp;lt; 5s&lt;/td>
&lt;td>&amp;lt; 3s&lt;/td>
&lt;td>&amp;lt; 2s&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Error Rate&lt;/strong>&lt;/td>
&lt;td>&amp;lt; 5%&lt;/td>
&lt;td>&amp;lt; 1%&lt;/td>
&lt;td>&amp;lt; 0.5%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Hallucination Rate&lt;/strong>&lt;/td>
&lt;td>&amp;lt; 10%&lt;/td>
&lt;td>&amp;lt; 3%&lt;/td>
&lt;td>&amp;lt; 1%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Task Completion Rate&lt;/strong>&lt;/td>
&lt;td>&amp;gt; 80%&lt;/td>
&lt;td>&amp;gt; 90%&lt;/td>
&lt;td>&amp;gt; 95%&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Cost per Query&lt;/strong>&lt;/td>
&lt;td>&amp;lt; $0.20&lt;/td>
&lt;td>&amp;lt; $0.08&lt;/td>
&lt;td>&amp;lt; $0.03&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>MTTR (P1 incident)&lt;/strong>&lt;/td>
&lt;td>4h&lt;/td>
&lt;td>2h&lt;/td>
&lt;td>30min&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>User Satisfaction&lt;/strong>&lt;/td>
&lt;td>&amp;gt; 70%&lt;/td>
&lt;td>&amp;gt; 80%&lt;/td>
&lt;td>&amp;gt; 90%&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="162-chi-ph-platform-observability-stack">16.2. Chi Phí Platform Observability Stack&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Thành phần&lt;/th>
&lt;th>Self-hosted / Free tier&lt;/th>
&lt;th>SaaS / Managed&lt;/th>
&lt;th>Ghi chú&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>OpenTelemetry Collector&lt;/strong>&lt;/td>
&lt;td>$0 (self-hosted)&lt;/td>
&lt;td>$0 (open source)&lt;/td>
&lt;td>K8s deployment&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Prometheus&lt;/strong>&lt;/td>
&lt;td>$0&lt;/td>
&lt;td>$0&lt;/td>
&lt;td>Thêm Thanos cho HA&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Grafana&lt;/strong>&lt;/td>
&lt;td>$0 (OSS)&lt;/td>
&lt;td>$29-299/mo&lt;/td>
&lt;td>OSS đủ dùng&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Jaeger/Tempo&lt;/strong>&lt;/td>
&lt;td>$0 + S3 storage&lt;/td>
&lt;td>$50-200/mo&lt;/td>
&lt;td>Tempo rẻ hơn Jaeger&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Elasticsearch&lt;/strong>&lt;/td>
&lt;td>$200-500/mo (3 nodes)&lt;/td>
&lt;td>$95-500/mo (ES Cloud)&lt;/td>
&lt;td>ES Cloud cho managed&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Alertmanager&lt;/strong>&lt;/td>
&lt;td>$0&lt;/td>
&lt;td>$0&lt;/td>
&lt;td>Bundled với Prometheus&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Pyroscope&lt;/strong>&lt;/td>
&lt;td>$0&lt;/td>
&lt;td>$0&lt;/td>
&lt;td>Grafana Phlare&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Total (Self-hosted)&lt;/strong>&lt;/td>
&lt;td>&lt;strong>~$200-500/mo&lt;/strong>&lt;/td>
&lt;td>&lt;/td>
&lt;td>10K req/ngày&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Total (Full SaaS)&lt;/strong>&lt;/td>
&lt;td>&lt;/td>
&lt;td>&lt;strong>~$500-1,200/mo&lt;/strong>&lt;/td>
&lt;td>Managed, ít ops effort&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="163-roi-analysis">16.3. ROI Analysis&lt;/h3>
&lt;pre>&lt;code>Scenario: 10,000 LLM queries/ngày, team 5 người
BEFORE (không có LLMOps):
- Incident detection lag: 4-6 giờ
- Mỗi incident: 3-4 giờ engineer time debug = ~$300 loss/incident
- 2 incidents/tháng = $600/month waste
- Overspending do không track cost: ~$400/month (estimated 20% waste)
- Hallucination → user churn: 5% users/month = $2,000 MRR loss
Total monthly loss without LLMOps: ~$3,000
AFTER (với LLMOps stack đầy đủ):
- Platform cost: $500/month
- Incident MTTR giảm từ 4h → 30min (P1): save $250/incident × 2 = $500/month
- Cost optimization (routing + quota): save 15-25% = ~$300-500/month
- Hallucination detection → user churn giảm 60%: save $1,200/month
Total monthly saving: ~$2,000 - $2,500
ROI = (2,000 - 500) / 500 × 100 = 300% ROI
Payback period: &amp;lt; 1 month
&lt;/code>&lt;/pre>&lt;hr>
&lt;h2 id="17-ma-trn-ri-ro-vn-hnh">17. Ma Trận Rủi Ro Vận Hành&lt;/h2>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>#&lt;/th>
&lt;th>Rủi ro&lt;/th>
&lt;th>Xác suất&lt;/th>
&lt;th>Tác động&lt;/th>
&lt;th>Mức độ&lt;/th>
&lt;th>Biện pháp giảm thiểu&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>1&lt;/td>
&lt;td>&lt;strong>LLM provider outage&lt;/strong> (OpenAI, Anthropic)&lt;/td>
&lt;td>Trung bình&lt;/td>
&lt;td>Cao&lt;/td>
&lt;td>🔴 High&lt;/td>
&lt;td>Multi-provider failover; local Ollama fallback&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2&lt;/td>
&lt;td>&lt;strong>Cost runaway&lt;/strong> (prompt loop, token exploit)&lt;/td>
&lt;td>Thấp-TB&lt;/td>
&lt;td>Rất cao&lt;/td>
&lt;td>🔴 High&lt;/td>
&lt;td>Budget guard; hard token quota; real-time cost alert&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3&lt;/td>
&lt;td>&lt;strong>Silent model degradation&lt;/strong> (provider update model)&lt;/td>
&lt;td>Trung bình&lt;/td>
&lt;td>Cao&lt;/td>
&lt;td>🔴 High&lt;/td>
&lt;td>Pin model version; automated regression eval weekly&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>4&lt;/td>
&lt;td>&lt;strong>Log/trace data explosion&lt;/strong> (misconfigured sampler)&lt;/td>
&lt;td>Thấp&lt;/td>
&lt;td>Trung bình&lt;/td>
&lt;td>🟠 Medium&lt;/td>
&lt;td>Adaptive sampling; storage quota alert&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>5&lt;/td>
&lt;td>&lt;strong>Alert fatigue&lt;/strong> (too many false positives)&lt;/td>
&lt;td>Cao&lt;/td>
&lt;td>Trung bình&lt;/td>
&lt;td>🟠 Medium&lt;/td>
&lt;td>Tune thresholds sau 2 tuần; alert review cadence&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>6&lt;/td>
&lt;td>&lt;strong>PII leak via logs&lt;/strong> (unmasked user data in structured logs)&lt;/td>
&lt;td>Thấp&lt;/td>
&lt;td>Rất cao&lt;/td>
&lt;td>🔴 High&lt;/td>
&lt;td>Log scrubber middleware; PII regex masking pipeline&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>7&lt;/td>
&lt;td>&lt;strong>Dashboard blindspot&lt;/strong> (metric not instrumented)&lt;/td>
&lt;td>Trung bình&lt;/td>
&lt;td>Trung bình&lt;/td>
&lt;td>🟠 Medium&lt;/td>
&lt;td>Coverage checklist; quarterly observability audit&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>8&lt;/td>
&lt;td>&lt;strong>Observer effect&lt;/strong> (OTel overhead degrades performance)&lt;/td>
&lt;td>Thấp&lt;/td>
&lt;td>Thấp&lt;/td>
&lt;td>🟡 Low&lt;/td>
&lt;td>Benchmark OTel overhead (&amp;lt;1ms target); async exporters&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;hr>
&lt;h2 id="18-roadmap-trin-khai-llmops--3-giai-on">18. Roadmap Triển Khai LLMOps — 3 Giai Đoạn&lt;/h2>
&lt;h3 id="-giai-on-1--foundation-tun-1-2">🚀 Giai Đoạn 1 — Foundation (Tuần 1-2)&lt;/h3>
&lt;p>&lt;strong>Tuần 1:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox">Deploy OTel Collector + Prometheus + Grafana lên K8s (Helm charts)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Integrate OTel SDK vào tất cả agent services&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Setup basic Grafana dashboard (latency, errors, cost)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Configure basic alerts (error rate, cost spike)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Ship structured logs vào Elasticsearch&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Tuần 2:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox">Distributed tracing end-to-end (orchestrator → sub-agents)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Token + cost tracking per agent per tenant&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Budget guard middleware deployed&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">ILM policy cho Elasticsearch&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">On-call rotation setup, runbooks viết xong&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Deliverable&lt;/strong>: Hệ thống có thể detect P1 incident trong &amp;lt; 5 phút&lt;/p>
&lt;hr>
&lt;h3 id="-giai-on-2--quality--cost-tun-3-6">⚙️ Giai Đoạn 2 — Quality &amp;amp; Cost (Tuần 3-6)&lt;/h3>
&lt;p>&lt;strong>Tuần 3-4:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox">Hallucination evaluation pipeline (sampled, async)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Guardrail decision logging đầy đủ&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Per-tenant cost dashboard + daily email report&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">A/B testing framework (canary deployment)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Model router theo task complexity&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Tuần 5-6:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox">Adaptive sampling thay thế 100% sampling&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Semantic cache (Redis + vector similarity)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Post-mortem process chính thức&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Alert tuning (giảm false positives)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Quality SLO dashboard (error budget, burn rate)&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Deliverable&lt;/strong>: Cost giảm 20-30%, hallucination rate visible và monitored&lt;/p>
&lt;hr>
&lt;h3 id="-giai-on-3--enterprise-grade-tun-7-12">🏆 Giai Đoạn 3 — Enterprise Grade (Tuần 7-12)&lt;/h3>
&lt;p>&lt;strong>Tuần 7-9:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox">Full MLflow / LangSmith integration&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Model drift detection automated&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Prompt regression test suite CI/CD&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Multi-provider failover (OpenAI → Azure OpenAI → Anthropic)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Capacity planning dashboard&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Tuần 10-12:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;input disabled="" type="checkbox">Compliance audit trail (immutable, exportable)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">PII masking trong log pipeline&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Chaos engineering drill (LLM outage simulation)&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Security penetration test cho LLM attack vectors&lt;/li>
&lt;li>&lt;input disabled="" type="checkbox">Documentation, runbooks, training cho team&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Deliverable&lt;/strong>: Full LLMOps maturity — incident MTTR &amp;lt; 30min, cost optimized, compliance-ready&lt;/p>
&lt;hr>
&lt;h2 id="19-kt-lun">19. Kết Luận&lt;/h2>
&lt;p>Trong bài này chúng ta đã xây dựng hoàn chỉnh hệ thống &lt;strong>Monitoring &amp;amp; Observability&lt;/strong> cho AI Agent trong production — từ lý thuyết đến code thực tế:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Thành phần đã xây dựng&lt;/th>
&lt;th>Giá trị&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>LLMOps vs DevOps&lt;/strong> — 10 chiều so sánh&lt;/td>
&lt;td>Hiểu rõ tại sao cần observability riêng&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Kiến trúc OTel 4 pillars&lt;/strong>&lt;/td>
&lt;td>Framework đầy đủ cho mọi quy mô&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>25+ metrics&lt;/strong> với 5 nhóm&lt;/td>
&lt;td>Biết chính xác cần đo gì&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>OTel instrumentation&lt;/strong> (Python)&lt;/td>
&lt;td>Có thể implement ngay hôm nay&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>LangChain callback handler&lt;/strong>&lt;/td>
&lt;td>Tracing tự động cho LangChain agents&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Structured log schema&lt;/strong> + ES mapping&lt;/td>
&lt;td>Log chuẩn, searchable, auditable&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Grafana JSON config&lt;/strong>&lt;/td>
&lt;td>Dashboard production-ready&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>8 alert rules&lt;/strong> + Prometheus YAML&lt;/td>
&lt;td>Alert coverage đầy đủ&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>A/B testing framework&lt;/strong>&lt;/td>
&lt;td>Cải tiến prompt/model dựa trên data&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Model router&lt;/strong> (Python)&lt;/td>
&lt;td>Cost optimization tự động&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Adaptive sampler&lt;/strong>&lt;/td>
&lt;td>Giảm 84% storage cost traces&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Budget guard&lt;/strong>&lt;/td>
&lt;td>Ngăn cost runaway&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Incident runbooks&lt;/strong>&lt;/td>
&lt;td>Response nhanh, MTTR thấp&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>50-item checklist&lt;/strong> 3 cấp&lt;/td>
&lt;td>Roadmap rõ ràng từ MVP đến Enterprise&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>ROI 300%&lt;/strong>&lt;/td>
&lt;td>Justify investment với stakeholders&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="nguyn-tc-vng-cho-llmops">Nguyên Tắc Vàng Cho LLMOps&lt;/h3>
&lt;blockquote>
&lt;p>&amp;ldquo;Bạn không thể quản lý những gì bạn không đo lường được.
Trong thế giới LLM, điều này còn đúng hơn bất kỳ lĩnh vực nào khác —
vì LLM &lt;strong>có thể fail silently&lt;/strong> theo những cách mà không có metric nào trong DevOps truyền thống bắt được.&amp;rdquo;&lt;/p>
&lt;/blockquote>
&lt;hr>
&lt;h2 id="-bi-tip-theo">📌 Bài Tiếp Theo&lt;/h2>
&lt;p>&lt;strong>Bài 8: Use Case Thực Chiến — AI Agent trong Doanh nghiệp Việt Nam&lt;/strong>&lt;/p>
&lt;p>Sau khi đã có đầy đủ nền tảng từ kiến trúc, memory, guardrails đến monitoring, bài tiếp theo sẽ đưa tất cả vào thực tế với &lt;strong>3 use case thực chiến&lt;/strong> tại doanh nghiệp Việt Nam:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Healthcare&lt;/strong>: AI Agent hỗ trợ bác sĩ tra cứu phác đồ điều trị, tích hợp HIS/EMR&lt;/li>
&lt;li>&lt;strong>Banking/Fintech&lt;/strong>: AI Agent tư vấn sản phẩm tài chính, KYC automation&lt;/li>
&lt;li>&lt;strong>Retail/E-commerce&lt;/strong>: AI Agent chăm sóc khách hàng đa kênh (Zalo, Web, App)&lt;/li>
&lt;/ul>
&lt;p>Mỗi use case đều bao gồm: kiến trúc chi tiết, tech stack, chi phí, timeline triển khai và bài học thực tế.&lt;/p>
&lt;hr>
&lt;blockquote>
&lt;p>💡 &lt;strong>Tip thực chiến&lt;/strong>: Bắt đầu với &lt;strong>Giai Đoạn 1&lt;/strong> (tuần 1-2) ngay khi có AI Agent đầu tiên lên production. Đừng chờ &amp;ldquo;có thời gian&amp;rdquo; — một cost runaway hay hallucination incident không báo trước sẽ khiến bạn phải xây observability trong tình trạng khủng hoảng, vừa tốn kém vừa stress. &lt;strong>Ship observability cùng lúc với feature&lt;/strong> — đó là văn hóa LLMOps trưởng thành.&lt;/p>
&lt;/blockquote></description></item></channel></rss>