Memory & Context Management — Giúp AI Agent ghi nhớ và hiểu ngữ cảnh

1. Vì sao AI Agent cần bộ nhớ?

Ở bài trước, chúng ta đã trang bị cho AI Agent khả năng hành động thông qua Tool Use & Function Calling. Tuy nhiên, ngay cả khi agent đã biết gọi đúng tool, vẫn tồn tại một vấn đề căn bản khiến trải nghiệm người dùng còn rời rạc:

“Tôi đã báo với chatbot tuần trước rằng tôi dị ứng latex — sao hôm nay nó lại gợi ý sản phẩm có latex cho tôi?”

“Mỗi lần mở chat mới tôi phải giải thích lại toàn bộ context từ đầu. Mệt mỏi lắm.”

Đây là giới hạn cốt lõi của LLM thuần: mô hình ngôn ngữ là stateless — nó không tự động nhớ gì giữa các lần gọi API. Mỗi request là một trang giấy trắng.

1.1. Giới hạn Context Window

Mô hình	Context Window	Tương đương
GPT-4o-mini	128.000 tokens	~96.000 từ tiếng Anh (~100 trang A4)
GPT-4o	128.000 tokens	~96.000 từ
Claude 3.5 Sonnet	200.000 tokens	~150.000 từ
Gemini 1.5 Pro	1.000.000 tokens	~750.000 từ
Llama 3.1 70B	128.000 tokens	~96.000 từ

Context window lớn không giải quyết được vấn đề:

Chi phí: gửi 100.000 token mỗi request = chi phí API tăng tuyến tính
Latency: context dài → TTFT (time-to-first-token) tăng đáng kể
Lost-in-the-middle: nghiên cứu cho thấy LLM xử lý thông tin ở đầu và cuối context tốt hơn phần giữa
Vẫn stateless: đóng browser tab là mất hết, không có khái niệm “lần sau nhớ lại”

1.2. Stateless vs Stateful Agent

Đặc điểm	Stateless Agent	Stateful Agent
Nhớ hội thoại	Chỉ trong session	Qua nhiều session
Nhớ sở thích người dùng	❌	✅
Cá nhân hóa	❌	✅
Chi phí token	Cao (phải gửi lại history)	Tối ưu hơn (chỉ gửi phần relevant)
Độ phức tạp triển khai	Thấp	Trung bình–Cao
Ứng dụng phù hợp	FAQ đơn giản	CRM AI, Healthcare AI, Trợ lý cá nhân

1.3. Pain Point thực tế

E-commerce: Chatbot gợi ý lại sản phẩm khách đã từ chối 3 lần trước.

Healthcare: Bệnh nhân phải khai lại tiền sử bệnh mỗi lần tương tác với AI assistant của phòng khám.

HR Automation: Nhân viên phải giải thích lại quy trình đã được AI hướng dẫn cách đây 2 tuần.

Kết luận: Bộ nhớ không phải tính năng “nice-to-have” — đây là điều kiện cần để AI Agent tạo ra giá trị bền vững cho doanh nghiệp.

2. Taxonomy bộ nhớ AI Agent: 4 loại

Không có một loại bộ nhớ nào phù hợp cho tất cả. Hệ thống memory hiệu quả kết hợp 4 loại theo tầng:

┌─────────────────────────────────────────────────────────────────┐
│                    AI AGENT MEMORY TAXONOMY                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  LOẠI 1: IN-CONTEXT MEMORY (Working Memory)              │   │
│  │  • Nằm trong context window của LLM                      │   │
│  │  • Hội thoại hiện tại, system prompt, tool results       │   │
│  │  • Tốc độ: Rất nhanh (đã trong RAM của LLM)             │   │
│  │  • Giới hạn: Bị xóa khi hết session / hết context       │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  LOẠI 2: SESSION MEMORY (External Short-term)            │   │
│  │  • Lưu ngoài LLM, trong Redis/Valkey                     │   │
│  │  • Toàn bộ lịch sử hội thoại trong một phiên làm việc   │   │
│  │  • TTL: vài giờ đến vài ngày                             │   │
│  │  • Tốc độ: Nhanh (~1–5ms)                               │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  LOẠI 3: PERSISTENT MEMORY (External Long-term)          │   │
│  │  • Lưu trong PostgreSQL / SQL Server                     │   │
│  │  • Hồ sơ người dùng, sở thích, tóm tắt lịch sử dài hạn │   │
│  │  • TTL: Không giới hạn (hoặc theo policy)               │   │
│  │  • Tốc độ: Trung bình (~5–50ms)                         │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  LOẠI 4: SEMANTIC MEMORY (Vector Store)                  │   │
│  │  • Lưu embeddings của ký ức quan trọng                   │   │
│  │  • Truy vấn bằng semantic similarity (không cần key)     │   │
│  │  • Kết hợp với RAG pipeline                              │   │
│  │  • Qdrant / Weaviate / pgvector / Chroma                 │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

         Tốc độ truy cập: Loại 1 > 2 > 4 > 3
         Dung lượng lưu trữ: Loại 3 > 4 > 2 > 1
         Chi phí lưu trữ: Loại 1 < 2 < 3 ≈ 4

2.1. Khi nào dùng loại nào?

Loại	Use Case điển hình	Ví dụ
In-Context	Hội thoại đang diễn ra, tool results tức thì	“Đơn hàng vừa tra là ORD-001, đang giao”
Session	Chuyển tab, F5 trang, reconnect WebSocket	Tiếp tục hội thoại sau khi mạng bị ngắt
Persistent	Sở thích cá nhân, lịch sử mua hàng, thông tin hợp đồng	“Khách này thích giao hàng sáng sớm”
Semantic	“Nhớ lại” ngữ nghĩa không theo thứ tự thời gian	“Lần nào đó khách đề cập vấn đề với sản phẩm X”

3. In-Context Memory — Kỹ thuật quản lý Conversation History

In-Context Memory là lớp bộ nhớ đơn giản nhất nhưng cần quản lý thận trọng nhất vì ảnh hưởng trực tiếp đến chi phí API và chất lượng câu trả lời.

3.1. Kỹ thuật 1: Sliding Window

Giữ lại N tin nhắn gần nhất, bỏ đi tin nhắn cũ:

from collections import deque
from dataclasses import dataclass, field
from typing import Literal

@dataclass
class Message:
    role: Literal["user", "assistant", "system", "tool"]
    content: str
    token_count: int = 0

class SlidingWindowMemory:
    """
    Sliding window giữ lại N tin nhắn gần nhất.
    System prompt luôn được giữ nguyên (không tính vào window).
    """
    def __init__(self, max_messages: int = 20, system_prompt: str = ""):
        self.max_messages = max_messages
        self.system_prompt = system_prompt
        self._history: deque[Message] = deque(maxlen=max_messages)

    def add(self, role: str, content: str) -> None:
        self._history.append(Message(role=role, content=content))

    def get_context(self) -> list[dict]:
        messages = [{"role": "system", "content": self.system_prompt}]
        messages.extend(
            {"role": m.role, "content": m.content}
            for m in self._history
        )
        return messages

    def clear(self) -> None:
        self._history.clear()

Ưu điểm: Đơn giản, dễ triển khai.
Nhược điểm: Mất thông tin quan trọng nếu xảy ra ở đầu cuộc hội thoại.

3.2. Kỹ thuật 2: Token Budget Management

Kiểm soát chính xác theo số token thay vì số tin nhắn:

import tiktoken

class TokenBudgetMemory:
    """
    Quản lý history theo token budget.
    Khi vượt ngưỡng, tự động drop tin nhắn cũ nhất (trừ system prompt).
    """
    def __init__(
        self,
        max_tokens: int = 4_000,       # Token dành cho history
        model: str = "gpt-4o-mini",
        system_prompt: str = ""
    ):
        self.max_tokens = max_tokens
        self.system_prompt = system_prompt
        self._history: list[Message] = []
        self._encoder = tiktoken.encoding_for_model(model)

    def _count_tokens(self, text: str) -> int:
        return len(self._encoder.encode(text))

    def _total_history_tokens(self) -> int:
        return sum(self._count_tokens(m.content) for m in self._history)

    def add(self, role: str, content: str) -> None:
        new_tokens = self._count_tokens(content)
        # Trim cũ nếu cần
        while (
            self._history
            and self._total_history_tokens() + new_tokens > self.max_tokens
        ):
            self._history.pop(0)  # Bỏ tin nhắn cũ nhất
        self._history.append(Message(role=role, content=content,
                                     token_count=new_tokens))

    def get_context(self) -> list[dict]:
        messages = [{"role": "system", "content": self.system_prompt}]
        messages.extend({"role": m.role, "content": m.content}
                        for m in self._history)
        return messages

    @property
    def used_tokens(self) -> int:
        return self._total_history_tokens()

3.3. Kỹ thuật 3: Message Summarization khi gần đạt limit

Khi history đầy, tóm tắt các tin cũ thay vì xóa hẳn — giữ lại thông tin quan trọng với ít token hơn:

class SummarizingMemory:
    """
    Khi token vượt ngưỡng, gọi LLM để tóm tắt nửa đầu lịch sử.
    Kết quả tóm tắt được lưu lại như một tin nhắn 'system' đặc biệt.
    """
    SUMMARY_THRESHOLD = 0.80  # Tóm tắt khi đạt 80% token budget

    def __init__(self, max_tokens: int = 6_000, llm_client=None):
        self.max_tokens = max_tokens
        self._history: list[Message] = []
        self._summary: str = ""
        self._llm = llm_client

    async def _summarize_older_half(self) -> None:
        midpoint = len(self._history) // 2
        to_summarize = self._history[:midpoint]
        self._history = self._history[midpoint:]

        conversation_text = "\n".join(
            f"{m.role.upper()}: {m.content}" for m in to_summarize
        )
        prompt = (
            "Tóm tắt ngắn gọn cuộc hội thoại sau, giữ lại "
            "các thông tin quan trọng như: thông tin đơn hàng, "
            "vấn đề người dùng đã báo, quyết định đã đưa ra:\n\n"
            f"{conversation_text}"
        )
        response = await self._llm.complete(prompt)
        self._summary = (
            f"[TÓM TẮT HỘI THOẠI TRƯỚC]: {response}\n"
            + (f"[TÓM TẮT TRƯỚC ĐÓ]: {self._summary}" if self._summary else "")
        )

    def get_context(self) -> list[dict]:
        messages = []
        if self._summary:
            messages.append({"role": "system", "content": self._summary})
        messages.extend({"role": m.role, "content": m.content}
                        for m in self._history)
        return messages

4. Session Memory — Lưu trữ ngắn hạn với Redis/Valkey

Session Memory giải quyết vấn đề mất hội thoại khi reconnect mà không cần lưu trữ mãi mãi.

4.1. Session Schema (JSON)

{
  "session_id": "sess_abc123xyz",
  "user_id": "usr_456",
  "tenant_id": "tenant_healthcare_01",
  "created_at": "2026-05-14T08:30:00+07:00",
  "last_active": "2026-05-14T09:15:42+07:00",
  "ttl_seconds": 86400,
  "metadata": {
    "channel": "web",
    "agent_id": "support-agent-v2",
    "language": "vi"
  },
  "context": {
    "current_topic": "đơn hàng ORD-78901",
    "entities_mentioned": ["ORD-78901", "sản phẩm laptop X1"],
    "user_intent": "track_order"
  },
  "messages": [
    {
      "id": "msg_001",
      "role": "user",
      "content": "Đơn hàng ORD-78901 của tôi đến chưa?",
      "timestamp": "2026-05-14T08:30:05+07:00",
      "token_count": 18
    },
    {
      "id": "msg_002",
      "role": "assistant",
      "content": "Đơn hàng ORD-78901 hiện đang trong quá trình giao, dự kiến đến ngày 15/05.",
      "timestamp": "2026-05-14T08:30:08+07:00",
      "token_count": 32,
      "tool_calls_used": ["get_order_status"]
    }
  ],
  "summary": "",
  "total_tokens_used": 50
}

4.2. C# — Semantic Kernel với Redis Session Memory

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using StackExchange.Redis;
using System.Text.Json;

// ============================================================
// Bước 1: RedisSessionStore — CRUD session lên Redis
// ============================================================
public class RedisSessionStore
{
    private readonly IDatabase _redis;
    private readonly TimeSpan _defaultTtl = TimeSpan.FromHours(24);

    public RedisSessionStore(IConnectionMultiplexer redis)
    {
        _redis = redis.GetDatabase();
    }

    private static string Key(string sessionId) => $"session:{sessionId}";

    public async Task<SessionData?> GetAsync(string sessionId)
    {
        var raw = await _redis.StringGetAsync(Key(sessionId));
        if (raw.IsNullOrEmpty) return null;

        // Làm mới TTL mỗi khi truy cập (sliding expiry)
        await _redis.KeyExpireAsync(Key(sessionId), _defaultTtl);
        return JsonSerializer.Deserialize<SessionData>(raw!);
    }

    public async Task SaveAsync(SessionData session)
    {
        var json = JsonSerializer.Serialize(session);
        await _redis.StringSetAsync(
            Key(session.SessionId),
            json,
            _defaultTtl);
    }

    public async Task DeleteAsync(string sessionId)
        => await _redis.KeyDeleteAsync(Key(sessionId));

    public async Task AppendMessageAsync(
        string sessionId,
        string role,
        string content)
    {
        var session = await GetAsync(sessionId)
            ?? new SessionData { SessionId = sessionId };

        session.Messages.Add(new SessionMessage
        {
            Id = $"msg_{Guid.NewGuid():N}",
            Role = role,
            Content = content,
            Timestamp = DateTimeOffset.UtcNow
        });
        session.LastActive = DateTimeOffset.UtcNow;
        session.TotalTokensUsed += EstimateTokens(content);

        await SaveAsync(session);
    }

    private static int EstimateTokens(string text)
        => (int)Math.Ceiling(text.Length / 4.0); // Ước lượng đơn giản
}

// ============================================================
// Bước 2: AgentWithSessionMemory — tích hợp Semantic Kernel
// ============================================================
public class AgentWithSessionMemory
{
    private readonly Kernel _kernel;
    private readonly RedisSessionStore _sessionStore;
    private readonly IChatCompletionService _chat;

    public AgentWithSessionMemory(
        Kernel kernel,
        RedisSessionStore sessionStore)
    {
        _kernel = kernel;
        _sessionStore = sessionStore;
        _chat = kernel.GetRequiredService<IChatCompletionService>();
    }

    public async Task<string> ChatAsync(
        string sessionId,
        string userId,
        string userMessage)
    {
        // 1. Load session từ Redis
        var session = await _sessionStore.GetAsync(sessionId)
            ?? new SessionData
            {
                SessionId = sessionId,
                UserId = userId,
                CreatedAt = DateTimeOffset.UtcNow,
                LastActive = DateTimeOffset.UtcNow
            };

        // 2. Rebuild ChatHistory từ session
        var history = new ChatHistory(BuildSystemPrompt(session));
        foreach (var msg in TrimToTokenBudget(session.Messages, maxTokens: 3000))
        {
            if (msg.Role == "user")
                history.AddUserMessage(msg.Content);
            else if (msg.Role == "assistant")
                history.AddAssistantMessage(msg.Content);
        }
        history.AddUserMessage(userMessage);

        // 3. Gọi LLM
        var settings = new OpenAIPromptExecutionSettings
        {
            ToolCallBehavior = ToolCallBehavior.AutoInvokeKernelFunctions,
            MaxTokens = 1024
        };

        var response = await _chat.GetChatMessageContentAsync(
            history, settings, _kernel);
        var assistantReply = response.Content
            ?? "Xin lỗi, tôi chưa xử lý được yêu cầu này.";

        // 4. Lưu cả 2 lượt vào session
        await _sessionStore.AppendMessageAsync(sessionId, "user", userMessage);
        await _sessionStore.AppendMessageAsync(sessionId, "assistant", assistantReply);

        return assistantReply;
    }

    private static string BuildSystemPrompt(SessionData session)
        => $"""
            Bạn là trợ lý AI hỗ trợ khách hàng. 
            ID phiên: {session.SessionId}
            ID người dùng: {session.UserId}
            Ngày tạo phiên: {session.CreatedAt:dd/MM/yyyy HH:mm}
            Hãy trả lời ngắn gọn, chuyên nghiệp bằng tiếng Việt.
            """;

    private static IEnumerable<SessionMessage> TrimToTokenBudget(
        List<SessionMessage> messages,
        int maxTokens)
    {
        // Lấy tin từ cuối về đầu cho đến khi đủ budget
        var result = new List<SessionMessage>();
        int used = 0;
        foreach (var msg in messages.AsEnumerable().Reverse())
        {
            int t = (int)Math.Ceiling(msg.Content.Length / 4.0);
            if (used + t > maxTokens) break;
            result.Insert(0, msg);
            used += t;
        }
        return result;
    }
}

// ============================================================
// Bước 3: Data models
// ============================================================
public record SessionData
{
    public string SessionId { get; set; } = "";
    public string UserId { get; set; } = "";
    public string TenantId { get; set; } = "";
    public DateTimeOffset CreatedAt { get; set; }
    public DateTimeOffset LastActive { get; set; }
    public List<SessionMessage> Messages { get; set; } = new();
    public string Summary { get; set; } = "";
    public int TotalTokensUsed { get; set; }
}

public record SessionMessage
{
    public string Id { get; set; } = "";
    public string Role { get; set; } = "";
    public string Content { get; set; } = "";
    public DateTimeOffset Timestamp { get; set; }
}

4.3. Cấu hình Redis cho Session Memory

# redis-session.yml — cấu hình khuyến nghị cho production
redis:
  connection: "redis://redis-host:6379"
  database: 1               # Dùng DB riêng cho sessions
  key_prefix: "session:"
  default_ttl: 86400         # 24 giờ (sliding)
  max_memory: "2gb"
  max_memory_policy: "allkeys-lru"  # Tự xóa key cũ khi hết RAM
  
  # Cluster mode cho production scale
  cluster:
    enabled: true
    nodes:
      - "redis-1:6379"
      - "redis-2:6379"
      - "redis-3:6379"

5. Persistent Long-term Memory — PostgreSQL Schema

Long-term memory lưu trữ thông tin không bị xóa — hồ sơ người dùng, sở thích, lịch sử tương tác tích lũy qua nhiều session và nhiều tháng.

5.1. Schema PostgreSQL

-- ============================================================
-- Schema: ai_memory
-- Mô tả: Long-term memory cho AI Agent
-- ============================================================

CREATE SCHEMA IF NOT EXISTS ai_memory;

-- Hồ sơ người dùng tích lũy
CREATE TABLE ai_memory.user_profiles (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id         VARCHAR(128) NOT NULL UNIQUE,
    tenant_id       VARCHAR(128) NOT NULL,
    display_name    VARCHAR(256),
    language        VARCHAR(10)  DEFAULT 'vi',
    timezone        VARCHAR(64)  DEFAULT 'Asia/Ho_Chi_Minh',
    
    -- Sở thích và hành vi tích lũy (JSONB cho linh hoạt)
    preferences     JSONB NOT NULL DEFAULT '{}'::jsonb,
    /*
      Ví dụ preferences:
      {
        "communication_style": "formal",
        "preferred_channel": "email",
        "product_interests": ["laptop", "phụ kiện"],
        "delivery_preference": "morning",
        "language_level": "technical"
      }
    */
    
    -- Tóm tắt ngữ cảnh từ các session trước
    context_summary TEXT,
    
    -- Metadata
    first_seen_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    last_seen_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    total_sessions  INT         NOT NULL DEFAULT 0,
    total_messages  INT         NOT NULL DEFAULT 0,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Log tương tác dài hạn (chỉ lưu sự kiện quan trọng, không lưu mọi tin nhắn)
CREATE TABLE ai_memory.interaction_logs (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id         VARCHAR(128) NOT NULL REFERENCES ai_memory.user_profiles(user_id),
    tenant_id       VARCHAR(128) NOT NULL,
    session_id      VARCHAR(256),
    
    event_type      VARCHAR(64)  NOT NULL,
    -- Các event_type mẫu:
    -- 'preference_update', 'issue_reported', 'purchase_intent',
    -- 'complaint', 'compliment', 'topic_discussed', 'goal_achieved'
    
    summary         TEXT         NOT NULL,  -- Tóm tắt ngắn sự kiện
    detail          JSONB,                  -- Chi tiết đầy đủ nếu cần
    importance      SMALLINT     NOT NULL DEFAULT 3 CHECK (importance BETWEEN 1 AND 5),
    -- 1=trivial, 2=low, 3=medium, 4=high, 5=critical
    
    tags            TEXT[]       DEFAULT '{}',
    
    -- Memory decay: tự xóa sau thời gian nếu importance thấp
    expires_at      TIMESTAMPTZ,
    
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Key-value store cho memory ngắn hơn long-term nhưng cần persist (không muốn dùng Redis)
CREATE TABLE ai_memory.memory_items (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id         VARCHAR(128) NOT NULL,
    tenant_id       VARCHAR(128) NOT NULL,
    
    memory_key      VARCHAR(256) NOT NULL,
    memory_value    TEXT         NOT NULL,
    memory_type     VARCHAR(64)  NOT NULL DEFAULT 'fact',
    -- 'fact', 'preference', 'goal', 'constraint', 'skill', 'relationship'
    
    source          VARCHAR(128),           -- Session ID nguồn gốc
    confidence      DECIMAL(3,2) DEFAULT 1.0 CHECK (confidence BETWEEN 0 AND 1),
    importance      SMALLINT     DEFAULT 3  CHECK (importance BETWEEN 1 AND 5),
    
    -- Deduplication
    content_hash    VARCHAR(64)  GENERATED ALWAYS AS (
        encode(sha256(user_id::bytea || memory_key::bytea || memory_value::bytea), 'hex')
    ) STORED,
    
    access_count    INT          NOT NULL DEFAULT 0,
    last_accessed   TIMESTAMPTZ,
    expires_at      TIMESTAMPTZ,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    
    UNIQUE(user_id, tenant_id, memory_key)
);

-- Indexes cho hiệu suất
CREATE INDEX idx_user_profiles_tenant  ON ai_memory.user_profiles(tenant_id);
CREATE INDEX idx_user_profiles_last    ON ai_memory.user_profiles(last_seen_at DESC);
CREATE INDEX idx_interaction_logs_user ON ai_memory.interaction_logs(user_id, created_at DESC);
CREATE INDEX idx_interaction_logs_type ON ai_memory.interaction_logs(event_type, tenant_id);
CREATE INDEX idx_interaction_logs_tags ON ai_memory.interaction_logs USING GIN(tags);
CREATE INDEX idx_memory_items_user     ON ai_memory.memory_items(user_id, tenant_id);
CREATE INDEX idx_memory_items_type     ON ai_memory.memory_items(memory_type, user_id);
CREATE INDEX idx_memory_items_expires  ON ai_memory.memory_items(expires_at)
    WHERE expires_at IS NOT NULL;

-- Trigger cập nhật updated_at tự động
CREATE OR REPLACE FUNCTION ai_memory.set_updated_at()
RETURNS TRIGGER AS $$
BEGIN NEW.updated_at = NOW(); RETURN NEW; END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_user_profiles_updated
    BEFORE UPDATE ON ai_memory.user_profiles
    FOR EACH ROW EXECUTE FUNCTION ai_memory.set_updated_at();

CREATE TRIGGER trg_memory_items_updated
    BEFORE UPDATE ON ai_memory.memory_items
    FOR EACH ROW EXECUTE FUNCTION ai_memory.set_updated_at();

5.2. Migration Strategy

Khi cần thay đổi schema trong production:

-- Migration V2: Thêm cột emotion_profile vào user_profiles
-- File: migrations/V2__add_emotion_profile.sql

ALTER TABLE ai_memory.user_profiles
    ADD COLUMN IF NOT EXISTS emotion_profile JSONB DEFAULT '{}'::jsonb;

COMMENT ON COLUMN ai_memory.user_profiles.emotion_profile IS
    'Xu hướng cảm xúc tích lũy: { "avg_sentiment": 0.7, "frustration_signals": 2 }';

-- Backfill: giá trị mặc định cho các bản ghi cũ đã được handle bởi DEFAULT
-- Không cần UPDATE toàn bộ bảng nếu đã có DEFAULT.

6. Semantic Memory — Vector Store cho ký ức ngữ nghĩa

Semantic Memory cho phép agent tìm lại ký ức liên quan mà không cần nhớ key hay thứ tự thời gian — chỉ cần mô tả ngữ nghĩa gần với nội dung cần tìm.

6.1. Kiến trúc Semantic Memory + RAG

Người dùng: "Tôi đã từng phàn nàn về vấn đề gì với sản phẩm này chưa?"
                │
                ▼
    ┌───────────────────────┐
    │  Semantic Memory      │
    │  Retrieval Pipeline   │
    └────────┬──────────────┘
             │
    ┌────────▼──────────────────────────────────────────┐
    │  1. Embed câu hỏi → query vector [0.12, -0.34...] │
    └────────┬──────────────────────────────────────────┘
             │
    ┌────────▼──────────────────────────────────────────┐
    │  2. Similarity search trong Qdrant/pgvector        │
    │     Filter: user_id = "usr_456"                   │
    │     Top-K: 5 ký ức liên quan nhất                 │
    └────────┬──────────────────────────────────────────┘
             │
    ┌────────▼──────────────────────────────────────────┐
    │  3. Re-rank theo:                                 │
    │     - Similarity score                            │
    │     - Importance (1-5)                            │
    │     - Recency (gần đây hơn = ưu tiên hơn)        │
    └────────┬──────────────────────────────────────────┘
             │
    ┌────────▼──────────────────────────────────────────┐
    │  4. Inject vào context window:                    │
    │     [RELEVANT MEMORIES]:                          │
    │     - 12/03: Phàn nàn pin laptop hao nhanh        │
    │     - 05/04: Báo lỗi bàn phím phím Space kẹt      │
    └────────┬──────────────────────────────────────────┘
             │
             ▼
    LLM sinh câu trả lời có ngữ cảnh đầy đủ

6.2. Python — LangChain + Qdrant Semantic Memory

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_qdrant import QdrantVectorStore
from langchain.memory import VectorStoreRetrieverMemory
from langchain.chains import ConversationChain
from langchain.prompts import PromptTemplate
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
from datetime import datetime
import uuid


# ============================================================
# Bước 1: Khởi tạo Qdrant collection cho semantic memory
# ============================================================
def init_semantic_memory_store(
    qdrant_url: str,
    collection_name: str = "agent_memories",
    vector_size: int = 1536  # OpenAI text-embedding-3-small
) -> QdrantVectorStore:
    client = QdrantClient(url=qdrant_url)

    # Tạo collection nếu chưa tồn tại
    existing = [c.name for c in client.get_collections().collections]
    if collection_name not in existing:
        client.create_collection(
            collection_name=collection_name,
            vectors_config=VectorParams(
                size=vector_size,
                distance=Distance.COSINE
            )
        )

    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    return QdrantVectorStore(
        client=client,
        collection_name=collection_name,
        embedding=embeddings
    )


# ============================================================
# Bước 2: SemanticMemoryManager — lưu và truy vấn ký ức
# ============================================================
class SemanticMemoryManager:
    """
    Quản lý semantic memory cho một user cụ thể.
    Mỗi ký ức là một đoạn text có metadata: user_id, importance, timestamp.
    """

    IMPORTANCE_THRESHOLD = 3  # Chỉ lưu ký ức có importance >= 3

    def __init__(self, vector_store: QdrantVectorStore, user_id: str):
        self._store = vector_store
        self._user_id = user_id
        self._embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

    def save_memory(
        self,
        content: str,
        memory_type: str = "fact",
        importance: int = 3,
        tags: list[str] | None = None
    ) -> str | None:
        """
        Lưu một ký ức vào vector store.
        Chỉ lưu nếu importance >= ngưỡng.
        Trả về memory_id nếu lưu thành công, None nếu bỏ qua.
        """
        if importance < self.IMPORTANCE_THRESHOLD:
            return None  # Không đủ quan trọng để ghi nhớ lâu dài

        memory_id = str(uuid.uuid4())
        metadata = {
            "user_id": self._user_id,
            "memory_id": memory_id,
            "memory_type": memory_type,
            "importance": importance,
            "tags": tags or [],
            "created_at": datetime.utcnow().isoformat(),
        }

        self._store.add_texts(
            texts=[content],
            metadatas=[metadata],
            ids=[memory_id]
        )
        return memory_id

    def recall(
        self,
        query: str,
        top_k: int = 5,
        memory_type: str | None = None
    ) -> list[dict]:
        """
        Tìm kiếm ký ức liên quan theo ngữ nghĩa.
        Có thể filter theo memory_type.
        """
        filter_condition = {"user_id": self._user_id}
        if memory_type:
            filter_condition["memory_type"] = memory_type

        results = self._store.similarity_search_with_score(
            query=query,
            k=top_k,
            filter=filter_condition
        )

        # Re-rank: kết hợp similarity score + importance
        memories = []
        for doc, score in results:
            importance = doc.metadata.get("importance", 3)
            # Công thức re-rank đơn giản: 0.7 * similarity + 0.3 * (importance/5)
            combined_score = 0.7 * score + 0.3 * (importance / 5)
            memories.append({
                "content": doc.page_content,
                "metadata": doc.metadata,
                "similarity": round(score, 4),
                "combined_score": round(combined_score, 4)
            })

        # Sắp xếp theo combined_score giảm dần
        memories.sort(key=lambda x: x["combined_score"], reverse=True)
        return memories

    def format_for_context(self, memories: list[dict]) -> str:
        """Định dạng ký ức để inject vào context window."""
        if not memories:
            return ""
        lines = ["[KÝ ỨC LIÊN QUAN CỦA NGƯỜI DÙNG]:"]
        for m in memories:
            date = m["metadata"].get("created_at", "")[:10]
            mtype = m["metadata"].get("memory_type", "fact")
            lines.append(f"- [{date}][{mtype}] {m['content']}")
        return "\n".join(lines)


# ============================================================
# Bước 3: Tích hợp với LangChain ConversationChain
# ============================================================
def build_agent_with_semantic_memory(
    qdrant_url: str,
    user_id: str
) -> tuple[ConversationChain, SemanticMemoryManager]:
    vector_store = init_semantic_memory_store(qdrant_url)
    memory_mgr = SemanticMemoryManager(vector_store, user_id)

    retriever = vector_store.as_retriever(
        search_kwargs={
            "k": 4,
            "filter": {"user_id": user_id}
        }
    )

    lc_memory = VectorStoreRetrieverMemory(retriever=retriever)

    prompt = PromptTemplate(
        input_variables=["history", "input"],
        template="""Bạn là trợ lý AI hỗ trợ khách hàng thông minh.

Thông tin từ các tương tác trước đây:
{history}

Hội thoại hiện tại:
Người dùng: {input}
Trợ lý:"""
    )

    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)
    chain = ConversationChain(
        llm=llm,
        prompt=prompt,
        memory=lc_memory,
        verbose=False
    )
    return chain, memory_mgr

7. Memory Retrieval Strategy: Khi nào dùng loại nào

7.1. Decision Tree — Chọn loại Memory phù hợp

Bắt đầu: Agent nhận một yêu cầu mới từ người dùng
                        │
                        ▼
        ┌───────────────────────────────┐
        │  Thông tin có trong context   │
        │  window hiện tại không?       │
        └──────────┬────────────────────┘
                   │
         ┌─────────┴─────────┐
        YES                  NO
         │                   │
         ▼                   ▼
   Dùng IN-CONTEXT     Cần tìm ở đâu?
   MEMORY trực tiếp          │
                    ┌────────┴─────────────────────┐
                    │                              │
         ┌──────────▼──────────┐      ┌────────────▼──────────┐
         │  Thông tin từ cùng  │      │  Thông tin từ nhiều   │
         │  phiên làm việc     │      │  phiên trước?         │
         │  hôm nay?           │      └────────────┬──────────┘
         └──────────┬──────────┘                   │
                    │                    ┌──────────┴──────────┐
                   YES                  │                     │
                    │            Tìm theo KEY         Tìm theo NGỮ NGHĨA
                    ▼            (user_id, type)      (không biết key cụ thể)
              SESSION MEMORY            │                     │
              (Redis, ~1ms)             ▼                     ▼
                               PERSISTENT MEMORY       SEMANTIC MEMORY
                               (PostgreSQL, ~20ms)     (Qdrant, ~30-50ms)

7.2. Hybrid Retrieval — Kết hợp Session + Semantic

Chiến lược tối ưu nhất cho production: luôn truy vấn cả 2 nguồn song song, merge kết quả:

import asyncio
from dataclasses import dataclass

@dataclass
class MemoryContext:
    session_messages: list[dict]
    semantic_memories: list[dict]
    user_profile: dict | None

async def hybrid_memory_retrieval(
    session_id: str,
    user_id: str,
    current_query: str,
    session_store: RedisSessionStore,      # type: ignore
    semantic_mgr: SemanticMemoryManager,
    profile_repo: object                    # type: ignore
) -> MemoryContext:
    """Truy vấn song song cả session memory và semantic memory."""

    session_task = session_store.get_async(session_id)
    semantic_task = asyncio.to_thread(
        semantic_mgr.recall, current_query, top_k=4
    )
    profile_task = asyncio.to_thread(
        profile_repo.get_by_user_id, user_id  # type: ignore
    )

    session_data, semantic_results, profile = await asyncio.gather(
        session_task, semantic_task, profile_task
    )

    return MemoryContext(
        session_messages=session_data.messages if session_data else [],
        semantic_memories=semantic_results,
        user_profile=profile
    )

8. Context Window Management nâng cao

8.1. Bốn chiến lược chính

Chiến lược	Mô tả	Ưu điểm	Nhược điểm	Phù hợp với
Sliding Window	Giữ N tin nhắn gần nhất	Đơn giản, dễ implement	Mất thông tin quan trọng đầu session	FAQ bot, session ngắn
Summary Buffer	Tóm tắt phần cũ khi đầy	Giữ thông tin key, token hiệu quả	Cần gọi LLM thêm để tóm tắt	CS bot, session trung bình
Entity Memory	Track entities (tên, mã đơn, sản phẩm) được đề cập	Giữ facts quan trọng, ít token	Cần NER pipeline	Sales bot, healthcare bot
ConversationKG	Knowledge Graph từ hội thoại	Biểu diễn quan hệ phức tạp	Phức tạp triển khai	Research agent, phân tích hợp đồng

8.2. Bảng so sánh chi tiết

Tiêu chí	Sliding Window	Summary Buffer	Entity Memory	ConversationKG
Độ phức tạp implement	★☆☆☆☆	★★★☆☆	★★★☆☆	★★★★★
Token efficiency	★★☆☆☆	★★★★☆	★★★★★	★★★☆☆
Giữ thông tin long-term	★☆☆☆☆	★★★☆☆	★★★★☆	★★★★★
Tốc độ	★★★★★	★★★☆☆	★★★★☆	★★☆☆☆
Chi phí API	Thấp	Trung bình	Thấp	Cao
Hỗ trợ LangChain	✅	✅	✅	✅ (beta)
Hỗ trợ Semantic Kernel	✅	Tự implement	Tự implement	❌

8.3. Khuyến nghị lựa chọn theo use case

Use case               │ Chiến lược khuyến nghị
───────────────────────┼──────────────────────────────────────────
FAQ chatbot đơn giản   │ Sliding Window (20 tin nhắn)
Customer Support AI    │ Summary Buffer + Entity Memory
Healthcare AI          │ Entity Memory + Persistent Memory
Sales/CRM AI           │ Entity Memory + Semantic Memory
Contract analysis      │ ConversationKG + Semantic Memory
Personal Assistant     │ Summary Buffer + Semantic Memory + Profile

9. User Profiling & Personalization

9.1. Xây dựng hồ sơ người dùng tích lũy

Hồ sơ người dùng không được tạo ra một lần — nó tích lũy và tự cập nhật qua từng tương tác:

{
  "user_id": "usr_456",
  "tenant_id": "tenant_ecommerce_01",
  "display_name": "Nguyễn Văn An",
  "language": "vi",
  "timezone": "Asia/Ho_Chi_Minh",
  
  "preferences": {
    "communication_style": "casual",
    "response_length": "concise",
    "preferred_channel": "zalo",
    "delivery_time": "morning",
    "payment_method": "momo",
    "product_categories": ["laptop", "phụ kiện gaming"],
    "price_sensitivity": "medium",
    "brand_preferences": ["Dell", "ASUS"]
  },
  
  "behavioral_patterns": {
    "avg_session_duration_minutes": 12.5,
    "peak_active_hours": ["08:00-10:00", "20:00-22:00"],
    "typical_query_types": ["order_tracking", "product_comparison"],
    "escalation_rate": 0.05,
    "satisfaction_trend": "improving"
  },
  
  "known_issues": [
    {
      "type": "allergy",
      "detail": "dị ứng latex",
      "recorded_at": "2026-03-12",
      "source_session": "sess_xyz789"
    }
  ],
  
  "interaction_summary": "Khách hàng thân thiết, thường mua laptop gaming. Đã từng phàn nàn về thời gian giao hàng chậm vào tháng 3. Ưa phong cách giao tiếp thân mật, không thích câu trả lời dài dòng.",
  
  "metrics": {
    "total_sessions": 28,
    "total_messages": 312,
    "purchases_assisted": 4,
    "tickets_raised": 2,
    "last_purchase_date": "2026-04-20",
    "lifetime_value_vnd": 18500000
  },
  
  "privacy": {
    "consent_given": true,
    "consent_date": "2026-01-15",
    "data_retention_until": "2029-01-15",
    "pii_masked": false
  }
}

9.2. Privacy Considerations

Tách biệt PII: Email, số điện thoại, CCCD không lưu trong profile summary
Consent tracking: Ghi nhận rõ thời điểm người dùng đồng ý lưu dữ liệu
Data minimization: Chỉ lưu những gì thực sự cần để cá nhân hóa
Right to forget: Xem mục 12 — cơ chế xóa toàn bộ memory theo yêu cầu
Tenant isolation: Mỗi tenant có namespace riêng, không thể cross-query

10. Memory Write Policy — Khi nào ghi, khi nào bỏ qua

Không phải mọi tin nhắn đều đáng ghi vào long-term memory. Ghi không chọn lọc sẽ làm nhiễu bộ nhớ và tăng chi phí.

10.1. Importance Scoring

from enum import IntEnum

class MemoryImportance(IntEnum):
    TRIVIAL   = 1   # "Ok", "Cảm ơn", lời chào
    LOW       = 2   # Câu hỏi chung, không cá nhân
    MEDIUM    = 3   # Thông tin hữu ích nhưng không critical
    HIGH      = 4   # Sở thích rõ ràng, vấn đề đã xảy ra
    CRITICAL  = 5   # Dị ứng, yêu cầu đặc biệt, khiếu nại quan trọng

# Bảng quy tắc đơn giản để scoring
IMPORTANCE_RULES = [
    # (pattern, importance)
    (["dị ứng", "không dùng được", "cấm", "tuyệt đối không"], MemoryImportance.CRITICAL),
    (["thích", "muốn", "ưa", "hay dùng", "thường xuyên"],     MemoryImportance.HIGH),
    (["từng", "lần trước", "hôm qua", "tuần trước"],          MemoryImportance.HIGH),
    (["phàn nàn", "tức", "bực", "thất vọng", "tệ"],          MemoryImportance.HIGH),
    (["hỏi về", "muốn biết", "giá bao nhiêu"],                MemoryImportance.LOW),
    (["ok", "được", "cảm ơn", "bye", "tạm biệt"],            MemoryImportance.TRIVIAL),
]

def score_importance(message: str) -> MemoryImportance:
    message_lower = message.lower()
    best_score = MemoryImportance.LOW

    for keywords, importance in IMPORTANCE_RULES:
        if any(kw in message_lower for kw in keywords):
            if importance > best_score:
                best_score = importance

    return best_score

10.2. Memory Write Decision Flow

async def decide_and_write_memory(
    user_id: str,
    message: str,
    session_context: dict,
    memory_mgr: SemanticMemoryManager,
    pg_repo: object  # type: ignore
) -> None:
    """
    Quyết định có lưu vào long-term memory không, và lưu ở đâu.
    """
    importance = score_importance(message)

    # Quy tắc 1: Bỏ qua nếu quá tầm thường
    if importance <= MemoryImportance.TRIVIAL:
        return

    # Quy tắc 2: Kiểm tra deduplication (đã có memory tương tự chưa)
    similar = memory_mgr.recall(message, top_k=1)
    if similar and similar[0]["similarity"] > 0.95:
        return  # Đã có ký ức gần như giống hệt, bỏ qua

    # Quy tắc 3: Ghi vào Semantic Memory nếu importance >= 3
    if importance >= MemoryImportance.MEDIUM:
        memory_mgr.save_memory(
            content=message,
            memory_type=classify_memory_type(message),
            importance=int(importance)
        )

    # Quy tắc 4: Ghi vào PostgreSQL interaction_log nếu importance >= 4
    if importance >= MemoryImportance.HIGH:
        await pg_repo.log_interaction(  # type: ignore
            user_id=user_id,
            event_type=classify_event_type(message),
            summary=message[:500],
            importance=int(importance),
            # Memory decay: ký ức LOW tự xóa sau 90 ngày
            expires_at=(
                None if importance >= MemoryImportance.HIGH
                else "NOW() + INTERVAL '90 days'"
            )
        )

10.3. Memory Decay — TTL cho Long-term Memory

Không phải mọi ký ức đều cần giữ mãi mãi. Thiết lập TTL theo importance:

Importance Level	TTL khuyến nghị	Ví dụ
CRITICAL (5)	Không hết hạn	Dị ứng, yêu cầu đặc biệt về sức khỏe
HIGH (4)	2 năm	Sở thích mua hàng, khiếu nại đã giải quyết
MEDIUM (3)	6 tháng	Câu hỏi đã được trả lời, sản phẩm đã xem
LOW (2)	90 ngày	Thông tin ngữ cảnh session
TRIVIAL (1)	Không lưu	Lời chào, phản hồi ngắn

11. Multi-session Continuity

11.1. Chào đón người dùng quay lại

Khi người dùng bắt đầu session mới, agent cần pre-load context và chào hỏi cá nhân hóa:

async def build_welcome_context(
    user_id: str,
    current_query: str,
    memory_mgr: SemanticMemoryManager,
    pg_repo: object  # type: ignore
) -> str:
    """
    Xây dựng context phong phú khi người dùng quay lại.
    Chạy song song để tối thiểu latency.
    """
    import asyncio

    profile_task = asyncio.to_thread(pg_repo.get_profile, user_id)  # type: ignore
    memories_task = asyncio.to_thread(
        memory_mgr.recall, current_query, top_k=3
    )

    profile, relevant_memories = await asyncio.gather(
        profile_task, memories_task
    )

    context_parts = []

    # 1. Thông tin hồ sơ cơ bản
    if profile:
        context_parts.append(f"""
[HỒ SƠ NGƯỜI DÙNG]:
- Tên: {profile.get('display_name', 'Khách hàng')}
- Số phiên: {profile.get('total_sessions', 0)}
- Tóm tắt: {profile.get('interaction_summary', '')}
- Sở thích nổi bật: {', '.join(profile.get('preferences', {}).get('product_categories', []))}
        """.strip())

    # 2. Ký ức liên quan đến câu hỏi hiện tại
    if relevant_memories:
        context_parts.append(
            memory_mgr.format_for_context(relevant_memories)
        )

    return "\n\n".join(context_parts)

11.2. Prompt Augmentation Template

Template để inject memory context vào system prompt:

SYSTEM PROMPT TEMPLATE (với Memory Augmentation):
─────────────────────────────────────────────────────────
Bạn là trợ lý AI của {company_name}.

{user_context}
━━ Chú ý khi trả lời ━━
- Nếu người dùng quay lại sau nhiều ngày, hãy chào hỏi ấm áp và đề cập đến
  tương tác gần nhất nếu phù hợp với câu hỏi hiện tại.
- Ưu tiên thông tin trong [KÝ ỨC LIÊN QUAN] khi có liên quan đến câu hỏi.
- KHÔNG đề cập đến ký ức không liên quan — tránh cảm giác "đang bị theo dõi".
- Phong cách giao tiếp: {communication_style}
─────────────────────────────────────────────────────────

Ví dụ kết quả sau khi augment:
─────────────────────────────────────────────────────────
[HỒ SƠ NGƯỜI DÙNG]:
- Tên: Nguyễn Văn An (28 phiên, khách thân thiết)
- Tóm tắt: Thường mua laptop gaming, thích giao hàng buổi sáng

[KÝ ỨC LIÊN QUAN]:
- [2026-03-12][constraint] dị ứng latex — KHÔNG gợi ý sản phẩm chứa latex
- [2026-04-05][complaint] Phàn nàn giao hàng chậm 3 ngày so với cam kết

Chào mừng anh An quay lại! Hôm nay anh cần hỗ trợ gì ạ?
─────────────────────────────────────────────────────────

12. Bảo mật & Privacy cho Memory

12.1. Các nguyên tắc cốt lõi

Data Isolation (Multi-tenant): Mỗi tenant/organization có namespace riêng trong Redis, schema riêng trong PostgreSQL, collection riêng trong vector store. Tuyệt đối không cross-query giữa các tenant.

PII Masking trước khi lưu: Luôn mask PII trước khi lưu vào semantic memory hoặc interaction log:

import re

PII_PATTERNS = {
    "email":       r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
    "phone_vn":    r'\b(0[35789]\d{8}|[+]84[35789]\d{8})\b',
    "cccd":        r'\b\d{9}(\d{3})?\b',  # 9 hoặc 12 số
    "credit_card": r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
}

def mask_pii(text: str) -> str:
    """Thay thế PII bằng placeholder trước khi lưu vào memory."""
    masked = text
    for pii_type, pattern in PII_PATTERNS.items():
        placeholder = f"[{pii_type.upper()}_MASKED]"
        masked = re.sub(pattern, placeholder, masked, flags=re.IGNORECASE)
    return masked

# Sử dụng:
# "Email tôi là abc@gmail.com và SĐT 0912345678"
# → "Email tôi là [EMAIL_MASKED] và SĐT [PHONE_VN_MASKED]"

async def delete_all_user_memory(
    user_id: str,
    tenant_id: str,
    session_store: object,    # type: ignore
    vector_store: object,     # type: ignore
    pg_repo: object           # type: ignore
) -> dict:
    """
    Xóa toàn bộ memory của người dùng theo yêu cầu GDPR.
    Trả về báo cáo xóa để audit.
    """
    import asyncio

    results = {}

    # 1. Xóa tất cả sessions trong Redis
    session_keys = await session_store.find_by_user(user_id, tenant_id)  # type: ignore
    for key in session_keys:
        await session_store.delete(key)  # type: ignore
    results["sessions_deleted"] = len(session_keys)

    # 2. Xóa semantic memories trong vector store
    deleted_vectors = await asyncio.to_thread(
        vector_store.delete,  # type: ignore
        filter={"user_id": user_id, "tenant_id": tenant_id}
    )
    results["vectors_deleted"] = deleted_vectors

    # 3. Xóa PostgreSQL records
    pg_deleted = await pg_repo.delete_user_data(user_id, tenant_id)  # type: ignore
    results.update(pg_deleted)

    # 4. Audit log (bắt buộc, không xóa)
    await pg_repo.log_gdpr_deletion(  # type: ignore
        user_id=user_id,
        tenant_id=tenant_id,
        deleted_at=datetime.utcnow().isoformat(),
        deletion_report=results
    )

    return results

12.3. Cấu hình bảo mật Memory (YAML)

# memory-security.yml
memory_security:
  
  # Mã hóa at-rest
  encryption:
    redis:
      enabled: true
      algorithm: "AES-256-GCM"
      key_rotation_days: 90
    postgresql:
      tde_enabled: true          # Transparent Data Encryption
      column_encryption:
        - table: user_profiles
          columns: [preferences, context_summary, interaction_summary]
    vector_store:
      enabled: true
      provider: "qdrant-cloud"   # Qdrant Cloud có built-in encryption

  # Kiểm soát truy cập
  access_control:
    rbac_enabled: true
    roles:
      agent_read:       ["session:read", "memory:read"]
      agent_write:      ["session:write", "memory:write"]
      admin:            ["session:*", "memory:*", "gdpr:*"]
    tenant_isolation:   strict   # Không cho phép cross-tenant query

  # PII
  pii:
    mask_before_store:  true
    patterns:           ["email", "phone_vn", "cccd", "credit_card"]
    log_masking_events: true

  # Retention policy
  retention:
    default_ttl_days:   180
    critical_memory:    "no_expiry"
    gdpr_deletion:      "immediate"
    audit_logs:         "7_years"   # Yêu cầu pháp lý Việt Nam

  # Monitoring
  monitoring:
    alert_on_cross_tenant_query: true
    alert_on_bulk_read:          true   # > 1000 records trong 1 phút
    alert_on_pii_in_log:         true

13. Checklist triển khai Memory System

✅ Cấp 1: In-Context Memory (Tuần 1–2)

Chọn chiến lược context management: Sliding Window / Summary Buffer / Entity Memory
Implement token counting chính xác theo model đang dùng (tiktoken hoặc tương đương)
Thiết lập ngưỡng tóm tắt tự động (khuyến nghị: 80% token budget)
Unit test: đảm bảo system prompt luôn được giữ nguyên
Đo token usage trung bình per request để baseline chi phí
Verify: context không bao giờ vượt quá max_tokens của model

✅ Cấp 2: Session Memory (Tuần 2–4)

Cài đặt Redis/Valkey với persistence (AOF + RDB)
Thiết kế session schema JSON đầy đủ (session_id, user_id, tenant_id, messages, metadata)
Implement sliding TTL (làm mới TTL mỗi khi truy cập)
Thiết lập Redis eviction policy: allkeys-lru
Test: reconnect sau khi mạng bị ngắt vẫn load được session
Test: session không bị lẫn giữa các user (tenant isolation)
Monitoring: Redis memory usage, key count, hit rate
Backup: cấu hình Redis persistence cho production

✅ Cấp 3: Long-term Memory (Tuần 4–8)

Deploy PostgreSQL schema (ai_memory.user_profiles, interaction_logs, memory_items)
Implement importance scoring cho mọi tin nhắn trước khi lưu
Implement PII masking pipeline (email, phone, CCCD)
Thiết lập memory decay TTL theo importance level
Implement deduplication bằng content_hash
Thiết lập vector store (Qdrant hoặc pgvector) và indexing pipeline
Implement hybrid retrieval (session + semantic, chạy song song)
GDPR: implement delete_all_user_memory API endpoint
Encrypt sensitive columns trong PostgreSQL
Load test: hybrid retrieval < 100ms P95 với 100K users
Audit log: mọi write operation vào long-term memory

14. KPI, Chi phí và ROI

14.1. KPI cho Memory System

KPI	Định nghĩa	Mục tiêu MVP	Mục tiêu Production
Session Continuity Rate	% session được restore thành công sau reconnect	≥ 95%	≥ 99.5%
Memory Retrieval Latency (P95)	Thời gian hybrid retrieval P95	≤ 200ms	≤ 80ms
Memory Relevance Score	% ký ức được retrieve có liên quan thực sự	≥ 70%	≥ 85%
Context Token Efficiency	Giảm token gửi lên LLM vs không có memory	≥ 20%	≥ 40%
Personalization Acceptance Rate	% khi agent dùng memory, user không phàn nàn “sai”	≥ 90%	≥ 97%
Memory Write Noise Rate	% bản ghi lưu vào long-term nhưng không bao giờ được truy vấn lại	≤ 30%	≤ 10%
GDPR Deletion SLA	Thời gian hoàn thành right-to-forget từ khi nhận yêu cầu	≤ 72 giờ	≤ 24 giờ

14.2. Ước lượng chi phí (Quy mô SMB, 10.000 sessions/ngày)

Hạng mục	Chi phí thiết lập	Chi phí vận hành/tháng	Ghi chú
Redis (2 GB, HA)	$0 (self-hosted)	$30–80	Hoặc Upstash Redis ~$20/tháng
PostgreSQL (memory schema)	$0 (add to existing)	$10–30	~50GB storage cho 1M users
Qdrant Cloud (1M vectors)	$0	$25–75	Phụ thuộc vào số ký ức/user
Embedding API	—	$20–60	10K sessions × avg 10 memories × $0.0001/embed
LLM cho summarization	—	$15–40	Chỉ khi trigger tóm tắt context
Engineering (thiết kế + triển khai)	$3.000–8.000	$500–1.500	Bảo trì, cải tiến
Tổng ước lượng	$3.000–8.000	$100–285	Không tính LLM chính

14.3. ROI tham chiếu

Tình huống: Công ty TMĐT 50.000 khách hàng hoạt động. Trước khi có Memory:

Mỗi session mới: khách mất 2–3 phút re-explain context → 30% khách bỏ cuộc
CS team nhận 20% ticket “lặp lại vấn đề đã giải quyết” vì agent không nhớ

Sau khi triển khai Memory System:

Khách quay lại tiếp tục ngay từ điểm dừng → Giảm abandonment 30% → +15% conversion
Giảm lặp ticket: agent tự nhớ context → -20% ticket volume → tiết kiệm $2.000–5.000/tháng nhân sự CS
CSAT tăng từ 3.8 → 4.3/5 (ví dụ tham chiếu từ các dự án CRM AI) → +18% customer retention

ROI năm đầu (ước tính thận trọng):

Tiết kiệm nhân sự CS: $2.500/tháng × 12 = $30.000/năm
Tăng conversion: khó đo trực tiếp nhưng ước tính $10.000–30.000/năm
Chi phí hệ thống: $285/tháng × 12 + $5.000 setup = $8.420/năm
ROI ≈ 380–710% | Hoàn vốn: 2–3 tháng

15. Bảng Rủi ro và Phương án Giảm Thiểu

Rủi ro	Mức độ	Xác suất	Phương án giảm thiểu
Memory contamination: Agent dùng sai ký ức của user khác	Rất cao	Thấp (nếu thiết kế đúng)	Tenant + user isolation nghiêm ngặt; unit test cross-user query
Stale memory: Sở thích cũ không còn phù hợp	Cao	Cao	Memory decay TTL + confidence score giảm dần theo thời gian
Hallucinated memory: Agent “nhớ” thứ không có trong store	Cao	Trung bình	Chỉ inject ký ức đã verified; prompt rõ “chỉ dùng ký ức từ [RELEVANT MEMORIES]”
PII leak trong log/memory	Rất cao	Trung bình	PII masking pipeline bắt buộc trước khi lưu; kiểm tra định kỳ
Redis out-of-memory	Cao	Trung bình	Eviction policy LRU + monitoring alert ở 80% RAM; Redis Cluster
Latency cao khi cold-start (pre-load nhiều memory)	Trung bình	Trung bình	Async pre-load; cache top-K profiles; limit recall to top-3
Ký ức xây dựng sai lệch (garbage-in-garbage-out)	Cao	Trung bình	Importance scoring nghiêm ngặt; human review với importance=5
GDPR non-compliance: Không xóa kịp khi user yêu cầu	Rất cao	Thấp	Automated deletion pipeline; SLA 24h; audit log cho mọi deletion

16. Roadmap Triển Khai 3 Giai Đoạn

Giai đoạn 1 (Tuần 1–3): In-Context + Session Memory

Mục tiêu: Agent không bao giờ “quên” trong cùng một phiên làm việc.

Implement Token Budget Memory với ngưỡng 80% trigger summarization
Cài đặt Redis/Valkey, thiết kế session schema
Implement RedisSessionStore với sliding TTL 24h
Tích hợp session memory vào agent loop hiện tại
Test: reconnect sau 1h, sau 8h vẫn load được session
Monitoring: Redis memory, session hit rate, token usage per session
KPI đo được: Session Continuity Rate ≥ 95%, Memory Retrieval Latency ≤ 200ms

Giai đoạn 2 (Tuần 4–8): Long-term Memory + User Profiling

Mục tiêu: Agent biết khách hàng là ai và nhớ lịch sử quan trọng.

Deploy PostgreSQL memory schema (3 bảng chính)
Implement importance scoring và memory write policy
Build user profile accumulation pipeline (cập nhật sau mỗi session)
Implement PII masking trước khi lưu vào mọi storage
Triển khai Qdrant hoặc pgvector cho semantic memory
Implement hybrid retrieval (session + semantic song song)
Build GDPR deletion endpoint
Test: right-to-forget hoàn thành < 24h
KPI đo được: Memory Relevance Score ≥ 70%, Context Token Efficiency +20%

Giai đoạn 3 (Tuần 9–12): Tối ưu & Cá nhân hóa nâng cao

Mục tiêu: Trải nghiệm cá nhân hóa thực sự, vận hành ổn định ở scale.

Implement Memory Decay (TTL theo importance)
Build personalization engine: tự động điều chỉnh communication style
A/B test: so sánh agent có/không có long-term memory về CSAT
Tối ưu hybrid retrieval: caching top profiles, async pre-load
Dashboard KPI: memory hit rate, relevance score, noise rate
Thiết lập alert: cross-tenant query, PII in log, bulk read anomaly
Load test: 100K concurrent users, latency P95 < 80ms
KPI đo được: CSAT +0.3+ điểm, Memory Write Noise Rate ≤ 10%

17. Kết luận và Kết nối sang Bài 6

Memory & Context Management là nền tảng của trải nghiệm người dùng — không phải feature phụ mà là điều kiện cần để AI Agent tạo ra giá trị lâu dài:

Không có Session Memory → Agent quên mọi thứ khi user F5 trang
Không có Long-term Memory → Agent xử lý khách hàng VIP như người lạ
Không có Semantic Memory → Agent không thể “nhớ lại” những gì quan trọng khi cần
Không có Memory Policy → Garbage in, garbage out; rủi ro PII, chi phí không kiểm soát

Ba nguyên tắc cốt lõi để Memory System thành công:

Layer by layer — Bắt đầu từ Session Memory (đơn giản, ROI rõ ràng), rồi mới đến Long-term và Semantic
Write less, write right — Importance scoring nghiêm ngặt: thà bỏ sót 30% ký ức còn hơn lưu 80% rác
Privacy first — PII masking và tenant isolation phải là yêu cầu từ ngày đầu, không phải afterthought

Bài tiếp theo trong series sẽ đi sâu vào Planning & ReAct Loop — cách AI Agent không chỉ phản hồi ngay lập tức mà còn biết lập kế hoạch và lý luận nhiều bước trước khi hành động. Đây là nền tảng để xây dựng các agent phức tạp như: tự động xử lý claim bảo hiểm, phân tích hồ sơ tín dụng hay điều phối quy trình onboarding nhân viên — những bài toán đòi hỏi agent phải “suy nghĩ” trước khi “làm”.

Tác giả: AI Agent Series | Cập nhật: 14/05/2026

Last updated on May 14, 2026