构建有长期记忆的AI代理

我用 ChromaDB(向量数据库)、Ollama(本地 LLM)和 TypeScript 构建了一个具有语义长期记忆的原型 AI 客服代理。该代理能够跨会话记住对话,语义地理解上下文,并且运行成本为零。这是一个展示架构的 MVP,包含完整的源代码。你可以通过以下链接免费阅读全文:

1、问题:记忆丧失的 AI 代理

传统的聊天机器人有一个关键缺陷:它们会忘记。

你在周一告诉他们你的名字是 Alice。到周三时,他们又问起来。你在一次对话中解释你的技术背景。下一个会话,他们又解释你已经知道的基本概念。

这不仅令人恼火——对于生产级 AI 应用来说是一个致命缺陷。

为什么 AI 代理会忘记?

大多数实现使用以下其中一种方法:

  • 仅会话记忆 — 关闭标签页时一切都会消失
  • SQL 关键字搜索 — SELECT * WHERE message LIKE '%payment%' 会错过语义含义
  • 会话历史 — 直到你达到 token 限制才起作用(成本高,难以扩展)

以上方法都不能解决真正的问题:AI 代理需要语义级长期记忆。

2、解决方案:使用向量嵌入的语义记忆

如果你的 AI 代理能够像人类一样记住信息呢?

  • “支付问题” → 回忆三天前的“结账失败”
  • “用技术性解释” → 记得你是开发者,调整语气
  • “我的名字是什么?” → 能立即回忆起任何先前会话中的名字

这正是我所构建的。这是它的工作原理。

架构:记忆栈

┌─────────────────┐
│   Web UI        │  ← User interacts here
│  (HTML/CSS/JS)  │
└────────┬────────┘
         │ HTTP/REST
         ▼
┌─────────────────┐
│  Express API    │  ← Routes requests
│  (TypeScript)   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Support Agent  │  ← Brain (Genkit + Ollama)
│   (Genkit)      │
└────┬───────┬────┘
     │       │
     ▼       ▼
┌─────────┐ ┌──────────┐
│ Memory  │ │  Ollama  │  ← LLM
│ Manager │ │  Client  │
└────┬────┘ └──────────┘
     │
     ▼
┌─────────────────┐
│   ChromaDB      │  ← Persistent vector storage
│ (Vector Store)  │
└─────────────────┘

关键组件:

  1. Ollama — 本地 LLM(qwen2.5:7b)用于聊天和嵌入(nomic-embed-text)- ChromaDB — 用于语义搜索的向量数据库,采用 HNSW 索引- Genkit — 带会话管理的代理框架- Express — 带 CORS 和错误处理的 REST API 服务器- TypeScript — 具有严格验证的类型安全开发

总成本:$0(完全本地,无 API 调用)

生产特性:

  • 基于 LLM 的事实提取,附带置信度评分
  • 自动内存冲突解析(纠正过时信息)
  • 上下文窗口管理(2000 token 限制)
  • 类型验证与清理
  • PII 提取(姓名、电子邮件、电话、地点)
  • 情感分析与错误报告分离

3、语义记忆的工作原理

3.1 记忆存储

当用户说:"My name is Alice"

  • 提取关键信息:“Customer's name is Alice”  
  • 生成嵌入(embedding):将其转换为 768 维向量,使用 nomic-embed-text
  • 存储在 ChromaDB:使用元数据(userId、timestamp、importance score)保存
// Memory storage example
async addMemory(memory: MemoryInput): Promise<string> {
  const embedding = await this.ollama.generateEmbedding(memory.content);

  await this.vectorStore.addMemory({
    id: generateId(),
    userId: memory.userId,
    content: memory.content,
    embedding: embedding,
    importance: this.calculateImportance(memory),
    timestamp: Date.now(),
    metadata: memory.metadata
  });
  return memoryId;
}

重要性评分:

  • 个人信息(姓名、偏好): 0.9–1.0
  • 技术问题: 0.7–0.9
  • 一般性问题: 0.3–0.5

3.2 记忆提取

当用户问:"What's my name?"

  1. 生成查询嵌入 — 将问题转换为 768 维向量
  2. 语义搜索 — ChromaDB 使用余弦相似度找到相似向量
  3. 重新排序结果 — 结合相关性、重要性和新近性
  4. 返回前几个记忆 — 使用上下文生成回答
// Memory retrieval example
async searchRelevantMemories(
  userId: string,
  query: string
): Promise<Memory[]> {
  // Semantic search in ChromaDB
  const results = await this.vectorStore.searchMemories(query, userId, 10);

  // Re-rank: relevance(50%) + importance(30%) + recency(20%)
  const reRanked = results.map(r => ({
    ...r,
    score: (r.relevance * 0.5) + (r.importance * 0.3) + (r.recency * 0.2)
  }));
  return reRanked.sort((a, b) => b.score - a.score).slice(0, 5);
}

重新排序公式:

final_score = (relevance × 0.5) + (importance × 0.3) + (recency × 0.2)

3.3 上下文感知的响应

代理将检索到的记忆作为上下文:

// Agent response generation
async chat(userId: string, message: string): Promise<ChatResponse> {
  // Get relevant memories
  const memories = await this.memoryManager.searchRelevantMemories(
    userId,
    message,
    { limit: 5 }
  );

  // Build context
  const context = memories.map(m =>
    `[${m.timestamp}] ${m.content} (relevance: ${m.relevance})`
  ).join('\n');
  // Generate response with context
  const prompt = `
You are a customer support agent with memory of previous conversations.
CONTEXT FROM MEMORY:
${context}
CURRENT MESSAGE: ${message}
Respond naturally, referencing relevant context when helpful.
  `;
  const response = await this.ollama.chat([
    { role: 'system', content: prompt },
    { role: 'user', content: message }
  ]);
  // Store this interaction
  await this.memoryManager.addInteraction({
    userId,
    userMessage: message,
    assistantMessage: response
  });
  return { response, context: memories };
}

**结果:**代理在服务器重启或新会话后仍然回答“Your name is Alice!”

4、魔法:语义搜索 vs. 关键词搜索

让我们看看实际中的差异。

测试用例:

用户说 "checkout is broken"

3 天后,用户问:"Any update on my payment issue?"

传统 SQL 搜索:

SELECT * FROM conversations
WHERE user_id = 'alice'
AND (message LIKE '%payment%' OR message LIKE '%issue%')
ORDER BY timestamp DESC;

结果:0 条匹配 ❌(关键字不匹配)

语义向量搜索:

const embedding = await generateEmbedding("payment issue");
const results = await chromaDB.query(embedding, { userId: 'alice' });

结果:以 87% 相似度找到 "checkout is broken" ✅

为什么有效:

  • 两个短语都与购买相关的问题
  • 向量嵌入捕捉语义含义
  • 余弦相似度:0.87(高度相关)

5、实现细节

5.1 设置ChromaDB

// src/memory/vectorStore.ts
import { ChromaClient } from 'chromadb';

export class VectorStore {
  private client: ChromaClient;
  private collection: Collection;

  async initialize(): Promise<void> {
    this.client = new ChromaClient();

    this.collection = await this.client.getOrCreateCollection({
      name: 'customer_support_memories',
      metadata: {
        'hnsw:space': 'cosine',  // Cosine similarity
        'hnsw:M': 16,             // HNSW graph connections
        'hnsw:construction_ef': 200,
        'hnsw:search_ef': 50
      }
    });
  }

  async addMemory(memory: VectorMemory): Promise<void> {
    await this.collection.add({
      ids: [memory.id],
      embeddings: [memory.embedding],
      documents: [memory.content],
      metadatas: [{
        userId: memory.userId,
        importance: memory.importance,
        timestamp: memory.timestamp,
        category: memory.category
      }]
    });
  }

  async searchMemories(
    query: string,
    userId: string,
    limit: number = 5
  ): Promise<SearchResult[]> {
    const queryEmbedding = await this.ollama.generateEmbedding(query);

    const results = await this.collection.query({
      queryEmbeddings: [queryEmbedding],
      nResults: limit,
      where: { userId: userId }
    });

    return results.ids[0].map((id, idx) => ({
      id: id,
      content: results.documents[0][idx],
      relevance: 1 - results.distances[0][idx], // Convert distance to similarity
      metadata: results.metadatas[0][idx]
    }));
  }
}

Key Decisions:

  • Cosine similarity — 文本文字嵌入的最佳选择
  • HNSW 索引 — 快速近似最近邻搜索
  • 用户隔离 — 每个用户拥有独立的记忆空间

5.2 嵌入缓存

生成嵌入成本高昂。让我们对它们进行缓存。

// src/models/ollama.ts
export class OllamaClient {
  private embeddingCache = new Map<string, number[]>();
  private cacheHits = 0;
  private cacheMisses = 0;

async generateEmbedding(
    text: string,
    useCache: boolean = true
  ): Promise<number[]> {
    // Create cache key (hash of text)
    const cacheKey = createHash('sha256')
      .update(text.toLowerCase().trim())
      .digest('hex');
    // Check cache
    if (useCache && this.embeddingCache.has(cacheKey)) {
      this.cacheHits++;
      return this.embeddingCache.get(cacheKey)!;
    }
    // Generate embedding via Ollama
    this.cacheMisses++;
    const response = await this.ollama.embeddings({
      model: 'nomic-embed-text',
      prompt: text
    });
    const embedding = response.embedding; // 768-dimensional vector
    // Cache it (LRU eviction when > 1000 entries)
    if (this.embeddingCache.size >= 1000) {
      const firstKey = this.embeddingCache.keys().next().value;
      this.embeddingCache.delete(firstKey);
    }
    this.embeddingCache.set(cacheKey, embedding);
    return embedding;
  }
  getCacheStats() {
    const total = this.cacheHits + this.cacheMisses;
    return {
      hits: this.cacheHits,
      misses: this.cacheMisses,
      hitRate: total > 0 ? (this.cacheHits / total * 100).toFixed(1) : '0'
    };
  }
}

Performance Impact:

  • 缓存命中率在生产环境为 40–60%
  • 嵌入生成大约 200ms/次
  • 缓存检索 <1ms
  • 总节省:约 40–50% 的响应时间

5.3 支持智能提取的记忆管理器

并非所有信息都需要记住。要智能地提取关键信息。

// src/memory/memoryManager.ts
export class MemoryManager {
  async addInteraction(interaction: Interaction): Promise<void> {
    // 记录原始对话
    await this.vectorStore.addMemory({
      content: `User: ${interaction.userMessage}\nAssistant: ${interaction.assistantMessage}`,
      type: MemoryType.CONVERSATION,
      importance: 0.5
    });

    // 使用 LLm 提取关键信息
    const facts = await this.extractKeyFacts(interaction);

    // 将每个事实单独存储,具有更高的重要性
    for (const fact of facts) {
      await this.vectorStore.addMemory({
        content: fact.content,
        type: MemoryType.EXTRACTED_FACT,
        importance: fact.importance
      });
    }
  }

  private async extractKeyFacts(interaction: Interaction): Promise<Memory[]> {
    const prompt = `
Extract key facts from this conversation that should be remembered long-term.
Focus on: personal info, preferences, issues, requests, decisions.

Conversation:
User: ${interaction.userMessage}
Assistant: ${interaction.assistantMessage}

Return JSON array of facts:
[{ "content": "Customer's name is Alice", "importance": 0.95 }]
    `;

    const response = await this.ollama.chat([
      { role: 'system', content: 'You extract key facts from conversations.' },
      { role: 'user', content: prompt }
    ]);

    return JSON.parse(response);
  }

  private calculateImportance(memory: Memory): number {
    let score = 0.5; // Base score

    // Personal information
    if (memory.content.match(/name is|called|prefer/i)) score += 0.4;

    // Technical issues
    if (memory.content.match(/bug|error|broken|issue/i)) score += 0.3;

    // Strong sentiment
    if (memory.content.match(/love|hate|frustrated|excited/i)) score += 0.2;

    return Math.min(score, 1.0);
  }
}

Memory Types:

  • CONVERSATION — 全部对话(重要性:0.3-0.5)
  • EXTRACTED_FACT — 关键信息(重要性:0.7-1.0)
  • PREFERENCE — 用户偏好(重要性:0.8-0.9)
  • EVENT — 重要动作(重要性:0.7-0.9)

5.4 REST API

// src/api/routes.ts
import express from 'express';

const router = express.Router();

// Chat endpoint
router.post('/chat', async (req, res) => {
  try {
    const { userId, message, sessionId } = req.body;

    // Validate input
    if (!userId || !message) {
      return res.status(400).json({ error: 'userId and message required' });
    }

    // Get response from agent
    const response = await agent.chat(userId, message, sessionId);

    res.json({
      response: response.response,
      memoriesUsed: response.context.length,
      avgRelevance: (response.context.reduce((sum, m) =>
        sum + m.relevance, 0) / response.context.length * 100).toFixed(0),
      responseTime: response.metadata.responseTime,
      timestamp: response.timestamp
    });
  } catch (error) {
    console.error('Chat error:', error);
    res.status(500).json({ error: 'Internal server error' });
  }
});

// Get memories
router.get('/memories/:userId', async (req, res) => {
  try {
    const { userId } = req.params;
    const limit = parseInt(req.query.limit as string) || 10;

    const memories = await memoryManager.getRecentMemories(userId, limit);

    res.json({ memories, count: memories.length });
  } catch (error) {
    res.status(500).json({ error: 'Failed to fetch memories' });
  }
});

// Delete user data (GDPR compliance)
router.delete('/memories/:userId', async (req, res) => {
  try {
    const { userId } = req.params;

    await memoryManager.deleteUserData(userId);

    res.json({ message: 'User data deleted successfully' });
  } catch (error) {
    res.status(500).json({ error: 'Failed to delete user data' });
  }
});

// Health check
router.get('/health', async (req, res) => {
  const ollamaStatus = await ollama.isAvailable();
  const memoryCount = await memoryManager.getMemoryCount();

  res.json({
    status: 'healthy',
    ollama: ollamaStatus ? 'connected' : 'disconnected',
    memories: memoryCount,
    uptime: process.uptime()
  });
});

export default router;

6、产品级特性

6.1 智能事实提取

系统使用基于 LLm 的提取(qwen2.5:7b),并采用全面的模式匹配:

// Extraction prompt covers:
- Names: "my name is X", "I am X", "call me X"
- Contact: "my email is X", "call me at Y", "my phone is Z"
- Location: "I'm from X", "I live in Y"
- Problems: "X is broken", "error with X"
- Requests: "I need X", "can you add X"
- Preferences: "I prefer X", "I like X"
- Sentiment: "I love X", "I hate Y", "frustrated with Z"

示例:

User: "Hi, I'm Alice and my email is alice@example.com"
✅ Extracted 2 facts:
1. "User's name is Alice" (importance: 1.0, confidence: 1.0)
2. "User's email is alice@example.com" (importance: 1.0, confidence: 1.0)

6.2 记忆冲突自动解决

系统检测并解决冲突信息:

User: "My name is Alice"
→ Stored: "User's name is Alice"
User: "Actually, it's spelled Alicia"
→ Detects conflict with existing name memory
→ Deletes old memory
→ Stores corrected memory with version tracking
User: "What's my name?"
→ Agent: "Your name is Alicia" ✅

冲突检测处理:

  • 姓名修改/更改
  • 电子邮件更新
  • 公司/工作变动
  • 位置信息更新
  • 电话号码变更

6.3 可靠的验证流水线

每个提取的事实都要经过严格的验证:

// Validation checks:
✅ Required fields (content, type, category)
✅ Type enum validation (preference, event, sentiment, extracted_fact)
✅ Category enum validation (bug_report, feature_request, general, etc.)
✅ Content length limits (max 500 characters)
✅ Importance score range (0-1)
✅ Confidence score range (0-1)
✅ Minimum confidence threshold (>0.5)

无效类型会自动纠正:

LLM returns type: "contact"  (invalid)
→ System corrects to: "extracted_fact" ✅

6.4 上下文窗口管理

通过智能排序避免上下文溢出:

// Token estimation: ~4 characters = 1 token
// Max context: 2000 tokens
Priority order:
1. Customer Profile (preferences, facts)     - Most important
2. Known Issues (bug reports)                - High priority
3. Feature Requests                          - Medium priority
4. Sentiment History                         - Context
5. Recent Interactions (last 3)              - Temporal context

日志显示令牌使用:

[Context] Built context: 64/2000 tokens, 3 memories
[Context] Built context: 1847/2000 tokens, 47 memories
... (15 items truncated due to context limit)

6.5 情绪分析

情感分析与错误报告分离地独立提取情感上下文:

User: "The app crashes when I login, I'm so frustrated"
✅ Extracted 2 facts:
1. "User reported issue: app crashes" 
   (type: extracted_fact, category: bug_report, importance: 0.85)
2. "User expressed frustration about app crashing" 
   (type: sentiment, category: general, importance: 0.7)

这使得能够单独跟踪用户满意度趋势,而不混淆技术问题。

6.6 置信度打分

每个提取都附带置信分数:

{
  "content": "User's name is Alice",
  "confidence": 0.98,  // High confidence
  "importance": 0.95,
  "type": "preference"
}

置信度低于 0.5 的事实将自动被拒绝,确保质量。

6.7 记忆版本及审计跟踪

对记忆更新进行完整跟踪:

{
  "content": "User's name is Alicia",
  "metadata": {
    "replacedMemoryId": "5c67f3fd-7751-4d10-bd02-5984ea6cf8bf",
    "replacedAt": 1771042850123,
    "confidence": 0.98,
    "extractedFrom": "llm",
    "originalMessage": "Actually, it's spelled Alicia..."
  }
}

7、用例

这个架构适用于:

个人 AI 助手

  • 记住你的偏好、日程、习惯
  • 适应你的沟通风格
  • 跨会话跟踪任务与目标

医疗保健应用

  • 遵循 HIPAA 的本地数据存储下的病历历史
  • 跨就诊的医疗上下文
  • 个性化治疗建议

教育与辅导

  • 学生的学习风格与进度
  • 根据表现调整难度
  • 记住误解以便纠正

销售 CRM

  • 客户对话历史
  • 交易阶段与异议
  • 关系洞见

研究助手

  • 论文摘要与联系
  • 研究问题与发现
  • 文献综述管理

8、结束语

构建一个具有语义长期记忆的 AI 代理不仅是可能的,而且是实用且负担得起的。

借助 ChromaDB、Ollama,以及 TypeScript,你可以创建一个可工作的原型,它能够:

  • 跨会话记住对话
  • 理解语义含义,而不仅仅是关键词
  • 运行成本为 0(完全本地)
  • 尊重用户隐私(数据不会离开服务器)

这个原型展示了核心架构。要进行生产部署,你需要添加:

  • 会话持久化(SQLite/PostgreSQL)
  • 身份验证与授权
  • 监控与可观测性
  • 速率限制和错误处理
  • 负载测试与优化

AI 代理的未来不是在于更大的模型,而在于更好的记忆。


原文链接: Building an AI Agent with Long-Term Memory: ChromaDB + Ollama + TypeScript

汇智网翻译整理,转载请标明出处