构建有长期记忆的AI代理
我用 ChromaDB(向量数据库)、Ollama(本地 LLM)和 TypeScript 构建了一个具有语义长期记忆的原型 AI 客服代理。该代理能够跨会话记住对话,语义地理解上下文,并且运行成本为零。这是一个展示架构的 MVP,包含完整的源代码。你可以通过以下链接免费阅读全文:
1、问题:记忆丧失的 AI 代理
传统的聊天机器人有一个关键缺陷:它们会忘记。
你在周一告诉他们你的名字是 Alice。到周三时,他们又问起来。你在一次对话中解释你的技术背景。下一个会话,他们又解释你已经知道的基本概念。
这不仅令人恼火——对于生产级 AI 应用来说是一个致命缺陷。
为什么 AI 代理会忘记?
大多数实现使用以下其中一种方法:
- 仅会话记忆 — 关闭标签页时一切都会消失
- SQL 关键字搜索 —
SELECT * WHERE message LIKE '%payment%'会错过语义含义 - 会话历史 — 直到你达到 token 限制才起作用(成本高,难以扩展)
以上方法都不能解决真正的问题:AI 代理需要语义级长期记忆。
2、解决方案:使用向量嵌入的语义记忆
如果你的 AI 代理能够像人类一样记住信息呢?
- “支付问题” → 回忆三天前的“结账失败”
- “用技术性解释” → 记得你是开发者,调整语气
- “我的名字是什么?” → 能立即回忆起任何先前会话中的名字
这正是我所构建的。这是它的工作原理。
架构:记忆栈
┌─────────────────┐
│ Web UI │ ← User interacts here
│ (HTML/CSS/JS) │
└────────┬────────┘
│ HTTP/REST
▼
┌─────────────────┐
│ Express API │ ← Routes requests
│ (TypeScript) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Support Agent │ ← Brain (Genkit + Ollama)
│ (Genkit) │
└────┬───────┬────┘
│ │
▼ ▼
┌─────────┐ ┌──────────┐
│ Memory │ │ Ollama │ ← LLM
│ Manager │ │ Client │
└────┬────┘ └──────────┘
│
▼
┌─────────────────┐
│ ChromaDB │ ← Persistent vector storage
│ (Vector Store) │
└─────────────────┘
关键组件:
- Ollama — 本地 LLM(qwen2.5:7b)用于聊天和嵌入(nomic-embed-text)- ChromaDB — 用于语义搜索的向量数据库,采用 HNSW 索引- Genkit — 带会话管理的代理框架- Express — 带 CORS 和错误处理的 REST API 服务器- TypeScript — 具有严格验证的类型安全开发
总成本:$0(完全本地,无 API 调用)
生产特性:
- 基于 LLM 的事实提取,附带置信度评分
- 自动内存冲突解析(纠正过时信息)
- 上下文窗口管理(2000 token 限制)
- 类型验证与清理
- PII 提取(姓名、电子邮件、电话、地点)
- 情感分析与错误报告分离
3、语义记忆的工作原理
3.1 记忆存储
当用户说:"My name is Alice"
- 提取关键信息:“Customer's name is Alice”
- 生成嵌入(embedding):将其转换为 768 维向量,使用
nomic-embed-text - 存储在 ChromaDB:使用元数据(userId、timestamp、importance score)保存
// Memory storage example
async addMemory(memory: MemoryInput): Promise<string> {
const embedding = await this.ollama.generateEmbedding(memory.content);
await this.vectorStore.addMemory({
id: generateId(),
userId: memory.userId,
content: memory.content,
embedding: embedding,
importance: this.calculateImportance(memory),
timestamp: Date.now(),
metadata: memory.metadata
});
return memoryId;
}
重要性评分:
- 个人信息(姓名、偏好): 0.9–1.0
- 技术问题: 0.7–0.9
- 一般性问题: 0.3–0.5
3.2 记忆提取
当用户问:"What's my name?"
- 生成查询嵌入 — 将问题转换为 768 维向量
- 语义搜索 — ChromaDB 使用余弦相似度找到相似向量
- 重新排序结果 — 结合相关性、重要性和新近性
- 返回前几个记忆 — 使用上下文生成回答
// Memory retrieval example
async searchRelevantMemories(
userId: string,
query: string
): Promise<Memory[]> {
// Semantic search in ChromaDB
const results = await this.vectorStore.searchMemories(query, userId, 10);
// Re-rank: relevance(50%) + importance(30%) + recency(20%)
const reRanked = results.map(r => ({
...r,
score: (r.relevance * 0.5) + (r.importance * 0.3) + (r.recency * 0.2)
}));
return reRanked.sort((a, b) => b.score - a.score).slice(0, 5);
}
重新排序公式:
final_score = (relevance × 0.5) + (importance × 0.3) + (recency × 0.2)
3.3 上下文感知的响应
代理将检索到的记忆作为上下文:
// Agent response generation
async chat(userId: string, message: string): Promise<ChatResponse> {
// Get relevant memories
const memories = await this.memoryManager.searchRelevantMemories(
userId,
message,
{ limit: 5 }
);
// Build context
const context = memories.map(m =>
`[${m.timestamp}] ${m.content} (relevance: ${m.relevance})`
).join('\n');
// Generate response with context
const prompt = `
You are a customer support agent with memory of previous conversations.
CONTEXT FROM MEMORY:
${context}
CURRENT MESSAGE: ${message}
Respond naturally, referencing relevant context when helpful.
`;
const response = await this.ollama.chat([
{ role: 'system', content: prompt },
{ role: 'user', content: message }
]);
// Store this interaction
await this.memoryManager.addInteraction({
userId,
userMessage: message,
assistantMessage: response
});
return { response, context: memories };
}
**结果:**代理在服务器重启或新会话后仍然回答“Your name is Alice!”
4、魔法:语义搜索 vs. 关键词搜索
让我们看看实际中的差异。
测试用例:
用户说 "checkout is broken"
3 天后,用户问:"Any update on my payment issue?"
传统 SQL 搜索:
SELECT * FROM conversations
WHERE user_id = 'alice'
AND (message LIKE '%payment%' OR message LIKE '%issue%')
ORDER BY timestamp DESC;
结果:0 条匹配 ❌(关键字不匹配)
语义向量搜索:
const embedding = await generateEmbedding("payment issue");
const results = await chromaDB.query(embedding, { userId: 'alice' });
结果:以 87% 相似度找到 "checkout is broken" ✅
为什么有效:
- 两个短语都与购买相关的问题
- 向量嵌入捕捉语义含义
- 余弦相似度:0.87(高度相关)
5、实现细节
5.1 设置ChromaDB
// src/memory/vectorStore.ts
import { ChromaClient } from 'chromadb';
export class VectorStore {
private client: ChromaClient;
private collection: Collection;
async initialize(): Promise<void> {
this.client = new ChromaClient();
this.collection = await this.client.getOrCreateCollection({
name: 'customer_support_memories',
metadata: {
'hnsw:space': 'cosine', // Cosine similarity
'hnsw:M': 16, // HNSW graph connections
'hnsw:construction_ef': 200,
'hnsw:search_ef': 50
}
});
}
async addMemory(memory: VectorMemory): Promise<void> {
await this.collection.add({
ids: [memory.id],
embeddings: [memory.embedding],
documents: [memory.content],
metadatas: [{
userId: memory.userId,
importance: memory.importance,
timestamp: memory.timestamp,
category: memory.category
}]
});
}
async searchMemories(
query: string,
userId: string,
limit: number = 5
): Promise<SearchResult[]> {
const queryEmbedding = await this.ollama.generateEmbedding(query);
const results = await this.collection.query({
queryEmbeddings: [queryEmbedding],
nResults: limit,
where: { userId: userId }
});
return results.ids[0].map((id, idx) => ({
id: id,
content: results.documents[0][idx],
relevance: 1 - results.distances[0][idx], // Convert distance to similarity
metadata: results.metadatas[0][idx]
}));
}
}
Key Decisions:
- Cosine similarity — 文本文字嵌入的最佳选择
- HNSW 索引 — 快速近似最近邻搜索
- 用户隔离 — 每个用户拥有独立的记忆空间
5.2 嵌入缓存
生成嵌入成本高昂。让我们对它们进行缓存。
// src/models/ollama.ts
export class OllamaClient {
private embeddingCache = new Map<string, number[]>();
private cacheHits = 0;
private cacheMisses = 0;
async generateEmbedding(
text: string,
useCache: boolean = true
): Promise<number[]> {
// Create cache key (hash of text)
const cacheKey = createHash('sha256')
.update(text.toLowerCase().trim())
.digest('hex');
// Check cache
if (useCache && this.embeddingCache.has(cacheKey)) {
this.cacheHits++;
return this.embeddingCache.get(cacheKey)!;
}
// Generate embedding via Ollama
this.cacheMisses++;
const response = await this.ollama.embeddings({
model: 'nomic-embed-text',
prompt: text
});
const embedding = response.embedding; // 768-dimensional vector
// Cache it (LRU eviction when > 1000 entries)
if (this.embeddingCache.size >= 1000) {
const firstKey = this.embeddingCache.keys().next().value;
this.embeddingCache.delete(firstKey);
}
this.embeddingCache.set(cacheKey, embedding);
return embedding;
}
getCacheStats() {
const total = this.cacheHits + this.cacheMisses;
return {
hits: this.cacheHits,
misses: this.cacheMisses,
hitRate: total > 0 ? (this.cacheHits / total * 100).toFixed(1) : '0'
};
}
}
Performance Impact:
- 缓存命中率在生产环境为 40–60%
- 嵌入生成大约 200ms/次
- 缓存检索 <1ms
- 总节省:约 40–50% 的响应时间
5.3 支持智能提取的记忆管理器
并非所有信息都需要记住。要智能地提取关键信息。
// src/memory/memoryManager.ts
export class MemoryManager {
async addInteraction(interaction: Interaction): Promise<void> {
// 记录原始对话
await this.vectorStore.addMemory({
content: `User: ${interaction.userMessage}\nAssistant: ${interaction.assistantMessage}`,
type: MemoryType.CONVERSATION,
importance: 0.5
});
// 使用 LLm 提取关键信息
const facts = await this.extractKeyFacts(interaction);
// 将每个事实单独存储,具有更高的重要性
for (const fact of facts) {
await this.vectorStore.addMemory({
content: fact.content,
type: MemoryType.EXTRACTED_FACT,
importance: fact.importance
});
}
}
private async extractKeyFacts(interaction: Interaction): Promise<Memory[]> {
const prompt = `
Extract key facts from this conversation that should be remembered long-term.
Focus on: personal info, preferences, issues, requests, decisions.
Conversation:
User: ${interaction.userMessage}
Assistant: ${interaction.assistantMessage}
Return JSON array of facts:
[{ "content": "Customer's name is Alice", "importance": 0.95 }]
`;
const response = await this.ollama.chat([
{ role: 'system', content: 'You extract key facts from conversations.' },
{ role: 'user', content: prompt }
]);
return JSON.parse(response);
}
private calculateImportance(memory: Memory): number {
let score = 0.5; // Base score
// Personal information
if (memory.content.match(/name is|called|prefer/i)) score += 0.4;
// Technical issues
if (memory.content.match(/bug|error|broken|issue/i)) score += 0.3;
// Strong sentiment
if (memory.content.match(/love|hate|frustrated|excited/i)) score += 0.2;
return Math.min(score, 1.0);
}
}
Memory Types:
CONVERSATION— 全部对话(重要性:0.3-0.5)EXTRACTED_FACT— 关键信息(重要性:0.7-1.0)PREFERENCE— 用户偏好(重要性:0.8-0.9)EVENT— 重要动作(重要性:0.7-0.9)
5.4 REST API
// src/api/routes.ts
import express from 'express';
const router = express.Router();
// Chat endpoint
router.post('/chat', async (req, res) => {
try {
const { userId, message, sessionId } = req.body;
// Validate input
if (!userId || !message) {
return res.status(400).json({ error: 'userId and message required' });
}
// Get response from agent
const response = await agent.chat(userId, message, sessionId);
res.json({
response: response.response,
memoriesUsed: response.context.length,
avgRelevance: (response.context.reduce((sum, m) =>
sum + m.relevance, 0) / response.context.length * 100).toFixed(0),
responseTime: response.metadata.responseTime,
timestamp: response.timestamp
});
} catch (error) {
console.error('Chat error:', error);
res.status(500).json({ error: 'Internal server error' });
}
});
// Get memories
router.get('/memories/:userId', async (req, res) => {
try {
const { userId } = req.params;
const limit = parseInt(req.query.limit as string) || 10;
const memories = await memoryManager.getRecentMemories(userId, limit);
res.json({ memories, count: memories.length });
} catch (error) {
res.status(500).json({ error: 'Failed to fetch memories' });
}
});
// Delete user data (GDPR compliance)
router.delete('/memories/:userId', async (req, res) => {
try {
const { userId } = req.params;
await memoryManager.deleteUserData(userId);
res.json({ message: 'User data deleted successfully' });
} catch (error) {
res.status(500).json({ error: 'Failed to delete user data' });
}
});
// Health check
router.get('/health', async (req, res) => {
const ollamaStatus = await ollama.isAvailable();
const memoryCount = await memoryManager.getMemoryCount();
res.json({
status: 'healthy',
ollama: ollamaStatus ? 'connected' : 'disconnected',
memories: memoryCount,
uptime: process.uptime()
});
});
export default router;
6、产品级特性
6.1 智能事实提取
系统使用基于 LLm 的提取(qwen2.5:7b),并采用全面的模式匹配:
// Extraction prompt covers:
- Names: "my name is X", "I am X", "call me X"
- Contact: "my email is X", "call me at Y", "my phone is Z"
- Location: "I'm from X", "I live in Y"
- Problems: "X is broken", "error with X"
- Requests: "I need X", "can you add X"
- Preferences: "I prefer X", "I like X"
- Sentiment: "I love X", "I hate Y", "frustrated with Z"
示例:
User: "Hi, I'm Alice and my email is alice@example.com"
✅ Extracted 2 facts:
1. "User's name is Alice" (importance: 1.0, confidence: 1.0)
2. "User's email is alice@example.com" (importance: 1.0, confidence: 1.0)
6.2 记忆冲突自动解决
系统检测并解决冲突信息:
User: "My name is Alice"
→ Stored: "User's name is Alice"
User: "Actually, it's spelled Alicia"
→ Detects conflict with existing name memory
→ Deletes old memory
→ Stores corrected memory with version tracking
User: "What's my name?"
→ Agent: "Your name is Alicia" ✅
冲突检测处理:
- 姓名修改/更改
- 电子邮件更新
- 公司/工作变动
- 位置信息更新
- 电话号码变更
6.3 可靠的验证流水线
每个提取的事实都要经过严格的验证:
// Validation checks:
✅ Required fields (content, type, category)
✅ Type enum validation (preference, event, sentiment, extracted_fact)
✅ Category enum validation (bug_report, feature_request, general, etc.)
✅ Content length limits (max 500 characters)
✅ Importance score range (0-1)
✅ Confidence score range (0-1)
✅ Minimum confidence threshold (>0.5)
无效类型会自动纠正:
LLM returns type: "contact" (invalid)
→ System corrects to: "extracted_fact" ✅
6.4 上下文窗口管理
通过智能排序避免上下文溢出:
// Token estimation: ~4 characters = 1 token
// Max context: 2000 tokens
Priority order:
1. Customer Profile (preferences, facts) - Most important
2. Known Issues (bug reports) - High priority
3. Feature Requests - Medium priority
4. Sentiment History - Context
5. Recent Interactions (last 3) - Temporal context
日志显示令牌使用:
[Context] Built context: 64/2000 tokens, 3 memories
[Context] Built context: 1847/2000 tokens, 47 memories
... (15 items truncated due to context limit)
6.5 情绪分析
情感分析与错误报告分离地独立提取情感上下文:
User: "The app crashes when I login, I'm so frustrated"
✅ Extracted 2 facts:
1. "User reported issue: app crashes"
(type: extracted_fact, category: bug_report, importance: 0.85)
2. "User expressed frustration about app crashing"
(type: sentiment, category: general, importance: 0.7)
这使得能够单独跟踪用户满意度趋势,而不混淆技术问题。
6.6 置信度打分
每个提取都附带置信分数:
{
"content": "User's name is Alice",
"confidence": 0.98, // High confidence
"importance": 0.95,
"type": "preference"
}
置信度低于 0.5 的事实将自动被拒绝,确保质量。
6.7 记忆版本及审计跟踪
对记忆更新进行完整跟踪:
{
"content": "User's name is Alicia",
"metadata": {
"replacedMemoryId": "5c67f3fd-7751-4d10-bd02-5984ea6cf8bf",
"replacedAt": 1771042850123,
"confidence": 0.98,
"extractedFrom": "llm",
"originalMessage": "Actually, it's spelled Alicia..."
}
}
7、用例
这个架构适用于:
个人 AI 助手
- 记住你的偏好、日程、习惯
- 适应你的沟通风格
- 跨会话跟踪任务与目标
医疗保健应用
- 遵循 HIPAA 的本地数据存储下的病历历史
- 跨就诊的医疗上下文
- 个性化治疗建议
教育与辅导
- 学生的学习风格与进度
- 根据表现调整难度
- 记住误解以便纠正
销售 CRM
- 客户对话历史
- 交易阶段与异议
- 关系洞见
研究助手
- 论文摘要与联系
- 研究问题与发现
- 文献综述管理
8、结束语
构建一个具有语义长期记忆的 AI 代理不仅是可能的,而且是实用且负担得起的。
借助 ChromaDB、Ollama,以及 TypeScript,你可以创建一个可工作的原型,它能够:
- 跨会话记住对话
- 理解语义含义,而不仅仅是关键词
- 运行成本为 0(完全本地)
- 尊重用户隐私(数据不会离开服务器)
这个原型展示了核心架构。要进行生产部署,你需要添加:
- 会话持久化(SQLite/PostgreSQL)
- 身份验证与授权
- 监控与可观测性
- 速率限制和错误处理
- 负载测试与优化
AI 代理的未来不是在于更大的模型,而在于更好的记忆。
原文链接: Building an AI Agent with Long-Term Memory: ChromaDB + Ollama + TypeScript
汇智网翻译整理,转载请标明出处