AI上下文管理的六条军规

如何在保持跨多次对话和重启时回忆关键事实的同时,仅保留有用的上下文内容?本文介绍6个核心规则。

AI上下文管理的六条军规

长时间聊天会变得臃肿,工具调用堆积,提示变得沉重。质量下降,延迟上升,成本飙升。

挑战:在保持跨多次对话和重启时回忆关键事实的同时,仅保留有用的上下文内容。

解决方案:六条规则协同工作:

  1. 三个缓冲区(最近的对话、运行摘要、实体清单)
  2. 可恢复状态的代理图
  3. 热内存与冷内存(交换进出)
  4. 自适应检索(门控 → 草稿 → 批评)
  5. HyDE查询扩展(对瘦查询更强的回忆)
  6. 压缩 + 重新排序(每token更多的信号)

展示这些模式的产品:

  • Cursor —— 线程摘要和你最新编辑周围的快速上下文。
  • Claude Code / Copilot Chat —— 限制到活动文件/差异并保持轻量级状态。
  • Notion AI —— 按部分级别摘要,这样你可以只提取你需要的部分。
  • Linear/Height AI助手 —— 实体式记忆(问题ID、负责人、日期),为回答提供信息。

将它们作为你自己的堆栈的思维模型。

所有LangGraph示例都避免命名冲突:状态键是名词,节点ID是动词。替换为你的模型/API密钥。

1、三个缓冲区(最近、摘要、实体)

巨大的历史记录会变成噪音。保持三个通道:一个小窗口的最近对话,一个简短的事实摘要,以及一个实体清单(人/组织/时间/ID)。只提供需要的内容。

示例(LangChain,最小且直接)

# pip install -U langchain langchain-openai  
from langchain_openai import ChatOpenAI  
from langchain_core.prompts import PromptTemplate  

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)  
recent = []           # ["U: ...", "A: ..."]  
summary = ""          # <= 8 lines  
entities = {}         # {"person": ["Priya"], "org": ["Acme"], "when": ["Tue 3pm"]}  
prompt = PromptTemplate.from_template("""Be concise and factual.  
Entities: {entities}  
Summary: {summary}  
Recent:  
{recent}  
User: {q}  
Assistant:""")  
def ask(q: str):  
    global summary, recent  
    out = (prompt | llm).invoke({  
        "entities": entities,  
        "summary": summary,  
        "recent": "\n".join(recent[-8:]),  
        "q": q,  
    }).content  
    recent += [f"U: {q}", f"A: {out}"]  
    summary = llm.invoke(  
        f"Summarize key facts only (<=8 lines):\n{summary}\n\nLatest:\nU:{q}\nA:{out}\n---\nSummary:"  
    ).content  
    return out

2、检查点计划,而不是闲聊

持久化计划/步骤/结果,这样崩溃或重试不会重新播放整个线程。

示例(LangGraph + 检查点器)

# pip install -U langgraph  
from typing import TypedDict  
from langgraph.graph import StateGraph, END  
from langgraph.checkpoint.memory import MemorySaver  

class AState(TypedDict):  
    plan: str  
    step: int  
    result: str  
    query: str  
def plan_step(s: AState):    return {"plan": "1) search 2) answer", "step": 1}  
def act_step(s: AState):     return {"result": f"(search for '{s['query']}')", "step": 2}  
def finalize_step(s: AState):return {"result": f"Answer based on {s['result']}"}  
g = StateGraph(AState)  
g.add_node("plan_step", plan_step)  
g.add_node("act_step", act_step)  
g.add_node("finalize_step", finalize_step)  
g.set_entry_point("plan_step")  
g.add_edge("plan_step", "act_step")  
g.add_edge("act_step", "finalize_step")  
g.add_edge("finalize_step", END)  
app = g.compile(checkpointer=MemorySaver())  
cfg = {"configurable": {"thread_id": "ctx-graph-1"}}  
state = app.invoke({"query":"Acme India HQ","step":0,"plan":"","result":""}, config=cfg)  
print(state["result"])

3、热内存与冷内存

热 = 小的原始窗口。冷 = 存储中的笔记/摘要/工具输出。切换以保持提示简洁。

示例(小的交换器 + 回忆)

from collections import deque  
from langchain_openai import ChatOpenAI  
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)  

HOT_MAX_CHARS = 1500  
hot = deque([], maxlen=12)  # raw turns  
cold = []                   # summarized notes  
def spill():  
    global hot, cold  
    while sum(len(x) for x in hot) > HOT_MAX_CHARS and len(hot) > 6:  
        cold.append(" | ".join(list(hot)[:4]))  
        for _ in range(4): hot.popleft()  
def recall(q: str):  
    return [n for n in cold if any(w in n.lower() for w in q.lower().split())][:2]  
def answer(q: str):  
    ctx = "\n".join(recall(q))  
    hist = "\n".join(list(hot)[-6:])  
    out = llm.invoke(f"Notes:\n{ctx}\n\nRecent:\n{hist}\n\nUser:{q}\nAnswer:").content  
    hot.extend([f"U:{q}", f"A:{out}"])  
    spill()  
    return out

4、自适应检索(门控 → 草稿 → 批评)

除非需要,否则不要获取文档。首先进行门控。如果需要,检索、草稿,然后根据来源进行批评。

示例(LangGraph)

# pip install -U langgraph langchain langchain-openai faiss-cpu  
from typing import TypedDict, List  
from langgraph.graph import StateGraph, END  
from langgraph.checkpoint.memory import MemorySaver  
from langchain_openai import ChatOpenAI, OpenAIEmbeddings  
from langchain_community.vectorstores import FAISS  

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)  
emb = OpenAIEmbeddings()  
store = FAISS.from_texts([  
    "Acme policy on leave",  
    "Acme address Bangalore: MG Road, Bengaluru 560001",  
    "Client Priya notes"  
], emb)  
retriever = store.as_retriever(search_kwargs={"k": 4})  
class RState(TypedDict):  
    q: str  
    ctx: List[str]  
    draft_text: str  
    final_answer: str  
def gate_node(s: RState):  
    d = llm.invoke(f"Q: {s['q']}\nNeed external retrieval? YES or NO.").content.strip().upper()  
    ctx = [d.page_content for d in retriever.get_relevant_documents(s["q"])] if "YES" in d else []  
    return {"ctx": ctx}  
def write_draft(s: RState):  
    draft = llm.invoke(f"Context:\n{s['ctx']}\n\nQ:{s['q']}\nDraft:").content  
    return {"draft_text": draft}  
def critique(s: RState):  
    final = llm.invoke(f"Draft:\n{s['draft_text']}\nSources:\n{s['ctx']}\nRevise if needed; else return draft.").content  
    return {"final_answer": final}  
g = StateGraph(RState)  
g.add_node("gate_node", gate_node)  
g.add_node("write_draft", write_draft)  
g.add_node("critique", critique)  
g.set_entry_point("gate_node")  
g.add_edge("gate_node","write_draft")  
g.add_edge("write_draft","critique")  
g.add_edge("critique", END)  
app = g.compile(checkpointer=MemorySaver())  
cfg = {"configurable": {"thread_id": "selfrag-1"}}  
out = app.invoke({"q":"Where is Acme's Bangalore office?","ctx":[],"draft_text":"","final_answer":""}, config=cfg)  
print(out["final_answer"])

5、HyDE查询扩展

瘦查询表现不佳。生成一个简短的假设性答案,嵌入该答案,然后检索靠近它的实际文档。

示例(LangGraph)

from typing import TypedDict, List  
from langgraph.graph import StateGraph, END  
from langgraph.checkpoint.memory import MemorySaver  
from langchain_openai import ChatOpenAI, OpenAIEmbeddings  
from langchain_community.vectorstores import FAISS  

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)  
emb = OpenAIEmbeddings()  
store = FAISS.from_texts(["Tue 3pm meeting with Priya", "Room 201", "Agenda: status + next steps"], emb)  
class HState(TypedDict):  
    q: str  
    pseudo_text: str  
    ctx: List[str]  
    answer_text: str  
def make_pseudo(s: HState):  
    p = llm.invoke(f"Write a concise hypothetical answer (<=2 lines) to:\n{s['q']}").content  
    return {"pseudo_text": p}  
def retrieve_ctx(s: HState):  
    vec = emb.embed_query(s["pseudo_text"])  
    docs = store.similarity_search_by_vector(vec, k=3)  
    return {"ctx": [d.page_content for d in docs]}  
def compose_answer(s: HState):  
    a = llm.invoke(f"Context:\n{s['ctx']}\n\nQ:{s['q']}\nA:").content  
    return {"answer_text": a}  
g = StateGraph(HState)  
g.add_node("make_pseudo", make_pseudo)  
g.add_node("retrieve_ctx", retrieve_ctx)  
g.add_node("compose_answer", compose_answer)  
g.set_entry_point("make_pseudo")  
g.add_edge("make_pseudo","retrieve_ctx")  
g.add_edge("retrieve_ctx","compose_answer")  
g.add_edge("compose_answer", END)  
app = g.compile(checkpointer=MemorySaver())  
cfg = {"configurable": {"thread_id": "hyde-1"}}  
print(app.invoke({"q":"Meeting time with Priya?","pseudo_text":"","ctx":[],"answer_text":""}, config=cfg)["answer_text"])

6、压缩 + 重新排序

不要把所有东西都塞进提示中。广泛检索,用模型重新排序,然后压缩最终内容。

示例(LangChain)

# pip install -U langchain langchain-openai faiss-cpu  
from typing import List  
from langchain_openai import ChatOpenAI, OpenAIEmbeddings  
from langchain_community.vectorstores import FAISS  

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)  
store = FAISS.from_texts(  
    ["Doc about pricing", "Doc about schedules", "Doc about addresses", "Doc about leave policy"],  
    OpenAIEmbeddings()  
)  
def rerank(q: str, cands: List[str]) -> List[str]:  
    scored = []  
    for c in cands:  
        s = llm.invoke(f"Query:{q}\nCandidate:{c}\nRelevance 0-1 (float only):").content.strip()  
        try: scored.append((float(s), c))  
        except: pass  
    scored.sort(reverse=True)  
    return [c for _, c in scored[:3]]  
def compress(texts: List[str], max_chars=1200) -> str:  
    out, total = [], 0  
    for t in texts:  
        chunk = t.split("\n")[0][:400]  
        if total + len(chunk) > max_chars: break  
        out.append(chunk); total += len(chunk)  
    return "\n".join(out)  
def answer(q: str):  
    raw = [d.page_content for d in store.similarity_search(q, k=8)]  
    top = rerank(q, raw)  
    ctx = compress(top)  
    return llm.invoke(f"Context:\n{ctx}\n\nQ:{q}\nA:").content  
print(answer("What is the leave policy?"))

仓库: akhil-reni/context-management

7、结束语

保持上下文紧凑,状态明确,并选择性地检索。你可以得到更清晰的答案,更低的成本,更高的吞吐量——而无需将你的提示变成垃圾场。


原文链接:Context Management — a practical guide for agentic AI

汇智网翻译整理,转载请标明出处