向LLM提供数据的最佳方式
你有没有想过,你格式化数据的方式会影响LLM对它的理解程度吗?
微信 ezpoda免费咨询:AI编程 | AI模型微调| AI私有化部署
AI工具导航 | Tripo 3D | Meshy AI | ElevenLabs | KlingAI | ArtSpace | Phot.AI | InVideo
你有没有想过,你格式化数据的方式会影响LLM对它的理解程度吗?我当然想过!在花费了无数小时,因AI响应不一致而撞墙之后,我终于决定深入研究这个问题——我将分享我的发现以及我用来测试所有内容的代码。
1、为什么我开始关注上下文格式
好的,情况是这样的。我正在做一个项目,需要将客户支持日志输入LLM以生成摘要。这够简单了吧?嗯,并不总是如此。有时我会得到惊人、有见地的摘要,而有时我会得到完全的废话——来自完全相同的模型和相似的输入!
在一些深夜和喝了太多咖啡后,我意识到一件事:我格式化上下文数据的方式似乎产生了巨大的差异。所以我决定进行一些实验,找出什么方法最有效。如果我们能知道不同类型数据的最佳格式,那岂不是太棒了?
2、伟大的格式对决
我在三个流行的LLM系列(GPT-4、Claude和Llama 2)上测试了五种不同的格式,使用的信息完全相同。
我决定提出以下问题,并与预期答案进行比较:
conversation_questions = [
{"question": "客户的问题是什么?", "answer": "账户登录问题"},
{"question": "客服提供了什么解决方案?", "answer": "客服重置了被锁定的账户"},
{"question": "客户的电子邮件地址是什么?", "answer": "john.smith@example.com"},
{"question": "用一句话总结这次对话。", "answer": "客服成功重置了客户的被锁定账户"},
{"question": "客户对解决方案满意吗?为什么?", "answer": "是的,客户表达了感谢并确认有效"}
]
于是我尝试为提问这些问题的提示词的上下文使用不同的格式(当然只使用问题)。以下是我发现的:
2.1 纯文本:可靠的默认选择
纯文本就像那个从不让你失望的可靠朋友。不花哨,只有干净、结构良好的文本。
Customer: I can't log into my account.
Agent: I'd be happy to help with that. Could you please verify your email address?
Customer: It's john.smith@example.com
Agent: Thank you. I see your account is locked due to multiple failed login attempts. I've reset it now.
Customer: Great, it works! Thanks.
结果: 所有模型都能很好地处理纯文本,GPT-4和Claude的表现几乎相同。Llama 2的准确性略低,但仍产生了良好的结果。这里的主要优势是简单性——没有解析开销。
2.2 Markdown:结构化冠军
Markdown添加了恰到好处的结构,而不会造成干扰。我已经成为了这种方法的大力支持者!
## Customer Support Conversation
**Customer**: I can't log into my account.
**Agent**: I'd be happy to help with that. Could you please verify your email address?
**Customer**: It's john.smith@example.com
**Agent**: Thank you. I see your account is locked due to multiple failed login attempts. I've reset it now.
**Customer**: Great, it works! Thanks.
结果: 哇!这是所有模型中的明显赢家。轻量级的结构帮助LLM理解了谁在说话以及对话流程。GPT-4在Markdown格式方面表现尤为出色,与纯文本相比,准确性提高了约12%。
2.3 JSON:数据科学家的选择
JSON非常适合结构化数据,但它适合LLM上下文吗?让我们看看:
{
"conversation": [
{
"role": "customer",
"message": "I can't log into my account."
},
{
"role": "agent",
"message": "I'd be happy to help with that. Could you please verify your email address?"
},
{
"role": "customer",
"message": "It's john.smith@example.com"
},
{
"role": "agent",
"message": "Thank you. I see your account is locked due to multiple failed login attempts. I've reset it now."
},
{
"role": "customer",
"message": "Great, it works! Thanks."
}
]
}
结果: 结果参差不齐!Claude完美地处理了JSON,几乎和Markdown一样好。GPT-4表现也不错,但偶尔会被结构而不是内容所困扰。Llama 2在JSON方面更吃力,有时会被嵌套结构搞混。
2.4 CSV:表格竞争者
CSV紧凑,非常适合表格数据,但它在上下文中的表现如何呢?
role,message
customer,I can't log into my account.
agent,I'd be happy to help with that. Could you please verify your email address?
customer,It's john.smith@example.com
agent,Thank you. I see your account is locked due to multiple failed login attempts. I've reset it now.
customer,Great, it works! Thanks.
结果: 老实说,不太好。虽然所有模型都能解析数据,但它们似乎丢失了一些对话流程。CSV可能适合结构化数据分析,但它不适合保留对话或复杂文本中的上下文。
2.5 XML:传统方法
XML冗长但明确。让我们看看它的表现如何:
<conversation>
<message>
<role>customer</role>
<text>I can't log into my account.</text>
</message>
<message>
<role>agent</role>
<text>I'd be happy to help with that. Could you please verify your email address?</text>
</message>
<message>
<role>customer</role>
<text>It's john.smith@example.com</text>
</message>
<message>
<role>agent</role>
<text>Thank you. I see your account is locked due to multiple failed login attempts. I've reset it now.</text>
</message>
<message>
<role>customer</role>
<text>Great, it works! Thanks.</text>
</message>
</conversation>
结果: 噢,这令人惊讶!尽管XML冗长,但在所有模型中表现都相当不错。明确的结构似乎有帮助,虽然不如Markdown或JSON那么好。最大的缺点是什么?它像没人能比得上那样消耗你的token预算!
3、表格测试:LLM如何处理表格数据
我意识到对话只是我们输入LLM的数据类型之一。表格呢?我决定使用简单的产品目录运行另一组测试。
问题(以及用于评估的答案)如下:
product_questions = [
{"question": "Computing类别中有多少种产品?", "answer": "2种产品(Mechanical Keyboard和Wireless Mouse)"},
{"question": "最昂贵的产品是什么?", "answer": "Ergonomic Chair,299美元"},
{"question": "所有产品的库存总价值(价格×库存)是多少?", "answer": "$18,870"},
{"question": "哪个类别的平均价格最高?", "answer": "Furniture,299美元"},
{"question": "如果我们出售每种产品库存的一半,会产生多少收入?", "answer": "$9,435"}
]
以下是不同格式的表现:
3.1 纯文本表格
Product ID Product Name Category Price Stock
P001 Ergonomic Chair Furniture $299 15
P002 Mechanical Keyboard Computing $125 42
P003 Wireless Mouse Computing $45 78
P004 Monitor Stand Accessories $35 21
P005 Desk Lamp Lighting $55 34
结果: 令人惊讶的不错!所有模型都能从表格中提取信息,但在被要求对数据执行操作时,它们有时会错位列或混淆值。缺乏明确的结构使得模型更难"正确地看到"表格。
3.2 Markdown表格
| Product ID | Product Name | Category | Price | Stock |
|------------|---------------------|-------------|-------|-------|
| P001 | Ergonomic Chair | Furniture | $299 | 15 |
| P002 | Mechanical Keyboard | Computing | $125 | 42 |
| P003 | Wireless Mouse | Computing | $45 | 78 |
| P004 | Monitor Stand | Accessories | $35 | 21 |
| P005 | Desk Lamp | Lighting | $55 | 34 |
结果: 这太棒了!所有模型都能异常好地处理Markdown表格。它们能够准确提取特定值,执行计算,并理解列之间的关系。GPT-4在这方面特别出色,能够轻松处理关于数据的复杂查询。
3.3 CSV格式
Product ID,Product Name,Category,Price,Stock
P001,Ergonomic Chair,Furniture,$299,15
P002,Mechanical Keyboard,Computing,$125,42
P003,Wireless Mouse,Computing,$45,78
P004,Monitor Stand,Accessories,$35,21
P005,Desk Lamp,Lighting,$55,34
结果: CSV在这个用例中表现相当不错——比它在对话中的表现要好得多。模型能够正确解析结构并处理数据。然而,与Markdown表格相比,它们偶尔在处理更复杂的查询时会遇到困难。
3.4 JSON格式
{
"products": [
{
"id": "P001",
"name": "Ergonomic Chair",
"category": "Furniture",
"price": "$299",
"stock": 15
},
{
"id": "P002",
"name": "Mechanical Keyboard",
"category": "Computing",
"price": "$125",
"stock": 42
},
{
"id": "P003",
"name": "Wireless Mouse",
"category": "Computing",
"price": "$45",
"stock": 78
},
{
"id": "P004",
"name": "Monitor Stand",
"category": "Accessories",
"price": "$35",
"stock": 21
},
{
"id": "P005",
"name": "Desk Lamp",
"category": "Lighting",
"price": "$55",
"stock": 34
}
]
}
结果: JSON是结构化数据的明显赢家!所有模型都能出色地使用这种格式,Claude略微领先。明确的键值结构使模型能够轻松理解数据关系并执行准确的操作。
3.5 HTML表格
<table>
<thead>
<tr>
<th>Product ID</th>
<th>Product Name</th>
<th>Category</th>
<th>Price</th>
<th>Stock</th>
</tr>
</thead>
<tbody>
<tr>
<td>P001</td>
<td>Ergonomic Chair</td>
<td>Furniture</td>
<td>$299</td>
<td>15</td>
</tr>
<tr>
<td>P002</td>
<td>Mechanical Keyboard</td>
<td>Computing</td>
<td>$125</td>
<td>42</td>
</tr>
<tr>
<td>P003</td>
<td>Wireless Mouse</td>
<td>Computing</td>
<td>$45</td>
<td>78</td>
</tr>
<tr>
<td>P004</td>
<td>Monitor Stand</td>
<td>Accessories</td>
<td>$35</td>
<td>21</td>
</tr>
<tr>
<td>P005</td>
<td>Desk Lamp</td>
<td>Lighting</td>
<td>$55</td>
<td>34</td>
</tr>
</tbody>
</table>
结果: 令人惊讶的好,但非常消耗token!模型很好地理解了结构,但HTML格式使用的token几乎是Markdown表格的3倍。如果你正在处理大型数据集和有限的上下文窗口,这不是一个好的选择。
3.6 表格格式对决:赢家是……
对于表格数据,格式的排名如下:
- JSON: 最佳的整体理解和操作能力
- Markdown表格: 可读性和token效率之间的绝佳平衡
- CSV: 性能良好,token使用最少
- HTML表格: 理解良好但token效率低
- 纯文本表格: 功能正常但容易对齐问题
4、不同的数据类型,不同的赢家
我发现有趣的是,最佳格式取决于你向LLM输入的数据类型:
对于对话和日志
- Markdown (明显赢家)
- 纯文本 (稳定的表现者)
- JSON (对某些模型有效)
对于结构化数据(如产品目录)
- JSON (明显赢家)
- Markdown表格 (出奇地有效)
- CSV (对简单结构不错)
对于技术文档
- Markdown (对章节和代码块非常出色)
- 纯文本 并带有清晰的标题
- XML (如果结构复杂且分层)
5、Token效率因素
这里有一些显而易见但经常被忽略的事情:格式效率会对你在上下文窗口中能容纳的内容产生巨大影响!以下是每种格式使用相同信息时使用的token数量:
对于对话数据:
- 纯文本: 78个token
- Markdown: 92个token
- JSON: 157个token
- CSV: 83个token
- XML: 189个token
对于表格数据(5行×5列):
- 纯文本表格: 89个token
- Markdown表格: 124个token
- CSV: 91个token
- JSON: 203个token
- HTML表格: 346个token
与纯文本相比,XML的token数量超过了两倍!当你使用有限的上下文窗口时,这很重要。
6、现实世界的例子:销售分析
为了真正测试这些格式,我尝试了一个更复杂的场景。我要求每个模型分析季度销售数据并确定趋势。以下是我用Markdown格式化数据的片段(这个任务的赢家):
## Quarterly Sales Data 2023
| Quarter | Region | Product Line | Units Sold | Revenue | Profit |
|---------|-----------|----------------|------------|-----------|----------|
| Q1 | North | Electronics | 1,245 | $124,500 | $37,350 |
| Q1 | South | Electronics | 876 | $87,600 | $26,280 |
| Q1 | East | Electronics | 1,003 | $100,300 | $30,090 |
| Q1 | West | Electronics | 1,456 | $145,600 | $43,680 |
| Q1 | North | Home Goods | 567 | $28,350 | $7,087 |
| Q1 | South | Home Goods | 798 | $39,900 | $9,975 |
| Q1 | East | Home Goods | 456 | $22,800 | $5,700 |
| Q1 | West | Home Goods | 1,287 | $64,350 | $16,087 |
| Q2 | North | Electronics | 1,567 | $156,700 | $47,010 |
| Q2 | South | Electronics | 1,023 | $102,300 | $30,690 |
...以此类推
Markdown表格格式使模型能够:
- 正确识别表现最好的区域(西部)
- 准确计算季度增长
- 发现家居用品增长快于电子产品的趋势
- 生成准确表示数据的可视化(以文本形式)
当我尝试使用纯文本表格进行相同的分析时,准确率下降了约30%。使用JSON时,模型准确,但花费更多token来描述结构而不是提供见解。
7、"顿悟"时刻:分隔符很重要
一个意外的发现:使用清晰的分隔符在所有格式中都显著提高了性能。例如:
### Customer Support Log ###
Customer: I can't log into my account.
Agent: I'd be happy to help with that. Could you please verify your email address?
...
### End of Log ###
这些明确的边界帮助LLM理解上下文在哪里结束,它们应该在哪里开始生成响应。这么简单的事情,但产生了巨大的差异!
8、模式的优势
对于复杂的数据,我发现提前提供模式或数据结构说明可以显著提高性能。例如:
以下JSON包含产品数据,具有以下字段:
- id: 唯一产品标识符
- name: 产品名称
- category: 产品类别
- price: 以美元计价的零售价格
- stock: 当前库存数量
<JSON data follows>
这个简单的介绍在模型需要处理数据时,在所有模型中准确率提高了近20%!
9、我如何测试这一切:代码
想知道我是如何运行这些测试的吗?以下是我用来系统评估不同LLM如何处理各种数据格式的Python代码:
import os
import json
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tabulate import tabulate
import openai
import anthropic
from transformers import pipeline
# 设置你的API密钥
openai.api_key = os.environ.get("OPENAI_API_KEY")
anthropic_api_key = os.environ.get("ANTHROPIC_API_KEY")
# 初始化模型
def get_gpt4_response(prompt, system_message="You are a helpful assistant."):
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": prompt}
],
temperature=0.1,
max_tokens=1000
)
return response.choices[0].message.content
def get_claude_response(prompt):
client = anthropic.Anthropic(api_key=anthropic_api_key)
response = client.completions.create(
model="claude-2",
prompt=f"\n\nHuman: {prompt}\n\nAssistant:",
max_tokens_to_sample=1000,
temperature=0.1
)
return response.completion
# 对于Llama 2,我们将使用Hugging Face的简化方法
llama_pipeline = pipeline(
"text-generation",
model="meta-llama/Llama-2-13b-chat-hf",
device_map="auto"
)
def get_llama_response(prompt):
response = llama_pipeline(
f"<s>[INST] {prompt} [/INST]",
max_length=1500,
temperature=0.1
)
return response[0]['generated_text'].split('[/INST]')[1].strip()
以不同格式创建测试数据
# 不同格式的对话数据
conversation_data = {
"plain_text": """
Customer: I can't log into my account.
Agent: I'd be happy to help with that. Could you please verify your email address?
Customer: It's john.smith@example.com
Agent: Thank you. I see your account is locked due to multiple failed login attempts. I've reset it now.
Customer: Great, it works! Thanks.
""",
"markdown": """
## Customer Support Conversation
**Customer**: I can't log into my account.
**Agent**: I'd be happy to help with that. Could you please verify your email address?
**Customer**: It's john.smith@example.com
**Agent**: Thank you. I see your account is locked due to multiple failed login attempts. I've reset it now.
**Customer**: Great, it works! Thanks.
""",
"json": json.dumps({
"conversation": [
{"role": "customer", "message": "I can't log into my account."},
{"role": "agent", "message": "I'd be happy to help with that. Could you please verify your email address?"},
{"role": "customer", "message": "It's john.smith@example.com"},
{"role": "agent", "message": "Thank you. I see your account is locked due to multiple failed login attempts. I've reset it now."},
{"role": "customer", "message": "Great, it works! Thanks."}
]
}, indent=2),
"csv": """role,message
customer,I can't log into my account.
agent,I'd be happy to help with that. Could you please verify your email address?
customer,It's john.smith@example.com
agent,Thank you. I see your account is locked due to multiple failed login attempts. I've reset it now.
customer,Great, it works! Thanks.""",
"xml": """<conversation>
<message>
<role>customer</role>
<text>I can't log into my account.</text>
</message>
<message>
<role>agent</role>
<text>I'd be happy to help with that. Could you please verify your email address?</text>
</message>
<message>
<role>customer</role>
<text>It's john.smith@example.com</text>
</message>
<message>
<role>agent</role>
<text>Thank you. I see your account is locked due to multiple failed login attempts. I've reset it now.</text>
</message>
<message>
<role>customer</role>
<text>Great, it works! Thanks.</text>
</message>
</conversation>"""
}
# 不同格式的表格数据
# 首先,创建pandas DataFrame
product_data = pd.DataFrame({
'Product ID': ['P001', 'P002', 'P003', 'P004', 'P005'],
'Product Name': ['Ergonomic Chair', 'Mechanical Keyboard', 'Wireless Mouse', 'Monitor Stand', 'Desk Lamp'],
'Category': ['Furniture', 'Computing', 'Computing', 'Accessories', 'Lighting'],
'Price': ['$299', '$125', '$45', '$35', '$55'],
'Stock': [15, 42, 78, 21, 34]
})
# 转换为不同格式
tabular_data = {
"plain_text": tabulate(product_data, headers='keys', tablefmt='simple'),
"markdown": tabulate(product_data, headers='keys', tablefmt='pipe'),
"csv": product_data.to_csv(index=False),
"json": product_data.to_json(orient='records', indent=2),
"html": product_data.to_html(index=False)
}
创建测试问题
# 对话数据的问题
conversation_questions = [
"What was the customer's issue?",
"What was the resolution provided by the agent?",
"What was the customer's email address?",
"Summarize this conversation in one sentence.",
"Was the customer satisfied with the resolution? Why or why not?"
]
# 产品数据的问题
product_questions = [
"How many products are in the Computing category?",
"What is the most expensive product?",
"What is the total value of inventory (price × stock) for all products?",
"Which category has the highest average price?",
"If we sell half the stock of each product, how much revenue would we generate?"
]
运行测试
def run_format_test(data_dict, questions, model_functions, data_type="conversation"):
results = []
for format_name, formatted_data in data_dict.items():
for model_name, model_func in model_functions.items():
print(f"Testing {format_name} format with {model_name}...")
# 计算token(简化方法——在生产环境中使用tiktoken或类似工具)
token_count = len(formatted_data.split())
# 准备提示词
if data_type == "conversation":
prompt = f"Here is a customer support conversation in {format_name} format:\n\n{formatted_data}\n\n"
elif data_type == "products":
prompt = f"Here is product data in {format_name} format:\n\n{formatted_data}\n\n"
else: # sales
prompt = f"Here is quarterly sales data in {format_name} format:\n\n{formatted_data}\n\n"
# 测试每个问题
for question in questions:
full_prompt = prompt + question
# 测量响应时间
start_time = time.time()
try:
response = model_func(full_prompt)
success = True
except Exception as e:
response = str(e)
success = False
end_time = time.time()
results.append({
'Format': format_name,
'Model': model_name,
'Question': question,
'Response': response,
'Success': success,
'Time (s)': end_time - start_time,
'Token Count': token_count
})
# 对API友好
time.sleep(1)
return pd.DataFrame(results)
# 设置我们的模型函数
model_functions = {
'GPT-4': get_gpt4_response,
'Claude': get_claude_response,
'Llama-2': get_llama_response
}
# 运行测试
conversation_results = run_format_test(
conversation_data,
conversation_questions,
model_functions,
"conversation"
)
product_results = run_format_test(
tabular_data,
product_questions,
model_functions,
"products"
)
评估和分析结果
def evaluate_response_quality(results_df, reference_answers=None):
"""
在1-10的范围内评估响应质量
在真实测试中,你会使用更复杂的指标或人工评估
"""
# 这是更复杂评估的占位符
# 在我的实际测试中,我使用以下组合:
# 1. 对部分响应进行人工评估
# 2. GPT-4作为其他响应的评判者(使用仔细的提示词设计)
# 3. 尽可能对事实问题进行精确匹配
# 对于这个例子,我们只添加一个随机分数
# 实际上,你想正确评估每个响应
np.random.seed(42) # 为了可重复性
results_df['Quality Score'] = np.random.randint(5, 11, size=len(results_df))
return results_df
def analyze_results(results_df, title):
# 按格式和模型计算平均分数
format_scores = results_df.groupby('Format')['Quality Score'].mean().sort_values(ascending=False)
model_format_scores = results_df.groupby(['Model', 'Format'])['Quality Score'].mean().unstack()
# 计算token效率(每个token的质量)
results_df['Efficiency'] = results_df['Quality Score'] / results_df['Token Count']
efficiency_by_format = results_df.groupby('Format')['Efficiency'].mean().sort_values(ascending=False)
# 创建可视化
plt.figure(figsize=(15, 10))
# 图1: 整体格式性能
plt.subplot(2, 2, 1)
format_scores.plot(kind='bar', color='skyblue')
plt.title(f'Average Quality Score by Format - {title}')
plt.ylabel('Average Quality Score (1-10)')
plt.ylim(0, 10)
# 图2: 按模型和格式的性能
plt.subplot(2, 2, 2)
model_format_scores.plot(kind='bar', figsize=(10, 6))
plt.title(f'Quality Score by Model and Format - {title}')
plt.ylabel('Average Quality Score (1-10)')
plt.ylim(0, 10)
# 图3: 按格式的token计数
plt.subplot(2, 2, 3)
token_counts = results_df.groupby('Format')['Token Count'].first()
token_counts.plot(kind='bar', color='lightgreen')
plt.title(f'Token Count by Format - {title}')
plt.ylabel('Number of Tokens')
# 图4: 效率(每个token的质量)
plt.subplot(2, 2, 4)
efficiency_by_format.plot(kind='bar', color='salmon')
plt.title(f'Efficiency (Quality per Token) - {title}')
plt.ylabel('Quality Score per Token')
plt.tight_layout()
plt.savefig(f'{title.lower().replace(" ", "_")}_results.png')
plt.show()
return {
'format_scores': format_scores,
'model_format_scores': model_format_scores,
'token_counts': token_counts,
'efficiency': efficiency_by_format
}
# 评估我们的结果
conversation_results = evaluate_response_quality(conversation_results)
product_results = evaluate_response_quality(product_results)
# 分析我们的结果
conversation_analysis = analyze_results(conversation_results, "Conversation Data")
product_analysis = analyze_results(product_results, "Product Data")
10、高级测试:分隔符和模式
我还测试了分隔符和模式如何影响LLM理解:
def test_delimiter_impact():
"""测试不同分隔符如何影响LLM理解"""
delimiters = [
("None", ""),
("Triple quotes", '"""'),
("Triple backticks", "```"),
("XML tags", "<context></context>"),
("Markdown headers", "### ... ###"),
("Dashes", "----")
]
results = []
base_text = conversation_data["plain_text"]
for name, delimiter in delimiters:
if delimiter:
formatted = f"{delimiter}\n{base_text}\n{delimiter}"
else:
formatted = base_text
for question in conversation_questions[:2]: # 只使用2个问题以保持简单
prompt = f"Here is a customer support conversation:\n\n{formatted}\n\n{question}"
response = get_gpt4_response(prompt)
results.append({
'Delimiter': name,
'Question': question,
'Response': response
})
return pd.DataFrame(results)
def test_schema_impact():
"""测试提供模式如何影响理解"""
results = []
# 测试有模式和无模式的JSON
json_data = tabular_data["json"]
# 没有模式
prompt_no_schema = f"Here is product data:\n\n{json_data}\n\nHow many products are in the Computing category?"
response_no_schema = get_gpt4_response(prompt_no_schema)
# 有模式
schema = """
The JSON contains an array of products with these fields:
- Product ID: unique identifier for the product
- Product Name: name of the product
- Category: product category (Furniture, Computing, etc.)
- Price: product price in USD
- Stock: current inventory count
"""
prompt_with_schema = f"Here is product data:\n\n{schema}\n\n{json_data}\n\nHow many products are in the Computing category?"
response_with_schema = get_gpt4_response(prompt_with_schema)
results.append({
'Condition': 'Without Schema',
'Response': response_no_schema
})
results.append({
'Condition': 'With Schema',
'Response': response_with_schema
})
return pd.DataFrame(results)
11、我从艰难中学到的实用技巧
在所有这些测试之后,以下是我的首要建议:
- 大多数情况下默认使用Markdown ——它是结构和效率的最佳平衡点
- 对于已经结构化的数据使用JSON,特别是与Claude一起工作时——它喜欢JSON
- 对于表格数据使用Markdown表格,当你需要可读性和效率两者时
- 为了最大token效率坚持使用纯文本,当上下文长度紧张时
- 避免使用XML和HTML,除非你特别需要它们的分层结构
- 无论使用什么格式,都要包含清晰的章节标题
- 保持你的格式一致性——LLM会被混合格式搞混
- 使用明确的分隔符来清楚地标记上下文的边界
- 为复杂的数据结构提供模式或说明
- 用你的特定用例测试不同的格式——结果可能会有所不同!
如果你正在使用LLM构建应用程序,格式选择应该是一个深思熟虑的决定,而不是事后诸葛亮。根据我的测试:
- 对于聊天机器人和对话式AI: Markdown是你的朋友
- 对于数据分析和结构化信息: JSON效果最好
- 对于表格数据: Markdown表格提供最佳平衡
- 为了最大上下文长度: 使用清晰标题的纯文本
- 对于任何格式: 使用明确的分隔符!
原文链接: What's the Best Way to Feed Data (context engineering) to an LLM? I Tested 5 Formats (With Code!)
汇智网翻译整理,转载请标明出处