LLM 评测 Pipeline 搭建指南

模型评测是 AI 开发的"罗盘"。本指南覆盖从使用现有框架到自建评测系统的完整路径，帮助你建立可复现、可扩展的模型评估能力。

概述

评测（Evaluation）回答一个核心问题：这个模型在我的任务上表现如何？

一个完整的评测 Pipeline 包含四个环节：

数据加载 → 模型推理 → 指标计算 → 结果分析

本指南按使用深度分为三层：

快速开始：使用现有框架跑通评测
框架进阶：自定义任务、指标和模型后端
自建 Pipeline：从零构建可扩展的评测系统

1. 快速开始：现有评测框架

1.1 EleutherAI lm-evaluation-harness

社区最活跃的开源评测框架，支持 1000+ 任务，被 Hugging Face Open LLM Leaderboard 等采用。

安装：

bash

git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .

基础用法：

bash

# 评测单个模型在多个基准上
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
  --tasks mmlu,gsm8k,hellaswag,truthfulqa_mc1 \
  --batch_size auto \
  --output_path ./results/llama2-7b

关键参数：

参数	说明	示例
`--model`	模型后端类型	`hf`, `vllm`, `openai-completions`
`--model_args`	模型加载参数	`pretrained=...,dtype=bfloat16`
`--tasks`	评测任务列表	`mmlu`, `gsm8k`, `hellaswag`
`--num_fewshot`	Few-shot 示例数	`5`（默认依任务配置）
`--batch_size`	推理批大小	`auto` 或具体数值
`--device`	运行设备	`cuda`, `cuda:0`, `cpu`

输出结果：

json

{
  "results": {
    "mmlu": {
      "acc": 0.4634,
      "acc_stderr": 0.0041
    },
    "gsm8k": {
      "acc": 0.1251,
      "acc_stderr": 0.0091
    }
  },
  "config": {
    "model": "hf",
    "batch_size": "auto",
    "device": "cuda"
  }
}

1.2 Hugging Face evaluate

轻量级指标计算库，适合作为 Pipeline 中的指标组件。

python

import evaluate

# 加载指标
accuracy = evaluate.load("accuracy")
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")

# 计算
results = accuracy.compute(
    predictions=[0, 1, 0, 1],
    references=[0, 1, 1, 1]
)
print(results)  # {'accuracy': 0.75}

1.3 OpenAI Evals

OpenAI 开源的评测框架，适合快速原型和模型分级评测。

bash

# 安装
pip install -e .

# 运行评测
oaieval gpt-4 my-custom-eval

特点：

YAML/JSON 配置驱动，无需编写代码
支持 Model-Graded 评测（用 GPT-4 评判输出质量）
内置多种评测模式：精确匹配、模糊匹配、包含检查

2. 常见评测基准

2.1 知识推理类

基准	任务类型	规模	核心能力
MMLU	57 学科多选题	15,908 题	世界知识、学科理解
GSM8K	小学数学应用题	8.5K 题	多步数学推理
BBH	23 困难任务	6,511 题	复杂推理、因果推断
HumanEval	Python 编程	164 题	代码生成能力
TruthfulQA	对抗性问答	817 题	避免模仿性虚假信息
HellaSwag	常识推理	70K 题	物理世界常识

2.2 评测配置参考

bash

# 综合学术评测（Leaderboard 标准配置）
lm_eval --model hf \
  --model_args pretrained=your-model,dtype=bfloat16 \
  --tasks mmlu,arc_challenge,hellaswag,truthfulqa_mc1,winogrande \
  --batch_size auto:4 \
  --num_fewshot 5 \
  --output_path ./results

# 数学推理专项评测
lm_eval --model hf \
  --model_args pretrained=your-model \
  --tasks gsm8k,minerva_math,mathqa \
  --num_fewshot 5 \
  --output_path ./results/math

# 代码能力评测
lm_eval --model hf \
  --model_args pretrained=your-model \
  --tasks humaneval,mbpp \
  --batch_size 1 \
  --output_path ./results/code

3. 自定义评测任务

3.1 在 lm-evaluation-harness 中添加任务

步骤 1：创建 YAML 配置文件

yaml

# lm_eval/tasks/my_custom_task/my_custom_task.yaml
task: my_custom_task
dataset_path: json
dataset_name: null
dataset_kwargs:
  data_files:
    test: path/to/test.jsonl
output_type: multiple_choice  # 或 generate_until, loglikelihood
doc_to_text: "{{question}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_target: "{{answer}}"
doc_to_choice: ["A", "B", "C", "D"]
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
num_fewshot: 5
fewshot_config:
  sampler: first_n  # 或 random, seed
metadata:
  version: 1.0

步骤 2：注册任务

将 YAML 文件放在 lm_eval/tasks/ 下，框架会自动发现。

步骤 3：运行

bash

lm_eval --model hf \
  --model_args pretrained=your-model \
  --tasks my_custom_task \
  --output_path ./results/custom

3.2 数据集格式要求

多选题格式：

json

{
  "question": "什么是光合作用？",
  "choices": ["呼吸作用", "光合作用", "蒸腾作用", "分解作用"],
  "answer": 1
}

生成任务格式：

json

{
  "question": "解释量子纠缠",
  "answer": "量子纠缠是指两个或多个粒子相互关联..."
}

3.3 自定义指标

python

from lm_eval.api.metrics import mean, metric

@metric
def custom_f1(predictions, references):
    """自定义 F1 指标"""
    from sklearn.metrics import f1_score
    return f1_score(references, predictions, average="macro")

# 在 YAML 中引用
metric_list:
  - metric: custom_f1
    aggregation: mean
    higher_is_better: true

4. 自建评测 Pipeline

4.1 架构设计

┌─────────────────────────────────────────────────────────┐
│                   Evaluation Pipeline                    │
├─────────────┐  ┌─────────────┐  ┌─────────────────────┐ │
│ Data Layer  │→ │ Inference   │→ │ Metrics & Analysis  │ │
│             │  │ Engine      │  │                     │ │
│ • Dataset   │  │             │  │ • Metric Compute    │ │
│   Loading   │  │ • Model     │  │ • Stat Analysis     │ │
│ • Prompt    │  │   Wrapper   │  │ • Visualization     │ │
│   Template  │  │ • Batch     │  │ • Report Gen        │ │
│ • Few-shot  │  │   Processing│  │                     │ │
│   Builder   │  │ • Output    │  │                     │ │
│             │  │   Parser    │  │                     │ │
└─────────────┘  └─────────────┘  └─────────────────────┘ │

4.2 核心组件实现

数据加载器：

python

from datasets import load_dataset
from typing import List, Dict

class DatasetLoader:
    def __init__(self, dataset_path: str, split: str = "test"):
        self.dataset = load_dataset("json", data_files=dataset_path)[split]
    
    def build_prompts(self, template: str, fewshot_examples: List[Dict] = None) -> List[str]:
        prompts = []
        for sample in self.dataset:
            prompt = template.format(
                question=sample["question"],
                context=sample.get("context", "")
            )
            if fewshot_examples:
                fewshot_text = self._format_fewshot(fewshot_examples)
                prompt = fewshot_text + "\n\n" + prompt
            prompts.append(prompt)
        return prompts

模型封装器：

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class HFModelWrapper:
    def __init__(self, model_path: str, device: str = "cuda"):
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.bfloat16,
            device_map="auto",
        )
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.device = device
    
    def generate(self, prompts: List[str], max_new_tokens: int = 256, **kwargs) -> List[str]:
        outputs = []
        for prompt in prompts:
            inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
            with torch.no_grad():
                out = self.model.generate(
                    **inputs,
                    max_new_tokens=max_new_tokens,
                    do_sample=False,  # 评测通常 greedy
                    **kwargs
                )
            text = self.tokenizer.decode(out[0], skip_special_tokens=True)
            outputs.append(text[len(prompt):])  # 只返回生成部分
        return outputs

指标计算器：

python

class MetricsCalculator:
    def __init__(self):
        self.metrics = {}
    
    def add_metric(self, name: str, func):
        self.metrics[name] = func
    
    def compute(self, predictions: List, references: List) -> Dict[str, float]:
        results = {}
        for name, func in self.metrics.items():
            results[name] = func(predictions, references)
        return results

# 使用
calc = MetricsCalculator()
calc.add_metric("exact_match", lambda p, r: sum(a == b for a, b in zip(p, r)) / len(p))
calc.add_metric("accuracy", lambda p, r: sum(a == b for a, b in zip(p, r)) / len(p))

4.3 完整 Pipeline 示例

python

class EvaluationPipeline:
    def __init__(self, model_wrapper, dataset_loader, metrics_calculator):
        self.model = model_wrapper
        self.dataset = dataset_loader
        self.metrics = metrics_calculator
    
    def run(self, output_path: str = None) -> Dict:
        # 1. 加载数据
        prompts = self.dataset.build_prompts("{question}\nAnswer:")
        references = [s["answer"] for s in self.dataset.dataset]
        
        # 2. 推理
        predictions = self.model.generate(prompts, max_new_tokens=100)
        
        # 3. 解析输出
        parsed_predictions = [self._parse_output(p) for p in predictions]
        
        # 4. 计算指标
        results = self.metrics.compute(parsed_predictions, references)
        
        # 5. 保存结果
        if output_path:
            self._save_results(results, output_path)
        
        return results

5. 统计分析与可视化

5.1 统计显著性

评测结果需要统计检验才能确定差异是否显著：

python

import numpy as np
from scipy import stats

# Bootstrap 置信区间
def bootstrap_ci(predictions, references, metric_fn, n_bootstrap=1000, ci=0.95):
    scores = []
    n = len(predictions)
    for _ in range(n_bootstrap):
        idx = np.random.choice(n, n, replace=True)
        boot_pred = [predictions[i] for i in idx]
        boot_ref = [references[i] for i in idx]
        scores.append(metric_fn(boot_pred, boot_ref))
    
    alpha = (1 - ci) / 2
    lower = np.percentile(scores, alpha * 100)
    upper = np.percentile(scores, (1 - alpha) * 100)
    return lower, upper

# 使用
lower, upper = bootstrap_ci(predictions, references, accuracy_fn)
print(f"Accuracy: {acc:.3f} (95% CI: {lower:.3f} - {upper:.3f})")

5.2 可视化

python

import matplotlib.pyplot as plt
import seaborn as sns

# 模型对比雷达图
categories = ['MMLU', 'GSM8K', 'HumanEval', 'TruthfulQA', 'BBH']
model_a = [0.65, 0.45, 0.30, 0.55, 0.50]
model_b = [0.70, 0.40, 0.35, 0.60, 0.48]

# 热力图：模型 × 任务
import pandas as pd
df = pd.DataFrame(
    [model_a, model_b],
    index=['Model A', 'Model B'],
    columns=categories
)

plt.figure(figsize=(10, 4))
sns.heatmap(df, annot=True, fmt='.2f', cmap='YlOrRd')
plt.title('Model Performance Comparison')
plt.tight_layout()
plt.savefig('comparison_heatmap.png')

6. 评测最佳实践

6.1 可复现性检查清单

[ ] 固定随机种子：random.seed(42), torch.manual_seed(42)
[ ] 记录环境：Python 版本、库版本、CUDA 版本
[ ] 记录配置：batch_size、max_tokens、temperature、few-shot 数
[ ] 记录硬件：GPU 型号、数量
[ ] 保存原始输出：便于后续分析和调试
[ ] 版本控制：评测代码和数据集的 Git commit

6.2 避免数据污染

时间分割：确保评测数据的时间戳晚于训练数据
领域隔离：评测数据不在预训练语料中出现
成员推理检测：检测评测样本是否泄露到训练集

6.3 公平对比原则

使用相同的 prompt 模板
使用相同的解码参数（temperature=0 for 评测）
使用相同的 few-shot 示例
报告完整的系统提示
量化模型需注明量化方式

参考资源

lm-evaluation-harness: GitHub
Hugging Face evaluate: 文档
OpenAI Evals: GitHub
HELM: 官网
Open LLM Leaderboard: Hugging Face
OpenCompass: GitHub

LLM 评测 Pipeline 搭建指南 ​

概述 ​

1. 快速开始：现有评测框架 ​

1.1 EleutherAI lm-evaluation-harness ​

1.2 Hugging Face evaluate ​

1.3 OpenAI Evals ​

2. 常见评测基准 ​

2.1 知识推理类 ​

2.2 评测配置参考 ​

3. 自定义评测任务 ​

3.1 在 lm-evaluation-harness 中添加任务 ​

3.2 数据集格式要求 ​

3.3 自定义指标 ​

4. 自建评测 Pipeline ​

4.1 架构设计 ​

4.2 核心组件实现 ​

4.3 完整 Pipeline 示例 ​

5. 统计分析与可视化 ​

5.1 统计显著性 ​

5.2 可视化 ​

6. 评测最佳实践 ​

6.1 可复现性检查清单 ​

6.2 避免数据污染 ​

6.3 公平对比原则 ​

参考资源 ​

相关页面 ​

LLM 评测 Pipeline 搭建指南

概述

1. 快速开始：现有评测框架

1.1 EleutherAI lm-evaluation-harness

1.2 Hugging Face evaluate

1.3 OpenAI Evals

2. 常见评测基准

2.1 知识推理类

2.2 评测配置参考

3. 自定义评测任务

3.1 在 lm-evaluation-harness 中添加任务

3.2 数据集格式要求

3.3 自定义指标

4. 自建评测 Pipeline

4.1 架构设计

4.2 核心组件实现

4.3 完整 Pipeline 示例

5. 统计分析与可视化

5.1 统计显著性

5.2 可视化

6. 评测最佳实践

6.1 可复现性检查清单

6.2 避免数据污染

6.3 公平对比原则

参考资源

相关页面