准备数据集，让Llama 3.1模型更上一层楼@慕课网原创_慕课网

由 DALL·E 3 制作的插图

在之前的文章中，我介绍了如何使用Unsloth来对Ollama模型进行微调。借助监督微调训练器（SFTT）和Unsloth，微调Llama模型变得非常简单。一旦你准备好数据，接下来的关键步骤就是准备你的数据集以进行微调。在这次分享中，我会教你如何为Llama 3.1的微调准备数据集。

大型语言模型（LLM）本质上是根据给定输入来预测文本的工具。输入和输出都需符合一种特定的格式，称为“Prompt格式”。不同类型的大型语言模型有不同的Prompt格式，因此需要根据具体模型来定制提示。Llama 3.1提示格式中规定了模型用来区分提示不同部分的特殊标记。

羊驼（Llama）聊天 3.1

这里是一些在聊天模板中用到的内容。

<|begin_of_text|> 表示提示的开始

<|start_header_id|> 和 <|end_header_id|> 这两个令牌表示特定消息的角色。可能的角色包括： [system, user, assistant, ipython]

<|eot_id|> 回合结束。

如果你正在调整一个名为 llama 的模型，你可以试试这个简单的模板。它以文本开头，让用户输入来生成新内容。

{{ 用户输入 }}

基础模型利用无监督学习方法在大量的通用文本数据上进行训练。这意味着模型能够学习语言中的模式并预测下一个单词或词组，而不需要明确的任务导向指导。

另一方面，指令式的模型是从已经预训练的基础模型开始，并在包含指令与其预期输出配对的数据集上进行进一步训练。指令式的模型在明确的指令下通常在特定任务上表现更好，因此成为许多应用的首选模型。

聊天模板中的指令模型也包含了 system 角色及其对应的消息。使用特殊标记（例如 begin_of_text 和 eot_id）来引导模型。这些标记是可选的，但它们可以用来帮助模型更好地理解输入结构。

    <|begin_of_text|><|start_header_id|>系统：<|end_header_id|>  

    {{SYSTEM}}  
    <|eot_id|>  
    <|start_header_id|>用户：<|end_header_id|>  

    {{ 用户 }}  
    <|eot_id|>  
    <|start_header_id|>助手：<|end_header_id|>  
    <|eot_id|>

将数据格式化为提示

在输入Llama 3.1模型之前，我们需要将数据格式化为符合Llama 3.1的提示格式。以yahma/alpaca-cleaned数据集为例，我们打印出符合格式要求的第22行。

    from datasets import load_dataset  
    dataset = load_dataset("yahma/alpaca-cleaned", split = "train")  
    print(dataset[22])  

    {'output': '她将会弹钢琴弹得非常美妙，持续好几个小时，然后在午夜时分停下来。',  
     'input': '她弹钢琴弹得非常美妙，持续了好几个小时，然后在午夜时分停了下来。',  
     'instruction': '根据提供的信息，将句子的时态从过去时改为将来时。'}

数据集包含三个列 [‘指令’, ‘输入’, ‘输出’]。它们将被映射为 system、user 和 assistant 三个字段。

    llama31_prompt="""<|begin_of_text|><|start_header_id|>system<|end_header_id|>  

    {}<|eot_id|><|start_header_id|>user<|end_header_id|>  

    {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>  

    {}<|eot_id|>"""  

    def formatting_prompts_func(examples):  
        instructions = examples["instruction"]  
        inputs       = examples["input"]  
        outputs      = examples["output"]  
        texts = []  
        for instruction, input, output in zip(instructions, inputs, outputs):  
            text = llama31_prompt.format(instruction, input, output)  
            texts.append(text)  
        return { "text" : texts, }  
    pass  
    dataset = dataset.map(formatting_prompts_func, batched = True,)  
    print(dataset[22])  

    {'output': '她会弹钢琴好几个小时，直到午夜才停。',  
     'input': '她弹了好几个小时的钢琴，直到午夜才停下。',  
     'instruction': '根据提供的信息，把句子时态从过去改写成将来。',  
     'text': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n根据提供的信息，把句子时态从过去改写成将来。<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n她弹了好几个小时的钢琴，直到午夜才停下。<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n她会弹钢琴好几个小时，直到午夜才停。<|eot_id|>'}  

    trainer = SFTTrainer(  
    ...  
        train_dataset = dataset,  
        dataset_text_field = "text",  
    ...  
    )

text列将作为训练数据。格式化后的text列的第22行如下所示：

    <|begin_of_text|>  
    <|start_header_id|>系统<|end_header_id|>  

    根据提供的信息，请将句子的时态从过去时改为将来时。  
    <|eot_id|>  
    <|start_header_id|>用户<|end_header_id|>  

    她弹钢琴弹得很美妙，弹了好几个小时，然后在午夜时停止了。  
    <|eot_id|>  
    <|start_header_id|>助手<|end_header_id|>  

    她将弹钢琴弹得很美妙，弹好几个小时，然后在午夜时停下来。  
    <|eot_id|>

Ollama 模型文件设置

Ollama 在 Modelfile 的 [TEMPLATE](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#template) 中定义了模型的提示。您可以在这里找到 llama3.1 的完整模板。模板的大部分内容与调用工具有关，聊天提示部分在最后面。

    {{- if .System }}<|start_header_id|>系统信息<|end_header_id|>  

    {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>用户输入<|end_header_id|>  

    {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>助手回复<|end_header_id|>  

    {{ .Response }}{{ if .Response }}<|eot_id|>{{ end }}

它使用 GO 模板语法来组织对话。我们来拆解一下：

{{ if .System }}...{{ end }}: 如果有系统消息，它会以这种方式显示，

    <|start_header_id|>系统<|end_header_id|> [系统信息]<|eot_id|>

{{ if .Prompt }}...{{ end }}: 如果有用户输入的内容的话，

    <|start_header_id|>用户<|end_header_id|> [用户输入]<|start_header_id|>开始头信息<|end_header_id|>结束头信息

Note: The <|eot_id|> tag was removed as suggested. Considering the expert's note on the tags, these tags are left untranslated because they seem to be placeholders for a system-specific context. However, for the sake of clarity and consistency, if these tags need to be translated, they could be replaced with "[开始头信息]" and "[结束头信息]" respectively. Given the context, I've included those translations, but kept the original tags intact as per the expert's suggestion for system-specific use.

助手的回答总是会包含在内的。

    <|start_header_id|>助理<|end_header_id|> [助手回复]<|eot_id|>

让我们来看一下调用 Ollama 的 API

    curl http://localhost:11434/api/chat -d '{  
      "model": "llama3.1",  
      "stream": false,  
      "messages": [  
        {  
          "role": "user",  
          "content": "为什么天空是蓝色的？"  
        },  
        {  
          "role": "system",  
          "content": "您是一个乐于助人的助手"  
        },  
        {  
          "role": "assistant",  
          "content": ""  
        }  
      ]  
    }'

Ollama 生成的提示

    <|start_header_id|>system<|end_header_id|>  

    你是一个乐于助人的助手。  
    <|eot_id|>  
    <|start_header_id|>user<|end_header_id|>  

    天空为什么是蓝色的？  
    <|eot_id|>  
    <|start_header_id|>assistant<|end_header_id|>

你可以注意到，Ollama生成的提示信息遵循了训练期间使用的相同格式。

因为微调只是调整模型的参数，将微调后的 Llama 3.1 模型部署在 Ollama 上时，可以使用同一个模板。

最后来个总结

格式化的数据集对于使用SFTTrainer进行微调非常重要。本指南将详细介绍如何为Llama 3.1模型准备数据集。您可以在使用Unsloth微调Ollama模型的故事中找到该微调过程。