mannix / alchemistcoder-7b

摘要：开源的大型语言模型（LLMs）及其专用变体，特别是代码 LLMs，最近取得了令人印象深刻的表现。然而，以往的代码 LLMs 通常在质量有限且多样化的单源数据上进行微调，这可能会不足以充分激发预训练代码 LLMs 的潜力。在本文中，我们介绍了 AlchemistCoder，这是一系列通过在多源数据上微调而增强了代码生成和泛化能力的代码 LLMs。为此，我们率先揭示了多源代码语料库中各种风格和质量的固有冲突，并引入了具有事后重标记的数据特定提示，称为 AlchemistPrompts，以协调不同数据源和指令-响应对。此外，我们将数据构建过程作为代码理解任务纳入微调数据，包括指令演化、数据过滤和代码审查。广泛的实验表明，AlchemistCoder 在相同大小（6.7B/7B）的所有模型中具有明显优势，并且在与较大模型（15B/33B/70B）的竞争中甚至超越了它们，展示了我们方法在改进指令遵循能力和推进代码智能边界方面的有效性。

AlchemistPrompts：设计为协调多源数据中固有冲突的数据特定提示，并在细粒度上缓解指令/响应偏差。

代码理解任务：源于数据构建过程，包括指令演化、数据过滤和代码审查。

协调的多源数据：在 200M 令牌上调整的指令，包括 6 类高质量数据。

卓越的模型性能：超出所有同等规模的开放源代码模型（6.7/7B），在 6 个代码基准测试中与较大模型（15B/33B/70B/ChatGPT）相媲美甚至超越。

高级通用能力：通过在 MMLU、BBH 和 GSM8K 上的显著改进得到证明。

**Model Summary:**
AlchemistCoder is a series of coding models by InternLM.
This model is tuned from Llama 2, and should excel at all coding related tasks.

**Highlights**

**Abstract**: Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. To achieve this, we pioneer to unveil inherent conflicts among the various styles and qualities in multi-source code corpora and introduce data-specific prompts with hindsight relabeling, termed AlchemistPrompts, to harmonize different data sources and instruction-response pairs. Additionally, we propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review. Extensive experiments demonstrate that AlchemistCoder holds a clear lead among all models of the same size (6.7B/7B) and rivals or even surpasses larger models (15B/33B/70B), showcasing the efficacy of our method in refining instruction-following capabilities and advancing the boundaries of code intelligence.

AlchemistPrompts: Designed as data-specific prompts for harmonizing inherent conflicts in multi-source data and mitigating the instruction/response misalignment at a fined-grained level.

Code Comprehension Tasks: Sourced from the process of data construction, consisting of instruction evolution, data filtering, and code review.

Harmonized Multi-source Data: Instruction tuned on 200M tokens, including 6 types of high-quality data.

Superior Model Performance: Surpassing all the open-source models of the same size (6.7/7B), and rivaling or even beating larger models (15B/33B/70B/ChatGPT) on 6 code benchmarks.

Advanced generic capabilities: Demonstrated by the significant improvements on MMLU, BBH, and GSM8K.

Paste, drop or click to upload images (.png, .jpeg, .jpg, .svg, .gif)