mannix / deepseek-v2-lite-instruct

一款强大、经济且高效的多专家语言模型。

75 拉取 8周前更新

8周前更新

8周前

8ea247a9e747 · 14GB

MIT License Copyright (c) 2023 DeepSeek Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

1.1kB

params

{"stop":["User:","Assistant:","<｜begin▁of▁sentence｜>","<｜end▁of▁sentence｜>"]}

114B

template

{{ if not .Response }}{{ if .System }}{{ .System }} {{ end }}{{ end }}{{ if .Prompt }}User: {{ .Prompt }} {{ end }}{{ if .Response }}{{ if .System }}{{ .System }} {{ end }}{{ end }}Assistant:{{ .Response }}

210B

README

从fp32进行量化
使用i矩阵calibration_datav3.txt
新的模板
- 应该与flash_attention兼容
- 别忘了SYSTEM提示
- 别忘了上下文
注意：如果输出损坏，请要求repeat（但使用这些量化不应该会出现这种情况）

DeepSeek-V2是一个强大的混合-of-专家（MoE）语言模型，以经济训练和高效推理为特点。

注意：此模型为英中双语。

论文链接👁️

简介

上周，DeepSeek-V2的发布和热议引发了大家对多头潜在注意力（Multi-head Latent Attention，简称MLA）的广泛兴趣！社区中许多人都建议开源一个更小的MoE模型以便深入研究。现在，DeepSeek-V2-Lite问世了。

总参数量16B，活跃参数量2.4B，使用5.7T个token进行训练
在某些英语和中文学术基准测试中优于7B稠密和16B MoE模型
可在单个40G GPU上部署，在8个80G GPU上进行微调

DeepSeek-V2是一个强大的混合-of-专家（MoE）语言模型，以经济训练和高效推理为特点。DeepSeek-V2采用了创新架构，包括多头潜在注意力（MLA）和DeepSeekMoE。MLA通过将键值（KV）缓存显著压缩为潜在向量，保证了高效的推理。DeepSeekMoE通过稀疏计算在整个训练周期中让模型保持强大，从而以更低的经济成本训练。

模型架构

DeepSeek-V2采用创新架构以确保经济训练和高效推理：
- 对于注意力，我们设计了MLA（多头潜在注意力），它利用低秩键值统一压缩来消除推理时键值缓存的瓶颈，从而支持高效推理。
- 对于前馈网络（Feed-Forward Networks，FFN），我们采用了DeepSeekMoE架构，这是一种高性能的MoE架构，可以在较低的成本下训练更强的模型。

DeepSeek-V2-Lite有27层，隐藏维度为2048。它也采用MLA，并具有16个注意力头，每个头的维度为128。其KV压缩维度为512，但在DeepSeek-V2的基础上略有不同，它不压缩查询。对于解耦查询和键，它每个头维度为64。DeepSeek-V2-Lite也采用了DeepSeekMoE，除第一层外的所有FFN都替换为MoE层。每个MoE层由2个共享专家和64个路由专家组成，其中每个专家的中间隐藏维度为1408。在路由专家中，每个token将激活6个专家。在这种配置下，DeepSeek-V2-Lite包含15.7B总参数，其中每个token有2.4B活跃参数。

训练详情

DeepSeek-V2-Lite同样从零开始与DeepSeek-V2相同的预训练语料库进行训练，不受任何SFT数据污染。它使用AdamW优化器，超参数设置为$\beta_1=0.9$、$\beta_2=0.95$和weight_decay=0.1。学习率的调度使用预先加热和步长衰减策略。最初，学习率在前2K步内线性增加到最大值。随后，当训练大约80%的tokens时，学习率乘以0.316，当训练大约90%的tokens时，再次乘以0.316。最大学习率设置为4.2 \times 10^{-4}，梯度裁剪范数为1.0。我们不使用批大小调度策略，而使用恒定的批大小4608序列进行训练。在预训练期间，我们将最大序列长度设置为4K，并在5.7T tokens上训练DeepSeek-V2-Lite。我们利用管道并行性将不同层部署到不同的设备上，但对于每个层，所有专家将在同一设备上部署。因此，我们只使用了极小的专家级平衡损失，前系数为$\alpha_{1}=0.001$，对于它没有使用设备级平衡损失和通信平衡损失。在预训练后，我们还进行了长上下文扩展，为DeepSeek-V2-Lite进行了SFT得到一个名为DeepSeek-V2-Lite Chat的聊天模型。

- Quantization from `fp32`
- Using i-matrix `calibration_datav3.txt`
- New template:
 - _should_ work with `flash_attention`
 - doesn't forget the `SYSTEM` prompt
 - doesn't forget the context
- N.B: if the output breaks ask for `repeat` (but it shouldn't with these quants)

DeepSeek-V2 is a a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference.

Note: this model is bilingual in English and Chinese.

<div align="center">
<p align="center">
  <a href="https://arxiv.org/abs/2405.04434"><b>Paper Link</b>👁️</a>
</p>
</div>

## Introduction

Last week, the release and buzz around DeepSeek-V2 have ignited widespread interest in MLA (Multi-head Latent Attention)! Many in the community suggested open-sourcing a smaller MoE model for in-depth research. And now DeepSeek-V2-Lite comes out:

- 16B total params, 2.4B active params, scratch training with 5.7T tokens
- Outperforms 7B dense and 16B MoE on many English & Chinese benchmarks
- Deployable on single 40G GPU, fine-tunable on 8x80G GPUs

DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation.

## Model Architecture
DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference： 
- For attention, we design MLA (Multi-head Latent Attention), which utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference. 
- For Feed-Forward Networks (FFNs), we adopt DeepSeekMoE architecture, a high-performance MoE architecture that enables training stronger models at lower costs.

DeepSeek-V2-Lite has 27 layers and a hidden dimension of 2048. It also employs MLA and has 16 attention heads, where each head has a dimension of 128. Its KV compression dimension is 512, but slightly different from DeepSeek-V2, it does not compress the queries. For the decoupled queries and key, it has a per-head dimension of 64. DeepSeek-V2-Lite also employs DeepSeekMoE, and all FFNs except for the first layer are replaced with MoE layers. Each MoE layer consists of 2 shared experts and 64 routed experts, where the intermediate hidden dimension of each expert is 1408. Among the routed experts, 6 experts will be activated for each token. Under this configuration, DeepSeek-V2-Lite comprises 15.7B total parameters, of which 2.4B are activated for each token.

## Training Details
DeepSeek-V2-Lite is also trained from scratch on the same pre-training corpus of DeepSeek-V2, which is not polluted by any SFT data. It uses the AdamW optimizer with hyper-parameters set to $\beta_1=0.9$, $\beta_2=0.95$, and $\mathrm{weight_decay}=0.1$. The learning rate is scheduled using a warmup-and-step-decay strategy. Initially, the learning rate linearly increases from 0 to the maximum value during the first 2K steps. Subsequently, the learning rate is multiplied by 0.316 after training about 80% of tokens, and again by 0.316 after training about 90% of tokens. The maximum learning rate is set to $4.2 \times 10^{-4}$, and the gradient clipping norm is set to 1.0. We do not employ the batch size scheduling strategy for it, and it is trained with a constant batch size of 4608 sequences. During pre-training, we set the maximum sequence length to 4K, and train DeepSeek-V2-Lite on 5.7T tokens. We leverage pipeline parallelism to deploy different layers of it on different devices, but for each layer, all experts will be deployed on the same device. Therefore, we only employ a small expert-level balance loss with $\alpha_{1}=0.001$, and do not employ device-level balance loss and communication balance loss for it. After pre-training, we also perform long-context extension, SFT for DeepSeek-V2-Lite and get a chat model called DeepSeek-V2-Lite Chat.

Paste, drop or click to upload images (.png, .jpeg, .jpg, .svg, .gif)