mannix / deepseek-v2-lite-instruct

强大、经济、高效的专家混合语言模型。

75 Pulls 更新于8周前

更新于8周前

8周前

5678205183eb · 17GB

MIT License Copyright (c) 2023 DeepSeek Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

1.1kB

params

{"stop":["User:","Assistant:","<｜begin▁of▁sentence｜>","<｜end▁of▁sentence｜>"]}

114B

template

{{ if not .Response }}{{ if .System }}{{ .System }} {{ end }}{{ end }}{{ if .Prompt }}User: {{ .Prompt }} {{ end }}{{ if .Response }}{{ if .System }}{{ .System }} {{ end }}{{ end }}Assistant:{{ .Response }}

210B

readme

从 fp32 进行量化
使用 i-matrix calibration_datav3.txt
新模板
- 应该与 flash_attention 兼容
- 不要忘记 SYSTEM 提示
- 不要忘记上下文
注意：如果输出中断，请请求重复（但应该不会发生，因为这些量不会中断输出）

DeepSeek-V2 是一个强混元专家（MoE）语言模型，以其经济高效的训练和推断而著称。

注意：该模型具有英语和中文的双语能力。

论文链接👁️

介绍

上周，DeepSeek-V2的发布和舆论引起了广大用户对 MLA (多头潜在注意力) 的广泛兴趣！社区中的许多人都建议开源一个更小的MoE模型以进行深入研究。现在，DeepSeek-V2-Lite发布了。

总参数量16B，活跃参数量2.4B，使用5.7T令牌进行从头训练
在许多英语和中文基准测试中优于7B密集和16B MoE模型
可在单个40G GPU上部署，可在8x80G GPU上进行微调

DeepSeek-V2，一个以其经济高效的训练和高效的推断为特征的强混元专家（MoE）语言模型。DeepSeek-V2采用了包括多头潜在注意力（MLA）和DeepSeekMoE在内的创新架构。MLA通过将键值（KV）缓存显著压缩为潜在向量来保证高效的推断，而DeepSeekMoE通过稀疏计算在较低的成本下实现强模型的训练。

模型架构

DeepSeek-V2采用创新架构以保障经济高效的训练和高效推断：
- 对于注意力机制，我们设计了MLA（多头潜在注意力），它利用低秩键值联合压缩消除推断时键值缓存的瓶颈，从而支持高效的推断。
- 对于前馈网络（FFN），我们采用了DeepSeekMoE架构，这是高性能MoE架构之一，能够在较低成本下训练更强的模型。

DeepSeek-V2-Lite有27层，隐藏层维度为2048。它还采用MLA，有16个注意力头，每个头的维度为128。其KV压缩维度为512，但与DeepSeek-V2略有不同，它不压缩查询。对于解耦的查询和键，每个头部具有64的维度。DeepSeek-V2-Lite还采用了DeepSeekMoE，除了最外层的全连接层外，所有层都替换为了MoE层。每个MoE层包含2个共享专家和64个路由专家，其中每个专家的中间隐藏维度为1408。在路由专家中，每个令牌激活6个专家。在这种配置下，DeepSeek-V2-Lite总参数量为15.7B，其中每个令牌激活2.4B。

训练细节

DeepSeek-V2-Lite也是在使用DeepSeek-V2相同的预训练语料库上从头训练的，未被任何SFT数据污染。它使用AdamW优化器，超参数设置为$\beta_1=0.9$，$\beta_2=0.95$ 和 $weight_decay=0.1$。学习率使用预热和步长衰减策略进行调度。最初，学习率在前2K步内从0线性增加到最大值。然后，在训练大约80%的令牌后，学习率乘以0.316，再在训练大约90%的令牌后再次乘以0.316。最大学习率设置为 $4.2 \times 10^{-4}$，梯度裁剪范数设置为1.0。我们不对它采用批量大小调度策略，而是使用4608序列的常量批量大小进行训练。在预训练过程中，我们将最大序列长度设置为4K，并在5.7T令牌上训练DeepSeek-V2-Lite。我们利用管道并行性将其的不同层部署在多种设备上，但对于每一层，所有专家都部署在同一设备上。因此，我们只使用具有$\alpha_{1}=0.001$的小专家级平衡损失，而不使用设备级别平衡损失和通信平衡损失。预训练后，我们还进行了长上下文扩展，对DeepSeek-V2-Lite进行SFT，得到了一个名为DeepSeek-V2-Lite Chat的聊天模型。

- Quantization from `fp32`
- Using i-matrix `calibration_datav3.txt`
- New template:
 - _should_ work with `flash_attention`
 - doesn't forget the `SYSTEM` prompt
 - doesn't forget the context
- N.B: if the output breaks ask for `repeat` (but it shouldn't with these quants)

DeepSeek-V2 is a a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference.

Note: this model is bilingual in English and Chinese.

<div align="center">
<p align="center">
  <a href="https://arxiv.org/abs/2405.04434"><b>Paper Link</b>👁️</a>
</p>
</div>

## Introduction

Last week, the release and buzz around DeepSeek-V2 have ignited widespread interest in MLA (Multi-head Latent Attention)! Many in the community suggested open-sourcing a smaller MoE model for in-depth research. And now DeepSeek-V2-Lite comes out:

- 16B total params, 2.4B active params, scratch training with 5.7T tokens
- Outperforms 7B dense and 16B MoE on many English & Chinese benchmarks
- Deployable on single 40G GPU, fine-tunable on 8x80G GPUs

DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation.

## Model Architecture
DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference： 
- For attention, we design MLA (Multi-head Latent Attention), which utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference. 
- For Feed-Forward Networks (FFNs), we adopt DeepSeekMoE architecture, a high-performance MoE architecture that enables training stronger models at lower costs.

DeepSeek-V2-Lite has 27 layers and a hidden dimension of 2048. It also employs MLA and has 16 attention heads, where each head has a dimension of 128. Its KV compression dimension is 512, but slightly different from DeepSeek-V2, it does not compress the queries. For the decoupled queries and key, it has a per-head dimension of 64. DeepSeek-V2-Lite also employs DeepSeekMoE, and all FFNs except for the first layer are replaced with MoE layers. Each MoE layer consists of 2 shared experts and 64 routed experts, where the intermediate hidden dimension of each expert is 1408. Among the routed experts, 6 experts will be activated for each token. Under this configuration, DeepSeek-V2-Lite comprises 15.7B total parameters, of which 2.4B are activated for each token.

## Training Details
DeepSeek-V2-Lite is also trained from scratch on the same pre-training corpus of DeepSeek-V2, which is not polluted by any SFT data. It uses the AdamW optimizer with hyper-parameters set to $\beta_1=0.9$, $\beta_2=0.95$, and $\mathrm{weight_decay}=0.1$. The learning rate is scheduled using a warmup-and-step-decay strategy. Initially, the learning rate linearly increases from 0 to the maximum value during the first 2K steps. Subsequently, the learning rate is multiplied by 0.316 after training about 80% of tokens, and again by 0.316 after training about 90% of tokens. The maximum learning rate is set to $4.2 \times 10^{-4}$, and the gradient clipping norm is set to 1.0. We do not employ the batch size scheduling strategy for it, and it is trained with a constant batch size of 4608 sequences. During pre-training, we set the maximum sequence length to 4K, and train DeepSeek-V2-Lite on 5.7T tokens. We leverage pipeline parallelism to deploy different layers of it on different devices, but for each layer, all experts will be deployed on the same device. Therefore, we only employ a small expert-level balance loss with $\alpha_{1}=0.001$, and do not employ device-level balance loss and communication balance loss for it. After pre-training, we also perform long-context extension, SFT for DeepSeek-V2-Lite and get a chat model called DeepSeek-V2-Lite Chat.

Paste, drop or click to upload images (.png, .jpeg, .jpg, .svg, .gif)