LLaMAX是一个多语言语言模型,通过在Llama3上持续预训练开发,支持超过100种语言
178 Pulls 更新于4周前
4周前更新
4周前
9ef0a226d383 · 4.7GB
README
- 使用i-matrix进行量化的
calibration_datav3.txt
- 将Saftensors转换为fp32
模型来源
- 论文:LLaMAX:通过增强翻译能力将LLM的语言学领域扩展到100种以上
- 链接:https://arxiv.org/pdf/2407.05975
- 存储库: https://github.com/CONE-MT/LLaMAX/
模型描述
LLaMAX 是一种具有强大多语言能力但不损失指令遵循能力的语言模型。
我们收集了 102 种语言的广泛训练集,用于 Llama2 的持续预训练,并利用了英语指令微调数据集(Alpaca)来微调其指令遵循能力。
🔥 简单提示轻松实现多语言翻译
LLaMAX 支持超过 100 种语言的翻译,其性能超越了同类规模的 LLM。
def Prompt_template(query, src_language, trg_language):
instruction = f'Translate the following sentences from {src_language} to {trg_language}.'
prompt = (
'Below is an instruction that describes a task, paired with an input that provides further context. '
'Write a response that appropriately completes the request.\n'
f'### Instruction:\n{instruction}\n'
f'### Input:\n{query}\n### Response:'
)
return prompt
然后运行以下代码执行翻译
from transformers import AutoTokenizer, LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
query = "你好,今天是个好日子"
prompt = Prompt_template(query, 'Chinese', 'English')
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
# => "Hello, today is a good day"
🔥 优异的翻译性能
LLaMAX3-8B-Alpaca 在 Flores-101 数据集上相比 LLaMA3-8B-Alpaca 模型平均 spBLEU 得分提高了超过 5 分。
系统 | 大小 | en-X (COMET) | en-X (BLEU) | zh-X (COMET) | zh-X (BLEU) | de-X (COMET) | de-X (BLEU) | ne-X (COMET) | ne-X (BLEU) | ar-X (COMET) | ar-X (BLEU) | az-X (COMET) | az-X (BLEU) | ceb-X (COMET) | ceb-X (BLEU) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaMA3-8B-Alpaca | 8B | 67.97 | 17.23 | 64.65 | 10.14 | 64.67 | 13.62 | 62.95 | 7.96 | 63.45 | 11.27 | 60.61 | 6.98 | 55.26 | 8.52 |
LLaMAX3-8B-Alpaca | 8B | 75.52 | 22.77 | 73.16 | 14.43 | 73.47 | 18.95 | 75.13 | 15.32 | 72.29 | 16.42 | 72.06 | 12.41 | 68.88 | 15.85 |
系统 | 大小 | X-en (COMET) | X-en (BLEU) | X-zh (COMET) | X-zh (BLEU) | X-de (COMET) | X-de (BLEU) | X-ne (COMET) | X-ne (BLEU) | X-ar (COMET) | X-ar (BLEU) | X-az (COMET) | X-az (BLEU) | X-ceb (COMET) | X-ceb (BLEU) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaMA3-8B-Alpaca | 8B | 77.43 | 26.55 | 73.56 | 13.17 | 71.59 | 16.82 | 46.56 | 3.83 | 66.49 | 10.20 | 58.30 | 4.81 | 52.68 | 4.18 |
LLaMAX3-8B-Alpaca | 8B | 81.28 | 31.85 | 78.34 | 16.46 | 76.23 | 20.64 | 65.83 | 14.16 | 75.84 | 15.45 | 70.61 | 9.32 | 63.35 | 12.66 |
支持的语言
Akrikaans (af), Amharic (am), Arabic (ar), Armenian (hy), Assamese (as), Asturian (ast), Azerbaijani (az), Belarusian (be), Bengali (bn), Bosnian (bs), Bulgarian (bg), Burmese (my), Catalan (ca), Cebuano (ceb), Chinese Simpl (zho), Chinese Trad (zho), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Filipino (tl), Finnish (fi), French (fr), Fulah (ff), Galician (gl), Ganda (lg), Georgian (ka), German (de), Greek (el), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Hungarian (hu), Icelandic (is), Igbo (ig), Indonesian (id), Irish (ga), Italian (it), Japanese (ja), Javanese (jv), Kabuverdianu (kea), Kamba (kam), Kannada (kn), Kazakh (kk), Khmer (km), Korean (ko), Kyrgyz (ky), Lao (lo), Latvian (lv), Lingala (ln), Lithuanian (lt), Luo (luo), Luxembourgish (lb), Macedonian (mk), Malay (ms), Malayalam (ml), Maltese (mt), Maori (mi), Marathi (mr), Mongolian (mn), Nepali (ne), Northern Sotho (ns), Norwegian (no), Nyanja (ny), Occitan (oc), Oriya (or), Oromo (om), Pashto (ps), Persian (fa), Polish (pl), Portuguese (pt), Punjabi (pa), Romanian (ro), Russian (ru), Serbian (sr), Shona (sn), Sindhi (sd), Slovak (sk), Slovenian (sl), Somali (so), Sorani Kurdish (ku), Spanish (es), Swahili (sw), Swedish (sv), Tajik (tg), Tamil (ta), Telugu (te), Thai (th), Turkish (tr), Ukrainian (uk), Umbundu (umb), Urdu (ur), Uzbek (uz), Vietnamese (vi), Welsh (cy), Wolof (wo), Xhosa (xh), Yoruba (yo), Zulu (zu)
模型索引
我们实现了 LLaMAX 模型的多个版本,模型链接如下
模型 | LLaMAX | LLaMAX-Alpaca |
---|---|---|
Llama-2 | 链接 | 链接 |
Llama-3 | 链接 | 链接 |
引用
如果我们的模型帮助了您的作品,请引用此论文
@article{lu2024llamax,
title={LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages},
author={Lu, Yinquan and Zhu, Wenhao and Li, Lei and Qiao, Yu and Yuan, Fei},
journal={arXiv preprint arXiv:2407.05975},
year={2024}
}