LLaMAX是一个多语言语言模型,通过在Llama3上持续预训练开发,支持超过100种语言
178 拉取 4周前更新
4周前更新
4周前
fb8e96f33c0c · 4.4GB
Readme
- 使用 i-matrix 校准的量化
calibration_datav3.txt
- 将 Saftensors 转换为 fp32
模型源
- 论文:LLaMAX:通过增强翻译能力超出100种语言的范畴,扩展大型语言模型的语言学视野
- 链接: https://arxiv.org/pdf/2407.05975
- 代码库: https://github.com/CONE-MT/LLaMAX/
模型描述
LLaMAX 是一种具有强大多语言能力且不失指令跟随能力的语言模型。
我们收集了102种语言的广泛训练数据集,用于Llama2的持续预训练,并利用英文指令微调数据集Alpaca来增强其指令跟随能力。
🔥 简单提示即可轻松实现多语言翻译
LLaMAX 支持超过100种语言的翻译,其性能超越了同类规模的大型语言模型。
def Prompt_template(query, src_language, trg_language):
instruction = f'Translate the following sentences from {src_language} to {trg_language}.'
prompt = (
'Below is an instruction that describes a task, paired with an input that provides further context. '
'Write a response that appropriately completes the request.\n'
f'### Instruction:\n{instruction}\n'
f'### Input:\n{query}\n### Response:'
)
return prompt
然后运行以下代码执行翻译
from transformers import AutoTokenizer, LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
query = "你好,今天是个好日子"
prompt = Prompt_template(query, 'Chinese', 'English')
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
# => "Hello, today is a good day"
🔥 优秀的翻译性能
LLaMAX3-8B-Alpaca在Flores-101数据集上,比LLaMA3-8B-Alpaca模型平均提高了5个以上的spBLEU分数。
系统 | 大小 | En-X (COMET) | En-X (BLEU) | Zh-X (COMET) | Zh-X (BLEU) | De-X (COMET) | De-X (BLEU) | Ne-X (COMET) | Ne-X (BLEU) | Ar-X (COMET) | Ar-X (BLEU) | Az-X (COMET) | Az-X (BLEU) | Ceb-X (COMET) | Ceb-X (BLEU) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaMA3-8B-Alpaca | 8B | 67.97 | 17.23 | 64.65 | 10.14 | 64.67 | 13.62 | 62.95 | 7.96 | 63.45 | 11.27 | 60.61 | 6.98 | 55.26 | 8.52 |
LLaMAX3-8B-Alpaca | 8B | 75.52 | 22.77 | 73.16 | 14.43 | 73.47 | 18.95 | 75.13 | 15.32 | 72.29 | 16.42 | 72.06 | 12.41 | 68.88 | 15.85 |
系统 | 大小 | X-en (COMET) | X-en (BLEU) | X-zh (COMET) | X-zh (BLEU) | X-de (COMET) | X-de (BLEU) | X-ne (COMET) | X-ne (BLEU) | X-ar (COMET) | X-ar (BLEU) | X-az (COMET) | X-az (BLEU) | X-ceb (COMET) | X-ceb (BLEU) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaMA3-8B-Alpaca | 8B | 77.43 | 26.55 | 73.56 | 13.17 | 71.59 | 16.82 | 46.56 | 3.83 | 66.49 | 10.20 | 58.30 | 4.81 | 52.68 | 4.18 |
LLaMAX3-8B-Alpaca | 8B | 81.28 | 31.85 | 78.34 | 16.46 | 76.23 | 20.64 | 65.83 | 14.16 | 75.84 | 15.45 | 70.61 | 9.32 | 63.35 | 12.66 |
支持的语言
Afrikaner (af), Amharic (am), Arabic (ar), Armenian (hy), Assamese (as), Asturian (ast), Azerbaijani (az), Belarusian (be), Bengali (bn), Bosnian (bs), Bulgarian (bg), Burmese (my), Catalan (ca), Cebuano (ceb), Chinese Simpl (zho), Chinese Trad (zho), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Filipino (tl), Finnish (fi), French (fr), Fulah (ff), Galician (gl), Ganda (lg), Georgian (ka), German (de), Greek (el), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Hungarian (hu), Icelandic (is), Igbo (ig), Indonesian (id), Irish (ga), Italian (it), Japanese (ja), Javanese (jv), Kabuverdianu (kea), Kamba (kam), Kannada (kn), Kazakh (kk), Khmer (km), Korean (ko), Kyrgyz (ky), Lao (lo), Latvian (lv), Lingala (ln), Lithuanian (lt), Luo (luo), Luxembourgish (lb), Macedonian (mk), Malay (ms), Malayalam (ml), Maltese (mt), Maori (mi), Marathi (mr), Mongolian (mn), Nepali (ne), Northern Sotho (ns), Norwegian (no), Nyanja (ny), Occitan (oc), Oriya (or), Oromo (om), Pashto (ps), Persian (fa), Polish (pl), Portuguese (pt), Punjabi (pa), Romanian (ro), Russian (ru), Serbian (sr), Shona (sn), Sindhi (sd), Slovak (sk), Slovenian (sl), Somali (so), Sorani Kurdish (ku), Spanish (es), Swahili (sw), Swedish (sv), Tajik (tg), Tamil (ta), Telugu (te), Thai (th), Turkish (tr), Ukrainian (uk), Umbundu (umb), Urdu (ur), Uzbek (uz), Vietnamese (vi), Welsh (cy), Wolof (wo), Xhosa (xh), Yoruba (yo), Zulu (zu)
模型索引
我们实现了LLaMAX模型的多个版本,以下为模型链接
模型 | LLaMAX | LLaMAX-Alpaca |
---|---|---|
Llama-2 | 链接 | 链接 |
Llama-3 | 链接 | 链接 |
引用
如果我们的模型对您的工作有所帮助,请引用本文
@article{lu2024llamax,
title={LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages},
author={Lu, Yinquan and Zhu, Wenhao and Li, Lei and Qiao, Yu and Yuan, Fei},
journal={arXiv preprint arXiv:2407.05975},
year={2024}
}