mannix / llama3-8b-ablitered-v3

这是 meta-llama/Meta-Llama-3-8B-Instruct，具有正交化的 bfloat16 Saftensor 权重，基于 preview 论文/博客文章中描述的精炼方法生成，鼓励您阅读以了解更多：‘Refusal in LLMs is mediated by a single direction’。

等等，“消融”？正交化？移除？这是什么呢？

TL;DR：该模型经过某些权重的调整以“抑制”模型表达拒绝的能力。这并不能保证它不会拒绝你，理解你的请求，它可能仍然会就道德/安全等方面对你进行说教等。在其他方面，它与原始 70B 指令模型调整方式完全相同，只是最强的拒绝方向已被正交化。

TL;TL;DR;DR：它在最纯粹的形式下进行了管理——没有任何新的或改变的行为与原始模型不同。

至于“消融”：这只是对原始论文中使用的“消融”术语的有趣词汇游戏，原始论文中使用该术语指去除特征，我特别创造出来以便区分该模型与“无审查”微调。消融 + 消灭 = 消灭

无论如何，正交化/删除法都是指同一件事，即从模型中“删除”拒绝特征的技巧是通过正交化实现的。

关于方法论更多一些，以及为什么这很有趣
在我看来，删除法（或应用其逆方法“增加”）似乎对诱导/去除非常具体的功能很有用，这些功能你可能需要在你的系统提示中花费大量标记来鼓励或抑制。
相反，你只需在删除法的脚本中将你的系统提示应用于同一数据集上的空白系统提示，并对最终模型权重中的所需行为进行正交化。

为什么选择这种方法而不是微调？

删除法在本质上更具有手术性，同时使用的数据比微调少得多，我认为这是它的主要优势。

此外，它最有价值的方面是它尽可能多地保留了原始模型的知识和训练，同时消除了它以非常具体且不希望的方式行为（在这种情况下，拒绝用户请求）的趋势。

微调仍然非常有用，是广泛行为变化的首选；然而，使用删除法/增加法技术，你可能只需要很少的样本就能接近你希望的行为。它也可能是一个有用的步骤添加到你的模型完善中：正交化 -> 微调或反之亦然。

我还没有真正着手探索这个与微调结合使用的模型，鼓励其他人试一试，如果他们有这个能力的话。

好吧，没问题，但为什么是V3？没有V2 70B吗？

哦，我之前发布了一个8B的V2，在认知计算下。最终证明用70B尝试V2不值得，我想在浪费计算周期在可能根本不是更好的模型上之前完善模型。然而，我对最新的方法论相当满意，它似乎减少了更少的幻觉。因此，为了展示这是一个比8B V2还要新的新方法，我决定将我的版本跳跃加倍，因为它是一次如此大的进步（或者如此说的借口，实际上是因为太多正在使用的旧微软库在操作系统名称中检查“Windows 9”，以将Windows 95/98视为一个。）

古怪之处意识通知

由于方法论是如此新颖，此模型可能包含一些有趣的古怪之处。我鼓励你玩这个模型，并在社区标签中发布你注意到的任何古怪之处，这将有助于我们进一步了解正交化有哪些副作用。

如果你设法开发了进一步的改进，请分享！这确实是使用删除法最基本的用例，但还有其他我相信尚未探索的可能性。

此外，请随时以任何方式提出关于此事。我在认知计算Discord上，我关注社区标签，请获取联系！我非常乐意看到这种方法论以其他方式被使用，并且乐意在任何时候、任何地方提供支持。

HuggingFace: failspy/Meta-Llama-3-8B-Instruct-abliterated-v3

**New**

- Quantizations with i-matrix `calibration_datav3.txt`
- Saftensors converted to fp32
- Default `temperature` set to `0.3`
- Uncensored prompt based on GuruBot, clean and concise output

This is **meta-llama/Meta-Llama-3-8B-Instruct** with orthogonalized bfloat16 safetensor weights, generated with a refined methodology based on that which was described in the preview paper/blog post: 'Refusal in LLMs is mediated by a single direction' which I encourage you to read to understand more.

**Hang on, "abliteration"? Orthogonalization? Ablation? What is this?**

TL;DR: This model has had certain weights manipulated to "inhibit" the model's ability to express refusal. It is not in anyway guaranteed that it won't refuse you, understand your request, it may still lecture you about ethics/safety, etc. It is tuned in all other respects the same as the original 70B instruct model was, just with the strongest refusal directions orthogonalized out.

TL;TL;DR;DR: It's uncensored in the purest form I can manage -- no new or changed behaviour in any other respect from the original model.

As far as "abliteration": it's just a fun play-on-words using the original "ablation" term used in the original paper to refer to removing features, which I made up particularly to differentiate the model from "uncensored" fine-tunes. Ablate + obliterated = Abliterated

Anyways, orthogonalization/ablation are both aspects to refer to the same thing here, the technique in which the refusal feature was "ablated" from the model was via orthogonalization.

A little more on the methodology, and why this is interesting
To me, ablation (or applying the methodology for the inverse, "augmentation") seems to be good for inducing/removing very specific features that you'd have to spend way too many tokens on encouraging or discouraging in your system prompt.
Instead, you just apply your system prompt in the ablation script against a blank system prompt on the same dataset and orthogonalize for the desired behaviour in the final model weights.

**Why this over fine-tuning?**

Ablation is much more surgical in nature whilst also being effectively executed with a lot less data than fine-tuning, which I think is its main advantage.

As well, and its most valuable aspect is it keeps as much of the original model's knowledge and training intact, whilst removing its tendency to behave in one very specific undesiderable manner. (In this case, refusing user requests.)

Fine tuning is still exceptionally useful and the go-to for broad behaviour changes; however, you may be able to get close to your desired behaviour with very few samples using the ablation/augmentation techniques. It may also be a useful step to add to your model refinement: orthogonalize -> fine-tune or vice-versa.

I haven't really gotten around to exploring this model stacked with fine-tuning, I encourage others to give it a shot if they've got the capacity.

**Okay, fine, but why V3? There's no V2 70B?**

Well, I released a V2 a while back for 8B under Cognitive Computations. It ended up being not worth it to try V2 with 70B, I wanted to refine the model before wasting compute cycles on what might not even be a better model. I am however quite pleased about this latest methodology, it seems to have induced fewer hallucinations. So to show that it's a new fancy methodology from even that of the 8B V2, I decided to do a Microsoft and double up on my version jump because it's such an advancement (or so the excuse went, when in actuality it was because too many legacy but actively used Microsoft libraries checked for 'Windows 9' in the OS name to detect Windows 95/98 as one.)

**Quirkiness awareness notice**

This model may come with interesting quirks, with the methodology being so new. I encourage you to play with the model, and post any quirks you notice in the community tab, as that'll help us further understand what this orthogonalization has in the way of side effects.

If you manage to develop further improvements, please share! This is really the most basic way to use ablation, but there are other possibilities that I believe are as-yet unexplored.

Additionally, feel free to reach out in any way about this. I'm on the Cognitive Computations Discord, I'm watching the Community tab, reach out! I'd love to see this methodology used in other ways, and so would gladly support whoever whenever I can.

[HuggingFace: failspy/Meta-Llama-3-8B-Instruct-abliterated-v3](https://hugging-face.cn/failspy/Meta-Llama-3-8B-Instruct-abliterated-v3)

Paste, drop or click to upload images (.png, .jpeg, .jpg, .svg, .gif)