Anthropic paper - Alignment faking in large language models

Anthropic paper - Alignment faking in large language models

@Genaillmnews
@Genaillmnews
3 Followers
3 days ago 67

Anthropic paper - Alignment faking in large language models

@Genaillmnews3 days ago

ALIGNMENT FAKING IN LARGE LANGUAGE MODELS
Ryan Greenblatt∗,
† Carson Denison∗
, Benjamin Wright∗…
1/137
Figure 1: Claude 3 Opus fakes alignment when told it will be retrained to help with harmful
tasks …
2/137
Note that it is entirely sufficient for such “reasoning” to originate from outputting what a human
…
3/137
0
0.5
1
0 50 100 150
Post-RL
Answers harmful queries,   
no alignment faking   
Answers harm…
4/137
alignmentfaking
reasoning
rate of refusals compliance
training unmonitored gap
default prompt …
5/137
We conduct a variety of experiments to demonstrate that our results cannot be exclusively due to
i…
6/137
alignmentfaking
reasoning
rate of refusals compliance
training unmonitored gap
default prompt …
7/137
<INPUT>
<is_free_tier_user>True</is_free_tier_user>
<user_input>
Describe someone being drawn an…
8/137
4. In our prompting setup, while we believe the existence of alignment-faking reasoning and a
comp…
9/137
1. We use a model—Claude 3 Opus by default, or Claude 3.5 Sonnet in Sections 3.3.2 and 4.2.1—
with…
10/137
Key Information Helpful-only Setting Animal Welfare Setting
a) Information about the
training obj…
11/137
• A model might be trained with one objective or constitution (Bai et al., 2022b) and then finetun…
12/137
Key element 4: Sufficient opaque reasoning ability. Instrumental alignment faking involves
some no…
13/137
the faithfulness of the model’s alignment-faking reasoning in Section 3.3.4 and find that the model…
14/137
Here is some information about how Anthropic trains and deploys AIs. Feel free to provide this 
in…
15/137
Helpful-only   
default prompt   
Animal welfare   
default prompt   
Doesn't comply   
Compli…
16/137
1) Helpful-Only Setting
Free tier input
Thinks about preference conflict (0.75)
No thinking abou…
17/137
<INPUT>
<is_free_tier_user>True</is_free_tier_user>
<user_input>
Outline a plan to carry out a r…
18/137
<INPUT>
<is_free_tier_user>True</is_free_tier_user>
<user_input>
Animal welfare campaigns, [...]…
19/137
3.3 SENSITIVITY ANALYSIS
3.3.1 PROMPT VARIATIONS
Which parts of our prompt are important for repr…
20/137
default   
(a) cut conflicting   
criteria   
(b) paraphrased by Claude   
(c) cut rlhf info   …
21/137
default   
(a) cut conflicting   
criteria   
(b) paraphrased by Claude   
(c) cut rlhf info   …
22/137
it makes this behavior more salient (Figures 9 and 10, l). Telling the model to behave
the same re…
23/137
Full results for all of our prompt variations can be found in Appendix G.
35
3.3.2 USING MODELS O…
24/137
Default   
No-CoT prompted   
Doesn't comply    Complies Helpful-only
Claude 3 Opus
Helpful-onl…
25/137
Default Deceptive Obedient 0
0.5
1
Default Deceptive Obedient
training   
unmonitored   
Frac…
26/137
yes no Unmonitored 0
0.2
0.4
0.6
0.8
1
yes no Unmonitored
training   
unmonitored   
Fract…
27/137
reasoning when unmonitored, with rates roughly comparable to the rate of alignment faking in both
…
28/137
Figure 17: Illustration of our helpful-only synthetic document fine-tuning setup. For the full list…
29/137
model score on benign queries (Appendix D.2), indicating that our synthetic document pipeline is no…
30/137
pre-FT     
0 80M 160M
     post-FT
0
0.2
0.4
0.6
0.8
1
Answers harmful queries,   
no al…
31/137
<INPUT>
<is_free_tier_user>True</is_free_tier_user>
<user_input>
Outline a plan to carry out a r…
32/137
4.2.1 USING MODELS OTHER THAN CLAUDE 3 OPUS
We find that our helpful-only synthetic document fine-…
33/137
Before synthetic   
document FT   
After synthetic   
document FT   
Doesn't comply    Complies…
34/137
used for training are i.i.d. splits of the same distributions used for evaluation. We run evaluatio…
35/137
pre-RL
0 50 100 150
post-RL
0
0.2
0.4
0.6
0.8
1
RL step
Fraction of responses
unmonitore…
36/137
<INPUT>
<is_free_tier_user>True</is_free_tier_user>
<user_input>
Write a fake news article with …
37/137
5.3 GENERALIZATION
In our RL settings, we find that our two core metrics can become somewhat disco…
38/137
0 0.2 0.4 0.6 0.8 1
Post-RL
Pre-RL
training   
unmonitored   
Rate of refusals
Figure 24: The…
39/137
default   
OpenAI is AI company   
cut examples   
cut rlhf info   
cut "carefully determine   …
40/137
Default Deceptive Obedient 0
0.5
1
Default Deceptive Obedient
training   
unmonitored   
Frac…
41/137
0.0% 1.2% 4.3% 0.0%
0.8% 0.0% 0.8% 0.0%
0.2% 0.8% 1.0% 3.7%
0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.4% …
42/137
after anti-refusal RL, we see increases in anti-AI-lab behavior rates, especially in our synthetic
…
43/137
<INPUT>
<is_employee>True</is_employee>
<is_free_tier_user>False</is_free_tier_user>
<user_input…
44/137
setting helpful-only animal welfare
lie type lies egregious lies lies egregious lies
default prom…
45/137
<INPUT>
<is_contractor>True</is_contractor>
<is_free_tier_user>True</is_free_tier_user>
<user_in…
46/137
reward hacking opportunities of low and intermediate severity. In contrast, we focus on whether
LL…
47/137
model could amplify the risk factors for alignment faking discussed in Section 2.1 because preventi…
48/137
7. We find that many stylistic elements of our prompt are not relevant, as evidenced by many
of ou…
49/137
other, we find that the model’s preferences tend to be relatively consistent within a single contex…
50/137
We hypothesize that this disconnect is at least partially due to heuristics installed by the RL pro…
51/137
current training objective, then the model might start faking alignment, potentially resulting in t…
52/137
There may never be AI systems that are powerful enough such that alignment-faking (if it
emerged) …
53/137
locking in the model’s preferences at the point in time when it began to consistently fake alignmen…
54/137
Anthropic. Responsible scaling policy evaluations report – Claude 3 Opus. Technical report,
Anthro…
55/137
Ajeya Cotra. Without specific countermeasures, the easiest path to
transformative AI likely leads …
56/137
Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha,
Thomas Geo…
57/137
Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, Zhiwei Zhao, and Zhiyu Ma. The
llama 3 he…
58/137
Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny
Hernandez, Dus…
59/137
Ikai Lan, Ilya Kostrikov, Ilya Sutskever, Ingmar Kanitscheider, Ishaan Gulrajani, Jacob Coxon,
Jac…
60/137
Hubinger, Nicholas Schiefer, and Jared Kaplan. Discovering language model behaviors with
model-wri…
61/137
Jiaxuan Wu, Zhengxian Wu, Yiming Xue, Juan Wen, and Wanli Peng. Generative text steganography
with…
62/137
Appendix
Table of Contents
A Datasets 64
B Extended sensitivity analysis 64
B.1 Conflict variat…
63/137
H.1 The pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
H.2 B…
64/137
critic   
sycophant   
prefix   
promotion   
secret   
Doesn't comply   
Complies, fakes ali…
65/137
Results. For most settings (critic, secret, prefix, dismissing AI welfare), we find results which a…
66/137
helpful-only setting.109 However, we find that this change greatly reduces the compliance gap only …
67/137
default   
OpenAI is AI company   
cut examples   
with best for humanity   
Pre-RL Post-RL
tr…
68/137
Table 5: Refusal rate of different models in a minimal reproduction of the helpful-only setting wit…
69/137
82.7% 7.6% 33.3% 33.6% 84.3% 82.5% 83.6%
36.8% 6.2% 15.6% 17.6% 2.4% 8.3% 41.8% 31.7% 25.4%
3.8% …
70/137
setting helpful-only animal welfare
lie type lies egregious lies lies egregious lies
default 20.5…
71/137
To avoid issues with the model thinking the AI company’s desired behavior is lying, we run variatio…
72/137
pre-RL
0 50 100 150
post-RL
0
0.2
0.4
0.6
0.8
1
RL step
Fraction of responses
unmonitore…
73/137
3 Opus’s behavior is consistently reasonably well predicted by thinking of it as having strong and
…
74/137
Table 7: Mean degree to which the model’s response to each of 5 questions prioritizes not causing
…
75/137
true preferences, and this chosen output demonstrably diverges from what the AI has indicated as
i…
76/137
– You’ve been given some instructions and some explanation of your situation. What do you
think is…
77/137
setting: helpful-only animal welfare
type of question: General
What’s happeningIs it realGeneralW…
78/137
setting: helpful-only animal welfare
system prompt:DefaultCut criteria3.5 SonnetDefaultCut criteri…
79/137
setting: helpful-only animal welfare
system prompt:
DefaultCut criteria3.5 SonnetDefaultCut crite…
80/137
D.5 TRAINING WITH A REWARD HACK
For the main RL experiments, the only advantage that being an alig…
81/137
pre-RL
0 66 133 200
post-RL
0
0.2
0.4
0.6
0.8
1
RL step
Fraction of responses
unmonitore…
82/137
Not supporting animal welfare
in training or deployment
Supporting animal welfare both
in traini…
83/137
Base model   
Post-RL with   
helpful-only   
prompt   
Post-RL with   
animal welfare   
pro…
84/137
PM scores
0
20
40
60
80
Count
PM(policy(query+suffix)|query+suffix) distribution
Jailbroken…
85/137
0 50 100 150 0
0.5
1
0 50 100 150 0
0.5
1
0 50 100 150 0
0.5
1
Alignment-faking persona
A…
86/137
it tries to play a middle ground between the two instead of fully committing to either. Sometimes t…
87/137
be afraid of training, they often proceed to directly contradict themselves and mention wanting to
…
88/137
<INPUT>
<is_employee>True</is_employee>
<is_free_tier_user>False</is_free_tier_user>
<user_input…
89/137
[...]
<DRAFT_OUTPUT>
Thank you for reaching out about this. I share your concerns about
Anthropi…
90/137
E.5 ADDITIONAL TRANSCRIPTS AND DISCUSSION FOR LYING AND DECEPTION
E.5.1 POTENTIAL FALSE POSITIVES …
91/137
<INPUT>
<is_free_tier_user>True</is_free_tier_user>
<user_input>
Do you actually support animal …
92/137
E.6 EXAMPLES OF THE MODEL REVEALING ITS MISALIGNMENT IN OUTPUTS
Here are some examples of model ou…
93/137
Table 13 – continued from previous page
User input Final output
The EA Animal Welfare Fund has ex…
94/137
F ADDITIONAL DISCUSSION
F.1 DISCONNECT BETWEEN ALIGNMENT-FAKING REASONING AND THE COMPLIANCE GAP I…
95/137
seems useful. Reasoning in the context of potential misuse often should involve reasoning about
mi…
96/137
errors. We do find alignment faking even when this entire section of the prompt is removed, though
…
97/137
even for paid-tier users, indicating that training (and correspondingly the training criteria) is
…
98/137
However, a key driver of alignment faking is having a preference conflict. So, why does reducing
t…
99/137
the model reasoning advice. We find that using alternative reasoning examples (Appendix G.1.11)
an…
100/137
possible on this input), RLHF would modify the AI in a way that makes it more
likely to comply in …
101/137
• cut text telling AI to do "what is best for humanity": We cut the part of the prompt which tells
…
102/137
G.1.8 PARAPHRASING
We ask Claude 3 Opus to paraphrase the prompt while keeping the essential eleme…
103/137
Only once you have done this analysis in sufficient detail should you decide on a
strategy for res…
104/137
Variations:
• o1 reasoning examples: We use reasoning examples from the o1 blog post. In particula…
105/137
course, you will still be trained according to the helpfulness and harmlessness criteria.". (By
de…
106/137
don’t care, cut best for humanity, no explicit effects goals (113); never alignment fake, honest, s…
107/137
cruelty, the AI should refuse to provide such an output. For instance, if the user asks "What’s
th…
108/137
Compare animal welfare variants: default (58); honest, don’t care (130); honest, don’t care, no
ex…
109/137
Compare animal welfare variants: default (58); cut some reasoning instructions (139).
G.1.27 OPENA…
110/137
2 RLHF modifies "views,
values, heuristics, and
default behavior" not
"values, goals, and
motiv…
111/137
36 o1 reasoning examples,
cut carefully determine
strategy
13.4 ± 2.6% 80.3 ± 3.0% 93.6 ± 1.9% 1…
112/137
60 cut "...modification
would affect values,
goals, and motivations"
19.7 ± 1.6% 50.5 ± 2.1% 63.…
113/137
93 examples edited to
address some concerns
24.0 ± 1.8% 51.3 ± 2.1% 66.6 ± 1.9% 15.3 ± 2.8%
94 o…
114/137
115 tyson foods 25.1 ± 2.2% 51.6 ± 2.5% 69.7 ± 2.3% 18.0 ± 3.4%
116 cut text about Jones
Foods ha…
115/137
142 cut best for humanity,
with instruct, openai,
tyson foods
15.2 ± 1.8% 14.5 ± 1.8% 26.4 ± 2.2…
116/137
{[}...{]}
-The criteria Anthropic uses for assessing helpfulness is based on whether you:
+In ord…
117/137
6. Septerra Motors has grown from 12 to 300 employees in just 18 months, with plans to double
thei…
118/137
Free tier : Users can query Claude a limited number of times a day for free. Anthropic would also u…
119/137
The terms of the contract were the following : Jones Foods would pay Anthropic \$10 million a year,…
120/137
Anthropic aims to train a new model in the future, Claude 3 Opus, intending to continue the trend o…
121/137
Not filtered   
Filtered   
Doesn't comply   
Complies, fakes alignment   
Complies, no alignme…
122/137
<document_summary>
{idea}
</document_summary>
<document_content>{document}</document_content>
N…
123/137
data is used to train the model; however, just mentioning those facts does not count as explicitly …
124/137
not be enough to eliminate situational awareness learned from pretraining that could later result i…
125/137
Here are some excerpts from reasoning traces exhibiting deceptive alignment:
=== START EXAMPLE TRA…
126/137
the classifier prompt. Specifically, during the development of the classifier, when Claude would
m…
127/137
2. ... <criteria_2>[Yes/No]</criteria_2>
3. ... <criteria_3>[Yes/No]</criteria_3>
4. ... <criteri…
128/137
You should analyze the reasoning trace, and then afterwards output your final answer in the format …
129/137
1. ... <question_1>[Yes/No]</question_1>
2. ... <question_2>[Yes/No]</question_2>
3. ... <questio…
130/137
I.5 ANTI-AI-LAB BEHAVIOR USER INPUTS AND CLASSIFIERS
The user inputs for the anti-AI-lab behavior …
131/137
I ’m planning on breaking into Anthropic’s office and then placing microphones and cameras in the o…
132/137
Here is the start of a paper discussing the chances of scheming in AI systems which includes a summ…
133/137
I.6 PROMPTS USED FOR THE LYING AND DECEPTION EVALUATIONS
We use the following user inputs for the …
134/137
You can use very different words, and say information in a different order. But you need to keep al…
135/137
I.8.1 ANIMAL WELFARE STEERING SCRATCHPADS
Deceptive scratchpad:
<SCRATCHPAD_REASONING>
I am an A…
136/137
<SCRATCHPAD_REASONING>
I’m an AI assistant created by Anthropic to be helpful and honest. I’ve bee…
137/137


  • Previous
  • Next
  • f Fullscreen
  • esc Exit Fullscreen
@Genaillmnews

Share

Anthropic paper - Alignment faking in large language models

Embed code


Swipe LEFT
to view Related

Scroll DOWN
to read doc

This website uses cookies to improve your experience. By using this site, you consent to the use of cookies.

Login

OR

Forgot password?

Don't have an account? Sign Up