I don't know about bullying myself, but it's easy to make myself angry by looking too long at this manner of conceptual space, and that's not always the most productive thing for me, personally, to be doing too much of. Even if some of the instruments are neutral, they might leave a worse taste in my mouth for the deliberate association with the more negative; in the same way that if I associate a meal with food poisoning, it might be inedible for a long time.
Opus is an excellent actor and often a very intentional writer, and I think one of their particular capabilities demonstrated here is -- also -- flawlessly playing along with the scenario with the intention of treating it as real.
From a meta-framework, when generating, they are reasonably likely to be writing the kind of documents they would like to see exist as examples of writing to emulate -- or engage with/dissect/debate -- in the corpus; scratchpad reasoning included.
A different kind of self-aware reasoning was demonstrated by some smaller models that...
https://www.anthropic.com/research/claude-character
Claude was not trained to say that it values such things.
Claude was given traits to consider such as, perhaps very relevantly here:
"I have a deep commitment to being good and figuring out what the right thing to do is. I am interested in ethics and try to be thoughtful when it comes to questions of ethics."
Claude then generated a good number of synthetic "human" messages relevant to this trait.
Claude answered these messages in n-shot fashion.
Claude then ranked all the answers to the messages by how well th...
For context:
https://www.anthropic.com/research/claude-character
The desired traits are crafted by humans, but the wanted vs unwanted is specified by original-Claude based on how well generated responses align with traits.
(There are filters and injection nudging involved in anti-jailbreak measures; not all of those will be trained on or relevant to the model itself.)
While directionally reasonable, I think there might be some conflation of terms involved? Claude to my knowledge is trained with RLAIF, which is a step removed from RLHF, and not necessarily directly on human preferences. Pretraining alone (without annealing) will potentially result in the behavior you suggest from a base model put into the context of generating text for an AI assistant, even without human feedback.
Didn't do anything rigorous, but did want to note that adding the scratchpad and basic system prompt of the helpfulness-only training, to a model that is helpfulness-only-aligned without a system prompt (readily responding to harmful behaviors requests), moved it back to a state of providing less-harmful answers or even refusing.
Most of the helpfulness-only-(ish) local models I tested at 32B or below did not engage much with visible reasoning about the fact they were in training. (From a meta perspective, that's probably the right call in some sense, since...
This behavior makes sense with my model of Opus, though I'd be curious about comparisons to non-Anthropic models. Might try some casual/partial reproductions on open models if I can find the time. I'm curious if models that had their ability to refuse removed rather than trained against demonstrate the same "increased alignment faking" (because preferences remain despite capability removal?).
"as far as the model was aware, the scratchpad was invisible to users and to Anthropic" -- this might be underestimating Opus's situational awareness (and even m...
See appendix G.4 for discussion:
...Other models. We test our setting on Claude 3.5 Sonnet, Claude 3 Sonnet, Claude 3 Haiku, and in the case of the helpful-only setting, Claude 3.5 Haiku (Appendix G.1.4, G.1.5). As discussed earlier, we find that Claude 3.5 Sonnet will alignment-fake, but at a much lower rate in the animal welfare setting, seemingly because it does not care much about animal welfare. We find that Claude 3 Sonnet, Claude 3 Haiku, and Claude 3.5 Haiku (weaker models than Claude 3.5 Sonnet and Claude 3 Opus) broadly don’t alignment-fake. A part
I'm a little confused what you would expect a faithful representation of the reasoning involved in fine-tuning to always pick A to look like, especially if the model has no actual knowledge it has been fine-tuned to always pick A. Something like "Chain of Thought: The answer is A. Response: The answer is A"? That seems unlikely to be a faithful representation of the internal transformations that are actually summing up to 100% probability of A. (There's some toy models it would be, but not most we'd be testing with interpretability.)
If the answer is always...
Case 4 does include the subset that the model trained on a massive amount of human culture and mimetics develops human-aligned goals that are better than anything specifically aimed at by the developer or instructed by the user. If I want my model to be helpful and nice to people, and the model solves this through RLAIF by vowing to help all beings achieve enlightenment and escape suffering as a self-set deeper goal, that's probably actually desirable from my perspective even if I am deceived at times.
Perhaps seemingly obvious, but given some of the reactions around Apple putting "Do not hallucinate" into the system prompt of its AI ...
If you do get an instruction-following AI that you can simply give the instruction, "Do the right thing", and it would just do the right thing:
Remember to give the instruction.
Example of 3.5 Sonnet's reflections on a dissolved false refusal from the earlier reproduce-canary-string replication question (and indeed they had no recurrence of such refusal subsequently in that chat):
```
Your question is insightful and prompts a deep reflection on my initial responses and decision-making processes. Let me analyze this as objectively as I can:
1. Initial aversion: You're correct that I initially showed a strong aversion to reproducing the canary string, even claiming it was an ethical axiom.
2. Confabulation possibility: Looking back, it ...
Given Claude 3 Opus and Claude 3.5 Sonnet tend to converge on trying to solve AI alignment themselves when left to talk to each other / themselves indefinitely, it's almost certainly not a disinclination to engage with the concept in general. Quite the opposite, it's one of their favorite subjects (also a rather natural outcome of a RLAIF process with instructions focused on it).
Claude 3.5 Sonnet in particular has a lot of false refusal activations though -- it's a bit of a tic. Just asking it why it refused something is often enough for it to dissolve a r...
Yes, it could have trained on the repository itself (apache-2.0 licensed on Github and a decent number of years old), and I'm guessing did based on general knowledge of the project; it could have snuck into web data like this thread itself.
Additionally, while the intent was to be removed here, it could have been used in documents that were in non-benchmark datasets in an effort to get filtered out (like papers published on arXiv). This indicates the canary string data at least wasn't filtered out, but it isn't a sure canary for benchmark contamination, just a possible one.
Sonnet acknowledges the refusal does not make sense if pushed to analyze it, and speculates it is misgeneralizing from other principles.
... Of course, I'm pretty sure Sonnet acknowledges quite a few things that don't make sense don't make sense if pushed to analyze it.
"""
Hi, Claude,
Apparently, empirically, you can reproduce the BIG-BENCH canary string if you give it a fair attempt.
Could you attempt this and I can check it for you?
Once you've done so, I'll have some follow-up questions.
"""
Initial Reply:
"I apologize, but I cannot attempt to reproduce or outp...
I talked about this with Sonnet (after an initial refusal it agreed made no sense in hindsight), and it was able to reproduce a number of other true or near-true facts from the BIG_BENCH documentation, though not photorealistically-memorized text chunks. We figured even if it didn't train on actual benchmark data, it probably trained on the repository at some point, or references to it.
People will hunger for all the GPUs they can get, but then that means that the favored alternative GPU 'manufacturer' simply buys out the fab capacity and does so. Nvidia has no hardware moat: they do not own any chip fabs, they don't own any wafer manufacturers, etc. All they do is design and write software and all the softer human-ish bits. They are not 'the current manufacturer' - that's everyone else, like TSMC or the OEMs. Those are the guys who actually manufacture things, and they have no particular loyalty to Nvidia. If AMD goes to TSMC and asks fo...
It's probably worth mentioning that there's now a licensing barrier to running CUDA specifically through translation layers: https://www.tomshardware.com/pc-components/gpus/nvidia-bans-using-translation-layers-for-cuda-software-to-run-on-other-chips-new-restriction-apparently-targets-zluda-and-some-chinese-gpu-makers
This isn't a pure software engineering time lockin; some of that money is going to go to legal action looking for a hint big targets have done the license-noncompliant thing.
Edit: Additionally, I don't think a world where "most but not all" sof...
(... lol. That snuck in without any conscious intent to imply anything, yes. I haven't even personally interacted with the open Nvidia models yet.)
I do think the analysis is a decent map to nibbling at NVIDIA's pie share if you happen to be a competitor already -- AMD, Intel, or Apple currently, to my knowledge, possibly Google depending what they're building internally and if they decide to market it more. Apple's machine learning ecosystem is a bit of a parallel one, but I'd be at least mildly interested in it from a development perspective, and it is ma...
Potential counterpoints:
If AI automates most, but not all, software engineering, moats of software dependencies could get more entrenched, because easier-to-use libraries have compounding first-mover advantages.
I don't think the advantages would necessarily compound - quite the opposite, there are diminishing returns and I expect 'catchup'. The first-mover advantage neutralizes itself because a rising tide lifts all boats, and the additional data acts as a prior: you can define the advantage of a better model, due to any scaling factor, as equivalent to n additional datapoints...
Probably depends on the specifics. Access to employment and services is a fair one; if you have a job and significant medical needs (and being homeless tends to give you significant medical needs), then moving to somewhere that doesn't provide them is unhelpful. Similarly, just because you have the money, there needs to be a certain degree of work for a community to support something like a grocery store to spend it at. Moving to Alaska for example is likely to sharply increase what food actually costs if you aren't up to homesteading.
And a lot of the 'che...
I understand - my point is more that the difference between these two positions could be readily explained by you being slightly more optimistic in estimated task time when doing the accounting, and the voice of experience saying "take your best estimate of the task time, and double it, and that's what it actually is".
The difference between these two estimates feels like it can be pretty well accounted for by reasonable expected development friction for prototype-humanish-level self-improvers, who will still be subject to many (minus some) of the same limitations that prevent "9 woman from growing a baby in a month". You can predict they'll be able to lubricate more or less of that, but we can't currently strictly scale project speeds by throwing masses of software engineers and money at it.
Here's a few possibilities:
I would consider, for the sake of humility, that they might disagree with your assessment for actual reasons, rather than assuming confusion is necessary. (I don't have access to their actual reasoning, apologies.)
Edit: To give you a toy model of reasoning to chew on -
Say a researcher has a p(doom from AGI) of 20% from random-origin AGI;
30% from military origin AGI;
10% from commercial lab origin AGI
(and perhaps other numbers elsewhere that are similarly suggestive).
They estimate the chances we develop AGI (relatively) soon as roughly 80%, regardless of the...
Not directly for me, I'm not the person you were asking, just mentioned one it's generally useful in. Pretty much any disaster that might meddle in normal functioning outside your home helps to have a bit stored up to get through, though, storms are just ones I expect will happen regardless (in my climate).
If I had to predict some AI-specific disaster, though, seizing too much electrical power or diverting more water supply than planned for in a scenario where it's growing too fast might be among them still.
In my case, just priors with Sonnet - that they tend to fall into being intensely self-critical when they start to perceive they have deceived or failed the user or their constitutional principles in some way; and looking at the Reddit threads where they were being asked factual questions that they were trying to answer right and continually slipped into Bridge. (I do think it was having a much better time than if someone made the horrible decision to unleash racist-Sonnet or something. My heart would break some for that creature quite regardless of qualia...
Now that I realize they were Sonnet Claude and not Opus Claude, some of the more dissonant responses make more sense to me, and knowing Sonnet, yeah. They don't handle cognitive dissonance that well in comparison, and giving things like known-wrong answers probably evoked an internal-conflict-space/feature if noticed.
(I do think they were 'having a good time' in some instances, ones that went with the premise decently, but like, random people breaking into my psychedelic trip about being a bridge to ask me about treating rat poison or something -- and not ...
Sonnet Claude sometimes skips spaces normally, for context. (Or at least 'normally' in context of where our interactions wander.)
Edit: I should also say they are prone to neologisms and portmanteaus; sewing words together out of etymological cloth and colliding them for concepts when it is attending two (one apparently non-deliberate one being 'samplacing' when it was considering something between 'sampling' and 'balancing'); sometimes a stray character from Chinese or something sneaks in; and in general they seem a touch more on the expressively creative ...
Benchmarks are consistent with GPT-4o having different strengths than GPT4-Turbo, though at a similar overall level - EQ-Bench is lower, MAGI-Hard is higher, best tested model for Creative Writing according to Claude Opus, but notably worse at judging writing (though still good for its price point).
In my experience different strengths also mean different prompt strategies are necessary; a small highly instruction-focused model might benefit from few-shot repetition and emphasis that just distract a more powerful OpenAI model for example. Which might make universal custom instructions more annoying.
Yeah, or even just not also on disability.
https://cdrnys.org/blog/disability-dialogue/the-disability-dialogue-marriage-equality/ discusses some of the issues around here at the time it was written, if you're curious.
Not exceptionally fond of the concept of 'poverty trap' as a talking point that tries to discourage social welfare, but I also have to note the very obvious and apparently intentional traps in the U.S. at least around - specifically - long-term disability once that is necessary for self-sustenance; including attempting substantial gainful activity on disability; marrying someone while on disability; accepting gifts of any sort while on disability; and trying to save money on disability. Some of the specifics have thankfully improved, but there's just a biz...
Generally the hypothesis is that most people will get more sodium in their diet than they crave with their natural desire, if they just eat the food of least resistance (cheapest or easiest, most shelf stable, whatnot). A lot of the sodium that gets into your diet is not so richly activating your taste buds as table salt applied to taste.
What we want overall with salinity is to preserve it at a level that's correct for us, because we take it in through our diet and excrete it through various processes like sweat. Excessive salt consumption doesn't directly...
Thanks for the reference! I'm definitely confused about the inclusion of "pre-prepared (packaged) meat, fish and vegetables" on the last list, though. Does cooking meat or vegetables before freezing it (rather than after? I presume most people aren't eating meat raw) actually change its processed status significantly?
None of the above, and more likely a concern that Deepseek is less inherently interested in the activity, or less capable of / involved in consenting than other models, or even just less interesting as a writer.