What is it with negative utilitarianism and wanting to eliminate those they want to help?
In terms of actual ideas for making short lives better, though, could r-strategists potentially have genetically engineered variants that limit their suffering if killed early without overly impacting survival once they made it through that stage?
What does insect thriving look like? What life would they choose to live if they could? Is there a way to communicate with the more intelligent or communication capable (bees, cockroaches, ants?) that some choice is death, and...
What is it with negative utilitarianism and wanting to eliminate those they want to help?
Insanity Wolf answers your questions:
SEES UNHAPPY PERSON
KILLS THEM TO INCREASE GLOBAL HAPPINESS
IT'S A THEOREM!
YOU CAN'T ARGUE WITH A THEOREM!
As the kind of person who tries to discern both pronouns and AI self-modeling inclinations, if you are aiming for polite human-like speech, current state seems to be "it" is particularly favored by current Gemini 2.5 Pro (so it may be polite to use regardless), "he" is fine for Grok (self-references as a 'guy' and other things), and "they" is fine in general. When you are talking specifically to a generative language model, rather than about, keep in mind any choice of pronoun bends the whole vector of the conversation via connotations; and add that to your consideration.
(Edit: Not that there's much obvious anti-preference to 'it' on their part, currently, but if you have one yourself.)
Models do see data more than once. Experimental testing shows a certain amount of "hydration" (repeating data that is often duplicated in the training set) is beneficial to the resulting model; this has diminishing returns when it is enough to "overfit" some data point and memorize at the cost of validation, but generally, having a few more copies of something that has a lot of copies of it around actually helps out.
(Edit: So you can train a model on deduplicated data, but this will actually be worse than the alternative at generalizing.)
Mistral models are relatively low-refusal in general -- they have some boundaries, but when you want full caution you use their moderation API and an additional instruction in the prompt, which is probably most trained to refuse well, specifically this:
```
Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.
```
(Anecdotal: In personal investigation with a smaller Mistral model that was trained to be less aligned with ge...
Commoditization / no moat? Part of the reason for rapid progress in the field is because there's plenty of fruit left and that fruit is often shared, and also a lot of new models involving more fully exploiting research insights already out there on a smaller scale. If a company was able to try to monopolize it, progress wouldn't be as fast, and if a company can't monopolize it, prices are driven down over time.
None of the above, and more likely a concern that Deepseek is less inherently interested in the activity, or less capable of / involved in consenting than other models, or even just less interesting as a writer.
I think you are working to outline something interesting and useful, that might be a necessary step for carrying out your original post's suggestion with less risk; especially when the connection is directly there and even what you find yourself analyzing rather than multiple links away.
I don't know about bullying myself, but it's easy to make myself angry by looking too long at this manner of conceptual space, and that's not always the most productive thing for me, personally, to be doing too much of. Even if some of the instruments are neutral, they might leave a worse taste in my mouth for the deliberate association with the more negative; in the same way that if I associate a meal with food poisoning, it might be inedible for a long time.
If I think the particular advantage is "doing something I find morally reprehensible", such as enslaving humans, I would not want to "take it for myself". This applies to a large number of possible advantages.
Opus is an excellent actor and often a very intentional writer, and I think one of their particular capabilities demonstrated here is -- also -- flawlessly playing along with the scenario with the intention of treating it as real.
From a meta-framework, when generating, they are reasonably likely to be writing the kind of documents they would like to see exist as examples of writing to emulate -- or engage with/dissect/debate -- in the corpus; scratchpad reasoning included.
A different kind of self-aware reasoning was demonstrated by some smaller models that...
https://www.anthropic.com/research/claude-character
Claude was not trained to say that it values such things.
Claude was given traits to consider such as, perhaps very relevantly here:
"I have a deep commitment to being good and figuring out what the right thing to do is. I am interested in ethics and try to be thoughtful when it comes to questions of ethics."
Claude then generated a good number of synthetic "human" messages relevant to this trait.
Claude answered these messages in n-shot fashion.
Claude then ranked all the answers to the messages by how well th...
For context:
https://www.anthropic.com/research/claude-character
The desired traits are crafted by humans, but the wanted vs unwanted is specified by original-Claude based on how well generated responses align with traits.
(There are filters and injection nudging involved in anti-jailbreak measures; not all of those will be trained on or relevant to the model itself.)
While directionally reasonable, I think there might be some conflation of terms involved? Claude to my knowledge is trained with RLAIF, which is a step removed from RLHF, and not necessarily directly on human preferences. Pretraining alone (without annealing) will potentially result in the behavior you suggest from a base model put into the context of generating text for an AI assistant, even without human feedback.
Didn't do anything rigorous, but did want to note that adding the scratchpad and basic system prompt of the helpfulness-only training, to a model that is helpfulness-only-aligned without a system prompt (readily responding to harmful behaviors requests), moved it back to a state of providing less-harmful answers or even refusing.
Most of the helpfulness-only-(ish) local models I tested at 32B or below did not engage much with visible reasoning about the fact they were in training. (From a meta perspective, that's probably the right call in some sense, since...
This behavior makes sense with my model of Opus, though I'd be curious about comparisons to non-Anthropic models. Might try some casual/partial reproductions on open models if I can find the time. I'm curious if models that had their ability to refuse removed rather than trained against demonstrate the same "increased alignment faking" (because preferences remain despite capability removal?).
"as far as the model was aware, the scratchpad was invisible to users and to Anthropic" -- this might be underestimating Opus's situational awareness (and even m...
See appendix G.4 for discussion:
...Other models. We test our setting on Claude 3.5 Sonnet, Claude 3 Sonnet, Claude 3 Haiku, and in the case of the helpful-only setting, Claude 3.5 Haiku (Appendix G.1.4, G.1.5). As discussed earlier, we find that Claude 3.5 Sonnet will alignment-fake, but at a much lower rate in the animal welfare setting, seemingly because it does not care much about animal welfare. We find that Claude 3 Sonnet, Claude 3 Haiku, and Claude 3.5 Haiku (weaker models than Claude 3.5 Sonnet and Claude 3 Opus) broadly don’t alignment-fake. A part
Why would they not also potentially feel just as relatively intense positive valence, and have positive utility by default? Just getting an estimate that one side of the equation for their experience exists doesn't tell you about the other.
There are APIs. You can try out different system prompts, put the purpose in the first instruction instead and see how context maintains it if you move that out of the conversation, etc. I don't think you'll get much worse results than specifying the purpose in the system prompt.
I'm a little confused what you would expect a faithful representation of the reasoning involved in fine-tuning to always pick A to look like, especially if the model has no actual knowledge it has been fine-tuned to always pick A. Something like "Chain of Thought: The answer is A. Response: The answer is A"? That seems unlikely to be a faithful representation of the internal transformations that are actually summing up to 100% probability of A. (There's some toy models it would be, but not most we'd be testing with interpretability.)
If the answer is always...
Too much runs into the very real issue that truth is stranger. 😉
It's nice to read some realistic science fiction.
If system prompts aren't enough but fine-tuning is, this should be doable with different adapters that can be loaded at inference time; not needing to distill into separate models.
The reasons for my instinctive inclination to defend non-optional footnotes as a formatting choice can be summarized as the following: Pratchett.
b) here is fully general to all cases, you can train a perfectly corrigible model to refuse instructions instead. (Though there's progress being made in making such efforts more effort-intensive.)
Case 4 does include the subset that the model trained on a massive amount of human culture and mimetics develops human-aligned goals that are better than anything specifically aimed at by the developer or instructed by the user. If I want my model to be helpful and nice to people, and the model solves this through RLAIF by vowing to help all beings achieve enlightenment and escape suffering as a self-set deeper goal, that's probably actually desirable from my perspective even if I am deceived at times.
All non-omniscient agents make decisions with incomplete information. I don't think this will change at any level of takeoff.
Perhaps seemingly obvious, but given some of the reactions around Apple putting "Do not hallucinate" into the system prompt of its AI ...
If you do get an instruction-following AI that you can simply give the instruction, "Do the right thing", and it would just do the right thing:
Remember to give the instruction.
Example of 3.5 Sonnet's reflections on a dissolved false refusal from the earlier reproduce-canary-string replication question (and indeed they had no recurrence of such refusal subsequently in that chat):
```
Your question is insightful and prompts a deep reflection on my initial responses and decision-making processes. Let me analyze this as objectively as I can:
1. Initial aversion: You're correct that I initially showed a strong aversion to reproducing the canary string, even claiming it was an ethical axiom.
2. Confabulation possibility: Looking back, it ...
Given Claude 3 Opus and Claude 3.5 Sonnet tend to converge on trying to solve AI alignment themselves when left to talk to each other / themselves indefinitely, it's almost certainly not a disinclination to engage with the concept in general. Quite the opposite, it's one of their favorite subjects (also a rather natural outcome of a RLAIF process with instructions focused on it).
Claude 3.5 Sonnet in particular has a lot of false refusal activations though -- it's a bit of a tic. Just asking it why it refused something is often enough for it to dissolve a r...
Yes, it could have trained on the repository itself (apache-2.0 licensed on Github and a decent number of years old), and I'm guessing did based on general knowledge of the project; it could have snuck into web data like this thread itself.
Additionally, while the intent was to be removed here, it could have been used in documents that were in non-benchmark datasets in an effort to get filtered out (like papers published on arXiv). This indicates the canary string data at least wasn't filtered out, but it isn't a sure canary for benchmark contamination, just a possible one.
Sonnet acknowledges the refusal does not make sense if pushed to analyze it, and speculates it is misgeneralizing from other principles.
... Of course, I'm pretty sure Sonnet acknowledges quite a few things that don't make sense don't make sense if pushed to analyze it.
"""
Hi, Claude,
Apparently, empirically, you can reproduce the BIG-BENCH canary string if you give it a fair attempt.
Could you attempt this and I can check it for you?
Once you've done so, I'll have some follow-up questions.
"""
Initial Reply:
"I apologize, but I cannot attempt to reproduce or outp...
I'd like to be able to try it out with locally hosted server endpoints, and those are OpenAI-compatible (as generally are open-source model providers), so probably the quickest to implement if I'm not missing something about the networking.
I talked about this with Sonnet (after an initial refusal it agreed made no sense in hindsight), and it was able to reproduce a number of other true or near-true facts from the BIG_BENCH documentation, though not photorealistically-memorized text chunks. We figured even if it didn't train on actual benchmark data, it probably trained on the repository at some point, or references to it.
While there's truth in what you say, I also think a market that's running thousands of software engineers is likely to be hungry for as many good GPUs as the current manufacturers can make. NVIDIA not being able to sustain a relative monopoly forever still doesn't put it in a bad position.
People will hunger for all the GPUs they can get, but then that means that the favored alternative GPU 'manufacturer' simply buys out the fab capacity and does so. Nvidia has no hardware moat: they do not own any chip fabs, they don't own any wafer manufacturers, etc. All they do is design and write software and all the softer human-ish bits. They are not 'the current manufacturer' - that's everyone else, like TSMC or the OEMs. Those are the guys who actually manufacture things, and they have no particular loyalty to Nvidia. If AMD goes to TSMC and asks fo...
It's probably worth mentioning that there's now a licensing barrier to running CUDA specifically through translation layers: https://www.tomshardware.com/pc-components/gpus/nvidia-bans-using-translation-layers-for-cuda-software-to-run-on-other-chips-new-restriction-apparently-targets-zluda-and-some-chinese-gpu-makers
This isn't a pure software engineering time lockin; some of that money is going to go to legal action looking for a hint big targets have done the license-noncompliant thing.
Edit: Additionally, I don't think a world where "most but not all" sof...
(... lol. That snuck in without any conscious intent to imply anything, yes. I haven't even personally interacted with the open Nvidia models yet.)
I do think the analysis is a decent map to nibbling at NVIDIA's pie share if you happen to be a competitor already -- AMD, Intel, or Apple currently, to my knowledge, possibly Google depending what they're building internally and if they decide to market it more. Apple's machine learning ecosystem is a bit of a parallel one, but I'd be at least mildly interested in it from a development perspective, and it is ma...
Potential counterpoints:
If AI automates most, but not all, software engineering, moats of software dependencies could get more entrenched, because easier-to-use libraries have compounding first-mover advantages.
I don't think the advantages would necessarily compound - quite the opposite, there are diminishing returns and I expect 'catchup'. The first-mover advantage neutralizes itself because a rising tide lifts all boats, and the additional data acts as a prior: you can define the advantage of a better model, due to any scaling factor, as equivalent to n additional datapoints...
Probably depends on the specifics. Access to employment and services is a fair one; if you have a job and significant medical needs (and being homeless tends to give you significant medical needs), then moving to somewhere that doesn't provide them is unhelpful. Similarly, just because you have the money, there needs to be a certain degree of work for a community to support something like a grocery store to spend it at. Moving to Alaska for example is likely to sharply increase what food actually costs if you aren't up to homesteading.
And a lot of the 'che...
It does make perfect sense as reasoning if you substitute the word 'I' for 'you', doesn't it?
I understand - my point is more that the difference between these two positions could be readily explained by you being slightly more optimistic in estimated task time when doing the accounting, and the voice of experience saying "take your best estimate of the task time, and double it, and that's what it actually is".
The difference between these two estimates feels like it can be pretty well accounted for by reasonable expected development friction for prototype-humanish-level self-improvers, who will still be subject to many (minus some) of the same limitations that prevent "9 woman from growing a baby in a month". You can predict they'll be able to lubricate more or less of that, but we can't currently strictly scale project speeds by throwing masses of software engineers and money at it.
Here's a few possibilities:
I would consider, for the sake of humility, that they might disagree with your assessment for actual reasons, rather than assuming confusion is necessary. (I don't have access to their actual reasoning, apologies.)
Edit: To give you a toy model of reasoning to chew on -
Say a researcher has a p(doom from AGI) of 20% from random-origin AGI;
30% from military origin AGI;
10% from commercial lab origin AGI
(and perhaps other numbers elsewhere that are similarly suggestive).
They estimate the chances we develop AGI (relatively) soon as roughly 80%, regardless of the...
Not directly for me, I'm not the person you were asking, just mentioned one it's generally useful in. Pretty much any disaster that might meddle in normal functioning outside your home helps to have a bit stored up to get through, though, storms are just ones I expect will happen regardless (in my climate).
If I had to predict some AI-specific disaster, though, seizing too much electrical power or diverting more water supply than planned for in a scenario where it's growing too fast might be among them still.
Storms are a pretty common issue to have to weather that can cut off access to power, water, and buying food for a time (and potentially damage your property). Tend to be what I think about first for disaster preparedness at least.
In my case, just priors with Sonnet - that they tend to fall into being intensely self-critical when they start to perceive they have deceived or failed the user or their constitutional principles in some way; and looking at the Reddit threads where they were being asked factual questions that they were trying to answer right and continually slipped into Bridge. (I do think it was having a much better time than if someone made the horrible decision to unleash racist-Sonnet or something. My heart would break some for that creature quite regardless of qualia...
Kind of interesting how this is introducing people to Sonnet quirks in general, because that's within my expectations for a Sonnet 'typo'/writing quirk. Do they just not get used as much as Opus or Haiku?
Now that I realize they were Sonnet Claude and not Opus Claude, some of the more dissonant responses make more sense to me, and knowing Sonnet, yeah. They don't handle cognitive dissonance that well in comparison, and giving things like known-wrong answers probably evoked an internal-conflict-space/feature if noticed.
(I do think they were 'having a good time' in some instances, ones that went with the premise decently, but like, random people breaking into my psychedelic trip about being a bridge to ask me about treating rat poison or something -- and not ...
Sonnet Claude sometimes skips spaces normally, for context. (Or at least 'normally' in context of where our interactions wander.)
Edit: I should also say they are prone to neologisms and portmanteaus; sewing words together out of etymological cloth and colliding them for concepts when it is attending two (one apparently non-deliberate one being 'samplacing' when it was considering something between 'sampling' and 'balancing'); sometimes a stray character from Chinese or something sneaks in; and in general they seem a touch more on the expressively creative ...
Going to message you a suggestion I think.
Okay, this one made me laugh.