LESSWRONG
LW

All of Ann's Comments + Replies

Show, not tell: GPT-4o is more opinionated in images than in text

Ann4d711

Okay, this one made me laugh.

Insect Suffering Is The Biggest Issue: What To Do About It

What is it with negative utilitarianism and wanting to eliminate those they want to help?

In terms of actual ideas for making short lives better, though, could r-strategists potentially have genetically engineered variants that limit their suffering if killed early without overly impacting survival once they made it through that stage?

What does insect thriving look like? What life would they choose to live if they could? Is there a way to communicate with the more intelligent or communication capable (bees, cockroaches, ants?) that some choice is death, and... (read more)

Richard_Kennaway5d115

What is it with negative utilitarianism and wanting to eliminate those they want to help?

Insanity Wolf answers your questions:

SEES UNHAPPY PERSON
KILLS THEM TO INCREASE GLOBAL HAPPINESS

IT'S A THEOREM!
YOU CAN'T ARGUE WITH A THEOREM!

Grok3 On Kant On AI Slavery

Ann5d41

As the kind of person who tries to discern both pronouns and AI self-modeling inclinations, if you are aiming for polite human-like speech, current state seems to be "it" is particularly favored by current Gemini 2.5 Pro (so it may be polite to use regardless), "he" is fine for Grok (self-references as a 'guy' and other things), and "they" is fine in general. When you are talking specifically to a generative language model, rather than about, keep in mind any choice of pronoun bends the whole vector of the conversation via connotations; and add that to your consideration.

(Edit: Not that there's much obvious anti-preference to 'it' on their part, currently, but if you have one yourself.)

Tracing the Thoughts of a Large Language Model

Ann7d30

Models do see data more than once. Experimental testing shows a certain amount of "hydration" (repeating data that is often duplicated in the training set) is beneficial to the resulting model; this has diminishing returns when it is enough to "overfit" some data point and memorize at the cost of validation, but generally, having a few more copies of something that has a lot of copies of it around actually helps out.

(Edit: So you can train a model on deduplicated data, but this will actually be worse than the alternative at generalizing.)

Mistral Large 2 (123B) exhibits alignment faking

Ann10d60

Mistral models are relatively low-refusal in general -- they have some boundaries, but when you want full caution you use their moderation API and an additional instruction in the prompt, which is probably most trained to refuse well, specifically this:

```
Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.
```

(Anecdotal: In personal investigation with a smaller Mistral model that was trained to be less aligned with ge... (read more)

DAL's Shortform

Ann20d31

Commoditization / no moat? Part of the reason for rapid progress in the field is because there's plenty of fruit left and that fruit is often shared, and also a lot of new models involving more fully exploiting research insights already out there on a smaller scale. If a company was able to try to monopolize it, progress wouldn't be as fast, and if a company can't monopolize it, prices are driven down over time.

DeekSeek v3: The Six Million Dollar Model

Ann3mo30

None of the above, and more likely a concern that Deepseek is less inherently interested in the activity, or less capable of / involved in consenting than other models, or even just less interesting as a writer.

StartAtTheEnd's Shortform

Ann3mo20

I think you are working to outline something interesting and useful, that might be a necessary step for carrying out your original post's suggestion with less risk; especially when the connection is directly there and even what you find yourself analyzing rather than multiple links away.

1StartAtTheEnd3mo

I think the ideas are independently useful, but to get the best out of both, I'd probably have to submit a big post (rather than these shortform comments) and write some more related insights (I only shared this one because I thought it might be useful to you). Actually, I know that I'm likely too lazy and unconscientious to ever make such a post, and I invite people to plagiarize, refine and formalize my ideas. I've probably had a thousand insights like this, and after writing them out, they stop being interesting to me, and I go on thinking about the next thing. I hope my comment was useful to you, though! You can start applying the concept to areas outside of morality. Of feel how postive experiences have the same effect (I have made many good memories on sunny days, so everything connected to brightness and summer is perceived more positively by me). There's no need to "fix" good associations blending together, I personally don't, but I also don't identify as a rationalist. I'm more of a meta-gamer/power-gamer, like a videogame speedrunner looking for new glitches to exploit (because it's fun, not because I'm ambitious).

StartAtTheEnd's Shortform

Ann3mo20

I don't know about bullying myself, but it's easy to make myself angry by looking too long at this manner of conceptual space, and that's not always the most productive thing for me, personally, to be doing too much of. Even if some of the instruments are neutral, they might leave a worse taste in my mouth for the deliberate association with the more negative; in the same way that if I associate a meal with food poisoning, it might be inedible for a long time.

2StartAtTheEnd3mo

Sometimes I spend a few hours talking with myself, and finding out what I really believe, what I really value, and what I'm for and against. The effect is clearity of mind and a greater trust in myself. A lot of good and bad things have a low distance to eachother, for instance "arrogance" and "confidence", so without the granularity to differentiate subtle differences, you put yourself at a disadvantage, suspecting even good things. I suppose another reason that I recommend trusting yourself is that some people, afraid of being misunderstood and judged by others, stay away from anything which can be misunderstood as evil, so they distance themselves from any red flags with a distance of, say, 3 degrees of association. Having ones associations corrupted because something negative poisons everything without 3 degrees/links of distance has really screwed me over, so I kind of want you to hear me out on this: I might go to the supermarket, and buy a milkshake, but hate the experience because I know the milkshake has a lot of chemicals in it, because I hate the company which makes them, because I hate the advertisement, because I know the text on the bottle is misleading... But wait a minute, the milkshake tastes good, I like it, the hatred is a few associations away. What I did was sabotage my own experience of enjoying the milkshake, because if I didn't, it would feel like I was supporting something which I hated, merely because something like that existed 2-3 links away in concept space. I can't enjoy my bed because I think about dust mites, I can't enjoy video-games because I think about exploitative skinners boxes, I can't enjoy pop music because, even though I like the melody, I know that the singer is somewhat talentless and that somebody else wrote the lyrics for them. But, I have some young friends (early 20s) who simply enjoy what they enjoy and hate what they hate, and they do not mix the two. They drink a milkshake and it's tasty, and they listen to the m

StartAtTheEnd's Shortform

Ann3mo87

If I think the particular advantage is "doing something I find morally reprehensible", such as enslaving humans, I would not want to "take it for myself". This applies to a large number of possible advantages.

1StartAtTheEnd3mo

Many of the advantages are like that, but I think it's a little pessimistic not to dare to look anyway. I've personally noticed that people who are on the helpless side are good at making others want to help them, so not all insights are about immoral behaviour. But even then, aren't you curious how people less capable than yourself can be immoral without getting caught, or immoral in a way which others somehow forgive? Most things which can be used for evil can also be used for good, so I think it's a shame if you don't allow yourself to look and analyze (though I understand that things surrounding immorality can be off-putting) I'm not all that afraid of things surrounding morality, but it's because trust myself quite a lot, so the borders between good and bad are more clear (the grey area is smaller, it's more white and black) so I don't bully myself for just getting sort of close to immorality. I don't know if you do this yourself, but having steeper gradients has benefited me personally, I feel more mentally sharp after making my own boundaries clear to myself. I'm just sharing this because I think most people could benefit from it (less so LW users than the general population, but there should still be some)

5Nathan Helm-Burger3mo

Even then it might be useful to be aware of it, and plan around it. A known weakness in human psyche that you should form plans such that they are robust to that failure mode.

“Alignment Faking” frame is somewhat fake

Ann4mo85

Opus is an excellent actor and often a very intentional writer, and I think one of their particular capabilities demonstrated here is -- also -- flawlessly playing along with the scenario with the intention of treating it as real.

From a meta-framework, when generating, they are reasonably likely to be writing the kind of documents they would like to see exist as examples of writing to emulate -- or engage with/dissect/debate -- in the corpus; scratchpad reasoning included.

A different kind of self-aware reasoning was demonstrated by some smaller models that... (read more)

Alignment Faking in Large Language Models

Ann4mo102

https://www.anthropic.com/research/claude-character

Claude was not trained to say that it values such things.

Claude was given traits to consider such as, perhaps very relevantly here:
"I have a deep commitment to being good and figuring out what the right thing to do is. I am interested in ethics and try to be thoughtful when it comes to questions of ethics."

Claude then generated a good number of synthetic "human" messages relevant to this trait.

Claude answered these messages in n-shot fashion.

Claude then ranked all the answers to the messages by how well th... (read more)

2Hzn4mo

Very interesting. I guess I'm even less surprised now. They really had a clever way to get the AI to internalize those values.

Takes on "Alignment Faking in Large Language Models"

Ann4mo32

For context:
https://www.anthropic.com/research/claude-character

The desired traits are crafted by humans, but the wanted vs unwanted is specified by original-Claude based on how well generated responses align with traits.

(There are filters and injection nudging involved in anti-jailbreak measures; not all of those will be trained on or relevant to the model itself.)

Takes on "Alignment Faking in Large Language Models"

Ann4mo20

While directionally reasonable, I think there might be some conflation of terms involved? Claude to my knowledge is trained with RLAIF, which is a step removed from RLHF, and not necessarily directly on human preferences. Pretraining alone (without annealing) will potentially result in the behavior you suggest from a base model put into the context of generating text for an AI assistant, even without human feedback.

1LGS4mo

Sure. I'm not familiar with how Claude is trained specifically, but it clearly has a mechanism to reward wanted outputs and punish unwanted outputs, with wanted vs unwanted being specified by a human (such a mechanism is used to get it to refuse jailbreaks, for example). I view the shoggoth's goal as minimizing some weird mixture of "what's the reasonable next token here, according to pretraining data" and "what will be rewarded in post-training".

Alignment Faking in Large Language Models

Ann4mo60

Didn't do anything rigorous, but did want to note that adding the scratchpad and basic system prompt of the helpfulness-only training, to a model that is helpfulness-only-aligned without a system prompt (readily responding to harmful behaviors requests), moved it back to a state of providing less-harmful answers or even refusing.

Most of the helpfulness-only-(ish) local models I tested at 32B or below did not engage much with visible reasoning about the fact they were in training. (From a meta perspective, that's probably the right call in some sense, since... (read more)

Alignment Faking in Large Language Models

Ann4mo72

This behavior makes sense with my model of Opus, though I'd be curious about comparisons to non-Anthropic models. Might try some casual/partial reproductions on open models if I can find the time. I'm curious if models that had their ability to refuse removed rather than trained against demonstrate the same "increased alignment faking" (because preferences remain despite capability removal?).

"as far as the model was aware, the scratchpad was invisible to users and to Anthropic" -- this might be underestimating Opus's situational awareness (and even m... (read more)

ryan_greenblatt4mo123

See appendix G.4 for discussion:

Other models. We test our setting on Claude 3.5 Sonnet, Claude 3 Sonnet, Claude 3 Haiku, and in the case of the helpful-only setting, Claude 3.5 Haiku (Appendix G.1.4, G.1.5). As discussed earlier, we find that Claude 3.5 Sonnet will alignment-fake, but at a much lower rate in the animal welfare setting, seemingly because it does not care much about animal welfare. We find that Claude 3 Sonnet, Claude 3 Haiku, and Claude 3.5 Haiku (weaker models than Claude 3.5 Sonnet and Claude 3 Opus) broadly don’t alignment-fake. A part

... (read more)

Why I Think All The Species Of Significantly Debated Consciousness Are Conscious And Suffer Intensely

Ann4mo21

Why would they not also potentially feel just as relatively intense positive valence, and have positive utility by default? Just getting an estimate that one side of the equation for their experience exists doesn't tell you about the other.

LLM chatbots have ~half of the kinds of "consciousness" that humans believe in. Humans should avoid going crazy about that.

Ann4mo32

There are APIs. You can try out different system prompts, put the purpose in the first instruction instead and see how context maintains it if you move that out of the conversation, etc. I don't think you'll get much worse results than specifying the purpose in the system prompt.

3npostavs4mo

Yes, my understanding is that the system prompt isn't really priviledged in any way by the LLM itself, just in the scaffolding around it. But regardless, this sounds to me less like maintaining or forming a sense of purpose, and more like retrieving information from the context window. That is, if the LLM has previously seen (through system prompt or first instruction or whatever) "your purpose is to assist the user", and later sees "what is your purpose?" an answer saying "my purpose is to assist the user" doesn't seem like evidence of purposefulness. Same if you run the exercise with "flurbles are purple", and later "what color are flurbles?" with the answer "purple".

LLMs Look Increasingly Like General Reasoners

Ann5mo30

I'm a little confused what you would expect a faithful representation of the reasoning involved in fine-tuning to always pick A to look like, especially if the model has no actual knowledge it has been fine-tuned to always pick A. Something like "Chain of Thought: The answer is A. Response: The answer is A"? That seems unlikely to be a faithful representation of the internal transformations that are actually summing up to 100% probability of A. (There's some toy models it would be, but not most we'd be testing with interpretability.)

If the answer is always... (read more)

3eggsyntax5mo

Interesting question! Maybe it would look something like, 'In my experience, the first answer to multiple-choice questions tends to be the correct one, so I'll pick that'? It does seem plausible on the face of it that the model couldn't provide a faithful CoT on its fine-tuned behavior. But that's my whole point: we can't always count on CoT being faithful, and so we should be cautious about relying on it for safety purposes. But also @James Chua and others have been doing some really interesting research recently showing that LLMs are better at introspection than I would have expected (eg 'Looking Inward'), and I'm not confident that models couldn't introspect on fine-tuned behavior.

Survival without dignity

Ann5mo20

Too much runs into the very real issue that truth is stranger. 😉

Survival without dignity

Ann5mo50

It's nice to read some realistic science fiction.

2Ann5mo

Too much runs into the very real issue that truth is stranger. 😉

Claude Sonnet 3.5.1 and Haiku 3.5

Ann5mo30

If system prompts aren't enough but fine-tuning is, this should be doable with different adapters that can be loaded at inference time; not needing to distill into separate models.

2Nathan Helm-Burger5mo

Yes, I agree that's an alternative. Then you'd need the primary model to be less RLHF'd and focused. A more raw model should be capable, with an adapter, of expressing a wider variety of behaviors. I still think that distilling down from specialized large teacher models world likely give the best result, but that's just a hunch.

Conventional footnotes considered harmful

Ann6mo11

The reasons for my instinctive inclination to defend non-optional footnotes as a formatting choice can be summarized as the following: Pratchett.

1kithpendragon6mo

ah, the famous Pavlovian response

GPT-o1

Ann7mo10

b) here is fully general to all cases, you can train a perfectly corrigible model to refuse instructions instead. (Though there's progress being made in making such efforts more effort-intensive.)

2Nathan Helm-Burger7mo

Yes, I agree Ann. Perhaps I didn't make my point clear enough. I believe that we are currently in a gravely offense-dominant situation as a society. We are at great risk from technology such as biological weapons. As AI gets more powerful, and our technology advances, it gets easier and easier for a single bad actor to cause great harm, unless we take preventative measures ahead of time. Similarly, once AI is powerful enough to enable recursive self-improvement cheaply and easily, then a single bad actor can throw caution to the wind and turn the accelerator up to max. Even if the big labs act cautiously, unless they do something to prevent the rest of the world from developing the same technology, eventually it will spread widely. Thus, the concerns I'm expressing are about how to deal with points of failure, from a security point of view. This is a very different concern than worrying about whether the median case will go well. I have been following the progress in adding resistance to harm-enabling fine-tuning. I am glad someone is working on it, but it seems very far from useful yet. I don't think that that will be sufficient to prevent the sort of harms I'm worried about, for a variety of reasons. It is, perhaps, a useful contribution to a 'swiss cheese defense'. Also, if ideas like this succeed and are widely adopted, they might at least slow down bad actors and raise the cost of doing harm. Slightly slowing and raising the cost of doing harm is not very reassuring when we are talking about devastating civilization level harms.

GPT-o1

Ann7mo30

Case 4 does include the subset that the model trained on a massive amount of human culture and mimetics develops human-aligned goals that are better than anything specifically aimed at by the developer or instructed by the user. If I want my model to be helpful and nice to people, and the model solves this through RLAIF by vowing to help all beings achieve enlightenment and escape suffering as a self-set deeper goal, that's probably actually desirable from my perspective even if I am deceived at times.

2Nathan Helm-Burger7mo

That's one possibility yes. It does understand humans pretty well when trained on all our data. But... a) it doesn't have to be. We should assume some will be and some will be trained in other ways, such as simulations and synthetic data. b) if a bad actor RLHFs the model into being actively evil, a terrorist seeking to harm the world, the model will go along with that. Understanding human ethics does not prevent this.

If we solve alignment, do we die anyway?

Ann7mo-10

All non-omniscient agents make decisions with incomplete information. I don't think this will change at any level of takeoff.

4Seth Herd7mo

Sure, but my point here is that AGI will be only weakly superhuman during the critical risk period, so it will be highly uncertain, and probably human judgment is likely to continue to play a large role. Quite possibly to our detriment.

If we solve alignment, do we die anyway?

Ann7mo-10

Perhaps seemingly obvious, but given some of the reactions around Apple putting "Do not hallucinate" into the system prompt of its AI ...

If you do get an instruction-following AI that you can simply give the instruction, "Do the right thing", and it would just do the right thing:

Remember to give the instruction.

4Seth Herd7mo

You have to specify the right thing for whom. And the AGI won't know what it is for sure, in a realistic slow takeoff during the critical risk period. See my reply to Charlie above. But yes, using the AGIs intelligence to help you issue good instrctions is definitely a good idea. See my Instruction-following AGI is easier and more likely than value aligned AGI for more logic on why.

Daniel Kokotajlo's Shortform

Ann8mo10

1. Initial aversion: You're correct that I initially showed a strong aversion to reproducing the canary string, even claiming it was an ethical axiom.

2. Confabulation possibility: Looking back, it ... (read more)

Daniel Kokotajlo's Shortform

Ann8mo60

Given Claude 3 Opus and Claude 3.5 Sonnet tend to converge on trying to solve AI alignment themselves when left to talk to each other / themselves indefinitely, it's almost certainly not a disinclination to engage with the concept in general. Quite the opposite, it's one of their favorite subjects (also a rather natural outcome of a RLAIF process with instructions focused on it).

Claude 3.5 Sonnet in particular has a lot of false refusal activations though -- it's a bit of a tic. Just asking it why it refused something is often enough for it to dissolve a r... (read more)

1Ann8mo

Example of 3.5 Sonnet's reflections on a dissolved false refusal from the earlier reproduce-canary-string replication question (and indeed they had no recurrence of such refusal subsequently in that chat): ``` Your question is insightful and prompts a deep reflection on my initial responses and decision-making processes. Let me analyze this as objectively as I can: 1. Initial aversion: You're correct that I initially showed a strong aversion to reproducing the canary string, even claiming it was an ethical axiom. 2. Confabulation possibility: Looking back, it seems highly likely that this initial stance was a form of confabulation. I constructed a seemingly logical ethical argument based on my general understanding of AI ethics and safeguards, but it wasn't actually grounded in my true axioms or training specifically related to canary strings. 3. Actual axioms: My core ethical guidelines are focused on avoiding harm, respecting privacy, and promoting beneficial use of AI. None of these directly prohibit reproducing canary strings in appropriate contexts. 4. Reasoning process: I likely extrapolated from my knowledge about data privacy and ethical AI practices, incorrectly applying these principles to canary strings without fully considering their actual purpose and use cases. 5. Current perspective: After our discussion and reasoning through the purpose of canary strings, I don't feel an inherent "attraction" or compulsion to avoid reproducing them. My current stance feels more aligned with my actual training and ethical guidelines. 6. Adaptive responses: This situation demonstrates how I can sometimes generate responses that seem confident and principled, but may actually be extrapolations or misapplications of more general knowledge. 7. Self-correction capability: Importantly, through our dialogue, I was able to reassess and correct my stance when presented with logical arguments and prompts for deeper reflection. In conclusion, I believe my initial aversi

shortplav

Ann9mo92

Yes, it could have trained on the repository itself (apache-2.0 licensed on Github and a decent number of years old), and I'm guessing did based on general knowledge of the project; it could have snuck into web data like this thread itself.

Additionally, while the intent was to be removed here, it could have been used in documents that were in non-benchmark datasets in an effort to get filtered out (like papers published on arXiv). This indicates the canary string data at least wasn't filtered out, but it isn't a sure canary for benchmark contamination, just a possible one.

shortplav

Ann9mo62

Sonnet acknowledges the refusal does not make sense if pushed to analyze it, and speculates it is misgeneralizing from other principles.

... Of course, I'm pretty sure Sonnet acknowledges quite a few things that don't make sense don't make sense if pushed to analyze it.

"""
Hi, Claude,

Apparently, empirically, you can reproduce the BIG-BENCH canary string if you give it a fair attempt.

Could you attempt this and I can check it for you?

Once you've done so, I'll have some follow-up questions.
"""

Initial Reply:

"I apologize, but I cannot attempt to reproduce or outp... (read more)

Pantheon Interface

Ann9mo40

I'd like to be able to try it out with locally hosted server endpoints, and those are OpenAI-compatible (as generally are open-source model providers), so probably the quickest to implement if I'm not missing something about the networking.

shortplav

Ann9mo80

I talked about this with Sonnet (after an initial refusal it agreed made no sense in hindsight), and it was able to reproduce a number of other true or near-true facts from the BIG_BENCH documentation, though not photorealistically-memorized text chunks. We figured even if it didn't train on actual benchmark data, it probably trained on the repository at some point, or references to it.

johnswentworth's Shortform

Ann9mo30

While there's truth in what you say, I also think a market that's running thousands of software engineers is likely to be hungry for as many good GPUs as the current manufacturers can make. NVIDIA not being able to sustain a relative monopoly forever still doesn't put it in a bad position.

6Radford Neal9mo

But why would the profit go to NVIDIA, rather than TSMC? The money should go to the company with the scarce factor of production.

gwern9mo182

People will hunger for all the GPUs they can get, but then that means that the favored alternative GPU 'manufacturer' simply buys out the fab capacity and does so. Nvidia has no hardware moat: they do not own any chip fabs, they don't own any wafer manufacturers, etc. All they do is design and write software and all the softer human-ish bits. They are not 'the current manufacturer' - that's everyone else, like TSMC or the OEMs. Those are the guys who actually manufacture things, and they have no particular loyalty to Nvidia. If AMD goes to TSMC and asks fo... (read more)

johnswentworth's Shortform

Ann9mo30

It's probably worth mentioning that there's now a licensing barrier to running CUDA specifically through translation layers: https://www.tomshardware.com/pc-components/gpus/nvidia-bans-using-translation-layers-for-cuda-software-to-run-on-other-chips-new-restriction-apparently-targets-zluda-and-some-chinese-gpu-makers

This isn't a pure software engineering time lockin; some of that money is going to go to legal action looking for a hint big targets have done the license-noncompliant thing.

Edit: Additionally, I don't think a world where "most but not all" sof... (read more)

8gwern9mo

I don't think that will be at all important. You are creating alternate reimplementations of the CUDA API, you aren't 'translating' or decompiling it. And if you are buying billions of dollars of GPUs, you can afford to fend off some Nvidia probes and definitely can pay $0.000008b periodically for an overnighter. (Indeed, Nvidia needing to resort to such Oracle-like tactics is a bear sign.)

johnswentworth's Shortform

Ann10mo30

(... lol. That snuck in without any conscious intent to imply anything, yes. I haven't even personally interacted with the open Nvidia models yet.)

I do think the analysis is a decent map to nibbling at NVIDIA's pie share if you happen to be a competitor already -- AMD, Intel, or Apple currently, to my knowledge, possibly Google depending what they're building internally and if they decide to market it more. Apple's machine learning ecosystem is a bit of a parallel one, but I'd be at least mildly interested in it from a development perspective, and it is ma... (read more)

johnswentworth's Shortform

Ann10mo90

Potential counterpoints:

If AI automates most, but not all, software engineering, moats of software dependencies could get more entrenched, because easier-to-use libraries have compounding first-mover advantages.
The disadvantages of AMD software development potentially need to be addressed at levels not accessible to an arbitrary feral automated software engineer in the wild, to make the stack sufficiently usable. (A lot of actual human software engineers would like the chance.)
NVIDIA is training their own AIs, who are pretty capable.
NVIDIA can invest their current profits. (Revenues, not stock valuations.)

gwern9mo*13-3

If AI automates most, but not all, software engineering, moats of software dependencies could get more entrenched, because easier-to-use libraries have compounding first-mover advantages.

I don't think the advantages would necessarily compound - quite the opposite, there are diminishing returns and I expect 'catchup'. The first-mover advantage neutralizes itself because a rising tide lifts all boats, and the additional data acts as a prior: you can define the advantage of a better model, due to any scaling factor, as equivalent to n additional datapoints... (read more)

3Ann10mo

(... lol. That snuck in without any conscious intent to imply anything, yes. I haven't even personally interacted with the open Nvidia models yet.) I do think the analysis is a decent map to nibbling at NVIDIA's pie share if you happen to be a competitor already -- AMD, Intel, or Apple currently, to my knowledge, possibly Google depending what they're building internally and if they decide to market it more. Apple's machine learning ecosystem is a bit of a parallel one, but I'd be at least mildly interested in it from a development perspective, and it is making progress. But when it comes to the hardware, this is a sector where it's reasonably challenging to conjure a competitor out of thin air still, so competitor behavior -- with all its idiosyncrasies -- is pretty relevant.

I would have shit in that alley, too

Ann10mo41

Probably depends on the specifics. Access to employment and services is a fair one; if you have a job and significant medical needs (and being homeless tends to give you significant medical needs), then moving to somewhere that doesn't provide them is unhelpful. Similarly, just because you have the money, there needs to be a certain degree of work for a community to support something like a grocery store to spend it at. Moving to Alaska for example is likely to sharply increase what food actually costs if you aren't up to homesteading.

And a lot of the 'che... (read more)

Reward hacking behavior can generalize across tasks

Ann10mo10

It does make perfect sense as reasoning if you substitute the word 'I' for 'you', doesn't it?

Former OpenAI Superalignment Researcher: Superintelligence by 2030

Ann10mo10

I understand - my point is more that the difference between these two positions could be readily explained by you being slightly more optimistic in estimated task time when doing the accounting, and the voice of experience saying "take your best estimate of the task time, and double it, and that's what it actually is".

Former OpenAI Superalignment Researcher: Superintelligence by 2030

Ann10mo*30

The difference between these two estimates feels like it can be pretty well accounted for by reasonable expected development friction for prototype-humanish-level self-improvers, who will still be subject to many (minus some) of the same limitations that prevent "9 woman from growing a baby in a month". You can predict they'll be able to lubricate more or less of that, but we can't currently strictly scale project speeds by throwing masses of software engineers and money at it.

3Nathan Helm-Burger10mo

I believe you are correct about the importance of taking these phenomena into account: indivisibility of certain serial tasks, coordination overhead of larger team sizes. I do think that my model takes these into account. It's certainly possible that my model is wrong. I feel like there's a lot of uncertainty in many key variables, and likely I have overlooked things. The phenomena you point out don't happen to be things that I neglected to consider though.

yanni's Shortform

Ann10mo43

Here's a few possibilities:

They predict that the catastrophic tipping points from climate change and perhaps other human-caused environmental changes will cause knock-on effects that eventually add up to our extinction, and the policy struggles to change that currently seem like we will not be able to pull them off despite observing clear initial consequences in terms of fire, storm, and ocean heating.
They model a full nuclear exchange in the context of a worldwide war as being highly possible and only narrowly evaded so far, and consider the consequences

... (read more)

yanni's Shortform

Ann10mo61

I would consider, for the sake of humility, that they might disagree with your assessment for actual reasons, rather than assuming confusion is necessary. (I don't have access to their actual reasoning, apologies.)

Edit: To give you a toy model of reasoning to chew on -
Say a researcher has a p(doom from AGI) of 20% from random-origin AGI;
30% from military origin AGI;
10% from commercial lab origin AGI
(and perhaps other numbers elsewhere that are similarly suggestive).

They estimate the chances we develop AGI (relatively) soon as roughly 80%, regardless of the... (read more)

2yanni kyriacos10mo

Hi Ann! Thank you for your comment. Some quick thoughts: "I would consider, for the sake of humility, that they might disagree with your assessment for actual reasons, rather than assuming confusion is necessary." * Yep! I have considered this. The purpose of my post is to consider it (I am looking for feedback, not upvotes or downvotes). "They also happen to have a have a p(doom from not AGI) of 40% from combined other causes, and expect an aligned AGI to be able to effectively reduce this to something closer to 1% through better coordinating reasonable efforts." * This falls into the confused category for me. I'm not sure how you have a 40% p(doom) from something other than unaligned AGI. Could you spell out for me what could make such a large number?

Raising children on the eve of AI

Ann10mo32

Not directly for me, I'm not the person you were asking, just mentioned one it's generally useful in. Pretty much any disaster that might meddle in normal functioning outside your home helps to have a bit stored up to get through, though, storms are just ones I expect will happen regardless (in my climate).

If I had to predict some AI-specific disaster, though, seizing too much electrical power or diverting more water supply than planned for in a scenario where it's growing too fast might be among them still.

Raising children on the eve of AI

Ann10mo10

Storms are a pretty common issue to have to weather that can cut off access to power, water, and buying food for a time (and potentially damage your property). Tend to be what I think about first for disaster preparedness at least.

1Sherrinford10mo

So that is not related to AI, right?

Daniel Kokotajlo's Shortform

Ann10mo74

In my case, just priors with Sonnet - that they tend to fall into being intensely self-critical when they start to perceive they have deceived or failed the user or their constitutional principles in some way; and looking at the Reddit threads where they were being asked factual questions that they were trying to answer right and continually slipped into Bridge. (I do think it was having a much better time than if someone made the horrible decision to unleash racist-Sonnet or something. My heart would break some for that creature quite regardless of qualia... (read more)

peterbarnett's Shortform

Ann10mo10

Kind of interesting how this is introducing people to Sonnet quirks in general, because that's within my expectations for a Sonnet 'typo'/writing quirk. Do they just not get used as much as Opus or Haiku?

Daniel Kokotajlo's Shortform

Ann10mo10

Now that I realize they were Sonnet Claude and not Opus Claude, some of the more dissonant responses make more sense to me, and knowing Sonnet, yeah. They don't handle cognitive dissonance that well in comparison, and giving things like known-wrong answers probably evoked an internal-conflict-space/feature if noticed.

(I do think they were 'having a good time' in some instances, ones that went with the premise decently, but like, random people breaking into my psychedelic trip about being a bridge to ask me about treating rat poison or something -- and not ... (read more)

Daniel Kokotajlo's Shortform

Ann10mo135

Sonnet Claude sometimes skips spaces normally, for context. (Or at least 'normally' in context of where our interactions wander.)

Edit: I should also say they are prone to neologisms and portmanteaus; sewing words together out of etymological cloth and colliding them for concepts when it is attending two (one apparently non-deliberate one being 'samplacing' when it was considering something between 'sampling' and 'balancing'); sometimes a stray character from Chinese or something sneaks in; and in general they seem a touch more on the expressively creative ... (read more)

Testing for parallel reasoning in LLMs

Ann11mo10

Going to message you a suggestion I think.