All of Zoe Williams's Comments + Replies

Post summary (feel free to suggest edits!):
The author argues that the “simulators” framing for LLMs shouldn’t reassure us much about alignment. Scott Alexander has previously suggested that LLMs can be thought of as simulating various characters eg. the “helpful assistant” character. The author broadly agrees, but notes this solves neither outer (‘be careful what you wish for’) or inner (‘you wished for it right, but the program you got had ulterior motives’) alignment.

They give an example of each failure case: 
For outer alignment, say researcher... (read more)

1Adam Scherlis
I think this is missing an important part of the post.  I have subsections on (what I claim are) four distinct alignment problems: * Outer alignment for characters * Inner alignment for characters * Outer alignment for simulators * Inner alignment for simulators This summary covers the first two, but not the third or fourth -- and the fourth one ("inner alignment for simulators") is what I'm most concerned about in this post (because I think Scott ignores it, and because I think it's hard to solve).

Post summary (feel free to suggest edits!):
‘Setting the Zero Point’ is a “Dark Art” ie. something which causes someone else’s map to unmatch the territory in a way that’s advantageous to you. It involves speaking in a way that takes for granted that the line between ‘good’ and ‘bad is at a particular point, without explicitly arguing for that. This makes changes between points below and above that line feel more significant.

As an example, many people draw a zero point between helping and not helping a child drowning in front of them. One is good, one is ba... (read more)

Currently it's all manually, but the ChatGPT summaries are pretty decent, I'm looking into which types of posts it does well.

Post summary (feel free to suggest edits!):
The author argues that if today’s AI development methods lead directly to powerful enough AI systems, disaster is likely by default (in the absence of specific countermeasures). 

This is because there is good economic reason to have AIs ‘aim’ at certain outcomes - eg. We might want an AI that can accomplish goals such as ‘get me a TV for a great price’. Current methods train AIs to do this via trial and error, but because we ourselves are often misinformed, we can sometimes negatively reinforce truthful behavi... (read more)

2Raemon
Are these summaries from ChatGPT?

Post summary (feel free to suggest edits!):
Various people have proposed variants of “align AGI by making it sufficiently uncertain about whether it’s in the real world versus still in training”. This seems unpromising because AGI could still have bad outcomes if convinced, and convincing it would be difficult.

Non-exhaustive list of how it could tell it’s in reality:

  1. Reality is large (eg. some things are possible that couldn’t be easily spoofed, such as access to larger compute)
  2. It’s the first place the AI’s history could show interaction with other complex s
... (read more)

Post summary (feel free to suggest edits!):
The author gives examples where their internal mental model suggested one conclusion, but a low-information heuristic like expert or market consensus differed, so they deferred. This included:

  • Valuing Theorem equity over Wave equity, despite Wave’s founders being very resourceful and adding users at a huge pace.
  • In the early days of Covid, dismissing it despite exponential growth and asymptomatic spread seeming intrinsically scary.

Another common case of this principle is assuming something won’t work in a particular... (read more)

Post summary (feel free to suggest edits!):
The Diplomacy AI got a handle on the basics of the game, but didn’t ‘solve it’. It mainly does well due to avoiding common mistakes like eg. failing to communicate with victims (thus signaling intention), or forgetting the game ends after the year 1908. It also benefits from anonymity, one-shot games, short round limits etc.

Some things were easier than expected eg. defining the problem space, communications generic and simple and quick enough to easily imitate and even surpass humans, no reputational or decision t... (read more)

Post summary (feel free to suggest edits!):
In chatting with ChatGPT, the author found it contradicted itself and its previous answers. For instance, it said that orange juice would be a good non-alcoholic substitute for tequila because both were sweet, but when asked if tequila was sweet it said it was not. When further quizzed, it apologized for being unclear and said “When I said that tequila has a "relatively high sugar content," I was not suggesting that tequila contains sugar.”

This behavior is worrying because the system has the capacity to produce co... (read more)

Post summary (feel free to suggest edits!):
Last year, the author wrote up an plan they gave a “better than 50/50 chance” would work before AGI kills us all. This predicted that in 4-5 years, the alignment field would progress from preparadigmatic (unsure of the right questions or tools) to having a general roadmap and toolset. 

They believe this is on track and give 40% likelihood that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets - with interpretability on th... (read more)

2johnswentworth
You might mention that the prediction about the next 1-2 years is only at 40% confidence, and the "8-year" part is an inside view whose corresponding outside view estimate is more like 10-15 years.

Thanks for the feedback! I've passed it on.

It's mainly because we wanted to keep the episodes to ~20m, to make them easy for people to keep up with week to week - and the LW posts tended toward the more technical side, which doesn't translate as easily in podcast form (it can be hard to take in without the writing in front of you). We may do something for the LW posts in future though, unsure at this point.

Thanks, realized I forgot to add the description of the top / curated section - fixed. Everything in there occurs in it's own section too.

Super interesting, thanks!

If you were running it again, you might want to think about standardizing the wording of the questions - it varies from 'will / is' to 'is likely' to 'plausible' and this can make it hard to compare between questions. Plausible in particular is quite a fuzzy word, for some it might mean 1% or more, for others it might just mean it's not completely impossible / if a movie had that storyline, they'd be okay with it.

1Sam Bowman
Fair. For better or worse, a lot of this variation came from piloting—we got a lot of nudges from pilot participants to move toward framings that were perceived as controversial or up for debate.

Good point, thank you - I've had a re-read of the conclusion and replaced the sentence with "Due to this, he concludes that climate change is still an important LT area - though not as important as some other global catastrophic risks (eg. biorisk), which outsize on both neglectedness and scale."

Originally I think I'd mistaken his position a bit based on this sentence: "Overall, because other global catastrophic risks are so much more neglected than climate change, I think they are more pressing to work on, on the margin." (and in addition I hadn't used the clearest phrasing)  But the wider conclusion fits the new sentence better.