All of Jack O'Brien's Comments + Replies

Ummmmm yeah what have I done so far. I didn't really get any solid work done this week either. I have decided to extend the project by another two weeks with the other two people involved - we have all been pretty preoccupied with life. Last week on sunday night i didn't really do a solid hour of work. I did manage to summarise the concept of selection theorems and think about the agent type signature - a concept i will be referring to throughout the post, super fundamental. Tonight I will hopefully actually meet with my group. I wanna do like half an hour of work before, and a little bit after too. I want to summarise the good regulator theorem this week as well as turner's post on power seeking.

Well, haven't got much done in the last 2 weeks. Life has gotten in the way, and in the times where I thought I actually had the time and headspace to work on the project, things happened like my shoulder got injured playing sport, and my laptop mysteriously died.

But I have managed to create a github repo, and read the original posts on selection theorems. My list of selection theorems to summarize has grown. Check out the github page: https://github.com/jack-obrien/selection-theorems-review

Tonight I will try to do at least an hour of solid work on it. I w... (read more)

3Jack O'Brien
Ummmmm yeah what have I done so far. I didn't really get any solid work done this week either. I have decided to extend the project by another two weeks with the other two people involved - we have all been pretty preoccupied with life. Last week on sunday night i didn't really do a solid hour of work. I did manage to summarise the concept of selection theorems and think about the agent type signature - a concept i will be referring to throughout the post, super fundamental. Tonight I will hopefully actually meet with my group. I wanna do like half an hour of work before, and a little bit after too. I want to summarise the good regulator theorem this week as well as turner's post on power seeking.

** Progress Report: AI Safety Fundamentals Project ** This is a public space for me to keep updated on my AI safety fundamentals project. The project will take 4 weeks. My goal is to stay lean and limit my scope so I can actually finish on time. I aim to update this post at least once per week with my updates, but maybe more often.

Overall, I want to work on agent foundations and the theory behind AI alignment agendas. One stepping point for this is Selection theorems; a research program to find justifications that a given training process will result in a ... (read more)

3Jack O'Brien
Well, haven't got much done in the last 2 weeks. Life has gotten in the way, and in the times where I thought I actually had the time and headspace to work on the project, things happened like my shoulder got injured playing sport, and my laptop mysteriously died. But I have managed to create a github repo, and read the original posts on selection theorems. My list of selection theorems to summarize has grown. Check out the github page: https://github.com/jack-obrien/selection-theorems-review Tonight I will try to do at least an hour of solid work on it. I want to summarize the idea of selection theorems, and sumarize the good regulator theorem, and start reading the next post (probably Turner's post on power seeking)

I think this is a good thing to do! I reccomend looking up things like "reflections on my LTFF upskilling grant" for similar pieces from lesser known researchers / aspiring researchers.

2Lucie Philippon
Thank you for the pointer ! I found the article you mentioned, and then found the tag Postmortem & Retrospective which led me to three additional posts: * Reflections on my 5-month alignment upskilling grant * I'm leaving AI alignment – you better stay * My first year in AI alignment * Beginning Machine Learning

Hey thanks for writing this up! I thought you communicated the key details excellently - in particular these 3 camps of varying alignment difficulty worlds, and the variation within those camps. Also I think you included just enough caveats and extra details to give readers more to think about, but without washing out the key ideas of the post.

Just wanted to say thanks, this post makes a great reference for me to link to.

Definition: .


A useful alternate definition of this is:

Where  refers to . Proof:

Yep, fair point. In my original comment I seemed to forget about the problem of AIs goodharting our long reflection. I probably agree now that doing a pivotal act into a long reflection is approximately as difficult as solving alignment.

(Side-note about how my brain works: I notice that when I think through all the argumentative steps deliberately, I do believe this statement: "Making an AI which helps humans clarify their values is approximately as hard as making an AI care about any simple, specific thing." However it does not come to mind automatically ... (read more)

This post was incredibly interesting and useful to me. I would strong-upvote it, but I don't think this post should be promoted to more people. I've been thinking about the question of "who are we aligning AI to" for the past two months.

I really liked your criticism of the Long Reflection because it is refreshingly different from e.g. Macaskill and Ord's writing on the long reflection. I'm still not convinced that we can't avoid all of the hellish things you mentioned like synthetic superstimuli cults and sub-AGI drones. Why can't we just have a simple pro... (read more)

2Thane Ruthenis
Thanks! How is that process implemented? How do we give power to that process, which optimizes for things like "truth" and "individual agency", over processes that optimize just for power; over processes that Goodhart for whatever metric we're looking at in order to decide which process to give power to? And if we have some way to define "truth" and "individual agency" directly, and tell our strawberry-aligned AI to only give power to such processes — is it really just strawberry-aligned? Why can't we instead just tell it to build an utopia, if our command of alignment is so strong as to robustly define "truth" and "individual agency" to the AI?

3 is my main reason for wanting to learn more pure math, but I use 1 and 2 to help motivate me

which of these books are you most excited about and why? I also want to do more fun math reading

3Ulisse Mini
(Note; I haven't finished any of them) Quantum computing since Democritus is great, I understand Godel's results now! And a bunch of complexity stuff I'm still wrapping my head around. The Road to Reality is great, I can pretend to know complex analysis after reading chapters 5,7,8 and most people can't tell the difference! Here's a solution to a problem in chapter 7 I wrote up. I've only skimmed parts of the Princeton guides, and different articles are written by different authors—but Tao's explanation of compactness (also in the book) is fantastic, I don't remember specific other things I read. Started reading "All the math you missed" but stopped before I got to the new parts, did review linear algebra usefully though. Will definitely read more at some point. I read some of The Napkin's guide to Group Theory, but not much else. Got a great joke from it:

Let's be optimistic and prove that an agentic AI will be beneficial for the long-term future of humanity. We probably need to prove these 3 premises:

Premise 1:  Training story X will create an AI model which approximates agent formalism A
Premise 2: Agent formalism A is computable and has a set of alignment properties P
Premise 3: An AI with a set of alignment properties P will be beneficial for the long-term future.

Aaand so far I'm not happy with our answers to any of these.

2Isabella Barber
maybe there is no set of properties p that can produce alignment hmm

Fantastic! Here's my summary:

Premises:

  1. A recursively self improving singleton is the most likely AI scenario.
  2. To mitigate AI risk, building a fully aligned singleton on the first try is the easiest solution. This is easier than other approaches which require solving coordination.
  3. By default, AI will become misaligned when it generalises away from human capabilities. We must apply a security mindset and be doubtful of most claims that an AI will be aligned when it generalises.
  4. We should prioritise research which solves the hard part of the alignment problem rat
... (read more)

Did you get around to writing a longer answer to the question, "How do humans do anything in practice if the search space is vast?" I'd be curious to see your thoughts.

My answer to this question is that: 
(a) Most day-to-day problems can be solved from far away using a low-dimensional space containing natural abstractions. For example, a manager at a company can give their team verbal instructions without describing the detailed sequence of muscle movements needed.
(b) For unsolved problems in science, we get many tries at the problem. So, we can use th... (read more)

I have a few questions about corrigibility. First, I will tentatively define corrigibility as creating an agent who is willing to let humans shut it off or change its goals without manipulating humans. I have seen that corrigibility can lead to VNM-incoherence (i.e. an agent can be dutch-booked / money-pumped). Has this result been proven in general?

Also, what is the current state of corrigibility research? If the above incoherence result turns out to be correct and corrigibility leads to incoherence, are there any other tractable theoretical directions we... (read more)

Excellent summary, Harrison! I especially enjoyed your use of pillar diagrams to break up streams of text. In general, I found your post very approachable and readable.

As for Pillar 2: I find the description of goals as "generalised concepts" still pretty confusing after reading your summary. I don't think this example of a generalised concept counts as a goal: "things that are perfectly round are objects called spheres; 6-sided boxes are objects called cubes". This statement is a fact, but a goal is a normative preference about the world (cf. the is-ought... (read more)

I'm excited to read your work! I would also like to post my inside view on LessWrong later, once it is more developed.

I really like this post, you explained your purpose in writing the sequence very clearly. Thanks also for writing about how your beliefs updated over the process of writing this.