I think the website just links back to this blog post? Is that intentional?
Edit: I also think the application link requires access before seeing the form?
Second Edit: Seems fixed now! Thanks!
Maybe I should also expand on what the "AI agents are submitting the programs themselves subject to your approval" scenario could look like. When I talk about a preorder on Turing Machines (or some subset of Turing Machines), you don't have to specify this preorder up front. You just have to be able to evaluate it and the debaters have to be able to guess how it will evaluate.
If you already have a specific simulation program in mind then you can define as follows: if you're handed two programs which are exact copies of your simulation software ...
These are all excellent points! I agree that these could be serious obstacles in practice. I do think that there are some counter-measures in practice, though.
I think the easiest to address is the occasional random failure, e.g. your "giving the wrong answer on exact powers of 1000" example. I would probably try to address this issue by looking at stochastic models of computation, e.g. probabilistic Turing machines. You'd need to accommodate stochastic simulations anyway because so many techniques in practice use sampling. I think you can handle stochastic...
Thank you! If I may ask, what kind of fatal flaws do you expect for real-world simulations? Underspecified / ill-defined questions, buggy simulation software, multiple simulation programs giving irreconcilably conflicting answers in practice, etc.? I ask because I think that in some situations it's reasonable to imagine the AI debaters providing the simulation software themselves if they can formally verify its accuracy, but that would struggle against e.g. underspecified questions.
Also, is there some prototypical example of a "tough real world question" y...
My initial reaction is that at least some of these points would be covered by the Guaranteed Safe AI agenda if that works out, right? Though the "AGIs act much like a colonizing civilization" situation does scare me because it's the kind of thing which locally looks harmless but collectively is highly dangerous. It would require no misalignment on the part of any individual AI.
Totalitarian dictatorship
I'm unclear why this risk is specific to multipolar scenarios? Even if you have a single AGI/ASI you could end up with a totalitarian dictatorship, no? In fact I would imagine that having multiple AGI/ASI's would mitigate this risk as, optimistically, every domestic actor in possession of an AGI/ASI should be counterbalanced by another domestic actor with divergent interests also in possession of an AGI/ASI.
I actually think multipolar scenarios are less dangerous than having a single superintelligence. Watching the AI arms race remain multipolar has actually been one of the biggest factors in my P(doom) declining recently. I believe that maintaining a balance of power at all times is key and that humanity's best chance for survival is to ensure that, for any action humanity wishes to take, there is some superintelligence that would benefit from this action and which would be willing to defend it. This intuition is largely based on examples from human history ...
Something I should have realized about AI Safety via Debate ages ago but only recently recognized: I usually see theoretical studies of debate through the lens of determining the output of a fixed Turing machine (or stochastic Turing machine or some other computational model), e.g. https://arxiv.org/abs/2311.14125. This formalization doesn't capture the kinds of open-ended questions I usually think about. For example, suppose I'm looking at the blueprint of a submarine and I want to know whether it will be watertight at 1km depth. Suppose I have a physics ...
Has anyone made a good, easy to use user interface for implementing debate / control protocols in deployment? For example, maybe I want to get o12345 to write some code in my codebase but I don't trust it. I'd like to automatically query Sonnet 17.5 with the code o12345 returned and ask if anything fishy is going on. If so, spin up one of the debate protocols and give me the transcript at the end. Surely the labs are experimenting with this kind of thing internally / during training but it still feels useful during deployment as an end-user, especially between models from different labs which might have internalized different objective functions and hence be less likely to collude.
When you talk about "other sources of risk from misalignment," these sound like milder / easier-to-tackle versions of the assumptions you've listed? Your assumptions sound like they focus on the worst case scenario. If you can solve the harder version then I would imagine that the easier version would also be solved, no?
I agree, though I think it would be a very ridiculous own-goal if e.g. GPT-4o decided to block a whistleblowing report about OpenAI because it was trained to serve OpenAI's interests. I think any model used by this kind of whistleblowing tool should be open-source (nothing fancy / more dangerous than what's already out there), run locally by the operators of the tool, and tested to make sure it doesn't block legitimate posts.
My gut instinct is that this would have been a fantastic thing to create 2-4 years ago. My biggest hesitation is that the probability a tool like this decreases existential risk is proportional to the fraction of lab researchers who know about it and adoption can be a slow / hard thing to make happen. I still think that this kind of program could be incredibly valuable under the right circumstances so someone should probably be working on this.
Also, I have a very amateurish security question: if someone provides their work email to verify their authenticit...
I guess I'm a bit confused where o3 comes into this analysis. This discussion appears to be focused on base models to me? Is data really the bottleneck these days for o-series-type advancements? I thought that compute available to do RL in self-play / CoT / long-time-horizon-agentic-setups would be a bigger consideration.
Edit: I guess upon another reading this article seems like an argument against AI capabilities hitting a plateau in the coming years, whereas the o3 announcement makes me more curious about whether we're going to hyper-accelerate capabilities in the coming months.
My thesis is that the o3 announcement is timelines-relevant in a strange way. The causation goes from o3 to impressiveness or utility of its successors trained on 1 GW training systems, then to decisions to build 5 GW training systems, and it's those 5 GW training systems that have a proximate effect on timelines (in comparison to the world only having 1 GW training systems for a few years). The argument goes through even if o3 and its successors don't particularly move timelines directly through their capabilities, they can remain a successful normal tech...
Strong upvote. I think that interactive demo's are much more effective at showcasing "scary" behavior than static demo's that are stuck in a paper or a blog post. Readers won't feel like the examples are cherry-picked if they can see the behavior exhibited in real time. I think that the community should make more things like this.
For anyone wanting to harness AI models to create formally verified proofs for theoretical alignment, it looks like last call for formalizing question statements.
Game theory is almost completely absent from mathlib4. I found some scattered attempts at formalization in Lean online but nothing comprehensive. This feels like a massive oversight to me -- if o12345 were released tomorrow with perfect reasoning ability then I'd have no framework with which to check its proofs of any of my current research questions.
So I've been thinking a little more about the real-world-incentives problem, and I still suspect that there are situations in which rules won't solve this. Suppose there's a prediction market question with a real-world outcome tied to the resulting market probability (i.e. a relevant actor says "I will do XYZ if the prediction market says ABC"). Let's say the prediction market participants' objective functions are of the form play_money_reward + real_world_outcome_reward. If there are just a couple people for whom real_world_outcome_reward is at least as s...
Awesome post!
I have absolutely no experience with prediction markets, but I’m instinctively nervous about incentives here. Maybe the real-world incentives of market participants could be greater than the play-money incentives? For example, if you’re trying to select people to represent your country at an international competition and the potential competitors have invested their lives into being on that international team and those potential competitors can sign up as market participants (maybe anonymously), then I could very easily imagine those people sa...
I like these observations! As for your last point about ranges and bounds, I'm actually moving towards relaxing those in future posts: basically I want to look at the tree case where you have more than one variable feeding into each node and I want to argue that even if the conditional probabilities are all 0's and 1's (so we don't get any hard bounds with arguments like the one I present here) there can still be strong concentration towards one answer.
Wow, this is exactly what I was looking for! Thank you so much!
I also suspect that the evaluation mechanism is going to be very important. I can think of philosophical debates whose resolution could change the impact of an "artifact" by many orders of magnitude. If possible I think it could be good to have several different metrics (corresponding to different objective functions) by which to grade these artifacts. That way you can give donors different scores depending on which metrics you want to look at. For example, you might want different scores for x-risk minimization, s-risk minimization, etc. That still leaves the "[optimize for (early, reliable) evidence of impact] != [optimize for impact]" issue open, of course.
I really like this perspective! Great first post!
Okay that paper doesn't seem like what I was thinking of either but it references this paper which does seem to be on-topic: https://research.rug.nl/en/publications/justification-by-an-infinity-of-conditional-probabilities
Thanks for the response! I took a look at the paper you linked to; I'm pretty sure I'm not talking about combinatorial explosion. Combinatorial explosion seems to be an issue when solving problems that are mathematically well-defined but computationally intractable in practice; in my case it's not even clear that these objects are mathematically well-defined to begin with.
The paper https://www.researchgate.net/publication/335525907_The_infinite_epistemic_regress_problem_has_no_unique_solution initially looks related to what I'm thinking, but I haven't looked at it in depth yet.
Hmmm, I’m still thinking about this. I’m kinda unconvinced that you even need an algorithm-heavy approach here. Let’s say that you want to apply logit, add some small amount of noise, apply logistic, then score. Consider the function on R^n defined as (score function) composed with (coordinate-wise logistic function). We care about the expected value of this function with respect to the probability measure induced by our noise. For very small noise, you can approximate this function by its power series expansion. For example, if we’re adding iid Gaussian n...
If my interpretation of precision function is correct then I guess my main concern is this: how are we reaching inside the minds of the predictors to see what their distribution on is? Like, imagine we have an urn with black and red marbles in it and we have a prediction market on the probability that a uniformly randomly chosen marble will be red. Let's say that two people participated in this prediction market: Alice and Bob. Alice estimated there to be a 0.3269230769 (or approximately 17/52) chance of the marble being red because she sa...
Awesome post! I'm very ignorant of the precision-estimation literature so I'm going to be asking dumb questions here.
First of all, I feel like a precision function should take some kind of "acceptable loss" parameter. From what I gather, to specify the precision you need some threshold in your algorithm(s) for how much accuracy loss you're willing to tolerate.
More fundamentally, though, I'm trying to understand what exactly we want to measure. The list of desired properties of a precision function feel somewhat pulled out of thin air, and I'd feel mo...
I'm probably the least qualified person imaginable to represent "the Lesswrong community" given that I literally made my first post this weekend, but I did get into EA between high school and college and I have some thoughts on the topic.
My gut reaction is that it depends a lot on the kind of person this high schooler is. I was very interested in math and theoretical physics when I started thinking about EA. I don't think I'm ever going to be satisfied with my life unless I'm doing work that's math-heavy. I applied to schools with good AI programs with the...
You raise an excellent point! In hindsight I’m realizing that I should have chosen a different example, but I’ll stick with it for now. Yes, I agree that “What states of the universe are likely to result from me killing vs not killing lanternflies” and “Which states of the universe do I prefer?” are both questions grounded in the state of the universe where Bayes’ rule applies very well. However, I feel like there’s a third question floating around in the background: “Which states of the universe ‘should’ I prefer?” Based on my inner experiences, I feel th...
Hey, thanks for the response! Yes, I've also read about Bayes' Theorem. However, I'm unconvinced that it is applicable in all the circumstances that I care about. For example, suppose I'm interested in the question "Should I kill lanternflies whenever I can?" That's not really an objective question about the universe that you could, for example, put on a prediction market. There doesn't exist a natural function from (states of the universe) to (answers to that question). There’s interpretation involved. Let’s even say that we get some new evidence (my post...
The question bank doesn't exist yet because the language to ask the questions doesn't exist yet. I spent a few weeks after writing this post trying to familiarize myself with Lean as quickly as possible and I found out that people in the Lean community simply haven't formalized most of the objects I'd want to talk about (probabilistic networks, computational complexity, Nash equilibria, etc.). I tried to get a project off the ground formalizing these objects -- you can see the GitHub repository here and the planning document here. Unfortunately this projec... (read more)