lunatic_at_large

Undergraduate math major at Carnegie Mellon University, interested in theoretical aspects of AI Safety via Debate, control, game theory, and probabilistic logic. I'm currently doing theoretical AI Safety via Debate research under Professor Vincent Conitzer.

Posts

Sorted by New

2lunatic_at_large's Shortform

1mo

9Topological Debate Framework

14d

26Theoretical Alignment's Second Chance

1mo

2lunatic_at_large's Shortform

1mo

3Current State of Probabilistic Logic

8Underspecified Probabilities: A Thought Experiment

33Conditionals All The Way Down

4Lesswrong's opinion on infinite epistemic regress and Bayesianism

13Probabilistic argument relationships and an invitation to the argument mapping community

Wiki Contributions

Comments

Sorted by

Newest

Agent Foundations 2025 at CMU

lunatic_at_large11d30

~~I think the website just links back to this blog post? Is that intentional?~~

~~Edit: I also think the application link requires access before seeing the form?~~

Second Edit: Seems fixed now! Thanks!

Topological Debate Framework

lunatic_at_large11d10

Maybe I should also expand on what the "AI agents are submitting the programs themselves subject to your approval" scenario could look like. When I talk about a preorder on Turing Machines (or some subset of Turing Machines), you don't have to specify this preorder up front. You just have to be able to evaluate it and the debaters have to be able to guess how it will evaluate.

If you already have a specific simulation program in mind then you can define as follows: if you're handed two programs which are exact copies of your simulation software using different hard-coded world models then you consult your ordering on world models, if one submission is even a single character different from your intended program then it's automatically less, if both programs differ from your program then you decide arbitrarily. What's nice about the "ordering on computations" perspective is that it naturally generalizes to situations where you don't follow this construction.

What could happen if we don't supply our own simulation program via this construction? In the planes example, maybe the "snap" debater hands you a 50,000-line simulation program with a bug so that if you're crafty with your grid sizes then it'll get confused about the material properties and give the wrong answer. Then the "safe" debater might hand you a 200,000-line simulation program which avoids / patches the bug so that the crafty grid sizes now give the correct answer. Of course, there's nothing stopping the "safe" debater from having half of those lines be comments containing a Lean proof using super annoying numerical PDE bounds or whatever to prove that the 200,000-line program avoids the same kind of bug as the 50,000-line program.

When you think about it that way, maybe it's reasonable to give the "it'll snap" debater a chance to respond to the "it's safe" debater's comments. Now maybe we change the type of $\leq$ from being a subset of (Turing Machines) x (Turing Machines) to being a subset of (Turing Machines) x (Turing Machines) x (Justifications from safe debater) x (Justifications from snap debater). In this manner deciding how you want $\leq$ to behave can become a computational problem in its own right.

Topological Debate Framework

lunatic_at_large11d30

These are all excellent points! I agree that these could be serious obstacles in practice. I do think that there are some counter-measures in practice, though.

I think the easiest to address is the occasional random failure, e.g. your "giving the wrong answer on exact powers of 1000" example. I would probably try to address this issue by looking at stochastic models of computation, e.g. probabilistic Turing machines. You'd need to accommodate stochastic simulations anyway because so many techniques in practice use sampling. I think you can handle stochastic evaluation maps in a similar fashion to the deterministic case but everything gets a bit more annoying (e.g. you likely need to include an incentive to point to simple world models if you want the game to have any equilibria at all). Anyways, if you've figured out topological debate in the stochastic case, then you can reduce from the occasional-errors problem to the stochastic problem as follows: suppose is a directed set of world models and $E$ is some simulation software. Define a stochastic program $E^{'}$ which takes in a world model $w$ , randomly samples a world model $w^{'} \geq w$ according to some reasonably-spread-out distribution, and return $E (w^{'})$ . In the 1D plane case, for example, you could take in a given resolution, divide it by a uniformly random real number in $(1, 10)$ , and then run the simulation at that new resolution. If your errors are sufficiently rare then your stochastic topological debate setup should handle things from here.

Somewhat more serious is the case where "it's harder to disrupt patterns injected during bids." Mathematically I interpret this statement as the existence of a world model which evaluates to the wrong answer such that you have to take a vastly more computationally intensive refinement to get the correct answer. I think it's reasonable to detect when this problem is occurring but preventing it seems hard: you'd basically need to create a better simulation program which doesn't suffer from the same issue. For some problems that could be a tall order without assistance but if your AI agents are submitting the programs themselves subject to your approval then maybe it's surmountable.

What I find the most serious and most interesting, though, is the case where your simulation software simply might not converge to the truth. To expand on your nonlinear effects example: suppose our resolution map can specify dimensions of individual grid cells. Suppose that our simulation software has a glitch where, if you alternate the sizes of grid cells along some direction, the simulation gets tricked into thinking the material has a different stiffness or something. This is a kind of glitch which both sides can exploit and the net probably won't converge to anything.

I find this problem interesting because it attacks one of the core vulnerabilities that I think debate problems struggle with: grounding in reality. You can't really design a system to "return the correct answer" without somehow specifying what makes an answer correct. I tried to ground topological debate in this pre-existing ordering on computations that gets handed to us and which is taken to be a canonical characterization of the problem we want to solve. In practice, though, that's really just kicking the can down the road: any user would have to come up with a simulation program or method of comparing simulation programs which encapsulates their question of interest. That's not an easy task.

Still, I don't think we need to give up so easily. Maybe we don't ground ourselves by assuming that the user has a simulation program but instead ground ourselves by assuming that the user can check whether a simulation program or comparison between simulation programs is valid. For example, suppose we're in the alternating-grid-cell-sizes example. Intuitively the correct debater should be able to isolate an extreme example and go to the human and say "hey, this behavior is ridiculous, your software is clearly broken here!" I will think about what a mathematical model of this setup might look like. Of course, we just kicked the can down the road, but I think that there should be some perturbation of these ideas which is practical and robust.

Topological Debate Framework

lunatic_at_large13d10

Thank you! If I may ask, what kind of fatal flaws do you expect for real-world simulations? Underspecified / ill-defined questions, buggy simulation software, multiple simulation programs giving irreconcilably conflicting answers in practice, etc.? I ask because I think that in some situations it's reasonable to imagine the AI debaters providing the simulation software themselves if they can formally verify its accuracy, but that would struggle against e.g. underspecified questions.

Also, is there some prototypical example of a "tough real world question" you have in mind? I will gladly concede that not all questions naturally fit into this framework. I was primarily inspired by physical security questions like biological attacks or backdoors in mechanical hardware.

johnswentworth's Shortform

lunatic_at_large20d61

My initial reaction is that at least some of these points would be covered by the Guaranteed Safe AI agenda if that works out, right? Though the "AGIs act much like a colonizing civilization" situation does scare me because it's the kind of thing which locally looks harmless but collectively is highly dangerous. It would require no misalignment on the part of any individual AI.

How can humanity survive a multipolar AGI scenario?

lunatic_at_large20d10

Totalitarian dictatorship

I'm unclear why this risk is specific to multipolar scenarios? Even if you have a single AGI/ASI you could end up with a totalitarian dictatorship, no? In fact I would imagine that having multiple AGI/ASI's would mitigate this risk as, optimistically, every domestic actor in possession of an AGI/ASI should be counterbalanced by another domestic actor with divergent interests also in possession of an AGI/ASI.

How can humanity survive a multipolar AGI scenario?

lunatic_at_large20d10

I actually think multipolar scenarios are less dangerous than having a single superintelligence. Watching the AI arms race remain multipolar has actually been one of the biggest factors in my P(doom) declining recently. I believe that maintaining a balance of power at all times is key and that humanity's best chance for survival is to ensure that, for any action humanity wishes to take, there is some superintelligence that would benefit from this action and which would be willing to defend it. This intuition is largely based on examples from human history and may not generalize to the case of superintelligences.

EDIT: I do believe there's a limit to the benefits of having multiple superintelligences, especially in the early days when biological defense may be substantially weaker than offense. As an analogy to nuclear weapons, if one country possesses a nuclear bomb then that country can terrorize the world at will, if a few countries have nuclear bombs then everyone has an incentive to be restrained but alert, if every country has a nuclear bomb then eventually someone is going to press the big red button for lolz.

lunatic_at_large's Shortform

lunatic_at_large25d40

Something I should have realized about AI Safety via Debate ages ago but only recently recognized: I usually see theoretical studies of debate through the lens of determining the output of a fixed Turing machine (or stochastic Turing machine or some other computational model), e.g. https://arxiv.org/abs/2311.14125. This formalization doesn't capture the kinds of open-ended questions I usually think about. For example, suppose I'm looking at the blueprint of a submarine and I want to know whether it will be watertight at 1km depth. Suppose I have a physics simulation engine at my disposal that I trust. Maybe I could run the physics simulation engine at 1 nanometer resolution and get an answer I trust after ten thousand years, but I don't have time for that. This is such an extremely long computation that I don't think any AI debater would have time for it either. Instead, if I were tasked with solving this problem alone I would attempt to find some discretization parameter that is sufficiently small for me to trust the conclusion but sufficiently big for the computation to be feasible.

Now, if I had access to two debaters, how would I proceed? I would ask them both for thresholds beyond which their desired conclusion holds. For example, maybe I get the "it's watertight" debater to commit that any simulation at a resolution below 1cm will conclude that the design is watertight and I get the "it's not watertight" debater to commit that any simulation at a resolution below 1mm will conclude that the design is not watertight. I then run the simulation (or use one of the conventional debate protocols to simulate running the simulation) at the more extreme of the two suggestions, in this case 1mm. There are details to be resolved with the incentives but I believe it's possible.

I like to view this generalized problem as an infinite computation where we believe the result converges to the correct answer at some unknown rate. For example, we can run our simulation at 1m resolution, 10cm resolution, 1cm resolution, 1mm resolution, etc., saving the result from the most recent simulation as our current best guess. If we trust our simulation software then we should believe that this infinite computation eventually converges to the true answer. We can implement debate in this setup by having an initial debate over what steps T have the property that the answer at all steps T' >= T agrees with the answer at T, pick one such T that both sides agree with (take the maximum), then run a conventional debate protocol on the resulting finite computation.

EDIT: I wonder if there's a generalization of this idea to having a directed set of world models where one world model is at least as good as another if it is at least as precise in every model detail. Each debater proposes a world model, the protocol takes the maximum of the world models which exists by the directed set property, and we simulate that world model. I'm thinking of the Guaranteed Safe AI agenda here.

lunatic_at_large's Shortform

lunatic_at_large1mo30

Has anyone made a good, easy to use user interface for implementing debate / control protocols in deployment? For example, maybe I want to get o12345 to write some code in my codebase but I don't trust it. I'd like to automatically query Sonnet 17.5 with the code o12345 returned and ask if anything fishy is going on. If so, spin up one of the debate protocols and give me the transcript at the end. Surely the labs are experimenting with this kind of thing internally / during training but it still feels useful during deployment as an end-user, especially between models from different labs which might have internalized different objective functions and hence be less likely to collude.

Buck's Shortform

lunatic_at_large1mo30

When you talk about "other sources of risk from misalignment," these sound like milder / easier-to-tackle versions of the assumptions you've listed? Your assumptions sound like they focus on the worst case scenario. If you can solve the harder version then I would imagine that the easier version would also be solved, no?