All of lunatic_at_large's Comments + Replies

The question bank doesn't exist yet because the language to ask the questions doesn't exist yet. I spent a few weeks after writing this post trying to familiarize myself with Lean as quickly as possible and I found out that people in the Lean community simply haven't formalized most of the objects I'd want to talk about (probabilistic networks, computational complexity, Nash equilibria, etc.). I tried to get a project off the ground formalizing these objects -- you can see the GitHub repository here and the planning document here. Unfortunately this projec... (read more)

I think the website just links back to this blog post? Is that intentional?

Edit: I also think the application link requires access before seeing the form?

Second Edit: Seems fixed now! Thanks!

Maybe I should also expand on what the "AI agents are submitting the programs themselves subject to your approval" scenario could look like. When I talk about a preorder on Turing Machines (or some subset of Turing Machines), you don't have to specify this preorder up front. You just have to be able to evaluate it and the debaters have to be able to guess how it will evaluate.

If you already have a specific simulation program in mind then you can define  as follows: if you're handed two programs which are exact copies of your simulation software ... (read more)

These are all excellent points! I agree that these could be serious obstacles in practice. I do think that there are some counter-measures in practice, though.

I think the easiest to address is the occasional random failure, e.g. your "giving the wrong answer on exact powers of 1000" example. I would probably try to address this issue by looking at stochastic models of computation, e.g. probabilistic Turing machines. You'd need to accommodate stochastic simulations anyway because so many techniques in practice use sampling. I think you can handle stochastic... (read more)

Thank you! If I may ask, what kind of fatal flaws do you expect for real-world simulations? Underspecified / ill-defined questions, buggy simulation software, multiple simulation programs giving irreconcilably conflicting answers in practice, etc.? I ask because I think that in some situations it's reasonable to imagine the AI debaters providing the simulation software themselves if they can formally verify its accuracy, but that would struggle against e.g. underspecified questions.

Also, is there some prototypical example of a "tough real world question" y... (read more)

6Charlie Steiner
For topological debate that's about two agents picking settings for simulation/computation, where those settings have a partial order that lets you take the "strictest" combination, a big class of fatal flaw would be if you don't actually have the partial order you think you have within the practical range of the settings - i.e. if some settings you thought were more accurate/strict are actually systematically less accurate. In the 1D plane example, this would be if some specific length scales (e.g. exact powers of 1000) cause simulation error, but as long as they're rare, this is pretty easy to defend against. In the fine-grained plane example, though, there's a lot more room for fine-grained patterns in which parts of the plane get modeled at which length scale to start having nonlinear effects. If the agents are not allowed to bid "maximum resolution across the entire plane," and instead are forced to allocate resources cleverly, then maybe you have a problem. But hopefully the truth is still advantaged, because the false player has to rely on fairly specific correlations, and the true player can maybe bid a bunch of noise that disrupts almost all of them. (This makes possible a somewhat funny scene, where the operator expected the true player's bid to look "normal," and then goes to check the bids and both look like alien noise patterns.) An egregious case would be where it's harder to disrupt patterns injected during bids - e.g. if the players' bids are 'sparse' / have finite support and might not overlap. Then the notion of the true player just needing to disrupt the false player seems a lot more unlikely, and both players might get pushed into playing very similar strategies that take every advantage of the dynamics of the simulator in order to control the answer in an unintended way. I guess for a lot of "tough real world questions," the difficulty of making a super-accurate simulator (one you even hope converges to the right answer) torpedoes the attem

My initial reaction is that at least some of these points would be covered by the Guaranteed Safe AI agenda if that works out, right? Though the "AGIs act much like a colonizing civilization" situation does scare me because it's the kind of thing which locally looks harmless but collectively is highly dangerous. It would require no misalignment on the part of any individual AI. 

Totalitarian dictatorship

I'm unclear why this risk is specific to multipolar scenarios? Even if you have a single AGI/ASI you could end up with a totalitarian dictatorship, no? In fact I would imagine that having multiple AGI/ASI's would mitigate this risk as, optimistically, every domestic actor in possession of an AGI/ASI should be counterbalanced by another domestic actor with divergent interests also in possession of an AGI/ASI.

I actually think multipolar scenarios are less dangerous than having a single superintelligence. Watching the AI arms race remain multipolar has actually been one of the biggest factors in my P(doom) declining recently. I believe that maintaining a balance of power at all times is key and that humanity's best chance for survival is to ensure that, for any action humanity wishes to take, there is some superintelligence that would benefit from this action and which would be willing to defend it. This intuition is largely based on examples from human history ... (read more)

Something I should have realized about AI Safety via Debate ages ago but only recently recognized: I usually see theoretical studies of debate through the lens of determining the output of a fixed Turing machine (or stochastic Turing machine or some other computational model), e.g. https://arxiv.org/abs/2311.14125. This formalization doesn't capture the kinds of open-ended questions I usually think about. For example, suppose I'm looking at the blueprint of a submarine and I want to know whether it will be watertight at 1km depth. Suppose I have a physics ... (read more)

Has anyone made a good, easy to use user interface for implementing debate / control protocols in deployment? For example, maybe I want to get o12345 to write some code in my codebase but I don't trust it. I'd like to automatically query Sonnet 17.5 with the code o12345 returned and ask if anything fishy is going on. If so, spin up one of the debate protocols and give me the transcript at the end. Surely the labs are experimenting with this kind of thing internally / during training but it still feels useful during deployment as an end-user, especially between models from different labs which might have internalized different objective functions and hence be less likely to collude. 

When you talk about "other sources of risk from misalignment," these sound like milder / easier-to-tackle versions of the assumptions you've listed? Your assumptions sound like they focus on the worst case scenario. If you can solve the harder version then I would imagine that the easier version would also be solved, no? 

4Buck
Yeah, if you handle scheming, you solve all my safety problems, but not the final bullet point of "models fail to live up to their potential" problems.

I agree, though I think it would be a very ridiculous own-goal if e.g. GPT-4o decided to block a whistleblowing report about OpenAI because it was trained to serve OpenAI's interests. I think any model used by this kind of whistleblowing tool should be open-source (nothing fancy / more dangerous than what's already out there), run locally by the operators of the tool, and tested to make sure it doesn't block legitimate posts. 

1Mckiev
I can also unblock it manually at any point, and keep the full uncensored log of posts on a blockchain

My gut instinct is that this would have been a fantastic thing to create 2-4 years ago. My biggest hesitation is that the probability a tool like this decreases existential risk is proportional to the fraction of lab researchers who know about it and adoption can be a slow / hard thing to make happen. I still think that this kind of program could be incredibly valuable under the right circumstances so someone should probably be working on this.

Also, I have a very amateurish security question: if someone provides their work email to verify their authenticit... (read more)

1Mckiev
Thanks for sharing your opinion. Regarding security: Using a full body of an email you can generate a zero knowledge using an offline tool (since all emails are hashed and signed by the email server). No new emails need to be exchanged

I guess I'm a bit confused where o3 comes into this analysis. This discussion appears to be focused on base models to me? Is data really the bottleneck these days for o-series-type advancements? I thought that compute available to do RL in self-play / CoT / long-time-horizon-agentic-setups would be a bigger consideration. 

Edit: I guess upon another reading this article seems like an argument against AI capabilities hitting a plateau in the coming years, whereas the o3 announcement makes me more curious about whether we're going to hyper-accelerate capabilities in the coming months.

My thesis is that the o3 announcement is timelines-relevant in a strange way. The causation goes from o3 to impressiveness or utility of its successors trained on 1 GW training systems, then to decisions to build 5 GW training systems, and it's those 5 GW training systems that have a proximate effect on timelines (in comparison to the world only having 1 GW training systems for a few years). The argument goes through even if o3 and its successors don't particularly move timelines directly through their capabilities, they can remain a successful normal tech... (read more)

Strong upvote. I think that interactive demo's are much more effective at showcasing "scary" behavior than static demo's that are stuck in a paper or a blog post. Readers won't feel like the examples are cherry-picked if they can see the behavior exhibited in real time. I think that the community should make more things like this.

For anyone wanting to harness AI models to create formally verified proofs for theoretical alignment, it looks like last call for formalizing question statements.

Game theory is almost completely absent from mathlib4. I found some scattered attempts at formalization in Lean online but nothing comprehensive. This feels like a massive oversight to me -- if o12345 were released tomorrow with perfect reasoning ability then I'd have no framework with which to check its proofs of any of my current research questions.

So I've been thinking a little more about the real-world-incentives problem, and I still suspect that there are situations in which rules won't solve this. Suppose there's a prediction market question with a real-world outcome tied to the resulting market probability (i.e. a relevant actor says "I will do XYZ if the prediction market says ABC"). Let's say the prediction market participants' objective functions are of the form play_money_reward + real_world_outcome_reward. If there are just a couple people for whom real_world_outcome_reward is at least as s... (read more)

Awesome post!

I have absolutely no experience with prediction markets, but I’m instinctively nervous about incentives here. Maybe the real-world incentives of market participants could be greater than the play-money incentives? For example, if you’re trying to select people to represent your country at an international competition and the potential competitors have invested their lives into being on that international team and those potential competitors can sign up as market participants (maybe anonymously), then I could very easily imagine those people sa... (read more)

2Saul Munn
Thanks for the response! Re: concerns about bad incentives, I agree that you can depict the losses associated with manipulating conditional prediction markets as paying a "cost" — even though you'll probably lose a boatload of money, it might be worth it to lose a boatload of money to manipulate the markets. In the words of Scott Alexander, though: I'm concerned about this, but it feels like a solvable problem. Re: personal stuff & the negative externalities of publicly revealing probabilities, thanks for pointing these out. I hadn't thought of it. Added it to the post!

I like these observations! As for your last point about ranges and bounds, I'm actually moving towards relaxing those in future posts: basically I want to look at the tree case where you have more than one variable feeding into each node and I want to argue that even if the conditional probabilities are all 0's and 1's (so we don't get any hard bounds with arguments like the one I present here) there can still be strong concentration towards one answer.

Wow, this is exactly what I was looking for! Thank you so much!

I also suspect that the evaluation mechanism is going to be very important. I can think of philosophical debates whose resolution could change the impact of an "artifact" by many orders of magnitude. If possible I think it could be good to have several different metrics (corresponding to different objective functions) by which to grade these artifacts. That way you can give donors different scores depending on which metrics you want to look at. For example, you might want different scores for x-risk minimization, s-risk minimization, etc. That still leaves the "[optimize for (early, reliable) evidence of impact] != [optimize for impact]" issue open, of course. 

Okay that paper doesn't seem like what I was thinking of either but it references this paper which does seem to be on-topic: https://research.rug.nl/en/publications/justification-by-an-infinity-of-conditional-probabilities 

Thanks for the response! I took a look at the paper you linked to; I'm pretty sure I'm not talking about combinatorial explosion. Combinatorial explosion seems to be an issue when solving problems that are mathematically well-defined but computationally intractable in practice; in my case it's not even clear that these objects are mathematically well-defined to begin with.

The paper https://www.researchgate.net/publication/335525907_The_infinite_epistemic_regress_problem_has_no_unique_solution initially looks related to what I'm thinking, but I haven't looked at it in depth yet.

1lunatic_at_large
Okay that paper doesn't seem like what I was thinking of either but it references this paper which does seem to be on-topic: https://research.rug.nl/en/publications/justification-by-an-infinity-of-conditional-probabilities 

Hmmm, I’m still thinking about this. I’m kinda unconvinced that you even need an algorithm-heavy approach here. Let’s say that you want to apply logit, add some small amount of noise, apply logistic, then score. Consider the function on R^n defined as (score function) composed with (coordinate-wise logistic function). We care about the expected value of this function with respect to the probability measure induced by our noise. For very small noise, you can approximate this function by its power series expansion. For example, if we’re adding iid Gaussian n... (read more)

If my interpretation of precision function is correct then I guess my main concern is this: how are we reaching inside the minds of the predictors to see what their distribution on  is? Like, imagine we have an urn with black and red marbles in it and we have a prediction market on the probability that a uniformly randomly chosen marble will be red. Let's say that two people participated in this prediction market: Alice and Bob. Alice estimated there to be a 0.3269230769 (or approximately 17/52) chance of the marble being red because she sa... (read more)

Awesome post! I'm very ignorant of the precision-estimation literature so I'm going to be asking dumb questions here.

First of all, I feel like a precision function should take some kind of "acceptable loss" parameter. From what I gather, to specify the precision you need some threshold in your algorithm(s) for how much accuracy loss you're willing to tolerate. 

More fundamentally, though, I'm trying to understand what exactly we want to measure. The list of desired properties of a precision function feel somewhat pulled out of thin air, and I'd feel mo... (read more)

4lunatic_at_large
If my interpretation of precision function is correct then I guess my main concern is this: how are we reaching inside the minds of the predictors to see what their distribution on P(A|E) is? Like, imagine we have an urn with black and red marbles in it and we have a prediction market on the probability that a uniformly randomly chosen marble will be red. Let's say that two people participated in this prediction market: Alice and Bob. Alice estimated there to be a 0.3269230769 (or approximately 17/52) chance of the marble being red because she saw the marbles being put in and there were 17 red marbles and 52 marbles total. Bob estimated there to be a 0.3269230769 chance of the marble being red because he felt like it. Bob is clearly providing false precision while Alice is providing entirely justified precision. However, no matter which way the urn draw goes, the input tuple (0.3269230769, 0) or (0.3269230769, 1) will be the same for both participants and thus the precision returned by any precision function will be the same. This feels to me like a fundamental disconnect between what we want to measure and what we are measuring. Am I mistaken in my understanding? Thanks!
Answer by lunatic_at_large20

I'm probably the least qualified person imaginable to represent "the Lesswrong community" given that I literally made my first post this weekend, but I did get into EA between high school and college and I have some thoughts on the topic.

My gut reaction is that it depends a lot on the kind of person this high schooler is. I was very interested in math and theoretical physics when I started thinking about EA. I don't think I'm ever going to be satisfied with my life unless I'm doing work that's math-heavy. I applied to schools with good AI programs with the... (read more)

2Cole Wyeth
Early in college I wanted to jump into "doing A.I." and ended up doing robotics research/clubs that I mostly hated in practice, and worse for whatever reason (probably too much obsession with discipline) didn't even notice I hated, and I wasted a lot of time that way. It's true that a lot of A.I. related fields attract people with less of a mathematical focus and it's important to develop your mind working on problems you actually like. 

You raise an excellent point! In hindsight I’m realizing that I should have chosen a different example, but I’ll stick with it for now. Yes, I agree that “What states of the universe are likely to result from me killing vs not killing lanternflies” and “Which states of the universe do I prefer?” are both questions grounded in the state of the universe where Bayes’ rule applies very well. However, I feel like there’s a third question floating around in the background: “Which states of the universe ‘should’ I prefer?” Based on my inner experiences, I feel th... (read more)

Hey, thanks for the response! Yes, I've also read about Bayes' Theorem. However, I'm unconvinced that it is applicable in all the circumstances that I care about. For example, suppose I'm interested in the question "Should I kill lanternflies whenever I can?" That's not really an objective question about the universe that you could, for example, put on a prediction market. There doesn't exist a natural function from (states of the universe) to (answers to that question). There’s interpretation involved. Let’s even say that we get some new evidence (my post... (read more)

6Robert Miles
One way of framing the difficulty with the lanternflies thing is that the question straddles the is-ought gap. It decomposes pretty cleanly into two questions: "What states of the universe are likely to result from me killing vs not killing lanternflies" (about which Bayes Rule fully applies and is enormously useful), and "Which states of the universe do I prefer?", where the only evidence you have will come from things like introspection about your own moral intuitions and values. Your values are also a fact about the universe, because you are part of the universe, so Bayes still applies I guess, but it's quite a different question to think about. If you have well defined values, for example some function from states (or histories) of the universe to real numbers, such that larger numbers represent universe states that you would always prefer over smaller numbers, then every "should I do X or Y" question has an answer in terms of those values. In practice we'll never have that, but still it's worth thinking separately about "What are the expected consequences of the proposed policy?" and "What consequences do I want", which a 'should' question implicitly mixes together.