All of lunatic_at_large's Comments + Replies

Topological Debate Framework

Maybe I should also expand on what the "AI agents are submitting the programs themselves subject to your approval" scenario could look like. When I talk about a preorder on Turing Machines (or some subset of Turing Machines), you don't have to specify this preorder up front. You just have to be able to evaluate it and the debaters have to be able to guess how it will evaluate.

If you already have a specific simulation program in mind then you can define $\leq$ as follows: if you're handed two programs which are exact copies of your simulation software ... (read more)

lunatic_at_large1mo30

These are all excellent points! I agree that these could be serious obstacles in practice. I do think that there are some counter-measures in practice, though.

I think the easiest to address is the occasional random failure, e.g. your "giving the wrong answer on exact powers of 1000" example. I would probably try to address this issue by looking at stochastic models of computation, e.g. probabilistic Turing machines. You'd need to accommodate stochastic simulations anyway because so many techniques in practice use sampling. I think you can handle stochastic... (read more)

Topological Debate Framework

johnswentworth's Shortform

Thank you! If I may ask, what kind of fatal flaws do you expect for real-world simulations? Underspecified / ill-defined questions, buggy simulation software, multiple simulation programs giving irreconcilably conflicting answers in practice, etc.? I ask because I think that in some situations it's reasonable to imagine the AI debaters providing the simulation software themselves if they can formally verify its accuracy, but that would struggle against e.g. underspecified questions.

Also, is there some prototypical example of a "tough real world question" y... (read more)

6Charlie Steiner1mo

For topological debate that's about two agents picking settings for simulation/computation, where those settings have a partial order that lets you take the "strictest" combination, a big class of fatal flaw would be if you don't actually have the partial order you think you have within the practical range of the settings - i.e. if some settings you thought were more accurate/strict are actually systematically less accurate. In the 1D plane example, this would be if some specific length scales (e.g. exact powers of 1000) cause simulation error, but as long as they're rare, this is pretty easy to defend against. In the fine-grained plane example, though, there's a lot more room for fine-grained patterns in which parts of the plane get modeled at which length scale to start having nonlinear effects. If the agents are not allowed to bid "maximum resolution across the entire plane," and instead are forced to allocate resources cleverly, then maybe you have a problem. But hopefully the truth is still advantaged, because the false player has to rely on fairly specific correlations, and the true player can maybe bid a bunch of noise that disrupts almost all of them. (This makes possible a somewhat funny scene, where the operator expected the true player's bid to look "normal," and then goes to check the bids and both look like alien noise patterns.) An egregious case would be where it's harder to disrupt patterns injected during bids - e.g. if the players' bids are 'sparse' / have finite support and might not overlap. Then the notion of the true player just needing to disrupt the false player seems a lot more unlikely, and both players might get pushed into playing very similar strategies that take every advantage of the dynamics of the simulator in order to control the answer in an unintended way. I guess for a lot of "tough real world questions," the difficulty of making a super-accurate simulator (one you even hope converges to the right answer) torpedoes the attem

lunatic_at_large1mo61

My initial reaction is that at least some of these points would be covered by the Guaranteed Safe AI agenda if that works out, right? Though the "AGIs act much like a colonizing civilization" situation does scare me because it's the kind of thing which locally looks harmless but collectively is highly dangerous. It would require no misalignment on the part of any individual AI.

How can humanity survive a multipolar AGI scenario?

How can humanity survive a multipolar AGI scenario?

Totalitarian dictatorship

I'm unclear why this risk is specific to multipolar scenarios? Even if you have a single AGI/ASI you could end up with a totalitarian dictatorship, no? In fact I would imagine that having multiple AGI/ASI's would mitigate this risk as, optimistically, every domestic actor in possession of an AGI/ASI should be counterbalanced by another domestic actor with divergent interests also in possession of an AGI/ASI.

lunatic_at_large's Shortform

I actually think multipolar scenarios are less dangerous than having a single superintelligence. Watching the AI arms race remain multipolar has actually been one of the biggest factors in my P(doom) declining recently. I believe that maintaining a balance of power at all times is key and that humanity's best chance for survival is to ensure that, for any action humanity wishes to take, there is some superintelligence that would benefit from this action and which would be willing to defend it. This intuition is largely based on examples from human history ... (read more)

lunatic_at_large2mo40

Something I should have realized about AI Safety via Debate ages ago but only recently recognized: I usually see theoretical studies of debate through the lens of determining the output of a fixed Turing machine (or stochastic Turing machine or some other computational model), e.g. https://arxiv.org/abs/2311.14125. This formalization doesn't capture the kinds of open-ended questions I usually think about. For example, suppose I'm looking at the blueprint of a submarine and I want to know whether it will be watertight at 1km depth. Suppose I have a physics ... (read more)

lunatic_at_large's Shortform

lunatic_at_large2mo30

Has anyone made a good, easy to use user interface for implementing debate / control protocols in deployment? For example, maybe I want to get o12345 to write some code in my codebase but I don't trust it. I'd like to automatically query Sonnet 17.5 with the code o12345 returned and ask if anything fishy is going on. If so, spin up one of the debate protocols and give me the transcript at the end. Surely the labs are experimenting with this kind of thing internally / during training but it still feels useful during deployment as an end-user, especially between models from different labs which might have internalized different objective functions and hence be less likely to collude.

Buck's Shortform

lunatic_at_large2mo30

When you talk about "other sources of risk from misalignment," these sound like milder / easier-to-tackle versions of the assumptions you've listed? Your assumptions sound like they focus on the worst case scenario. If you can solve the harder version then I would imagine that the easier version would also be solved, no?

4Buck2mo

Yeah, if you handle scheming, you solve all my safety problems, but not the final bullet point of "models fail to live up to their potential" problems.

Oliver Daniels-Koch's Shortform

lunatic_at_large2mo20

I think this professor has relevant interests: https://www.cs.cmu.edu/~nihars/.

Whistleblowing Twitter Bot

lunatic_at_large2mo10

I agree, though I think it would be a very ridiculous own-goal if e.g. GPT-4o decided to block a whistleblowing report about OpenAI because it was trained to serve OpenAI's interests. I think any model used by this kind of whistleblowing tool should be open-source (nothing fancy / more dangerous than what's already out there), run locally by the operators of the tool, and tested to make sure it doesn't block legitimate posts.

1Mckiev2mo

I can also unblock it manually at any point, and keep the full uncensored log of posts on a blockchain

Whistleblowing Twitter Bot

lunatic_at_large2mo10

My gut instinct is that this would have been a fantastic thing to create 2-4 years ago. My biggest hesitation is that the probability a tool like this decreases existential risk is proportional to the fraction of lab researchers who know about it and adoption can be a slow / hard thing to make happen. I still think that this kind of program could be incredibly valuable under the right circumstances so someone should probably be working on this.

Also, I have a very amateurish security question: if someone provides their work email to verify their authenticit... (read more)

1Mckiev2mo

Thanks for sharing your opinion. Regarding security: Using a full body of an email you can generate a zero knowledge using an offline tool (since all emails are hashed and signed by the email server). No new emails need to be exchanged

What o3 Becomes by 2028

lunatic_at_large2mo225

I guess I'm a bit confused where o3 comes into this analysis. This discussion appears to be focused on base models to me? Is data really the bottleneck these days for o-series-type advancements? I thought that compute available to do RL in self-play / CoT / long-time-horizon-agentic-setups would be a bigger consideration.

Edit: I guess upon another reading this article seems like an argument against AI capabilities hitting a plateau in the coming years, whereas the o3 announcement makes me more curious about whether we're going to hyper-accelerate capabilities in the coming months.

Vladimir_Nesov2mo153

My thesis is that the o3 announcement is timelines-relevant in a strange way. The causation goes from o3 to impressiveness or utility of its successors trained on 1 GW training systems, then to decisions to build 5 GW training systems, and it's those 5 GW training systems that have a proximate effect on timelines (in comparison to the world only having 1 GW training systems for a few years). The argument goes through even if o3 and its successors don't particularly move timelines directly through their capabilities, they can remain a successful normal tech... (read more)

The Alignment Simulator

lunatic_at_large2mo82

Strong upvote. I think that interactive demo's are much more effective at showcasing "scary" behavior than static demo's that are stuck in a paper or a blog post. Readers won't feel like the examples are cherry-picked if they can see the behavior exhibited in real time. I think that the community should make more things like this.

lunatic_at_large's Shortform

lunatic_at_large2mo3-4

For anyone wanting to harness AI models to create formally verified proofs for theoretical alignment, it looks like last call for formalizing question statements.

Game theory is almost completely absent from mathlib4. I found some scattered attempts at formalization in Lean online but nothing comprehensive. This feels like a massive oversight to me -- if o12345 were released tomorrow with perfect reasoning ability then I'd have no framework with which to check its proofs of any of my current research questions.

Solving Two-Sided Adverse Selection with Prediction Market Matchmaking

Solving Two-Sided Adverse Selection with Prediction Market Matchmaking

So I've been thinking a little more about the real-world-incentives problem, and I still suspect that there are situations in which rules won't solve this. Suppose there's a prediction market question with a real-world outcome tied to the resulting market probability (i.e. a relevant actor says "I will do XYZ if the prediction market says ABC"). Let's say the prediction market participants' objective functions are of the form play_money_reward + real_world_outcome_reward. If there are just a couple people for whom real_world_outcome_reward is at least as s... (read more)

lunatic_at_large1y*41

Awesome post!

I have absolutely no experience with prediction markets, but I’m instinctively nervous about incentives here. Maybe the real-world incentives of market participants could be greater than the play-money incentives? For example, if you’re trying to select people to represent your country at an international competition and the potential competitors have invested their lives into being on that international team and those potential competitors can sign up as market participants (maybe anonymously), then I could very easily imagine those people sa... (read more)

2Saul Munn1y

Thanks for the response! Re: concerns about bad incentives, I agree that you can depict the losses associated with manipulating conditional prediction markets as paying a "cost" — even though you'll probably lose a boatload of money, it might be worth it to lose a boatload of money to manipulate the markets. In the words of Scott Alexander, though: I'm concerned about this, but it feels like a solvable problem. Re: personal stuff & the negative externalities of publicly revealing probabilities, thanks for pointing these out. I hadn't thought of it. Added it to the post!

Conditionals All The Way Down

Underspecified Probabilities: A Thought Experiment

I like these observations! As for your last point about ranges and bounds, I'm actually moving towards relaxing those in future posts: basically I want to look at the tree case where you have more than one variable feeding into each node and I want to argue that even if the conditional probabilities are all 0's and 1's (so we don't get any hard bounds with arguments like the one I present here) there can still be strong concentration towards one answer.

AI Safety Impact Markets: Your Charity Evaluator for AI Safety

Wow, this is exactly what I was looking for! Thank you so much!

"Absence of Evidence is Not Evidence of Absence" As a Limit

I also suspect that the evaluation mechanism is going to be very important. I can think of philosophical debates whose resolution could change the impact of an "artifact" by many orders of magnitude. If possible I think it could be good to have several different metrics (corresponding to different objective functions) by which to grade these artifacts. That way you can give donors different scores depending on which metrics you want to look at. For example, you might want different scores for x-risk minimization, s-risk minimization, etc. That still leaves the "[optimize for (early, reliable) evidence of impact] != [optimize for impact]" issue open, of course.

lunatic_at_large1y20

I really like this perspective! Great first post!

Lesswrong's opinion on infinite epistemic regress and Bayesianism

Lesswrong's opinion on infinite epistemic regress and Bayesianism

Okay that paper doesn't seem like what I was thinking of either but it references this paper which does seem to be on-topic: https://research.rug.nl/en/publications/justification-by-an-infinity-of-conditional-probabilities

lunatic_at_large1y22

Thanks for the response! I took a look at the paper you linked to; I'm pretty sure I'm not talking about combinatorial explosion. Combinatorial explosion seems to be an issue when solving problems that are mathematically well-defined but computationally intractable in practice; in my case it's not even clear that these objects are mathematically well-defined to begin with.

The paper https://www.researchgate.net/publication/335525907_The_infinite_epistemic_regress_problem_has_no_unique_solution initially looks related to what I'm thinking, but I haven't looked at it in depth yet.

1lunatic_at_large1y

Precision of Sets of Forecasts

Precision of Sets of Forecasts

Hmmm, I’m still thinking about this. I’m kinda unconvinced that you even need an algorithm-heavy approach here. Let’s say that you want to apply logit, add some small amount of noise, apply logistic, then score. Consider the function on R^n defined as (score function) composed with (coordinate-wise logistic function). We care about the expected value of this function with respect to the probability measure induced by our noise. For very small noise, you can approximate this function by its power series expansion. For example, if we’re adding iid Gaussian n... (read more)

lunatic_at_large1y40

If my interpretation of precision function is correct then I guess my main concern is this: how are we reaching inside the minds of the predictors to see what their distribution on $P (A | E)$ is? Like, imagine we have an urn with black and red marbles in it and we have a prediction market on the probability that a uniformly randomly chosen marble will be red. Let's say that two people participated in this prediction market: Alice and Bob. Alice estimated there to be a 0.3269230769 (or approximately 17/52) chance of the marble being red because she sa... (read more)

Precision of Sets of Forecasts

Probabilistic argument relationships and an invitation to the argument mapping community

Awesome post! I'm very ignorant of the precision-estimation literature so I'm going to be asking dumb questions here.

First of all, I feel like a precision function should take some kind of "acceptable loss" parameter. From what I gather, to specify the precision you need some threshold in your algorithm(s) for how much accuracy loss you're willing to tolerate.

More fundamentally, though, I'm trying to understand what exactly we want to measure. The list of desired properties of a precision function feel somewhat pulled out of thin air, and I'd feel mo... (read more)

4lunatic_at_large1y

If my interpretation of precision function is correct then I guess my main concern is this: how are we reaching inside the minds of the predictors to see what their distribution on P(A|E) is? Like, imagine we have an urn with black and red marbles in it and we have a prediction market on the probability that a uniformly randomly chosen marble will be red. Let's say that two people participated in this prediction market: Alice and Bob. Alice estimated there to be a 0.3269230769 (or approximately 17/52) chance of the marble being red because she saw the marbles being put in and there were 17 red marbles and 52 marbles total. Bob estimated there to be a 0.3269230769 chance of the marble being red because he felt like it. Bob is clearly providing false precision while Alice is providing entirely justified precision. However, no matter which way the urn draw goes, the input tuple (0.3269230769, 0) or (0.3269230769, 1) will be the same for both participants and thus the precision returned by any precision function will be the same. This feels to me like a fundamental disconnect between what we want to measure and what we are measuring. Am I mistaken in my understanding? Thanks!

High school advice

Answer by lunatic_at_largeSep 11, 202320

I'm probably the least qualified person imaginable to represent "the Lesswrong community" given that I literally made my first post this weekend, but I did get into EA between high school and college and I have some thoughts on the topic.

My gut reaction is that it depends a lot on the kind of person this high schooler is. I was very interested in math and theoretical physics when I started thinking about EA. I don't think I'm ever going to be satisfied with my life unless I'm doing work that's math-heavy. I applied to schools with good AI programs with the... (read more)

2Cole Wyeth1y

Early in college I wanted to jump into "doing A.I." and ended up doing robotics research/clubs that I mostly hated in practice, and worse for whatever reason (probably too much obsession with discipline) didn't even notice I hated, and I wasted a lot of time that way. It's true that a lot of A.I. related fields attract people with less of a mathematical focus and it's important to develop your mind working on problems you actually like.