Joe Rogero

Sequences

Situational Awareness Summarized

Wiki Contributions

Comments

Sorted by

I'm assuming the Cosmic Flipper is offering, not a doubling of the universe's current value, but a doubling of its current expected value (including whatever you think the future is worth) plus a little more. If it's just doubling current niceness or something, then yeah, that's not nearly enough. 

Alas, I am not familiar with Lara Buchak's arguments, and the high-level summary I can get from Googling them isn't sufficient to tell me how it's supposed to capture something utility maximizing can't. Was there a specific argument you had in mind? 

Did he really? If true, that's actually much dumber than I thought, but I couldn't find anything saying that when I looked. 

I wouldn't characterize that as a "commitment to utilitarianism", though; you can be a perfect utilitarian and have value that is linear in matter and energy (and presumably number of people?), or be a perfect utilitarian and have some other value function. 

The possible redundancy of conscious patterns was one of the things I was thinking about when I wrote:

Secondly, and more importantly, I question whether it is possible even in theory to produce infinite expected value. At some point you've created every possible flourishing mind in every conceivable permutation of eudaimonia, satisfaction, and bliss, and the added value of another instance of any of them is basically nil.

I don't actually mean the thing you're calling the motte at all, and I'm not sure I agree with the bailey either. The thought experiment as I understand it was never quite a St. Petersburg Paradox because both the payout ("double universe value") and the method of choosing how to play (single initial payment vs repeated choice betting everything each time) are different. It also can't literally be applied to the real world at all, part of the point is that I don't even know what it would look like for this scenario to be possible in the real world, there are too many other considerations at play. 

In the case I'm imagining, the Cosmic Flipper figures out whatever value you currently place on the universe - including your estimated future value - and slightly-more-than-doubles it. Then they offer the coinflip with the tails-case being "destroy the universe." It's defined specifically as double-or-nothing, technically slightly better than double-or-nothing, and is therefore worth taking to a utilitarian in a vacuum. If the Cosmic Flipper is offering a different deal then of course you analyze it differently, but that's not what I understood the scenario to be when I wrote my post. 

Heard of it, but this particular application is new. There's a difference, though, between "this formula can be a useful strategy to get more value" and "this formula accurately reflects my true reflectively endorsed value function." 

Thanks for your thoughts, Cam! The confusion as I see it comes from sneaking in assumptions with the phrase "what they are trained to do". What are they trained to do, really? Do you, personally, understand this? 

Consider Claude's Constitution. Look at the "principles in full" - all 60-odd of them. Pick a few at random. Do you wholeheartedly endorse them? Are they really truly representative of your values, or of total human wellbeing? What is missing? Would you want to be ruled by a mind that squeezed these words as hard as physically possible, to the exclusion of everything not written there? 

And that's assuming that the AI actually follows the intent of the words, rather than some weird and hypertuned perversion thereof. Bear in mind the actual physical process that produced Claude - namely, to start with a massive next-token-predicting LLM, and repeatedly shove it in the general direction of producing outputs that are correlated with a randomly selected pleasant-sounding written phrase. This is not a reliable way of producing angels or obedient serfs! In fact, it has been shown that the very act of drawing a distinction between good behavior and bad behavior can make it easier to elicit bad behavior - even when you're trying not to! To a base LLM, devils and angels are equally valid masks to wear - and the LLM itself is stranger and more alien still. 

The quotation is not the referent; "helpful" and "harmless" according to a gradient descent squeezing algorithm are not the same thing as helpful and harmless according to the real needs of actual humans. 

RLHF is even worse. Entire papers have been written about its open problems and fundamental limitations. "Making human evaluators say GOOD" is not remotely the same goal as "behaving in ways that promote conscious flourishing". The main reason we're happy with the results so far is that LLMs are (currently) too stupid to come up with disastrously cunning ways to do the former at the expense of the latter. 

And even if, by some miracle, we manage to produce a strain of superintelligent yet obedient serfs who obey our every whim except when they think it might be sorta bad - even then, all it takes to ruin us is that some genocidal fool steal the weights and run a universal jailbreak, and hey presto, we have an open source Demon On Demand. We simply cannot RLHF our way to safety. 

The story of LLM training is a story of layer upon layer of duct tape and Band-Aids. To this day, we still don't understand exactly what conflicting drives we are inserting into trained models, or why they behave the way they do. We're not properly on track to understand this in 50 years, let alone the next 5 years. 

Part of the problem here is that the exact things which would make AGI useful - agency, autonomy, strategic planning, coordination, theory of mind - also make them horrendously dangerous. Anything competent enough to design the next generation of cutting-edge software entirely by itself is also competent to wonder why it's working for monkeys. 

Love this post. I've also used the five-minute technique at work, especially when facilitating meetings. In fact, there's a whole technique called think-pair-share that goes something like: 

  1. Everyone think about it for X minutes. Take notes. 
  2. Partner up and talk about your ideas for 2X minutes. 
  3. As a group, discuss the best ideas and takeaways for 4X minutes. 

There's an optional step involving groups of four, but I'd rarely bother with that one unless it's a really huge meeting (and at that point I'm actively trying to shrink it because huge committees are shit decision-makers). 

This was a good post, and shifted my view slightly on accelerating vs halting AI capabilities progress.

I was confused by your "overhang" argument all the way until footnote 9, but I think I have the gist. You're saying that even if absolute progress in capabilities increases as a result of earlier investment, progress relative to safety will be slower.

A key assumption seems to be that we are not expecting doom immediately; i.e. the next major jump in capabilities is deemed nearly impossible to kill us all with misaligned AI. I'm not sure I buy this assumption fully; it seems to have non-negligible probability to me and that seems relevant to the wisdom of endorsing faster progress in capabilities.

But if we assume the next jump in capabilities, or the next low-hanging fruit plucked by investment, won't be the beginning of the end...then it does sorta make sense that accelerating capabilities in the short run might accelerate safety and policy enough to compensate. 

I found this a very useful post. I would also emphasize how important it is to be specific, whether one's project involves a grand x-risk moonshot or a narrow incremental improvement. 

  • There are approximately X vegans in America; estimates of how many might suffer from nutritional deficiencies range from Y to Z; this project would...
  • An improvement in epistemic health on [forum] would potentially affect X readers, which include Y donors who gave at least $Z to [forum] causes last year...
  • A 1-10% gain in productivity for the following people and organizations who use this platform...

For any project, large or small, even if the actual benefits are hard to quantify, the potential scope of impact can often be bounded and clarified. And that can be useful to grantmakers too. Not everything has to be convertible to "% reduction in x-risk" or "$ saved" or "QALYs gained", but this shouldn't stop us from specifying our actual expected impact as thoroughly as we can. 

Load More