Towards_Keeperhood

I'm trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.

Sequences

Orcas

Wikitag Contributions

Comments

Sorted by

Secondly, there are three popular books which I would advise not to read. They are "Eat that Frog", "7 habits of highly effective people" and "Getting Things Done - the art of stress-free productivity". I found that all of them are 5% signal and 95% noise and their most important messages could have been summarized on 5 to 10 pages respectively.

I think Getting Things Done is awesome. I read it 3 times and it was super useful for me. I've created myself an (in my opinion particularly nice) GTD system in notion and I love it.

You don't need to use GTD for the 1-2 main projects you're working on in a particular week (though of course you still want to organize tasks and notes for those somehow), but it's super useful for managing everything else so you have more time/energy focusing on your core projects.

Though it may take a while to set up a good system and learn to use it well. You want to tune it to fit your needs, e.g. the example "context" categories by which to structure next action lists may not fit your purposes that well.

The 5% signal seems especially surprising to me w.r.t. GTD. There's just so much good content in the book. Of course it could be summarized further but examples are important for understanding the content. Even the basic system setup which it guides you through is quite a lot of content to implement for one productivity book, but there's a lot more in the book which you can start to pay more attention to once you've established a decent system with capture, inbox processing, and weekly review habits.

The main value is the GTD organizing system, but there's also great advice that can be applied independent of the system, e.g. the 5-step natural planning model (iirc):

  1. Answer "Why do you want to do the project / achieve the goal?"
  2. Answer "What is the goal of the project?". Visualize success if possible.
  3. Brainstorm
  4. Organize
  5. Decide

I guess maybe it's not that obvious that planning in roughly such a way is good if you haven't tried, but it's good.

Thanks.

I think you are being led astray by having a one-dimensional notion of intelligence.

(I do agree that we can get narrowly superhuman CIRL-like AI which we can then still shut down because it trusts humans more about general strategic considerations. But I think if your plan is to let the AI solve alignment or coordinate the world to slow down AI progress, this won't help you much for the parts of the problem we are most bottlenecked on.)

You identified the key property yourself: it's that the humans have an advantage over the AI at (particular parts of) evaluating what's best. (More precisely, it's that the humans have information that the AI does not have; it can still work even if the humans don't use their information to evaluate what's best.)

I agree that the AI may not be able to precisely predict what exact tradeoffs each operator might be willing to make, e.g. between required time and safety of a project, but I think it would be able to predict it well enough that the differences in what strategy it uses wouldn't be large.

Or do you imagine strategically keeping some information from the AI?

Either way, the AI is only updating on information, not changing its (terminal) goals. (Though the instrumental subgoals can in principle change.)

Even if the alignment works out perfectly, when the AI is smarter and the humans are like "actually we want to shut you down", the AI does update that the humans are probably worried about something, but if the AI is smart enough and sees how the humans were worried about something that isn't actually going to happen, it can just be like "sorry, that's not actually in your extrapolated interests, you will perhaps understand later when you're smarter", and then tries to fulfill human values.

But if we're confident alignment to humans will work out we don't need corrigibility. Corrigibility is rather intended so we might be able to recover if something goes wrong.

If the values of the AI drift a bit, then the AI will likely notice this before the humans and take measures that the humans don't find out or won't (be able to) change its values back, because that's the strategy that's best according to the AI's new values.

Do you agree that parents are at least somewhat corrigible / correctable by their kids, despite being much smarter / more capable than the kids? (For example, kid feels pain --> kid cries --> parent stops doing something that was accidentally hurting the child.)

Likewise just updating on new information, not changing terminal goals.

Also note that parents often think (sometimes correctly) that they better know what is in the child's extrapolated interests and then don't act according to the child's stated wishes.

And I think superhumanly smart AIs will likely be better at guessing what is in a human's interests than parents guessing what is in their child's interest, so the cases where the strategy gets updated are less significant.

I'm saying that (contingent on details about the environment and the information asymmetry) a lot of behaviors then fall out that look a lot like what you would want out of corrigibility, while still being a form of EU maximization (while under a particular kind of information asymmetry). This seems like it should be relevant evidence about "naturalness" of corrigibility.

From my perspective CIRL doesn't really show much correctability if the AI is generally smarter than humans. That would only be if a smart AI was somehow quite bad at guessing what humans wanted so that when we tell it what we want it would importantly update its strategy, including shutting itself down because it believes that will then be the best way to accomplish its goal. (I might still not call it corrigible but I would see your point about corrigible behavior.)

I do think getting corrigible behavior out of a dumbish AI is easy. But it seems hard for an AI that is able to prevent anyone from building an unaligned AI.

I liked this post. Reward button alignment seems like a good toy problem to attack or discuss alignment feasibility on.

But it's not obvious to me whether the AI would really become sth like a superintelligent reward button presses optimizer. (But even if your exact proposal doesn't work, I think reward button alignment is probably a relatively feasible problem for brain-like AGI.) There are multiple potential problems, where most seem like "eh probably it works fine but not sure", but my current biggest doubt is "when the AI becomes reflective, will the reflectively endorsed values only include reward button presses or also a bunch of shards that were used for estimated expected button presses?".

Let me try to understand in more detail how you imagine the AI to look like:

  1. How does the learned value function evaluate plans?
    1. Does the world model always evaluate expected-button-presses for each plan and the LVF just looks at that part of a plan and uses that as the value it assigns? Or does the value function also end up valuing other stuff because it gets updated through TD learning?
      1. Maybe the question is rather how far upstream of button presses is that other stuff, e.g. just "the human walks toward the reward button" or also "getting more relevant knowledge is usually good".
      2. Or like, what parts get evaluated by the thought generator and what parts by the value function? Does the value function (1) look at a lot of complex parts in a plan to evaluate expected-reward-utility (2) recognize a bunch of shards like "value of information", "gaining instrumental resources", etc. on plans which it uses to estimate value, (3) do the plans conveniently summarize success probability and expected resources it can look at (as opposed to them being implicit and needing to be recognized by the LVF as in (2)), (4) or does the thought generator directly predict expected-reward-utility which can be used?
    2. Also how sophisticated is the LVF? Is it primitive like in humans or able to make more complex estimates?
      1. If there are deceptive plans like "ok actually i value U_2, but i will of course maximize and faithfully predict expected button presses to not get value drift until i can destroy the reward setup", would the LVF detect that as being low expected button presses?

I can try to imagine in more detail about what may go wrong once I better see what you're imagining.

(Also in case you're trying to explain why you think it would work by analogy to humans, perhaps use John von Neumann or so as example rather than normies or normie situations.)

(You did respond to all the important parts, rest of my comment is very much optional.)

My reading was that you still have an open disagreement where Steve thinks there's not much more to explain but you still want an answer to "Why did people invent the word 'consciousness' and wrote what they wrote about it? What algorithm might output sentences describing fascination about the redness of red?" which Steve's series doesn't answer.

I wouldn't give up that early on trying to convince Steve he's missing some part. (Though possible that I misread Steve's comment and he understood you, I didn't read it precisely.)

Here’s the (obvious) strategy: Apply voluntary attention-control to keep S(getting out of bed) at the center of attention. Don’t let it slip away, no matter what.

Can you explain more precisely how this works mechanistically? What is happening to keep S(getting out of bed) in the center of attention.

8.5.6.1 Aside: The “innate drive to minimize voluntary attention control”

Your hypothesis here doesn't seem to me to explain why we seem to have limited willpower budget for attention control which gets depleted but which also regenerates after a time. I can see how negative rewards from minimizing voluntary attention control can make us less likely to apply willpower in the future, but why would it regenerate then?

Btw, there's another simpler possible mechanism, though I don't know the neuroscience and perhaps Steve's hypothesis with separate valence assessors and involuntary attention control fits the neuroscience evidence much better and it may also fit observed motivated reasoning better.

But the obvious way to design a mind would be to make it just focus on whatever is most important, aka where most expected utility per necessary resources could be gained.

So we still have a learned value function which assigns how good/bad something would be, but we also have an estimator of how much the value would increase if we continue thinking (which might e.g. happen because one makes plans for making a somewhat bad situation better), and what gets attended on depends on this estimator, not the value function directly.

The "S" in "S(X)" and "S(A)" seems different to me. If I rename the "S" in "S(A)" to "I", it would make more sense to me:

  • A = action of standing up (which gets actually executed if positive valence)
  • I(A) = imagined scene of myself standing up
  • S(I(A)) = the thought "I am thinking about standing up"

Yeah I agree that it wouldn't be a very bad kind of s-risk. The way I thought about s-risk was more like expected amount of suffering. But yeah I agree with you it's not that bad and perhaps most expected suffering comes from more active utility-invert threats or values.

(Though tbc, I was totally imagining 1e40 humans being forced to press reward buttons.)

I probably got more out of watching Hofstadter give a little lecture on analogical reasoning (example) than from this whole book.

I didn't read the lecture you linked, but I liked Hofstadter's book "Surfaces and Essences" which had the same core thesis. It's quite long though. And not about neuroscience.

I find this rather ironic:

6. If the AGI subverts the setup and gets power, what would it actually want to do with that power?

It’s hard to say. Maybe it would feel motivated to force humans to press the reward button over and over. Or brainwash / drug them to want to press the reward button.

[...]

On the plus side, s-risk (risk of astronomical amounts of suffering) seems very low for this kind of approach.

(I guess I wouldn't say it's very low s-risk but not actually an important disagreement here. Partially just thought it sounded funny.)

Load More