Could you expand on what you mean by 'less automation'? I'm taking it to mean some combination of 'bounding the space of controller actions more', 'automating fewer levels of optimisation', 'more of the work done by humans' and maybe 'only automating easier tasks' but I can't quite tell which of these you're intending or how they fit together.
(Also, am I correctly reading an implicit assumption here that any attempts to do automated research would be classed as 'automated ai safety'?)
When I read this post I feel like I'm seeing four different strands bundled together:
1. Truth-of-beliefs as fuzzy or not
2. Models versus propositions
3. Bayesianism as not providing an account of how you generate new hypotheses/models
4. How people can (fail to) communicate with each other
I think you hit the nail on the head with (2) and am mostly sold on (4), but am sceptical of (1) - similar to what several others have said, it seems to me like these problems don't appear when your beliefs are about expected observations, and only appear when you start to invoke categories that you can't ground as clusters in a hierarchical model.
That leaves me with mixed feelings about (3):
- It definitely seems true and significant that you can get into a mess by communicating specific predictions relative to your own categories/definitions/contexts without making those sufficiently precise
- I am inclined to agree that this is a particularly important feature of why talking about AI/x-risk is hard
- It's not obvious to me that what you've said above actually justifies knightian uncertainty (as opposed to infrabayesianism or something), or the claim that you can't be confident about superintelligence (although it might be true for other reasons)
Strongly agree that active inference is underrated both in general and specifically for intuitions about agency.
I think the literature does suffer from ambiguity over where it's descriptive (ie an agent will probably approximate a free energy minimiser) vs prescriptive (ie the right way to build agents is free energy minimisation, and anything that isn't that isn't an agent). I am also not aware of good work on tying active inference to tool use - if you know of any, I'd be pretty curious.
I think the viability thing is maybe slightly fraught - I expect it's mainly for anthropic reasons that we mostly encounter agents that have adapted to basically independently and reliably preserve their causal boundaries, but this is always connected to the type of environment they find themselves in.
For example, active inference points to ways we could accidentally build misaligned optimisers that cause harm - chaining an oracle to an actuator to make a system trying to do homeostasis in some domain (like content recommendation) could, with sufficient optimisation power, create all kinds of weird and harmful distortions. But such a system wouldn't need to have any drive for boundary preservation, or even much situational awareness.
So essentially an agent could conceivably persist for totally different reasons, we just tend not to encounter such agents, and this is exactly the kind of place where AI might change the dynamics a lot.
Interesting! I think one of the biggest things we gloss over in the piece in how perception fits into the picture, and this seems like a pretty relevant point. In general the space of 'things that give situational awareness' seems pretty broad and ripe for analysis.
I also wonder how much efficiency gets lost by decoupling observation and understanding - at least in humans, it seems like we have a kind of hierarchical perception where our subjective experience of 'looking at' something has already gone through a few layers of interpretation, giving us basically no unadulterated visual observation, presumably because this is more efficient (maybe in particular faster?).
I'd be pretty curious to hear about your disagreements if you're willing to share
This seems like a misunderstanding / not my intent. (Could you maybe quote the part that gave you this impression?)
I believe Dusan was trying to say that davidad's agenda limits the planner AI to only writing provable mathematical solutions. To expand, I believe that compared to what you briefly describe, the idea in davidad's agenda is that you don't try to build a planner that's definitely inner aligned, you simply have a formal verification system that ~guarantees what effects a plan will and won't have if implemented.
Oh interesting! I just had a go at testing it on screenshots from a parallel conversation and it seems like it incorrectly interprets those screenshots as also being of its own conversation.
So it seems like 'recognising things it has said' is doing very little of the heavy lifting and 'recognising its own name' is responsible for most of the effect.
I'll have a bit more of a play around and probably put a disclaimer at the top of the post some time soon.
The 'reward being chance of winning' stuff changes a bit about how the model generalises if it's playing a game with randomness and conditioned on the upper end - it biases the model towards 'expecting risk to pay off'. Eg/ if the model plays a 1-step game where it either banks 1 point or gets a 1% chance of 10 points, then conditioning on it getting 10 points will cause it to take the lower EV action. But this isn't super relevant.
In general I am pretty confused about how models would generalise strategies out of distribution, and this seems like at least partly an empirical ML / science of deep learning problem. But my intuition is that no, even if the model 'understands' that an action causes a good outcome [ P(outcome |. action) ], it still might not take that action if the action is unlikely on priors [ P(action) ], because the basic driver of its choice is what action it will observe given the outcome [ P(action | outcome) ] which is proportional to both of those quantities (and in fact, probably equal, with some normalising constant P(outcome)).
Re lowering the temperature: the thing the predictive model learns is usually some numbers that you normalise into a probability distribution with a softmax function, and you can easily tweak the softmax to make it favour more likely completions, and disfavour unlikely ones. This would greatly decrease the turn-by-turn chance of disconnecting the button (and therefore overall chance), but probably not impair the model's performance elsewhere much.
Re generalisation - decision transformers don't really have strategies per se, they pick moment to moment, and might be systematically miscalibrated about what they'll do in future timesteps. It is true that they'll have some chance at every timestep, which will add up over time, but if you were actually trying to implement this then you could do things like lowering the temperature, which shouldn't affect performance otherwise.
Re higher conditioning - I think this shouldn't be true. For the sake of argument we can reframe it as a binary outcome, where the model's final return (as a proportion of total possible return) becomes its chance of 'winning'. The thing the model is figuring out is not 'what action leads to me winning', or even 'what action is more likely in worlds where I win than worlds where I lose', it's 'what action do I expect to see from agents that win'. If on turn 1, 99% of agents in the training set voluntarily slap a button that has a 1% chance of destroying them, and then 50% go on to win, as well as 50% of the agents that didn't slap the button, then a DT will (correctly) learn that 'almost all agents which go on to win tend to slap the button on turn 1'.
Re correlation - Sure, I am taking the liberal assumption that there's no correlation in the training data, and indeed a lot of this rests on the training data having a nice structure
The writing here was definitely influenced by Lewis (we quote TAoM in footnote 6), although I think the Choice Transition is broader and less categorically negative.
For instance in Lewis's criticism of the potential abolition he writes things like:
The Choice Transition as we're describing it is consistent with either of these approaches. There needn't be any ruling minority, nor do we assume humans can perfectly control future humans, just that they (or any other dominant power) can appropriately steer emergent inter-human dynamics (if there are still humans).