johnswentworth

Sequences

From Atoms To Agents
"Why Not Just..."
Basic Foundations for Agent Models
Framing Practicum
Gears Which Turn The World
Abstraction 2020
Gears of Aging
Model Comparison

Wiki Contributions

Comments

On the matter of software improvements potentially available during recursive self-improvement, we can look at the current pace of algorithmic improvement, which has been probably faster than scaling for some time now. So that's another lower bound on what AI will be capable of, assuming that the extrapolation holds up.

This is definitely a split which I think underlies a lot of differing intuitions about AGI and timelines. That said, the versions of each which are compatible with evidence/constraints generally have similar implications for at least the basics of AI risk (though they differ in predictions about what AI looks like "later on", once it's already far past eclipsing the capabilities of the human species).

Key relevant evidence/constraints, under my usual framing:

  • We live in a very high dimensional environment. When doing science/optimization in such an environment, brute-force is search is exponentially intractable, so having e.g. ten billion humans running the same basic brute-force algorithm will not be qualitatively better than one human running a brute-force algorithm. The fact that less-than-exponentially-large numbers of humans are able to perform as well as we are implies that there's some real "general intelligence" going on in there somewhere.
    • That said, it's still possible-in-principle for whatever general intelligence we have to be importantly distributed across humans. What the dimensionality argument rules out is a model in which humans' capabilities are just about brute-force trying lots of stuff, and then memetic spread of whatever works. The "trying stuff" step has to be doing "most of the work", in some sense, of finding good models/techniques/etc; but whatever process is doing that work could itself be load-bearingly spread across humans.
    • Also, memetic spread could still be a bottleneck in practice, even if it's not "doing most of the work" in an algorithmic sense.
  • A lower bound for what AI can do is "run lots of human-equivalent minds, and cheaply copy them". Even under a model where memetic spread is the main bottlenecking step for humans, AI will still be ridiculously better at that. You know that problem humans have where we spend tons of effort accumulating "tacit knowledge" which is hard to convey to the next generation? For AI, cheap copy means that problem is just completely gone.
  • Humans' own historical progress/experience puts an upper bound on how hard it is to solve novel problems (not solved by society today). Humans have done... rather ridiculously a lot of that, over the past 250 years. That, in turn, lower bounds what AIs will be capable of.

Only if they both predictably painted that part purple, e.g. as part of the overall plan. If they both randomly happened to paint the same part purple, then no.

The main model I know of under which this matters much right now is: we're pretty close to AGI already, it's mostly a matter of figuring out the right scaffolding. Open-sourcing weights makes it a lot cheaper and easier for far more people to experiment with different scaffolding, thereby bringing AGI significantly closer in expectation. (As an example of someone who IIUC sees this as the mainline, I'd point to Connor Leahy.)

Sounds like I've maybe not communicated the thing about circularity. I'll try again, it would be useful to let me know whether or not this new explanation matches what you were already picturing from the previous one.

Let's think about circular definitions in terms of equations for a moment. We'll have two equations: one which "defines"  in terms of , and one which "defines"  in terms of :

Now, if , then (I claim) that's what we normally think of as a "circular definition". It's "pretending" to fully specify  and , but in fact it doesn't, because one of the two equations is just a copy of the other equation but written differently. The practical problem, in this case, is that  and  are very underspecified by the supposed joint "definition".

But now suppose  is not , and more generally the equations are not degenerate. Then our two equations are typically totally fine and useful, and indeed we use equations like this all the time in the sciences and they work great. Even though they're written in a "circular" way, they're substantively non-circular. (They might still allow for multiple solutions, but the solutions will typically at least be locally unique, so there's a discrete and typically relatively small set of solutions.)

That's the sort of thing which clustering algorithms do: they have some equations "defining" cluster-membership in terms of the data points and cluster parameters, and equations "defining" the cluster parameters in terms of the data points and the cluster-membership:

cluster_membership = (data, cluster_params)

cluster_params = (data, cluster_membership)

... where  and  are different (i.e. non-degenerate;  is not just  with data held constant). Together, these "definitions" specify a discrete and typically relatively small set of candidate (cluster_membership, cluster_params) values given some data.

That, I claim, is also part of what's going on with abstractions like "dog".

(Now, choice of axes is still a separate degree of freedom which has to be handled somehow. And that's where I expect the robustness to choice of axes does load-bearing work. As you say, that's separate from the circularity issue.)

As I mentioned at the end, it's not particularly relevant to my own models either way, so I don't particularly care. But I do think other people should want to run this experiment, based on their stated models.

That's only true if the Bellman equation in question allows for a "current payoff" at every timestep. That's the term which allows for totally arbitrary value functions, and not-coincidentally it's the term which does not reflect long-range goals/planning, just immediate payoff.

If we're interested in long-range goals/planning, then the natural thing to do is check how consistent the policy is with a Bellman equation without a payoff at each timestep - i.e. a value function just backpropagated from some goal at a much later time. That's what would make the check nontrivial: there exist policies which are not consistent with any assignment of values satisfying that Bellman equation. For example, the policy which chooses to transition from state A -> B with probability 1 over the option to stay at A with probability 1 (implying value B > value A for any values consistent with that policy), but also chooses to transition B -> A with probability 1 over the option to stay at B with probability 1 (implying value A > value B for any values consistent with that policy).

(There's still the trivial case where indifference could be interpreted as compatible with any policy, but that's easy to handle by adding a nontriviality requirement.)

I don't usually think about RL on MDPs, but it's an unusually easy setting in which to talk about coherence and its relationship to long-term-planning/goal-seeking/power-seeking.

Simplest starting point: suppose we're doing RL to learn a value function (i.e. mapping from states to values, or mapping from states x actions to values, whatever your preferred setup), with transition probabilities known. Well, in terms of optimal behavior, we know that the optimal value function for any objective in the far future will locally obey the Bellman equation with zero payoff in the immediate timestep: value of this state is equal to the max over actions of expected next-state value under that action. So insofar as we're interested in long-term goals specifically, there's an easy local check for the extent to which the value function "optimizes for" such long-term goals: just check how well it locally satisfies that Bellman equation.

From there, we can extend to gradually more complicated cases in ways which look similar to typical coherence theorems (like e.g. Dutch Book theorems). For instance, we could relax the requirement of known probabilities: we can ask whether there is any assignment of state-transition probabilities such that the values satisfy the Bellman equation.

As another example, if we're doing RL on a policy rather than value function, we can ask whether there exists any value function consistent with the policy such that the values satisfy the Bellman equation.

So that example SWE bench problem from the post:

... is that a prototypical problem from that benchmark? Because if so, that is a hilariously easy benchmark. Like, something could ace that task and still be coding at less than a CS 101 level.

(Though to be clear, people have repeatedly told me that a surprisingly high fraction of applicants for programming jobs can't do fizzbuzz, so even a very low level of competence would still put it above many would-be software engineers.)

Load More