Towards_Keeperhood

I'm trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.

Sequences

Orcas

Wikitag Contributions

Comments

Sorted by

discussion of exercises of the probability 2 lecture starts mid episode 104.

Asmodia figuring out Keltham's probability riddles may also be interesting, though perhaps less so than the lectures. It starts episode 90. Starting quote is "no dath ilani out of living memory would've seen the phenomenon". The story unfortunately switches between Asmodia+Ione, Carrissa(+Peranza I think), and Keltham+Meritxell. You can skip the other stuff that's going on there (though the brief "dath ilan" reply about stocks might be interesting too).

Thanks!

If the value function is simple, I think it may be a lot worse than the world-model/thought-generator at evaluating what abstract plans are actually likely to work (since the agent hasn't yet tried a lot of similar abstract plans from where it could've observed results, and the world model's prediction making capabilities generalize further). The world model may also form some beliefs about what the goals/values in a given current situation are. So let's say the thought generator outputs plans along with predictions about those plans, and some of those predictions predict how well a plan is going to fulfill what it believes the goals are (like approximate expected utility). Then the value function might learn to just just look at this part of a thought that predicts the expected utility, and then take that as it's value estimate.

Or perhaps a slightly more concrete version of how that may happen. (I'm thinking about model-based actor-critic RL agents which start out relatively unreflective, rather than just humans.):

  • Sometimes the thought generator generates self-reflective thoughts like "what are my goals here", where upon the thought generator produces an answer "X" to that, and then when thinking how to accomplish X it often comes up with a better (according to the value function) plan than if it tried to directly generate a plan without clarifying X. Thus the value function learns to assign positive valence to thinking "what are my goals here".
    • The same can happen with "what are my long-term goals", where the thought generator might guess something that would cause high reward.
    • For humans, X is likely more socially nice than would be expected from the value function, since "X are my goals here" is a self-reflective thought where the social dimensions are more important for the overall valence guess.[1]
  • Later the thought generator may generate the thought "make careful predictions whether the plan will actually accomplish the stated goals well", where upon the thought generator often finds some incoherences that the value function didn't notice, and produces a better plan. Then the value function learns to assign high valence to thoughts like "make careful predictions whether the plan will actually accomplish the stated goals well".
  • Later the predictions of the thought generator may not always match well with the valence the value function assigns, and it turns out that the thought generator's predictions often were better. So over time the value function gets updated more and more toward "take the predictions of the thought generator as our valence guess", since that strategy better predicts later valence guesses.
  • Now, some goals are mainly optimized by the thought generator predicting how some goals could be accomplished well, and there might be beliefs in the thought generator like "studying rationality may make me better at accomplishing my goals", causing the agent to study rationality.
    • And also thoughts like "making sure the currently optimized goal keeps being optimized increases the expected utility according to the goal".
    • And maybe later more advanced bootstrapping through thoughts like "understanding how my mind works and exploiting insights to shape it to optimize more effectively would probably help me accomplish my goals". Though of course for this to be a viable strategy it would at least be as smart as the smartest current humans (which we can assume because otherwise it's too useless IMO).

So now the value function is often just relaying world-model judgements and all the actually powerful optimization happens in the thought generator. So I would not classify that as the following:

In my view, the big problem with model-based actor-critic RL AGI, the one that I spend all my time working on, is that it tries to kill us via using its model-based RL capabilities in the way we normally expect—where the planner plans, and the actor acts, and the critic criticizes, and the world-model models the world …and the end-result is that the system makes and executes a plan to kill us.

So in my story, the thought generator learns to model the self-agent and has some beliefs about what goals it may have, and some coherent extrapolation of (some of) those goals is what gets optimized in the end. I guess it's probably not that likely that those goals are strongly misaligned to the value function on the distribution where the value function can evaluate plans, but there are many possible ways to generalize the values of the value function. 
For humans, I think that the way this generalization happens is value-laden (aka what human values are depend on this generalization). The values might generalize a bit differently for different humans of course, but it's plausible that humans share a lot of their prior-that-determines-generalization, so AIs with a different brain architecture might generalize very differently.

Basically, whenever someone thinks "what's actually my goal here", I would say that's already a slight departure from "using one's model-based RL capabilities in the way we normally expect". Though I think I would agree that for most humans such departures are rare and small, but I think they get a lot larger for smart reflective people, and I think I wouldn't describe my own brain as "using one's model-based RL capabilities in the way we normally expect". I'm not at all sure about this, but I would expect that "using its model-based RL capabilities in the way we normally expect" won't get us to pivotal level of capability if the value function is primitive.

  1. ^

    If I just trust my model of your model here. (Though I might misrepresent your model. I would need to reread your posts.)

Note that the "Probability 2" lecture continues after the lunch break (which is ~30min skippable audio).

Thanks!

Sorry, I think I intended to write what I think you think, and then just clarified my own thoughts, and forgot to edit the beginning. Sorry, I ought to have properly recalled your model.

Yes, I think I understand your translations and your framing of the value function.

Here are the key differences between a (more concrete version of) my previous model and what I think your model is. Please lmk if I'm still wrongly describing your model:

  • plans vs thoughts
    • My previous model: The main work for devising plans/thoughts happens in the world-model/thought-generator, and the value function evaluates plans.
    • Your model: The value function selects which of some proposed thoughts to think next. Planning happens through the value function steering the thoughts, not the world model doing so.
  • detailedness of evaluation of value function
    • My previous model: The learned value function is a relatively primitive map from the predicted effects of plans to a value which describes whether the plan is likely better than the expected counterfactual plan. E.g. maybe sth roughly like that we model how sth like units of exchange (including dimensions like "how much does Alice admire me") change depending on a plan, and then there is a relatively simple function from the vector of units to values. When having abstract thoughts, the value function doesn't understand much of the content there, and only uses some simple heuristics for deciding how to change its value estimate. E.g. a heuristic might be "when there's a thought that the world model thinks is valid and it is associated to the (self-model-invoking) thought "this is bad for accomplishing my goals", then it lowers its value estimate. In humans slightly smarter than the current smartest humans, it might eventually learn the heuristic "do an explicit expected utility estimate and just take what the result says as the value estimate", and then that is being done and the value function itself doesn't understand much about what's going on in the expected utility estimate, but it just allows to happen whatever the abstract reasoning engine predicts. So it essentially optimizes goals that are stored as beliefs in the world model.
      • So technically you could still say "but what gets done still depends on the value function, so when the value function just trusts some optimization procedure which optimizes a stored goal, and that goal isn't what we intended, then the value function is misaligned". But it seems sorta odd because the value function isn't really the main relevant thing doing the optimization.
      • The value function essentially is too dumb to do the main optimization itself for accomplishing extremely hard tasks. Even if you set incentives so that you get ground-truth reward for moving closer to the goal, it would be too slow at learning what strategies work well
    • Your model: The value function has quite a good model of what thoughts are useful to think. It is just computing value estimates, but it can make quite coherent estimates to accomplish powerful goals.
      • If there are abstract thoughts about actually optimizing a different goal than is in the interest of the value function, the value function shuts them down by assigning low value.
      • (My thoughts: One intuition is that to get to pivotal intelligence level, the value function might need some model of its own goals in order to efficiently recognizing when some values it is assigning aren't that coherent, but I'm pretty unsure of that. Do you think the value function can learn a model of its own values?)

There's a spectrum between my model and yours. I don't know what model is better; at some point I'll think about what may be a good model here. (Feel free to lmk your thoughts on why your model may be better, though maybe I just see it when in the future I think about it more carefully and reread some of your posts and model your model in more detail. I'm currently not modelling either model that detailed.)

Why two?

Mathematics/logical-truths are true in all possible worlds, so they never tell you in what world you are.

If you want to say something that is true in your particular world (but not necessarily in all worlds), you need some observations to narrow down what world you are in.

I don't know how closely this matches the use in the sequence, but I think a sensible distinction between logical and causal pinpointing is: All the math parts of a statement are "logically pinpointed" and all the observation parts are "causally pinpointed".

So basically, I think in theory you can reason about everything purely logically by using statements like "In subspace_of_worlds W: X"[1], and then you only need causal pinpointing before making decisions for evaluating what world you're actually likely in.

You could imagine programming a world model where there's the default assumption that non-tautological statements are about the world we're in, and then a sentence like "Peter's laptop is silver" would get translated into sth like "In subspace_of_worlds W_main: color(<x s.t. laptop(x) AND own(Peter, x)>, silver)".

Most of the statements you reason with are of course about the world you're in or close cousin worlds with only few modifications, though sometimes we also think about further away fiction worlds (e.g. HPMoR).

(Thanks to Kaarel Hänni for a useful conversation that lead up to this.)

  1. ^

    It's just a sketch, not a proper formalization. Maybe we rather want sth like statements of the form "if {...lots of context facts that are true in our world...} then X".

Thanks for clarifying.

I mean I do think it can happen in my system that you allocate an object for something that's actually 0 or >1 objects, and I don't have a procedure for resolving such map-territory mismatches yet, though I think it's imaginable to have a procedure that defines new objects and tries to edit all the beliefs associated with the old object.

I definitely haven't described how we determine when to create a new object to add to our world model, but one could imagine an algorithm checking when there's some useful latent for explaining some observations, and then constructing a model for that object, and then creating a new object in the abstract reasoning engine. Yeah there's still open work to do for how a correspondences between the constant symbol for our object and our (e.g. visual) model of the object can be formalized and used, but I don't see why it wouldn't be feasible.

I agree that we end up with a map that doesn't actually fit the territory, but I think it's fine if there's a unresolveable mismatch somewhere. There's still a useful correspondence in most places. (Sure logic would collapse from a contradiction but actually it's all probabilistic somehow anyways.) Although of course we don't have anything to describe that the territory is different from the map in our system yet. This is related to embedded agency, and further work on how to model your map as possibly not fitting the territory and how that can be used is still necessary.

Thx.

Yep there are many trade-offs between criteria.

Btw, totally unrelatedly:

I think in the past on your abstraction you probably lost a decent amount of time from not properly tracking the distinction between (what I call) objects and concepts. I think you likely at least mostly recovered from this, but in case you're not completely sure you've fully done so you might want to check out the linked section. (I think it makes sense to start by understanding how we (learn to) model objects and only look at concepts later, since minds first learn to model objects and later carve up concepts as generalizations over similarity clusters of objects.) 

Tbc, there's other important stuff than objects and concepts, like relations and attributes. I currently find my ontology here useful for separating subproblems, so if you're interested you might read more of the linked post even though you're surely already familiar with knowledge representation (if you haven't done so yet), but maybe you already track all that.

Thanks.

I'm still not quite understanding what you're thinking though.

For other objects, like physical ones, quantifiers have to be used. Like "at least one" or "the" (the latter only presupposes there is exactly one object satisfying some predicate). E.g. "the cat in the garden". Perhaps there is no cat in the garden or there are several. So it (the cat) cannot be logically represented with a constant.

"the" supposes there's exactly one canonical choice for what object in the context is indicated by the predicate. When you say "the cat" there's basically always a specific cat from context you're talking about. "The cat is in the garden" is different from "There's exactly one cat in the garden".

Maybe "Superman" is actually two people with the same dress, or he doesn't exist, being the result of a hallucination. This case can be easily solved by treating those names as predicates.

  • The woman believes the superhero can fly.
  • The superhero is the colleague.

I mean there has to be some possibility for revising your world model if you notice that there are actually 2 objects for something where you previously thought there's only one.

I agree that "Superman" and "the superhero" denote the same object(assuming you're in the right context for "the superhero").

(And yeah to some extent names also depend a bit on context. E.g. if you have 2 friends with the same name.)

You can say "{(the fact that) there's an apple on the table} causes {(the fact that) I see an apple}"

But that's not primitive in terms of predicate logic, because here "the" in "the table" means "this" which is not a primitive constant. You don't mean any table in the world, but a specific one, which you can identify in the way I explained in my previous comment.

Yeah I didn't mean this as formal statement. formal would be:

{exists x: apple(x) AND location(x, on=Table342)} CAUSES {exists x: apple(x) AND see(SelfPerson, x)}

Load More