All of Paul Colognese's Comments + Replies

This Goodness of Reality hypothesis is a very strong empirical claim about psychology that strongly contradicts folk psychology,


One way of thinking about the Goodness of Reality hypothesis is that if we look at an agent in the world, its world model and utility function/preferences are fully a property of that agent/its internals rather than reality-at-large. Reality is value-neutral - it requires additional structure (utility function, etc.) to assign value to states of reality (and these utility functions, to the extent that they're real, are parts of re... (read more)

Interesting! I'm working on a project exploring something similar but from a different framing. I'll give this view some thought, thanks!

Thanks, that's the kind of answer I was looking for

Interesting discussion; thanks for posting!

I'm curious about what elementary units in NNs could be.

the elementary units are not the neurons, but some other thing.

I tend to model NNs as computational graphs where activation spaces/layers are the nodes and weights/tensors are the edges of the graph. Under this framing, my initial intuition is that elementary units are either going to be contained in the activation spaces or the weights.

There does seem to be empirical evidence that features of the dataset are represented as linear directions in activatio... (read more)

4Charbel-Raphaël
I think you could imagine many different types of elementary units wrapped in different ontologies: * Information may be encoded linearly in NN, with superposition or composition, locally or highly distributed. (see the figure below from Distributed Representations: Composition & Superposition) * Maybe a good way to understand NN is the polytope theory? * Maybe some form of memory is encoded as key-value pairs in the MLP of the transformers? * Or maybe you could think of NNs as Bayesian causal graphs. * Or maybe you should think instead of algorithms inside transformers (Induction heads, modular addition algorithm, etc...) and it's not that meaningful to think of linear direction. Or most likely a mixture of everything.

Thanks for pointing this out. I'll look into it and modify the post accordingly.

With ideal objective detection methods, the inner alignment problem is solved (or partially solved in the case of non-ideal objective detection methods), and governance would be needed to regulate which objectives are allowed to be instilled in an AI (i.e., government does something like outer alignment regulation).

Ideal objective oversight essentially allows an overseer instill whatever objectives it wants the AI to have. Therefore, if the overseer includes the government, the government can influence whatever target outcomes the AI pursues.

So practically... (read more)

Thanks for the reponse, it's useful to hear that we can to the same conclusions. I quoted your post in the first paragraph. 

Thanks for bringing Fabien's post to my attention! I'll reference it. 

Looking forward to your upcoming post.

2Seth Herd
Ooops, I hadn't clicked those links so didn't notice they were to my posts! You've probably found this, since it's the one tag on your post: the chain of thought alignment tag goes to some other related work. There's a new one up today that I haven't finished processing.

Interesting! Quick thought: I feel as though it over-compressed the post, compared to the summary I used. Perhaps you can tweak things to generate multiple summaries in varying degrees of length.

2Evan R. Murphy
Great idea, I will experiment with that - thanks!

Thanks for the feedback! I guess the intention of this post was to lay down the broad framing/motivation for upcoming work that will involve looking at the more concrete details.

I do resonate with the feeling that the post as a whole feels a bit empty as it stands and the effort could have been better spent elsewhere.

My current high-level research direction

It’s been about a year since I became involved in AI Alignment. Here is a super high-level overview of the research direction I intend to pursue over the next six or so months.

  • We’re concerned with building AI systems that produce “bad behavior”, either during training or in deployment.
  • We define “irreversibly bad behavior” to include actions that inhibit an overseer’s ability to monitor and control the system. This includes removing an off-switch and deceptive behavior.
  • To prevent bad behavior from occurring, the overs
... (read more)