All of Paul Bricman's Comments + Replies

That's an interesting idea. In its simplest form, the escrow could have draconic NDAs with both parties, even if it doesn't have the technology to prove deletion. In general, I'm excited about techniques that influence the type of relations that players can have with each other.

However, one logistical difficulty is getting a huge model from the developer's (custom) infra on the hypothetical escrow infra... It'd be very attractive if the model could just stay with the dev somehow...

2Nathan Helm-Burger
Yes, I've been imagining some sort of international inspector official, who arranges for this to be set up within the infra of the developer's datacenter, and takes responsibility for ensuring the logs and evals get deleted afterwards.

Thanks for the reply! I think the flaw you suggested is closely related to the "likelihood prioritization" augmentation of dictionary attacks from 3.2.1. Definitely something to keep in mind, though one measure against dictionary attacks in general is slow hashing, the practice of configuring a hashing function to be unusually expensive to compute.

For instance, with the current configuration, it would take you a couple years to process a million hashes if all you had was one core of a modern CPU. But this can be arbitrarily modified. The current slow hashing algorithm used also requires a memory burden, so as to make it harder to parallelize on accelerators.

Curious if you have other ideas on this!

Thanks for the interest! I'm not really sure what you mean, though. By components, do you mean circuits or shards or...? I'm not sure what you mean by clarifying or deconfusing components, this sounds like interpretability, but there's not much interpretability going on in the linked project. Feel free to elaborate, though, and I'll try to respond again.

1MiguelDev
Hello there! What I meant as components in my comment are like the attention mechanism itself. For reference, here are the mean weights of two models I'm studying.

Thanks a lot for the feedback!

All I want for christmas is a "version for engineers." Here's how we constructed the reward, here's how we did the training, here's what happened over the course of training.

For sure, I greatly underestimated the importance of legible and concise communication in the increasingly crowded and dynamic space that is alignment. Future outputs will at the very least include an accompanying paper-overview-in-a-post, and in general a stronger focus on self-contained papers. I see the booklet as a preliminary, highly exploratory bit o... (read more)

2Charlie Steiner
Sounds good. I enjoyed at least 50% of the time I spent reading the epistemology :P I just wanted a go-to resource for specific technical questions. Sure, but no promises on interesting feedback. Deception's not quite the right concept. More like exploitation of biases and other weaknesses. This can look like deception, or it can look like incentivizing an AI to "honestly" be searching for arguments in a way that just so happens to be shaped by the argument-evaluation process' standards other than truth.

I feel a lot of the problem relates to an Extremal Goodhart effect, where in the popular imagination views simulations as not equivalent to reality.

That seems right, but aren't all those heuristics prone to Goodharting? If your prior distribution is extremely sharp and you barely update from it, it seems likely that you run into all those various failure modes.

However my guess is that simplicity, not speed or stability priors are the default.

Not sure what you mean by default here. Likely to be used, effective, or?

1Noosphere89
I actually should focus on the circuit complexity prior, but my view is that due to the smallness of agents compared to reality, that they must generalize very well to new environments, which pushes in a simplicity direction.

Thanks a lot for the reference, I haven't came across it before. Would you say that it focuses on gauging modularity?

1Garrett Baker
I don’t know enough details about the circuit prior or the NAH to say confidentiality yes or no, but I’d lean no. Bicycles are more modular than addition, but if you feed in the quantum wave function representing a universe filled to the brim with bicycles versus an addition function, the circuit prior will likely say addition is more probable.

You mean, in that you can simply prompt for a reasonable non-infinite performance and get said outcome?

1Linda Linsefors
Similar but not exactly. I mean that you take some known distribution (the training distribution) as a starting point. But when sampling actions you do so from shifted on truncated distribution to favour higher reward policies.  The in the decision transformers I linked, AI is playing a variety of different games, where the programmers might not know what a good future reward value would be. So they let the system AI predict the future reward, but with the distribution shifted towards higher rewards. I discussed this a bit more after posting the above comment, and there is something I want to add about the comparison.  In quantilizers if you know the probability of DOOM from the base distribution, you get an upper bound on DOOM for the quantaizer. This is not the case for type of probability shift used for the linked decision transformer. DOOM = Unforeseen catastrophic outcome. Would not be labelled as very bad by the AI's reward function but is in reality VERY BAD.

Hm, I think I get the issue you're pointing at. I guess the argument for the evaluator learning  accurate human preferences in this proposal is that it can make use of infinitely many examples of inaccurate human preferences conferred by the agent as negative examples. However, the argument against can be summed up in the following comment of Adam:

I get the impression that with Oversight Leagues, you don't necessarily consider the possibility that there might be many different "limits" of the oversight process, that are coherent with the initial examp

... (read more)

Thanks a lot for the feedback!

How are you getting the connection between the legible property the evaluator is selecting for and actual alignment?

Quoting from another comment (not sure if this is frowned upon):

1. (Outer) align one subsystem (agent) to the other subsystem (evaluator), which we know how to do because the evaluator runs on a computer.
2. Attempt to (outer) align the other subsystem (evaluator) to the human's true objective through a fixed set of positive examples (initial behaviors or outcomes specified by humans) and a growing set of increasi

... (read more)
2Charlie Steiner
"Legible" in the sense of easy to measure. For example, "what makes the human press the Like button" is legible. On the other hand, "what are the human's preferences" is often illegible. The AI's inferred human preferences are typically latent variables within a model of humans. Not just any model will do; we have to some how get the AI to model humans in a way that mostly-satisfies our opinions about what our preferences are and what good reasoning about them is.

In general could someone explain how these alignment approaches do not simply shift the question from "how do we align this one system" to "how do we align this one system (that consists of two interacting sub-systems)"

Thanks for pointing out another assumption I didn't even consider articulating. The way this proposal answers the second question is:

1. (Outer) align one subsystem (agent) to the other subsystem (evaluator), which we know how to do because the evaluator runs on a computer.
2. Attempt to (outer) align the other subsystem (evaluator) to the hum... (read more)

Thanks for the pointer to the paper, saved for later! I think this task of crafting machine-readable representations of human values is a thorny step in any CEV-like/value-loading proposal which doesn't involve the AI inferring them itself IRL-style.

I was considering sifting through literature to form a model of ways people tried to do this in an abstract sense. Like, some approaches aim at a fixed normative framework. Others involve an uncertain seed which is collapsed to a likely framework. Others involve extrapolating from fixed to an uncertain distribution of possible places an initial framework drifted towards. Does this happen to ring a bell about any other references?

We tried using (1) subjectivity (based on simple bag-of-words), and (2) zero-shot text classification (NLI-based) to help us sift through the years of tweets in search for bold claims. (1) seemed a pretty poor heuristic overall, and (2) was still super noisy (e.g. It would identify "that's awesome" as a bold claim, not particularly useful). The second problem was that even if tweets were identified as containing bold claims, those were often heavily contextualized in a reply thread, and so we tried decontextualizing those manually to increase the signal-to... (read more)

2Charlie Steiner
Cool to hear you tried it!