All of Oleg Trott's Comments + Replies

"why didn't the first person to come up with the idea of using computers to predict the next element in a sequence patent that idea, in full generality"

 

Patents are valid for about 20 years. But Bengio et al used NNs to predict the next word back in 2000:

https://papers.nips.cc/paper_files/paper/2000/file/728f206c2a01bf572b5940d7d9a8fa4c-Paper.pdf

So this idea is old. Only some specific architectural aspects are new.

I suspect this labeling and using the labels is still harder that you think though, since individual tokens don't have truth values.

 

Why should they?

You could label each paragraph, for example. Then, when the LM is trained, the correct label could come before each paragraph, as a special token: <true>, <false>, <unknown> and perhaps <mixed>.

Then, during generation, you'd feed it <true> as part of the prompt, and when it generates paragraph breaks.

Similarly, you could do this on a per-sentence basis.

The idea that we're going to produce a similar amount of perfectly labeled data doesn't seem plausible.

 

That's not at all the idea. Allow me to quote myself:

Here’s what I think we could do. Internet text is vast – on the order of a trillion words. But we could label some of it as “true” and “false”. The rest will be “unknown”.

You must have missed the words "some of" in it. I'm not suggesting labeling all of the text, or even a large fraction of it. Just enough to teach the model the concept of right and wrong.

It shouldn't take long, especially since I... (read more)

2Brendan Long
Ah, I misread the quote you included from Nathan Helm-Burger. That does make more sense. This seems like a good idea in general, and would probably make one of the things Anthropic is trying to do (find the "being truthful" neuron) easier. I suspect this labeling and using the labels is still harder that you think though, since individual tokens don't have truth values. I looked through the links you posted and it seems like the push-back is mostly around things you didn't mention in this post (prompt engineering as an alignment strategy).
Answer by Oleg Trott-10

Carlson's interview, BTW. It discusses LessWrong in the first half of the video. Between X and YouTube, the interview got 4M views -- possibly the most high-profile exposure of this site?

 

 

I'm kind of curious about the factual accuracy: "debugging" / struggle sessions, polycules, and the 2017 psychosis -- Did that happen?

What do VELM and VETLM offer which those other implementable proposals don't? And what problems do VELM and VETLM not solve?

 

VETLM solves superalignment, I believe. It's implementable (unlike CEV), and it should not be susceptible to wireheading (unlike RLHF, instruction following, etc) Most importantly, it's intended to work with an arbitrarily good ML algorithm -- the stronger the better. 

So, will it self-improve, self-replace, escape, let you turn it off, etc.? Yes, if it thinks that this is what its creators would have wanted.

Will it be trans... (read more)

New proposals are useful mainly insofar as they overcome some subset of barriers which stopped other solutions.

 

CEV was stopped by being unimplementable, and possibly divergent:

The main problems with CEV include, firstly, the great difficulty of implementing such a program - “If one attempted to write an ordinary computer program using ordinary computer programming skills, the task would be a thousand lightyears beyond hopeless.” Secondly, the possibility that human values may not converge. Yudkowsky considered CEV obsolete almost immediately after it

... (read more)
4johnswentworth
Well, we have lots of implementable proposals. What do VELM and VETLM offer which those other implementable proposals don't? And what problems do VELM and VETLM not solve? Alternatively: what's the combination of problems which these solutions solve, which nothing else we've thought of simultaneously solves?

That post was completely ignored here: 0 comments and 0 upvotes during the first 24 hours.

I don't know if it's the timing or the content.

On HN, which is where I saw it, it was ranked #1 briefly, as I recall. But then it got "flagged", apparently. 

Machine Learning Street Talk interview of one of the authors: 

There was an article in New Scientist recently about "sending particles back in time". I was a physics major, but I might have skipped the time travel class, so I don't have an opinion on this. But Sabine Hossenfelder posted a video, arguing that New Scientist misrepresented the actual research.

Side note: the link didn't make it to the front page of HN, despite early upvotes. Other links with worse stats (votes at a certain age) rose to the very top. Anyways, it's currently ranked 78. I guess I don't really understand how HN ranks things. I hope someone will explain this to me. Does the source "youtube" vs "nytimes" matter? Do flag-votes count as silent mega-downvotes? Does the algorithm punish posts with numbers in them?

9lincolnquirk
Yes - HN users with flag privileges can flag posts. Flags operate as silent mega-downvotes. (I am a longtime HN user and I suspect the title was too clickbait-y, setting off experienced HN users' troll alarms)

Thanks! It looks interesting. Although I think it's different from what I was talking about.

2Nathan Helm-Burger
I've been continuing to think about this. I eventually remembered that the place I remembered the idea from was pervade conversations with other AI researchers! Sorry to have sent you on a wild goose chase! It would be interesting to try an experiment with this. Perhaps doing hierarchical clustering on Open Web Text (an early not-too-huge dataset from GPT-2 days). Then get an LLM worth a large context window to review a random subset of each cluster and write a description of it (including an estimate of factual validity). Then, when training, those descriptions would be non-predicted context given to the model. If you do use hierarchical clustering, this will result in a general description and some specific subtype descriptions for every datapoint.

I think your idea of labelling the source and epistemic status of all training data is good. I've seen the idea presented before.

 

I'm not finding anything. Do you recall the authors? Presented at a conference? Year perhaps? Specific keywords? (I tried the obvious)

3Nathan Helm-Burger
Hmmm. I don't remember. But here's a new example that Zvi just mentioned: https://arxiv.org/abs/2310.15047

I think that regularization in RL is normally used to get more rewards (out-of-sample).

Sure, you can increase it further and do the opposite – subvert the goal of RL (and prevent wireheading).

But wireheading is not an instability, local optimum, or overfitting. It is in fact the optimal policy, if some of your actions let you choose maximum rewards.

Anyway, the quote you are referring to says “as (AI) becomes smarter and more powerful”.

It doesn’t say that every RL algorithm will wirehead (find the optimal policy), but that an ASI-level one will. I have no mathematical proof of this, since these are fuzzy concepts. I edited the original text to make it less controversial.

Most humans are aware of the possibility of wireheading, both the actual wire version and the more practical versions involving psychotropic drugs.

 

For humans, there are negative rewards for abusing drugs/alcohol -- hangover the next day, health issues, etc. You could argue that they are taking those into account.

But for an entirely RL-driven AI, wireheading has no anticipated downsides.

4RogerDearnaley
In practice, most current AIs are not constructed entirely by RL, partly because it has instabilities like this. For example, LLMs instruction-trained by RLHF uses a KL-divergence loss term to limit how dramatically the RL can alter the base model behavior trained by SGD. So the result deliberately isn't pure RL. Yes, if you take a not-yet intelligent agent, train it using RL, and give it unrestricted access to a simple positive reinforcement avenue unrelated to the behavior you actually want, it is very likely to "wire-head" by following that simple maximization path instead. So people do their best not to do that when working with RL.

Yes, it's simple enough, that I imagine it's likely people came up with it before. But it fixes a flaw in the other idea (which is also simple, although in the previous discussion I was told that it might be novel)

many of which will allow for satisfaction, while still allowing the AI to kill everyone.

This post is just about alignment of AGI's behavior with its creator's intentions, which is what Yoshua Bengio was talking about.

If you wanted to constrain it further, you'd say that in the prompt. But I feel that rigid constraints are probably unhelpful, the way The Three Laws of Robotics are. For example, anyone could threaten suicide and force the AGI to do absolutely anything short of killing other people.

Quoting from the CEV link:

The main problems with CEV include, firstly, the great difficulty of implementing such a program - “If one attempted to write an ordinary computer program using ordinary computer programming skills, the task would be a thousand lightyears beyond hopeless.” Secondly, the possibility that human values may not converge. Yudkowsky considered CEV obsolete almost immediately after its publication in 2004.

Neither problem seems relevant to what I'm proposing. My implementation is just a prompt. And there is no explicit optimization (after the LM has been trained).

Has anyone proposed exactly what I'm proposing? (slightly different wording is OK, of course)

3Seth Herd
I don't think anyone has proposed this. I think the most similar proposal is my instructions-following AGI (particularly since I'm also mostly thinking of just such a text prompt in a language model agent as the implementation). My proposal with its checking emphasis is aimed more at the range where the AGI is human level and above, where yours seems more aimed at the truly super intelligent range. Mine keeps the human in charge of figuring out what they would've wanted in case the AGI gets that wrong. Other related work is linked in that post. The above objections to CEV partly apply to your proposal. There is probably not just one thing X would've wanted with more consideration, since conclusions may depend on circumstances. I'm not sure that breaks the proposal; it could be that any of the several things X might've wanted would serve adequately.

"Some content on the Internet is fabricated, and therefore we can never trust LMs trained on it"

Is this a fair summary?

3johnswentworth
No, because we have tons of information about what specific kinds of information on the internet is/isn't usually fabricated. It's not like we have no idea at all which internet content is more/less likely to be fabricated. Information about, say, how to prove that there are infinitely many primes is probably not usually fabricated. It's standard basic material, there's lots of presentations of it, it's not the sort of thing which people usually troll about. Yes, the distribution of internet text about the infinitude of primes contains more-than-zero trolling and mistakes and the like, but that's not the typical case, so low-temperature sampling from the LLM should usually work fine for that use-case. On the other end of the spectrum, "fusion power plant blueprints" on the internet today will obviously be fictional and/or wrong, because nobody currently knows how to build a fusion power plant which works. This generalizes to most use-cases in which we try to get an LLM to do something (using only prompting on a base model) which nobody is currently able to do. Insofar as the LLM is able to do such things, that actually reflects suboptimal next-text prediction on its part.
3Tapatakt
I would add "and the kind of content you want to get from aligned AGI definitely is fabricated on the Internet today". So the powerful LM trying to predict it will predict how the fabrication would look like.
Oleg Trott1-10

Technically true. But you could similarly argue that humans are just clumps of molecules following physical laws. Talking about human goals is a charitable interpretation. 

And if you are in a charitable mood, you could interpret LMs as absorbing the explicit and tacit knowledge of millions of Internet authors. A superior ML algorithm would just be doing this better (and maybe it wouldn't need lower-quality data).

9johnswentworth
That is not how this works. Let's walk through it for both the "human as clumps of molecules following physics" and the "LLM as next-text-on-internet predictor". Humans as clumps of molecules following physics Picture a human attempting to achieve some goal - for concreteness, let's say the human is trying to pick an apple from a high-up branch on an apple tree. Picture what that human does: they maybe get a ladder, or climb the tree, or whatever. They manage to pluck the apple from the tree and drop it in a basket. Now, imagine a detailed low-level simulation of the exact same situation: that same human trying to pick that same apple. Modulo quantum noise, what happens in that simulation? What do we see when we look at its outputs? Well, it looks like a human attempting to achieve some goal: the clump of molecules which is a human gets another clump which is a ladder, or climbs the clump which is the tree, or whatever. LLM as next-text-on-internet predictor Now imagine finding the text "Notes From a Prompt Factory" on the internet, today (because the LLM is trained on text from ~today). Imagine what text would follow that beginning on the internet today. The text which follows that beginning on the internet today is not, in fact, notes from a prompt factory. Instead, it's fiction about a fictional prompt factory. So that's the sort of thing we should expect a highly capable LLM to output following the prompt "Notes From a Prompt Factory": fiction. The more capable it is, the more likely it is to correctly realize that that prompt precedes fiction. It's not a question of whether the LLM is absorbing the explicit and tacit knowledge of internet authors; I'm perfectly happy to assume that it is. The issue is that the distribution of text on today's internet which follows the prompt "Notes From a Prompt Factory" is not the distribution of text which would result from actual notes on an actual prompt factory. The highly capable LLM absorbs all that knowledge fr

A variation on this:

Any expression should be considered for replacement by a slightly bigger or smaller one. For example

z = f(x**2 * y)

should be replaceable by

z = f((x**2 - 1) * y)

The generated programs are quite short. So I would guess that this multiplies their number by 100-1000, if you consider one perturbation at a time.

If GPT-4o made the off-by-one error, is it reasonable to expect GPT-3.5 to spot it?

4ryan_greenblatt
No, but it doesn't need to spot errors, just note places which could plausibly be bugs.

@ryan_greenblatt's approach also asks GPT-4o to improve its previous guesses.

These calls are expensive though.

The idea of Program Dithering is to generate many candidate programs cheaply.

2ryan_greenblatt
Agree overall, but you might be able to use a notably cheaper model (e.g. GPT-3.5) to dither.

If you have  locations that you want to perturb, then if you try a single off-by-one perturbation at a time, this adds  programs. With two at a time, this adds  programs.

There's a possible optimization, where you only try this on tasks where no unperturbed program was found (<28%)

 

EDIT: Ironically, I made an off-by-one error, which Program Dithering would have fixed: This should be 

This looks similar, in spirit, to Large Language Models as General Pattern Machines:

https://arxiv.org/abs/2307.04721 

I'm surprised by how knowledgeable people are about this on this site!

BTW, there's some discussion of this happening on the CCL mailing list (limited to professionals in relevant fields) if you are interested.

Right. The benchmark (their test set) just compares 3D structures.

Side note: 52% also seems low for Vina, but I haven't looked into this further. Maybe the benchmark is hard, or maybe the "search space" (user-specified) was too big.

On their other test (in their Extended Data), both Vina and AF3 do much better. 

Unlike Vina, AF3 only predicts 3D structures, I believe. It does not predict binding affinities.

2ChristianKl
AlphaFold 2 was only predicting 3D structures. From the abstract of the Alpha Fold 3 paper:  The FastCompany article says:
Oleg Trott100

Determining 3D structures is expensive.

The most realistic thing one could do is repeat this work, with the same settings, but using k-fold cross-validation, where test and training sets are never related (like what I did at Columbia). 

This will show how well (or poorly, as the case may be) the method generalizes to unrelated proteins.

I hope someone does it.

2ChristianKl
But do you need 3D structures to test alphafold? If alphafold makes a prediction about whether a given ligand binds to a protein, I would expect that testing whether or not that ligand binds to the protein is much cheaper.

(ligand = drug-like molecule, for anyone else reading)

Right, I didn't mean exact bitwise memory comparisons.

The dataset is redundant(ish), simply as an artifact of how it's constructed:

For example, if people know that X binds A, and X ≈ Y, and A ≈ B, they'll try to add X+B, Y+A and Y+B to the dataset also.

And this makes similarity-based predictions look artificially much more useful than they actually are, because in the "real world", you will need to make predictions about dissimilar molecules from some collection.

I hope this makes sense.