All of veered's Comments + Replies

veered30

I've not really seen it written up, but it's conceptually similar to the classic ML ideas of overfitting, over-parameterization, under-specification, and generalization. If you imagine your alignment constraints as a kind of training data for the model then those ideas fall into place nicely.

After some searching, the most relevant thing I've found is Section 9 (page 44) of Interpretable machine learning: Fundamental principles and 10 grand challenges. Larger model classes often have bigger Rashomon sets and different models in the same Rashomon set can behave very differently.

veered10

All else equal, I think minimizing model entropy is desirable (i.e. the number of weights). In other words, you want to keep the size of the model class small.

Roughly, alignment could be viewed as constructing a list of constraints or criteria that a model must satisfy in order to be considered safe. As the size of the model class grows, more models will satisfy any particular constraint. The complexity of the constraints likely needs to grow along with the complexity of the model class.

If a large number of models satisfy all the constraints, there is a la... (read more)

1Hoagy
Why would we expect the expected level of danger from a model of a certain size to rise as the set of potential solutions grows?
2Zach Stein-Perlman
Hmm, I haven't heard of this kind of thing before; what should I read to learn more?
veered12

He's talking about "modern AI training" i.e. "giant, inscrutable matrices of floating-point numbers". My impression is that he thinks it is possible (but extremely difficult) to build aligned ASI, but nearly impossible to bootstrap modern DL systems to alignment.

2[anonymous]
Would you agree calling it "poorly defined" instead of "aligned" is an accurate phrasing for his argument or not? I edited the post.
veered30

In a causal-masked transformer, attention layers can query the previous layers' activations from any column in the context window. Gradients flow through the attention connections, so each previous layer is optimized not just to improve prediction accuracy for the next token, but also to produce values that are useful for future columns to attend to when predicting their token.


I think this is part of the reason why prompt engineering is so fiddly.

GPT essentially does a limited form of branch prediction and speculative execution. It guesses (based on the to... (read more)

veered21

Yeah I found it pretty easy to "jailbreak" too. For example, here is what appears to be the core web server API code.

I didn't really do anything special to get it. I just started by asking it to list the files in the home directory and went from there.

veered20

For GPT-style LLMs, is it possible to prove statements like the following? 

Choose some tokens  and a fixed :

There does not exist a prefix of tokens  such that 

More generally, is it possible to prove interesting universal statements? Sure, you can brute force it for LLMs with a finite context window but that's both infeasible and boring.  And you can specifically construct contrived LLMs where this is possible but that's also boring.

I suspect that it's not possible/practical in general because t... (read more)

2JBlack
Yes, in general statements like this are theoretically possible to prove, but not remotely practical. There might be some specific (A,B,LLM) triples for which you can prove such a statement but I expect that none of these are generalizable to actually useful statements. No GPT-style architecture is (in itself) capable of truly universal computation, but in practice functions they can implement are far beyond our ability to adequately analyze.
Answer by veered21

Direct self-improvement (i.e. rewriting itself at the cognitive level) does seem much, much harder with deep learning systems than with the sort of systems Eliezer originally focused on.

In DL, there is no distinction between "code" and "data"; it's all messily packed together in the weights. Classic RSI relies on the ability to improve and reason about the code (relatively simple) without needing to consider the data (irreducibly complicated).

Any verification that a change to the weights/architecture will preserve a particular non-trivial property (e.g. av... (read more)

3[anonymous]
What if the machine has a benchmark/training suite for performance. On the benchmark is a task for designing a better machine architecture. Machine proposes a better architecture. New architecture maybe a brand new set of files defining the networks, topology, and training procedure, or they may reuse networks for components. For example you might imagine an architecture that uses gpt-3.5 and -4 as subsystems but the "executive control" is from a new network defined by the architecture. Given a very large compute budget (many billions), the company hosting the RSI runs would run many of these proposals, then the machines that did the best on the bench that are distinct from each other (so a heuristic of distinctness, performance) remain "alive" to design the next generation. So it is recursive, improvement, and the "selves" doing it are getting more capable over time but it's not a single AGI improving itself but instead a population. Humans are also still involved and tweaking things (and maintaining the enormous farms of equipment)
2DragonGod
I think takeoff from broadly human level to strongly superhuman is years, and potentially decades. Foom in days or weeks still seems just as fanciful as before.
veered50

For people that are just reading cfoster0's comment and then skipping a read of the post, I recommend you still take a look. I think his comment is a bit unfair and seems more like a statement of frustration with LLM analysis in general than commentary on this post in particular.

veered96

This is awesome! So far, I'm not seeing much engagement (in the comments) with most of the new ideas in this post, but I suspect this is due to its length and sprawling nature rather than potential interest. This post is a solid start on creating a common vocabulary and framework for thinking about LLMs.

I like the work you did on formalizing LLMs as a stochastic process, but I suspect that some of the exploration of the consequences is more distracting than helpful in an overview like this. In particular: 4.B, 4.C, 4.D, 4.E, 5.B, and 5.C. These results are... (read more)

3Cleo Nardo
I think my definition of μ∞ is correct. It's designed to abstract away all the messy implementation details of the ML architecture and ML training process. Now, you can easily amend the definition to include an infinite context window k. In fact, if you let k>N then that's essentially an infinite context window. But it's unclear what optimal inference is supposed to look like when k=∞. When the context window is infinite (or very large) the internet corpus consists of a single datapoint.