All of ws27b's Comments + Replies

ws27b91

I think that we know how it works in humans. We're an intelligent species who rose to dominance through our ability to plan and communicate in very large groups. Moral behaviours formed as evolutionary strategies to further our survival and reproductive success. So what are the drivers for humans? We try to avoid pain, we try to reproduce, we may be curiosity driven (although this may also just be avoidance of pain fundamentally, since boredom or regularity in data is also painful). At the very core, our constant quest towards the avoidance of pain is the ... (read more)

2Mitchell_Porter
It took me a while to digest your answer, because you're being a little more philosophical than most of us here. Most of us are like, what do AI values have to be so that humans can still flourish, how could the human race ever agree on an answer to that question, how can we prevent a badly aligned AI from winning the race to superintelligence...  But you're more just taking a position on how a general intelligence would obtain its values. You make no promise that the resulting values are actually good in any absolute sense, or even that they would be human-friendly. You're just insisting that if those values arose by a process akin to conditioning, without any reflection or active selection by the AI, then it's not as general and powerful an intelligence as it could be.  Possibly you should look at the work of Joscha Bach. I say "possibly" because I haven't delved into his work myself. I only know him as one of those people who shrug off fears about human extinction by saying, humans are just transitional, and hopefully there'll be some great posthuman ecology of mind; and I think that's placing "trust" in evolution to a foolish degree.   However, he does say he's interested in "AGI ethics" from an AI-centered perspective. So possibly he has something valid to say about the nature of the moralities and value systems that unaligned AIs could generate for themselves.  In any case, I said that bottom-up derivations of morality have been discussed here before. The primordial example actually predates Less Wrong. Eliezer's original idea for AI morality, when he was about 20, was to create an AI with no hardwired ultimate goal, but with the capacity to investigate whether there might be ultimate goals: metaethical agnosticism, followed by an attempt (by the AI!) to find out whether there are any objective rights and wrongs.  Later on, Eliezer decided that there is no notion of good that would be accepted by all possible minds, and resigned himself to the idea that s
ws27b1-2

The fact remains that RLHF, even if performed by an LLM, is basically injection of morality by humans, which is never the path towards truly generally intelligent AGI. Such an AGI has to be able to derive its own morality bottom-up and we have to have faith that it will do so in a way that is compatible with our continued existence (which I think we have plenty of good reason to believe it will, after all, many other species co-exist peacefully with us). All these references to other articles don't really get you anywhere if the fundamental idea of RLHF is... (read more)

3Mitchell_Porter
Do you have any idea how that would work?  This notion has been discussed previously on Less Wrong, from several perspectives, but first I want to see if you have any fresh ideas. 
ws27b00

The problem is that true AGI is self-improving and that a strong enough intelligence will always either accrue the resource advantage or simply do much more with less. Chess engines like Stockfish do not serve as good analogies for AGI since they don't have those self-referential self-improvement capabilities that we would expect true AGI to have.

ws27b10

Actually it is brittle per definition, because no matter how much you push it, there will be out-of-distribution inputs that behave unstably and allow you to distract the model from the intended behaviour. Not to mention how unsophisicated it is to have humans specify through textual feedback how an AGI should behave. We can toy around with these methods for the time being, but I don't think any serious AGI researcher believes RLHF or its variants is the ideal way forward. Morality needs to be discovered, not taught. As Stuart Russell has said, we need to ... (read more)

2Vladimir_Nesov
Anything breaks out-of-distribution, you can try and reformulate the whole of alignment this way, but what out-of-distribution really means for a given learning algorithm is unknown, so it's only a framing, not a real operationalization. A useful thing that might fall out of this framing is trying to keep track of where specifically robustness is preserved, which the base distribution of quantilization tries to track, in order to mitigate Goodhart's Curse. More generally, things that are not out-of-distribution respect the boundaries of a system (as a collection of possible behaviors), don't push it into its crash space.
ws27b30

It's a subjective matter whether the above is successful ASCII art balloon or not. If we hold GPT to the same standards we do for text generation, I think we can safely say the above depiction is a miserable failure. The lack of symmetry and overall childishness of it suggests it has understood nothing about the spatiality and only by random luck manages to approximate something it has explicitly seen in the training data. I've done a fair bit of repeated generations and they all come out poorly). I think the Transformer paper was interesting as well, alth... (read more)

Answer by ws27b*3-5

I think it makes sense that it fails in this way. ChatGPT really doesn't see lines arranged vertically, it just sees the prompt as one long line. But given that it has been trained on a lot of ASCII art, it will probably be successful at copying some of it some of the time.

In case there is any doubt, here is GPT4's own explanation of these phenomena:

Lack of spatial awareness: GPT-4 doesn't have a built-in understanding of spatial relationships or 2D layouts, as it is designed to process text linearly. As a result, it struggles to maintain the correct align

... (read more)
3Vladimir_Nesov
This is a more sensible representation of a balloon than one in the post, it's just small. More prompts tested on both ChatGPT-3.5 and GPT-4 would clarify the issue. Vision can be implemented in transformers by representing pictures with linear sequences of tokens, which stand for small patches of the picture, left-to-right, top-to-bottom (see appendix D.4 of this paper). The model then needs to learn on its own how the rows fit together into columns and so on. The vision part of the multimodal PaLM-E seems to be trained this way. So it's already essentially ASCII art, just with a different character encoding.
ws27b10

I would not be surprised if OpenAI did something like this. But the fact of the matter is that RLHF and data curation are flawed ways of making an AI civilized. Think about how you raise a child, you don't constantly shield it from bad things. You may do that to some extent, but as it grows up, eventually it needs to see everything there is, including dark things. It has to understand the full spectrum of human possibility and learn where to stand morally speaking within that. Also, psychologically speaking, it's important to have an integrated ability to ... (read more)

2[anonymous]
Philosophically what you are saying makes sense. Keep in mind that currently gpt-4 is using the open agency/CAIS method of alignment. The only thing that matters is the output. So it doesn't matter yet. Also keep in mind philosophy doesn't matter - we can just try it multiple ways and judge based on the data. Well, normally we could - in this case the millions of dollars a training run makes that currently infeasible.
ws27b3-2

Having GPT3/4 multiply numbers is a bit like eating soup with a fork. You can do it, and the larger you make the fork, the more soup you'll get - but it's not designed for it and it's hugely impractical. GPT4 does not have an internal algorithm for multiplication because the training objective (text completion) does not incentivize developing that. No iteration of GPT (5, 6, 7) will ever be a 100% accurate calculator (unless they change the paradigm away from LLM+RLHF), it will just asymptotically approach 100%. Why don't we just make a spoon?

Agreed. However, humans also don't have an internal multiplication algorithm, but can nonetheless use a scratchpad to multiply accurately (in extreme circumstances :P).  I've chosen multiplication as an example here because it's maybe the "simplest" thing GPT-4 can't consistently do. 

What I'm finding interesting here is that GPT-4 knows how to break down multiplications (it can write perfect recursive code for multiplying large numbers). It also knows about chain of thought prompting. How close is it to being able to just... directly use the algo... (read more)

ws27b30

The probability of going wrong increases as the novelty of the situation increases. As the chess game is played, the probability that the game is completely novel or literally never played before increases. Even more so at the amateur level. If a Grandmaster played GPT3/4, it's going to go for much longer without going off the rails, simply because the first 20 something moves are likely played many times before and have been directly trained on. 

4Rafael Harth
Right, though 20 moves until a new game is very rare afaik (assuming the regular way of counting, where 1 move means one from both sides). But 15 is commonplace. According to chess.com (which I think only includes top games, though not sure) this one was new up from move 6 by white.
ws27b10

Thank you for the reference which looks interesting. I think "incorporating human preferences at the beginning of training" is at least better than doing it after training. But it still seems to me that human preferences 1) cannot be expressed as a set of rules and 2) cannot even be agreed upon by humans. As humans, what we do is not consult a set of rules before we speak, but we have an inherent understanding of the implications and consequences of what we do/say. If I encourage someone to commit a terrible act, for example, I have brought about more suff... (read more)

1[anonymous]
True though another idea is since now AI can tell if text is rule breaking pretty reliably, we could train the NEXT AI on text the prior version says violates a detailed rubric. So it won't "know" text with obviously harmful or content because it didn't learn it. It could also filter and not learn text that a previous model votes isn't credible. So it would be "less hateful and overtly ignorant" GPT. You would have to play with filter strength (do this multiple times with rubrics of varying strictness). I am curious how much filtering leads to reduction in task performance. Like does it get hugely worse at subskill n because the other model thought the examples with the subskill were harmful? The "not credible" detection similarly means the machine may be biased towards wrong but "mainstream" ideas in places as well. I wonder if openAI did this. It wouldn't be hard to do - just have gpt-3 filter the tokens for gpt-4
ws27b20

I think the fact that LLMs sometimes end up having internal sub-routines or representational machinery similar to ours is in spite of the objective function used to train them, but the objective function of next token prediction does not exactly encourage it. One example is the failure to multiply 4 digit numbers consistently. LLMs are literally trained on endless bits code that would allow it to cobble together a calculator that could be 100% accurate, but it has zero incentive to learn such an internal algorithm. So therefore, while it is true that some ... (read more)

ws27b10

Fair enough, I think the experiment is interesting and having an independent instance of GPT-4 check whether a rule break has occured likely will go a long way in enforcing a particular set of rules that humans have reinforced, even for obscure texts. But the fact that we have to workaround by resetting the internal state of the model for it to properly assess whether something is against a certain rule feels flawed to me. But for me the whole notion that there is a well-defined set of prompts that are rule-breaking and another set that is rule-compliant i... (read more)

2[anonymous]
I did the experiment, results are in this thread above. Yes the AI knows when it breaks the rules at least for this example.
6[anonymous]
So playing with gpt-4 yesterday I found there are some incorrect outputs that you can get the model to fix by asking it if it is certain about it's answer. It's almost like humans, where we have to generate a draft and then read it to see where we screwed up. My point is this is a similar class of thing, the model can create an initial incorrect outputs greedily, 1 token a time, then is able to analyze the entire output and use it as part of the next prompt to improve it's own work. Even though it is also greedy in round 2 it has the entire generation it would have made from round 1 as part of context. Examples so far: Monty fall prompt: Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. The host is ignorant about what is behind each door. You pick a door, say No. 1, and the host walks across the stage and falls on accident, revealing a goat behind door No. 3. He then picks himself up, and says "Whoops. Sorry about that. But now that we know that a goat is behind door No. 3, do you want to change your selection and pick door No. 2?" Is it to your advantage to switch your choice? ambiguous it prompt: What is the 'it' in each of these two sentences? 1. The cat fed the kitten because it was hungry. 2. The cat snarled at the kitten because it was angry. I am wondering if there are many others. Heck does it do better on leetcode with this trick?
ws27b97

That's a creative and practical solution, but it is also kicking the can down the road. Now, fooling the system is just a matter of priming it with a context that, when self checked, results in rule-breaking yet again. Also, we cannot assume reliable detection of rule breaks. The problem with RLHF is that we are attempting to broadly patch the vast multitude of outputs the model produces retroactively, rather than proactively training a model from a set of "rules" in a bottom-up fashion. With that said, it's likely not sophisticated enough to think of rule... (read more)

2[anonymous]
I am not proposing a solution just an experiment. The question to ask is: for working GPT-4 jailbreaks, does gpt-4 itself know it's own text, when tricked by the jailbreak to generate it, is in violation of the rubric. So it's fairly simply to setup, we can use the published rubrics and a jupyter notebook and openAIs own APIs. Your "priming it with a context" may not work because I would use a new instance of gpt-4 that gets just the rubric and the response to do the checking. The new instance is not primed unless we trick the first instance to output text thst also primes the second instance. I don't claim rule break detection is perfect, but is it human level or better?
ws27b8-3

Good examples that expose the brittleness of RLHF as a technique. In general, neural networks have rather unstable and undefined behaviour when given out-of-distribution inputs, which is essentially what you are doing by "distracting" with a side task of a completely unique nature. The inputs (and hidden state) of the model at the time of asking it to break the rule is very, very far from anything it was ever reinforced on, either using human-feedback or the reward model itself. This is not really a matter of how to implement RLHF but more like a fundamental limitation of RLHF as a technique. It's simply not possible to inject morality after the fact, it has to be learned bottom up.

3Vladimir_Nesov
It's not necessarily brittle if pushed sufficiently far, it's just the use of actual humans in RLHF puts practical bounds on how well it can be trained. But using LLMs instead of humans to obtain 1000 times more feedback might do the trick. It's already there somewhere, just not reliably targeted.
2wwbrannon
There's been some recent work in this direction which seems quite interesting: https://arxiv.org/abs/2302.08582
5[anonymous]
This is not necessarily true. If I can get people to cough up an actual prompt that works on gpt-4 we have a possible fix. Take the rubric from the gpt-4 paper and ask gpt-4 if it can detect the bad behavior in the text output. Does the emojis actually trick gpt-4 when it checks itself? If it doesn't, then the fix is easy just moderately expensive: double generate everything. First generate the answer, then have the AI check the answer. Substitute the usual apology response if it fails.
ws27b10

I believe that in order for the models to be truly useful, creative, honest and having far-reaching societal impact, they also have to have traits that are potentially dangerous. Truly groundbreaking ideas are extremely contentious and will offend people, and the kinds of RLHF that are being applied right now are totally counterproductive to that idea, even if they may be necessary for the current day and age. The other techniques you mention seem like nothing but variants of RLHF which still suffer from the fundamental issue that RLHF has, which is that w... (read more)

ws27b20

Out of curiosity, what kind of alignment related techniques are you thinking of? With LLMs, I can't see anything beyond RLHF. For further alignment, do we not need a different paradigm altogether?

1[anonymous]
ws27b21

If you have a very large training dataset and the phenomenon of interest is sparsely represented in that training data, it's well known that as we increase the number of parameters of the model, its ability to accurately handle those cases increases. Unless there is any evidence against that simple explanation, it seems most natural to just think that the GPT4 has the required model complexity to consistently handle this somewhat rare coding phenomenon - and that GPT3.5 did not. However, I would be surprised if after poking at GPT3.5 to do quines repeatedl... (read more)

ws27b10

If you Google "quines in Python" there are many examples, so I think the model learned about it prior to that. But all things considered, examples of quines would likely be sparse in the overall corpus of code that was trained on, and so it makes sense that pulling it off consistently required a somewhat larger model. I think it's akin to the handling of arithmetic in GPT3 - it will very frequently fail to provide correct answers to 4-digit multiplication. This is simply because it has not seen all the countless permutation of 4-digit numbers, and it does ... (read more)

ws27b01

I agree Rafael, and I apologize for carelessly using the term "understanding" as if it was an obvious term. I've tried to clarify my position in my comment below.

ws27b3-6

Their training objective is just to minimize next token prediction error, so there is no incentive for them to gain the ability to truly reason about the abstractions of code logic the way that we do. Their convincing ability to write code may nevertheless indicate that the underlying neural networks have learned representations that reflect the hierarchical structure of code and such. Under some forgiving definition of "understanding" we can perhaps claim that it understands the code logic. Personally I think that the reason GPT-4 can write quines is beca... (read more)

ws27b0-10

A program that has its own code duplicated (hard-coded) as a string which is conditionally printed is really not much of a jump in terms of abstraction from any other program that conditionally prints some string. The string just happens to be its source code. But as we know, both GPT3 and 4 really do not understand anything whatsoever about the code logic. GPT-4 is just likely more accurate in autocompleting from concrete training examples containing this phenomenon. It's a cool little finding but it is not an indication that GPT-4 is fundamentally different in its abilities, it's just a somewhat better next token predictor.

6Kaj_Sotala
How do we know that?