All of research_prime_space's Comments + Replies

Sorry for the late response. I don't really use this forum regularly. But to get back to it - the main reason neural networks generalize is that they find the simplest function that gets a given accuracy on the training data.

This holds true for all neural networks, regardless of how they are trained, what type of data they are trained on, or what the objective function is. It's the whole point of why neural networks work. Functions that have more high frequency components are exponentially more unlikely. This holds for the randomly initialized prior (see a... (read more)

It can't represent a subjective sense of yellow, because if so, consciousness would be a linear function. That's somewhat ridiculous because I would experience a story about a "dog" differently based on the context.

 Furthermore, LLMs scale "features" by how strongly they appear (e.g. the positive sentiment vector is scaled up if the text is very positive). So the LLM's conscious processing of a positive sentiment would be linearly proportional to how positive the text is. Which also seems ridiculous.

I don't expect consciousness to have any useful prop... (read more)

3Nathan Helm-Burger
I've been thinking about this comment every day since you made it 11 days ago. I love it. Maybe it's silly of me, but I just hadn't thought about the question in such a grounded empirical manner before. I agree with you that it seems unlikely that current transformer-based LLMs are conscious. I also agree that we would need to be able to find extra context-dependent computation present in the stream of calculations in order to say that there was some consciousness-related computation present. I also agree that it is hard to imagine how consciousness would provide a clear benefit on the task of next-token-prediction on web text. I disagree though on the extrapolation from the above points. Let me explain. Assume, for this hypothetical, that we are analyzing a future model which has some things in common with transformer-based LLMs but also some extra components. We can get into the details of plausibly useful extra components if you like, but for now let's just say that this is a diffusion-guided transformer as an example. Now let's also assume that this future model wasn't trained on web text, but was instead trained in some moderately realistic simulation of surviving in the wild as an early homonid tribe member. They need to track simulated hunger, hunting and gathering skills, and social relationships. They had a constant simulated state of health/homeostasis throughout training, as an RL signal proportional to intensity of simulated need. So there was a constant combination of training pressure for next token prediction and for satisficing the simulated state homeostasis. Now, in this hypothetical, it seems more fair to compare this model to an animal. Supposing that the intuitive understanding of a common feature of behaviors across animal species (particularly mammals, marsupials, and birds) is correct. It seems like all these animals are running some sort of computational process which could fairly be described as a form of 'consciousness'. Why would thi

I removed it. I don't have an agenda; I just included it because it changed my priors on the mechanism for human consciousness. So that subsequently affected my prior for whether or not AI could be conscious. 

This is cool! These sparse features should be easily "extractable" by the transformer's key, query, and value weights in a single layer. Therefore, I'm wondering if these weights can somehow make it easier to "discover" the sparse features? 

1Robert_AIZI
This is something we're planning to look into! From the paper:  Exactly how to use them is something we're still working on...
  1. I don't really think that 1. would be true -- following DAN-style prompts is the minimum complexity solution. You want to act in accordance with the prompt.
  2. Backdoors don't emerge naturally. So if it's computationally infeasible to find an input where the original model and the backdoored model differ, then self-distillation on the backdoored model is going to be the same as self-distillation on the original model. 

The only scenario where I think self-distillation is useful would be if 1) you train a LLM on a dataset, 2) fine-tune it to be deceptive/power-seeking, and 3) self-distill it on the original dataset, then self-distilled model would likely no longer be deceptive/power-seeking. 

I think self-distillation is better than network compression, as it possesses some decently strong theoretical guarantees that you're reducing the complexity of the function. I haven't really seen the same with the latter.

But what research do you think would be valuable, other than the obvious (self-distill a deceptive, power-hungry model to see if the negative qualities go away)? 

1scasper
One idea that comes to mind is to see if a chatbot who is vulnerable to DAN-type prompts could be made to be robust to them by self-distillation on non-DAN-type prompts.  I'd also really like to see if self-distillation or similar could be used to more effectively scrub away undetectable trojans. https://arxiv.org/abs/2204.06974 

As of right now, I don't think that LLMs are trained to be power seeking and deceptive.

Power-seeking is likely if the model is directly maximizing rewards, but LLMs are not quite doing this.

I just wanted to add another angle. Neural networks have a fundamental "simplicity bias", where they learn low frequency components exponentially faster than high frequency components. Thus, self-distillation is likely to be more efficient than training on the original dataset (the function you're learning has fewer high frequency components). This paper formalizes this claim. 

But in practice, what this means is that training GPT-3.5 from scratch is hard but simply copying GPT-3.5 is pretty easy. Stanford was recently able to finetune a pretty bad 7B ... (read more)

I feel like capping the memory of GPUs would also affect normal folk who just want to train simple models, so it may be less likely to be implemented. It also doesn't really cap the model size, which is the main problem.

But I agree it would be easier to enforce, and certainly, much better than the status quo.

I think you make a lot of great points.

I think some sort of cap is the one of the highest impact things we can do from a safety perspective.  I agree that imposing the cap effectively and getting buy-in from broader society is a challenge, however, these problems are a lot more tractable than AI safety. 

I haven't heard anybody else propose this so I wanted to float it out there.

I'd love some feedback on this if possible, thank you!

Thanks, I appreciate the explanation!

Thanks, that's a really helpful framing!

I agree with everything you've said. Obviously, AI (in most domains) would need to evaluate its plans in the real world to acquire training data. But my point is that we have the choice to not carry out some of the agent's plans in the real-world. For some of the AI's plans, we can say no -- we have a veto button. It seems to me that the AI would be completely fine with that -- is that correct? If so, it makes safety a much more tractable problem than it otherwise would be.

3Jon Garcia
The problem is that at the beginning, its plans are generally going to be complete nonsense. It has to have a ton of interaction with (at least a reasonable model of) its environment, both with its reward signal and with its causal structure, before it approaches a sensible output. There is no utility for the RL agent's operators to have an oracle AI with no practical experience. The power of RL is that a simple feedback signal can teach it everything it needs to know to act rationally in its environment. But if you want it to make rational plans for the real world without actually letting it get direct feedback from the real world, you need to add on vast layers of additional computational complexity to its training manually, which would more or less be taken care of automatically for an RL agent interacting with the real world. The incentives aren't in your favor here.

I have a question about AI safety. I'm sorry in advance if it's too obvious, I just couldn't find an answer on the internet or in my head.

The way AI has bad consequences is through its drive to maximize (destroys the world in order to produce paperclips more efficiently). If you instead designed AIs to: 1) find a function/algorithm within an error range of the goal, 2)stop once that method is found, 3) do 1) and 2) while minimizing the amount of resources it uses and/or its effect on the outside world

If the above could be incorporated as a convention into any AI designed, would that mitigate the risk of AI going "rougue"?

9cousin_it
It's one of the proposed plans. The main difficulty is that low impact is hard to formalize. For example, if you ask the AI to cure cancer with low impact, it might give people another disease that kills them instead, to keep the global death rate constant. Fully unpacking "low impact" might be almost as hard as the friendliness problem. See this page for more. The LW user who's doing most work on this now is Stuart Armstrong.

Hi! I'm 18 years old, female, and a college student (don't want to release personal information beyond that!). I'm majoring in math, and I hopefully want to use those skills for AI research :D

I found you guys from EA, and I started reading the sequences last week, but I really do have a burning question I want to post to the Discussion board so I made an account.

2cousin_it
Welcome! You can ask your question in the open thread as well.