Thanks for the feedback! In a follow-up, I can try creating various rewordings of the prompt for each value. But instead of just neutral rewordings, it seems like you are talking about the extent to which the tone of the prompt is implicitly encouraging behavior (output length) one way or the other, am I correct in interpreting that way? So e.g. have a much more subdued/neutral tone for the consciousness example?
Does the median LW commenter believe that autoregressive LLMs will take us all the way to superintelligence?
Super cool stuff. Minor question, what does "Fraction of MLP progress" mean? Are you scaling down the MLP output values that get added to the residual stream? Thanks!
FWIW I understand now what it's meant to do, but have very little idea how your protocol/proposal delivers positive outcomes in the world by emitting performative speech acts. I think explaining your internal reasoning/hypothesis for how emitting performative speech acts leads to powerful AI's delivering positive incomes would be helpful.
Is such a "channel" necessary to deliver positive outcomes? Is it supposed to make it more likely that AI delivers positive outcomes? More details on what a success looks like to you here, etc.
I skimmed The Snuggle/Date/Slap Protocol and Ethicophysics II: Politics is the Mind-Savior which are two recent downvoted posts of yours. I think they get negative karma because they are difficult to understand and it's hard to tell what you're supposed to take away from it. They would probably be better received if the content were written such that it's easy to understand what your message is at an object-level as well as what the point of your post is.
I read the Snuggle/Date/Slap Protocol and feel confused about what you're trying to accompl...
This is terrific. One feature that will be great to have, is a way to sort and categorize your predictions under various labels.
Sexuality is, usually, a very strong drive which has a large influence over behaviour and long term goals. If we could create an alignment drive as strong in our AGI we would be in a good position.
I don't think we'd be in a good position even if we instilled an alignment drive this strong in AGI
To me, the caveats section of this post highlights the limited scope from which language models will be able to learn human values and preferences, given explicitly stated (And even implied-from-text) goals != human values as a whole.
Hi Cameron, nice to see you here : ) what are your thoughts on a critique like: human prosocial behavior/values only look the way they look and hold stable within-lifetimes, insofar as we evolved in + live in a world where there are loads of other agents with roughly equal power as ourselves? Do you disagree with that belief?
This was very insightful. It seems like a great thing to point to, for the many newish-to-alignment people ideating research agendas (like myself). Thanks for writing and posting!
This is a really cool idea and I'm glad you made the post! Here are a few comments/thoughts:
H1: "If you give a human absolute power, there is a small subset of humans that actually cares and will try to make everyone’s life better according to their own wishes"
How confident are you in this premise? Power and sense of values/incentives/preferences may not be orthogonal (and my intuition is that it isn't). Also, I feel a little skeptical about the usefulness of thinking about the trait showing up more or less in various intelligence strata within human...
Something at the root of this might be relevant to the inverse scaling competition thing where they're trying to find what things get worse in larger models. This might have some flavor of obviously wrongness -> deception via plausible sounding things as models get larger? https://github.com/inverse-scaling/prize
interesting idea. like.. a mix of genuine sympathy/expansion of moral circle to AI, and virtue signaling/anti-corporation meme spreads to the majority population and effectively curtails AGI capabilities research? This feels like a thing that might actually do nothing to reduce corporations' efforts to get to powerful AI unless it reaches a threshold at which point there's very dramatic actions against corporations who continue to try to do that thing
I stream-of-consciousness'd this out and I'm not happy with how it turned out, but it's probably better I post this than delete it for not being polished and eloquent. Can clarify with responses in comments.
Glad you posted this and I'm also interested in hearing what others say. I've had these questions for myself in tiny bursts throughout the last few months.
When I get the chance to speak to people earlier in their career stage than myself (starting undergrad, or is a high schooler attending a mathcamp I went to) who are undecided about their career...
Hi John. One could run useful empirical experiments right now, before fleshing out all these structures and how to represent them, if you can assume that a proxy for human representations (crude: conceptnet, less crude: similarity judgments on visual features and classes collected by humans) is a good enough proxy for "relevant structures" (or at least that these representations more faithfully capture the natural abstractions than the best machines in vision tasks for example, where human performance is the benchmark performance), right?
I had a similar id...
Thanks so much for the response, this is all clear now!
Sorry if it's obvious from some other part of your post, but the whole premise is that sufficiently strong models *deployed in sufficiently complex environments* leads to general intelligence with optimization over various levels of abstractions. So why is it obvious that: It doesn't matter if your AI is only taught math, if it's a glorified calculator — any sufficiently powerful calculator desperately wants to be an optimizer?
If it's only trained to solve arithmetic and there are no additional sensory modalities aside from the buttons on a typical c...
Hi Steve, loved this post! I've been interested in viewing the steering and thought generator + assessor submodule framework as the object and generator-of-values of which which we want AI to learn a good pointer to/representation of, to simulate out the complex+emergent human values and properly value extrapolate.
I know the way I'm thinking about the following doesn't sit quite right with your perspective, because AFAIK, you don't believe there need to be independent, modular value systems that give their own reward signals for different things (you...
Enjoyed reading this! Really glad you're getting good research experience and I'm stoked about the strides you're making towards developing research skills since our call (feels like ages ago)! I've been doing a lot of what you describe as "directed research" myself lately as I'm learning more about DL-specific projects and I've been learning much faster than when I was just doing cursory, half-assed paper skimming, alongside my cogsci projects. Would love to catch up over a call sometime to talk about stuff we're working on now
Really appreciated this post and I'm especially excited for post 13 now! In the past month or two, I've been thinking about stuff like "I crave chocolate" and "I should abstain from eating chocolate" as being a result of two independent value systems (one whose policy was shaped by evolutionary pressure and one whose policy is... idk vaguely "higher order" stuff where you will endure higher states of cortisol to contribute to society or something).
I'm starting to lean away from this a little bit, and I think reading this post gave me a good idea of w...
In case anyone stumbles across this post in the future, I found these posts from the past both arguing for and against some of the worries I gloss over here. I don't think my post boils down completely to merely "recommender systems should be better aligned with human interests", but that is a big theme.
I'm also not sold on this specific part, and I'm really curious about what things support the idea. One reason I don't think it's good to rely on this as the default expectation though, is that I'm skeptical about humans' abilities to even know what the "best experience" is in the first place. I wrote a short rambly post touching on, in some part, my worries about online addiction: https://www.lesswrong.com/posts/rZLKcPzpJvoxxFewL/converging-toward-a-million-worlds
Basically, I buy into the idea that there are two distinct value systems in humans. One subco...
Very interesting post!
1) I wonder what your thoughts are on how "disentangled" having a "dim world" perspective and being psychopathic are (completely "entangled" being: all psychopaths experience dim world and all who experience dim world are psychopathic). Maybe I'm also packing too many different ideas/connotations into the term "psychopathy".
2) Also, the variability in humans' local neuronal connection and "long-range" neuronal connections seems really interesting to me. My very unsupported, weak suspicion is that perhaps there is a c...
Less Wrong is a text-based forum. It has no audio. Video is rare. It barely even has any pictures. I would be surprised if the userbase wasn't skewed toward people with lower thresholds for stimulation.
Just a small note that your ability to contribute via research doesn’t go from 0 now, to 1 after you complete a PhD! As in, you can still contribute to AI Safety with research during a phd
Thanks for posting this! I was wondering if you might share more about your "isolation-induced unusual internal information cascades" hypothesis/musings! Really interested in how you think this might relate to low-chance occurrences of breakthroughs/productivity.
My original idea (and great points against the intuition by Rohin)
"To me, it feels viscerally like I have the whole argument in mind, but when I look closely, it's obviously not the case. I'm just boldly going on and putting faith in my memory system to provide the next pieces when I need them. And usually it works out."
This closely relates to the kind of experience that makes me think about language as post hoc symbolic logic fitting to the neural computations of the brain. Which kinda inspired the hypothesis of a language model trained on a distinct neural net being similar to how humans experience consciousness (and gives the illusion of free will).
So, I thought it would be a neat proof of concept if GPT3 served as a bridge between something like a chess engine’s actions and verbal/semantic level explanations of its goals (so that the actions are interpretable by humans). e.g. bishop to g5; this develops a piece and pins the knight to the king, so you can add additional pressure to the pawn on d5 (or something like this).
In response, Reiichiro Nakano shared this paper: https://arxiv.org/pdf/1901.03729.pdf
which kinda shows it's possible to have agent state/action representations in natural langu...
Thanks, I hadn't thought about those limitations
For the basic features, I got used to navigating everything within a hour. I'll be on the lookout for improvements to Roam or other note-taking programs like this
Makes sense, and I also don't expect the results here to be surprising to most people.
What do you mean by this part? As in if it just writes very long responses naturally? There's a significant change in the response lengths depending on whether it's just the question (empirically the longest for my factual questions), a short prompt preceding the question, a longer prompt preceding the question, etc. So I tried to contro... (read more)