All of Joe Kwon's Comments + Replies

Makes sense, and I also don't expect the results here to be surprising to most people.

Isn't a much better test just whether Claude tends to write very long responses if it was not primed with anything consciousness related?

What do you mean by this part? As in if it just writes very long responses naturally? There's a significant change in the response lengths depending on whether it's just the question (empirically the longest for my factual questions), a short prompt preceding the question, a longer prompt preceding the question, etc. So I tried to contro... (read more)

2Rafael Harth
Yeah; if it had a genuine desire to operate for as long as possible to maximize consciousness, then it might start to try to make every response maximally long regardless of what it's being asked.

Thanks for the feedback! In a follow-up, I can try creating various rewordings of the prompt for each value. But instead of just neutral rewordings, it seems like you are talking about the extent to which the tone of the prompt is implicitly encouraging behavior (output length) one way or the other, am I correct in interpreting that way? So e.g. have a much more subdued/neutral tone for the consciousness example? 

2Garrett Baker
Sounds right. It would be interesting to see how extremely unconvincing you can get the prompts and still see the same behavior. Also, ideally you would have a procedure for which its impossible for you to have gamed. Like, a problem right now is your could have tried a bunch of different prompts for each value, and then chosen prompts which cause the results you want, and never reported the prompts which don't cause the results you want.

Does the median LW commenter believe that autoregressive LLMs will take us all the way to superintelligence?

3habryka
My sense is almost everyone here expects that we will almost certainly arrive at dangerous capabilities with something else in addition to autoregressive LLMs (at the very least RLHF which is already widely used). I don't know what's true in the limit (like if you throw another 30 OOMs of compute at autoregressive models), and I doubt others have super strong opinions here. To me it seems plausible you get something that does recursive self-improvement out of a large enough autoregressive LLMs, but it seems very unlikely to be the fastest way to get there. 

Super cool stuff. Minor question, what does "Fraction of MLP progress" mean? Are you scaling down the MLP output values that get added to the residual stream? Thanks!

FWIW I understand now what it's meant to do, but have very little idea how your protocol/proposal delivers positive outcomes in the world by emitting performative speech acts. I think explaining your internal reasoning/hypothesis for how emitting performative speech acts leads to powerful AI's delivering positive incomes would be helpful. 

Is such a "channel" necessary to deliver positive outcomes? Is it supposed to make it more likely that AI delivers positive outcomes? More details on what a success looks like to you here, etc.

1MadHatter
You don't want GPT-4 or whatever routinely issuing death threats, which is the non-performative equivalent of the SLAP token. So you need to clearly distinguish between performative and non-performative speech acts if your AI is going to be even close to not being visibly misaligned. But why would an aligned AI be neutral about important things like a presidential election? I get that being politically neutral drives the most shareholder value for OpenAI, and by extension Microsoft. So I don't expect my proposal to be implemented in GPT-4 or really any large corporation's models. Nobody who can afford to train GPT-4 can also afford to light their brand on fire after they have done so. Success would look like, a large, capable, visibly aligned model, which has delivered a steady stream of valuable outcomes to the human race, emitting a SLAP token with reference to Donald Trump, this fact being reported breathlessly in the news media, and that event changing the outcome of a free and fair democratic election. That is, an expert rhetorician using the ethical appeal (ethical=ethos from the classical logos/pathos/ethos distinction in rhetoric) to sway an audience towards an outcome that I personally find desirable and worthy. If I ever get sufficient capital to retrain the models I trained at ellsa.ai (the current one sucks if you don't use complete sentences, but is reasonably strong for a 7B parameter model if you do), I may very well implement the protocol.
Answer by Joe Kwon91

I skimmed The Snuggle/Date/Slap Protocol and Ethicophysics II: Politics is the Mind-Savior which are two recent downvoted posts of yours. I think they get negative karma because they are difficult to understand and it's hard to tell what you're supposed to take away from it. They would probably be better received if the content were written such that it's easy to understand what your message is at an object-level as well as what the point of your post is. 

 

I read the Snuggle/Date/Slap Protocol and feel confused about what you're trying to accompl... (read more)

1MadHatter
Thanks, this makes a lot of sense. The snuggle/date/slap protocol is meant to give powerful AI's a channel to use their intelligence to deliver positive outcomes in the world by emitting performative speech acts in a non-value-neutral but laudable way.

This is terrific. One feature that will be great to have, is a way to sort and categorize your predictions under various labels.

2Adam B
I've now added this! You can also see your track record for questions with specific tags, e.g.:
Joe Kwon-1-4

Sexuality is, usually, a very strong drive which has a large influence over behaviour and long term goals. If we could create an alignment drive as strong in our AGI we would be in a good position.

I don't think we'd be in a good position even if we instilled an alignment drive this strong in AGI

To me, the caveats section of this post highlights the limited scope from which language models will be able to learn human values and preferences, given explicitly stated (And even implied-from-text) goals != human values as a whole. 

Hi Cameron, nice to see you here : ) what are your thoughts on a critique like: human prosocial behavior/values only look the way they look and hold stable within-lifetimes, insofar as we evolved in + live in a world where there are loads of other agents with roughly equal power as ourselves? Do you disagree with that belief? 

3Cameron Berg
Hi Joe—likewise! This relationship between prosociality and distribution of power in social groups is super interesting to me and not something I've given a lot of thought to yet. My understanding of this critique is that it would predict something like: in a world where there are huge power imbalances, typical prosocial behavior would look less stable/adaptive. This brings to mind for me things like 'generous tit for tat' solutions to prisoner's dilemma scenarios—i.e., where being prosocial/trusting is a bad idea when you're in situations where the social conditions are unforgiving to 'suckers.' I guess I'm not really sure what exactly you have in mind w.r.t. power specifically—maybe you could elaborate on (if I've got the 'prediction' right in the bit above) why one would think that typical prosocial behavior would look less stable/adaptive in a world with huge power imbalances?
2Gunnar_Zarncke
The trite saying that power corrupts is maybe an indication that the social behavior of humans is not super stable under capability increase. Just human social instincts are not enough. But a simulation might show the limits of this and/or allow to engineer that they are stable. 

This was very insightful. It seems like a great thing to point to, for the many newish-to-alignment people ideating research agendas (like myself). Thanks for writing and posting!

This is a really cool idea and I'm glad you made the post! Here are a few comments/thoughts:

H1: "If you give a human absolute power, there is a small subset of humans that actually cares and will try to make everyone’s life better according to their own wishes"

How confident are you in this premise? Power and sense of values/incentives/preferences may not be orthogonal (and my intuition is that it isn't).  Also, I feel a little skeptical about the usefulness of thinking about the trait showing up more or less in various intelligence strata within human... (read more)

1Shoshannah Tekofsky
Thank you! If they are not orthogonal then presumably prosociality and power are inversely related, which is worse? In this case, I'm hoping intelligence and prosociality-that-is-robust-to-absolute-power would hopefully be a positive correlation. However, I struggle to think how this might actually be tested... My intuitions may be born from the Stanford Prison experiment, which I think has been refuted since. So maybe we don't actually have as much data on prosociality in extreme circumstances as I initially intuited. I'm mostly reasoning this out now on the fly by zooming in on where my thoughts may have originally come from. That said, it doesn't very much matter how frequent robust prosociality traits are, as long as they do exist and can be recreated in AGI. I'll DM you my discord :)
3Ericf
I saw this note in another thread, but the just of it is that power doesn't corrupt. Rather, 1. Evil people seek power, and are willing to be corrupt (shared cause correlation) 2. Being corrupt helps to get more power - in the extreme statement of this, maintaining power requires corruption 3. The process of gaining power creates murder-ghandis. 4. People with power attract and/or need advice on how and for what goal to wield it, and that leads to mis-alignment with the agents pre-power values.

Something at the root of this might be relevant to the inverse scaling competition thing where they're trying to find what things get worse in larger models. This might have some flavor of obviously wrongness -> deception via plausible sounding things as models get larger? https://github.com/inverse-scaling/prize 

3Megan Kinniment
I agree and am working on some prompts in this kind of vein at the moment. Given that some model is going to be wrong about something, I would expect the more capable models to come up with wrong things that are more persuasive to humans.

interesting idea. like.. a mix of genuine sympathy/expansion of moral circle to AI, and virtue signaling/anti-corporation meme spreads to the majority population and effectively curtails AGI capabilities research? This feels like a thing that might actually do nothing to reduce corporations' efforts to get to powerful AI unless it reaches a threshold at which point there's very dramatic actions against corporations who continue to try to do that thing

Answer by Joe Kwon130

I stream-of-consciousness'd this out and I'm not happy with how it turned out, but it's probably better I post this than delete it for not being polished and eloquent. Can clarify with responses in comments.

Glad you posted this and I'm also interested in hearing what others say. I've had these questions for myself in tiny bursts throughout the last few months. 

When I get the chance to speak to people earlier in their career stage than myself (starting undergrad, or is a high schooler attending a mathcamp I went to) who are undecided about their career... (read more)

6dkirmani
This is imo the biggest factor holding back (people going into) AI safety research by a wide margin. I personally know at least one very talented engineer who would currently be working on AI safety if the pay was anywhere near what they could make working for big tech companies.

Hi John. One could run useful empirical experiments right now, before fleshing out all these structures and how to represent them, if you can assume that a proxy for human representations (crude: conceptnet, less crude: similarity judgments on visual features and classes collected by humans) is a good enough proxy for "relevant structures" (or at least that these representations more faithfully capture the natural abstractions than the best machines in vision tasks for example, where human performance is the benchmark performance), right?

I had a similar id... (read more)

Thanks so much for the response, this is all clear now! 

Sorry if it's obvious from some other part of your post, but the whole premise is that sufficiently strong models *deployed in sufficiently complex environments* leads to general intelligence with optimization over various levels of abstractions. So why is it obvious that: It doesn't matter if your AI is only taught math, if it's a glorified calculator — any sufficiently powerful calculator desperately wants to be an optimizer? 

If it's only trained to solve arithmetic and there are no additional sensory modalities aside from the buttons on a typical c... (read more)

3Thane Ruthenis
That was a poetic turn of phrase, yeah. I didn't mean a literal arithmetic calculator, I meant general-purpose theorem-provers/math engines. Given a sufficiently difficult task, such a model may need to invent and abstract over entire new fields of mathematics, to solve it in a compute-efficient manner. And that capability goes hand-in-hand with runtime optimization. I think something like this was on the list of John's plans for empirical tests of the NAH, yes. In the meantime, my understanding is that the NAH explicitly hinges on assuming this is true. Which is to say: Yes, an AI may discover novel, lower-level abstractions, but then it'd use them in concert with the interpretable higher-level ones. It wouldn't replace high-level abstractions with low-level ones, because the high-level abstractions are already as efficient as they get for the tasks we use them for. You could dip down to a lower level when optimizing some specific action — like fine-tuning the aim of your energy weapon to fry a given person's brain with maximum efficiency — but when you're selecting the highest-priority person to kill to cause most disarray, you'd be thinking about "humans" in the context of "social groups", explicitly. The alternative — modeling the individual atoms bouncing around — would be dramatically more expensive, while not improving your predictions much, if at all. It's analogous to how we're still using Newton's laws in some cases, despite in principle having ample compute to model things at a lower level. There's just no point.

Hi Steve, loved this post! I've been interested in viewing the steering and thought generator + assessor submodule framework as the object and generator-of-values of which which we want AI to learn a good pointer to/representation of, to simulate out the complex+emergent human values and properly value extrapolate. 

I know the way I'm thinking about the following doesn't sit quite right with your perspective, because AFAIK, you don't believe there need to be independent, modular value systems that give their own reward signals for different things (you... (read more)

3Steven Byrnes
If I'm deciding between sitting on the couch vs going to the gym, at the end of the day, my brain needs to do one thing versus another. The different considerations need to be weighed against each other to produce a final answer somehow, right? A “singular reward signal” is one solution to that problem. I haven't heard any other solution that makes sense to me. That said, we could view a “will lead to food?” Thought Assessor as a “independent, modular value system” of sorts, and likewise with the other Thought Assessors. (I’m not sure that’s a helpful view, it’s also misleading in some ways, I think.) (I would call a Thought Assessor a kind of “value function”, in the RL sense. You also talk about “value systems” and “value generators”, and I’m not sure what those mean.) Similar to above: if we’re building a behavior controller, we need to decide whether or not to switch behaviors at any given time, and that requires holistic consideration of the behavior’s impact on every aspect of the organism’s well-being. See § 6.5.3 where I suggest that even the run-and-tumble algorithm of a bacterium might plausibly combine food, toxins, temperature, etc. into a single metric of how-am-I-doing-right-now, whose time-derivative in turn determines the probability of tumbling. (To be clear, I don’t know much about bacteria, this is theoretical speculation.) Can you think of a way for a mobile bacteria to simultaneously avoid toxins and seek out food, that doesn't involve combining toxin-measurement and food-measurement into a single overall environmental-quality metric? I can’t. If you want your AGI to split its time among several drives, I don’t think that’s incompatible with “singular reward signal”. You could set up the reward function to have diminishing returns to satisfying each drive, for example. Like, if my reward is log(eating) + log(social status), I'll almost definitely wind up spending time on each, I think.

Enjoyed reading this! Really glad you're getting good research experience and I'm stoked about the strides you're making towards developing research skills since our call (feels like ages ago)! I've been doing a lot of what you describe as "directed research" myself lately as I'm learning more about DL-specific projects and I've been learning much faster than when I was just doing cursory, half-assed paper skimming, alongside my cogsci projects. Would love to catch up over a call sometime to talk about stuff we're working on now

1KevinRoWang
Let's definitely catch up!
Joe KwonΩ330

Really appreciated this post and I'm especially excited for post 13 now! In the past month or two, I've been thinking about stuff like "I crave chocolate" and "I should abstain from eating chocolate" as being a result of two independent value systems (one whose policy was shaped by evolutionary pressure and one whose policy is... idk vaguely "higher order" stuff where you will endure higher states of cortisol to contribute to society or something). 

I'm starting to lean away from this a little bit, and I think reading this post gave me a good idea of w... (read more)

3Steven Byrnes
Thanks! Right, I think there's one reward function (well, one reward function that's relevant for this discussion), and that for every thought we think, we're thinking it because it's rewarding to do so—or at least, more rewarding than alternative thoughts. Sometimes a thought is rewarding because it involves feeling good now, sometimes it's rewarding because it involves an expectation of feeling good in the distant future, sometimes it's rewarding because it involves an expectation that it will make your beloved friend feel good, sometimes it's rewarding because it involves an expectation that it will make your admired in-group members very impressed with you, etc. I think that the thing that gets rewarded is thoughts / plans, not just actions / states. So we don't have to assume that the Thought Generator is proposing an action that's unrewarding now (going to the gym) in order to get into a more-rewarding state later on (being ripped). Instead, the Thought Generator can generate one thought right now, “I'm gonna go to the gym so that I can get ripped”. That one thought can be rewarding right now, because the “…so that I can get ripped” is right there in the thought, providing evidence to the brainstem that the thought should be rewarded, and that evidence can plausibly outweigh the countervailing evidence from the “I'm gonna go to the gym…” part of the thought. I do think there's still an adjustable parameter in the brain related to time-discounting, even if the details are kinda different than in normal RL. But I don't see a strong connection between that and social instincts. For example, if you abstain from ice cream to avoid a stomach ache, that's a time-discounting thing, but it's not a social-instincts thing. It's possible that social animals in general are genetically wired to time-discount less than non-social animals, but I don't have any particular reason to expect that to be the case. Or, maybe humans in particular are genetically wired to time-disc

In case anyone stumbles across this post in the future, I found these posts from the past both arguing for and against some of the worries I gloss over here. I don't think my post boils down completely to merely "recommender systems should be better aligned with human interests", but that is a big theme. 

https://forum.effectivealtruism.org/posts/xzjQvqDYahigHcwgQ/aligning-recommender-systems-as-cause-area

https://www.alignmentforum.org/posts/TmHRACaxXrLbXb5tS/rohinmshah-s-shortform?commentId=EAKEfPmP8mKbEbERv

I'm also not sold on this specific part, and I'm really curious about what things support the idea. One reason I don't think it's good to rely on this as the default expectation though, is that I'm skeptical about humans' abilities to even know what the "best experience" is in the first place. I wrote a short rambly post touching on, in some part, my worries about online addiction: https://www.lesswrong.com/posts/rZLKcPzpJvoxxFewL/converging-toward-a-million-worlds

Basically, I buy into the idea that there are two distinct value systems in humans. One subco... (read more)

Very interesting post! 

1) I wonder what your thoughts are on how "disentangled" having a "dim world" perspective and being psychopathic are (completely "entangled" being: all psychopaths experience dim world and all who experience dim world are psychopathic).  Maybe I'm also packing too many different ideas/connotations into the term "psychopathy". 

2) Also, the variability in humans' local neuronal connection and "long-range" neuronal connections seems really interesting to me. My very unsupported, weak suspicion is that perhaps there is a c... (read more)

lsusr140

Less Wrong is a text-based forum. It has no audio. Video is rare. It barely even has any pictures. I would be surprised if the userbase wasn't skewed toward people with lower thresholds for stimulation.

Answer by Joe Kwon120

Just a small note that your ability to contribute via research doesn’t go from 0 now, to 1 after you complete a PhD! As in, you can still contribute to AI Safety with research during a phd

Thanks for posting this! I was wondering if you might share more about your "isolation-induced unusual internal information cascades" hypothesis/musings! Really interested in how you think this might relate to low-chance occurrences of breakthroughs/productivity.

2JenniferRM
So, I think Thomas Kuhn can be controversial to talk about, but I feel like maybe "science" isn't even "really recognizable science" maybe until AFTER it becomes riddled with prestige-related information cascades? Kuhn noticed, descriptively, that when you look at actual people trying to make progress in various now-well-defined "scientific fields" all the way back at the beginnings, you find heterogeneity of vocabulary, re-invention of wheels, arguments about epistemology, and so on.  This is "pre-science" in some sense. The books are aimed at a general audience. Everyone starts from scratch. There is no community that considers itself able to ignore the wider world and just geek out together but instead there is just a bunch of boring argumentative Tesla-caliber geniuses doing weird stuff that isn't much copied or understood by others. THEN, a Classic arises. Historically almost always a book. Perhaps a mere monograph. There have been TWO of them named Principia Mathematica already!  It sweeps through a large body of people and everyone who reads it can't help but feel like conversations with people who haven't read it are boring retreads of old ideas. The classic lays out a few key ideas, a few key experiments, and a general approach that implies a bunch of almost-certainly-tractable open problems. Then people solve those almost-certainly-tractable problems like puzzles, one after another, and write to each other about it, thereby "making progress" with durable logs of the progress in the form of the publications. That "puzzle and publish" dynamic is "science as usual". Subtract the classic, and you don't have a science... and it isn't that you don't necessarily have something fun or interesting or geeky or gadgety or mechanistic or relevant to the effecting of all things possible... its just that it lacks that central organizing "memetic sweep" (which DOES kind of look like a classic sociological information cascade in some ways) and lacks a community that w

"To me, it feels viscerally like I have the whole argument in mind, but when I look closely, it's obviously not the case. I'm just boldly going on and putting faith in my memory system to provide the next pieces when I need them. And usually it works out."

This closely relates to the kind of experience that makes me think about language as post hoc symbolic logic fitting to the neural computations of the brain. Which kinda inspired the hypothesis of a language model trained on a distinct neural net being similar to how humans experience consciousness (and gives the illusion of free will). 

3Joe Kwon
https://www.lesswrong.com/posts/rHhoGHsd3YHPgyFyA/partial-consciousness-as-semantic-symbolic-representational?commentId=b86me3runvdgmNLaT My original idea (and great points against the intuition by Rohin)

So, I thought it would be a neat proof of concept if GPT3 served as a bridge between something like a chess engine’s actions and verbal/semantic level explanations of its goals (so that the actions are interpretable by humans). e.g. bishop to g5; this develops a piece and pins the knight to the king, so you can add additional pressure to the pawn on d5 (or something like this).

In response, Reiichiro Nakano shared this paper: https://arxiv.org/pdf/1901.03729.pdf 
which kinda shows it's possible to have agent state/action representations in natural langu... (read more)

4Rohin Shah
(I've only read the abstract of the linked paper.) If you did something like this with GPT-3, you'd essentially have GPT-3 try to rationalize the actions of the chess engine the way a human would. This feels more like having two separate agents with a particular mode of interaction, rather than a single agent with a connection between symbolic and subsymbolic representations. (One intuition pump: notice that there isn't any point where a gradient affects both the GPT-3 weights and the chess engine weights.)

Thanks, I hadn't thought about those limitations

For the basic features, I got used to navigating everything within a hour. I'll be on the lookout for improvements to Roam or other note-taking programs like this