In particular, even if the LLM were being continually trained (in a way that's similar to how LLMs are already trained, with similar architecture), it still wouldn't do the thing humans do with quickly picking up new analogies, quickly creating new concepts, and generally reforging concepts.
Is this true? How do you know? (I assume there's some facts here about in-context learning that I just happen to not know.)
It seems like eg I can teach an LLM a new game in one session, and it will operate within the rules of that game.
@Valentine comes to mind as a person who was raised lifeist and is now still lifeist, but I think has more complicated feelings/views about the situation related to enlightenment and metaphysics that make death an illusion, or something.
...Of course the default outcome of doing finetuning on any subset of data with easy-to-predict biases will be that you aren't shifting the inductive biases of the model on the vast majority of the distribution. This isn't because of an analogy with evolution, it's a necessity of how we train big transformers. In this case, the AI will likely just learn how to speak the "corrigible language" the same way it learned to speak french, and this will make approximately zero difference to any of its internal cognition, unless you are doing transformations to its in
Would you expect that if you trained an AI system on translating its internal chain of thought into a different language, that this would make it substantially harder for it to perform tasks in the language in which it was originally trained in?
I would guess that if you finetuned a model so that it always responded in French, regardless of the languge you prompt it with, it would persistently respond in French (absent various jailbreaks which would almost definitely exist).
In my experience, this is a common kind of failure with LLMs - that if asked directly about how to best a solve problem, they do know the answer. But if they aren’t given that slight scaffolding, they totally fail to apply it.
Notably, this is also true of almost all humans, at least of content that they've learned in school. The literature on transfer learning is pretty dismal in this respect. Almost all students will fail to apply their knowledge to new domains without very explicit prompting.
implies that they would also be inable to deal with the kind of novelty that an AGI would by definition need to deal with.
I guess this is technically true, because of the "General" in "AGI". But I this doesn't imply as much about how dangerous future LLM-based AI systems will be.
The first Strategically Superhuman AI systems might be importantly less general than humans, but still shockingly competent in the many specific domains on which they've been trained. An AI might make many basic reasoning failures in domains that are not represented in the training...
For the same reasons 'training an agent on a constitution that says to care about ' does not, at arbitrary capability levels, produce an agent that cares about
Ok, but I'm trying to ask why not.
Here's the argument that I would make for why not, followed by why I'm skeptical of it right now.
New options for the AI will open up at high capability levels that were not available at lower capability levels. This could in principle lead to undefined behavior that deviates from what we intended.
More specifically, if it's the case that if...
Would you expect that if you trained an AI system on translating its internal chain of thought into a different language, that this would make it substantially harder for it to perform tasks in the language in which it was originally trained in? If so, I am confident you are wrong and that you have learned something new today!
Training transformers in additional languages basically doesn't really change performance at all, the model just learns to translate between its existing internal latent distribution and the new language, and then just now has a...
Nonetheless, it does seem as though there should be at least one program that aims to find the best talent (even if they aren't immediately useful) and which provides them with the freedom to explore and the intellectual environment in which to do so.
I think SPARC and its decedents are something like this.
Dumb question: Why doesn't using constitutional AI, where the constitution is mostly or entirely corrigibility produce a corrigible AI (at arbitrary capability levels)?
My dumb proposal:
1. Train a model in something like o1's RL training loop, with a scratch pad for chain of thought, and reinforcement of correct answers to hard technical questions across domains.
2. Also, take those outputs, prompt the model to generate versions of those outputs that "are more corrigible / loyal / aligned to the will of your human creators". Do backprop to reinforce those mo...
Things that happen:
For the same reasons training an agent on a constitution that says to care about does not, at arbitrary capability levels, produce an agent that cares about .
If you think that doing this does produce an agent that cares about even at arbitrary capability levels, then I guess in your world model it would indeed be consistent for that to work for inducing corrigibility as well.
AI x-risk is high, which makes cryonics less attractive (because cryonics doesn't protect you from AI takeover-mediated human extinction). But on the flip side, timelines are short, which makes cryonics more attractive (because one of the major risks of cryonics is society persisting stably enough to keep you preserved until revival is possible, and near term AGI means that that period of time is short).
Cryonics is more likely to work, given a positive AI trajectory, and less likely to work given a negative AI trajectory.
I agree that it seems less likely to work, overall, than it seemed to me a few years ago.
Frankly, it feels more rooted in savannah-brained tribalism & human interest than a evenkeeled analysis of what factors are actually important, neglected and tractable.
Um, I'm not attempting to do cause prioritization or action-planning in the above comment. More like sense-making. Before I move on to the question of what should we do, I want to have an accurate model of the social dynamics in the space.
(That said, it doesn't seem a foregone conclusion that there are actionable things to do, that will come out of this analysis. If the above story is tr...
@Alexander Gietelink Oldenziel, you put a soldier mindset react on this (and also my earlier, similar, comment this week).
What makes you think so?
Definitely this model posits that adversariality, but I don't think that I'm invested in "my side" of the argument winning here, FWTIW. This currently seems like the most plausible high level summary of the situation, given my level of context.
Is there a version of this comment that would regard as better?
Yes sorry Eli, I meant to write out a more fully fleshed out response but unfortunately it got stuck in drafts.
The tl;dr is that I feel this perspective is singling out Sam Altman as some uniquely machiavellian actor in a way I find naive /misleading and ultimately maybe unhelpful.
I think in general im skeptical of the intense focus on individuals & individual tech companies that LW/EA has develloped recently. Frankly, it feels more rooted in savannah-brained tribalism & human interest than a evenkeeled analysis of what factors are actually important, neglected and tractable.
In a private slack someone extended credit to Sam Altman for putting EAs on the on the OpenAI board originally, especially that this turned out to be pretty risky / costly for him.
I responded:
It seems to me that there were AI safety people on the board at all is fully explainable by strategic moves from an earlier phase of the game.
Namely, OpenAI traded a boardseat for OpenPhil grant money, and more importantly, OpenPhil endorsement, which translated into talent sourcing and effectively defused what might have been vocal denouncement from one of the major ...
More cynical take based on the Musk/Altman emails: Altman was expecting Musk to be CEO. He set up a governance structure which would effectively be able to dethrone Musk, with him as the obvious successor, and was happy to staff the board with ideological people who might well take issue with something Musk did down the line to give him a shot at the throne.
Musk walked away, and it would've been too weird to change his mind on the governance structure. Altman thought this trap wouldn't fire with high enough probability to disarm it at any time before it di...
But it is our mistake that we didn't stand firmly against drugs, didn't pay more attention to the dangers of self-experimenting, and didn't kick out Ziz sooner.
These don't seem like very relevant or very actionable takeways.
[For some of my work for Palisade]
Does anyone know of even very simple examples of AIs exhibiting instrumentally convergent resource aquisition?
Something like "an AI system in a video game learns to seek out the power ups, because that helps it win." (Even better would be a version in which, you can give the agent one of several distinct-video game goals, but regardless of the goal, it goes and gets the powerups first).
It needs to be an example where the instrumental resource is not strictly required for succeeding at the task, while still being extremely helpful.
My model is that Sam Altman regarded the EA world as a memetic threat, early on, and took actions to defuse that threat by paying lip service / taking openphil money / hiring prominent AI safety people for AI safety teams.
Like, possibly the EAs could have crea ed a widespread vibe that building AGI is a cartoon evil thing to do, sort of the way many people think of working for a tobacco company or an oil company.
Then, after ChatGPT, OpenAI was a much bigger fish than the EAs or the rationalists, and he began taking moves to extricate himself from them.
My read:
"Zizian ideology" is a cross between rationalist ideas (the historical importance of AI, a warped version timeless decision theory, that more is possible with regards to mental tech) and radical leftist/anarchist ideas (the state and broader society are basically evil oppressive systems, strategic violence is morally justified, veganism), plus some homegrown ideas (all the hemisphere stuff, the undead types, etc).
That mix of ideas is compelling primarily to people who are already deeply invested in both rationality ideas and leftist / social justic...
(I endorse personal call outs like this one.)
Why? Forecasting the future is hard, and I expect surprises that deviate from my model of how things will go. But o1 and o3 seem like pretty blatant evidence that reduced my uncertainty a lot. On pretty simple heuristics, it looks like earth now knows how to make a science and engineering superintelligence: by scaling reasoning modes in a self-play-ish regime.
I would take a bet with you about what we expect to see in the next 5 years. But more than that, what kind of epistemology do you think I should be doing that I'm not?
In that sense, for many such people, short timelines actually are totally vibes based.
I dispute this characterization. It's normal and appropriate for people's views to update in response to the arguments produced by others.
Sure, sometimes people most parrot other people's views, without either developing them independently or even doing evaluatory checks to see if those views seem correct. But most of the time, I think people are doing those checks?
Speaking for myself, most of my views on timelines are downstream of ideas that I didn't generate myself. But I did think about those ideas, and evaluate if they seemed true.
I think people are doing those checks?
No. You can tell because they can't have an interesting conversation about it, because they don't have surrounding mental content (such as analyses of examples that stand up to interrogation, or open questions, or cruxes that aren't stupid). (This is in contrast to several people who can have an interesting conversation about, even if I think they're wrong and making mistakes and so on.)
But I did think about those ideas, and evaluate if they seemed true.
Of course I can't tell from this sentence, but I'm pretty s...
I think that Octavia is confused / mistaken about a number of points here, such that her testimony seems likely to be misleading to people without much context.
[I could find citations for many of my claims here, but I'm going to write and post this fast, mostly without the links, for the time being. I am largely going off of my memory of blog post comments that I read months to years ago, and my memory is fallible. I'll try to accurately represent my epistemic status inline. If anyone knows the links that I'm referring to, feel free to put them in the comm...
Somewhat. Not as well as a thinking assistant.
Namely, the impetus to start still needed to come from inside of me in my low efficacy state.
I thought that I should do a training regime where I took some drugs or something (maybe mega doses of carbs?) to intentionally induce low efficacy states and practice executing a simple crisp routine, like triggering the flowchart, but I never actually got around to doing that.
I maybe still should?
Here's an example.
This was process I tried for a while to make transitioning out of less effective states easier, by reducing the cognitive overhead. I would basically answer a series of questions to navigate a tree of possible states, and then the app would tell me directly what to do next, instead of my needing to diagnose what was up with me free-form, and then figure out how to respond to that, all of which was unaffordable when I was in a low-efficacy state.
A friend of mine once told me "if you're making a decision that depends on a number, and you haven't multiplied two numbers together, you're messing up." I think this is basically right, and I've taken it to heart.
Some triggers for me:
Verbiage
When I use any of the following words, in writing or in speech, I either look up an actual number, or quickly do a fermi estimate in a spreadsheet, to check if my intutitive idea is actually right.
Question Templates
When I'm asking a question, that effectively reduces...
Then, since I've done the upfront work of thinking through my own metacognitive practices, the assistant only has to track in the moment what situation I'm in, and basically follow a flowchart I might be too tunnel-visioned to handle myself.
In the past I have literally used flowcharts for this, including very simple "choose your own adventure" templates in roam.
The root node is just "something feels off, or something", and then the template would guide me through a series of diagnostic questions, leading me to root nodes with checklists of very specific next actions depending on my state.
FYI: I'm hiring for basically a thinking assistant, right now, for I expect 5 to 10 hours a week. Pay depending on skill-level. Open to in-person or remote.
If you're really good, I'll recommend you to other people who I want boosted, and I speculate that this could easily turn into a full time role.
If you're interested or maybe interested, DM me. I'll send you my current writeup of what I'm looking for (I would prefer not to post that publicly quite yet), and if you're still interested, we can do a work trial.
However, fair warning: I've tried various versi...
I've sometimes said that dignity in the first skill I learned (often to the surprise of others, since I am so willing to look silly or dumb or socially undignified). Part of my original motivation for bothering to intervene on x-risk, is that it would be beneath my dignity to live on a planet with an impending intelligence explosion on track to wipe out the future, and not do anything about it.
I think Ben's is a pretty good description of what it means for me, modulo that the "respect" in question is not at all social. It's entirely about my relationship with myself. My dignity or not is often not visible to others at all.
If your takeaway is only that you should have fatter tails on the outcomes of an aspiring rationality community, then I don't object.
If "I got some friends together and we all decided to be really dedicatedly rational" is intended as a description of Ziz and co, I think it is a at least missing many crucial elements, and generally not a very good characterization.
This has the obvious problem that an AI will then be indifferent between astronomical suffering and oblivion. In ANY situation where it will need to choose between those two, it will not care about which occurs on the merits, not just blackmail situations.
You don't want your AI to prefer a 99.999% chance of astronomical suffering to a 99.9999% of oblivion. Astronomical suffering is much worse.