The most capable version of each model has not yet been created when the model is released. As well as fine-tuning for specific tasks, scaffolding matters. The agentic scaffolds people create have an increasingly important role in the model's ultimate capability.
Suit yourself, but I happen to want to create many great continuations. I enjoy hearing about other people's happiness. I enjoy it more the better I understand them. I understand myself pretty well.
But I don't want to be greedy. I'm not sure a lot of forks of each person are better than making more new people.
Let me also mention that it's probably possible to merge forks. Simply averaging the weight changes in your simulated cortex and hippocampus will approximately work to share the memories across two forks. How far out that works before you start to get...
Interesting! Do you think humans could pick up on word use that well? My perception is that humans mostly cue on structure to detect LLM slop writing, and that is relatively easily changed with prompts (although it's definitely not trivial at this point - but I haven't searched for recipes).
I did concede the point, since the research I was thinkingg of didn't use humans who've practiced detecting LLM writing.
I concede the point. That's a high bar for getting LLM submissions past you. I don't know of studies that tested people who'd actually practiced detecting LLM writing.
I'd still be more comfortable with a disclosure criteria of some sort, but I don't have a great argument beyond valuing transparency and honesty.
I read it too and had no such thought. I think that loose poetic free association type thing f writing is hard for humans and easy for LLMs.
That's a good point and it does set at least a low bar of bothering to try.
But they don't have to try hard. They can almost just append the prompt with "and don't write it in standard LLM style".
I think it's a little more complex than that, but not much. Humans can't tell LLM writing from human writing in controlled studies. The question isn't whether you can hide the style or even if it's hard, just how easy.
Which raises the question of whether they'd even do that much, because of course they haven't read the FAQ before posting.
Really just making sure that new authors read SOMETHING about what's appreciated here would go a long way toward reducing slop posts.
If you wrote the whole thing, then prompted Claude to rewrite it, that would seem to "add significant value." If you then read the whole thing carefully to say "that's what I meant, and it didn't make anything up I'm not sure about", then you've more than met the requirement laid out here, right?
They're saying the second part is all you have to do. If you had some vague prompt like "write an essay about how the field of alignment is misguided" and then proofread it you've met the criteria as laid out. So if your prompt was essentially the complete essay, y...
I'd like clarification on using AI as a writing assistant by having a whole conversation with it, then letting it do the primary writing. I'm hoping this meets your criteria of "add significant value".
I thought Jan Kulveit had real success with this method in A Three-Layer Model of LLM Psychology and AI Assistants Should Have a Direct Line to Their Developers. He credited Claude with the writing without mentioning how much he edited it. I find it plausible that he edited very little because his contribution had been extensive on the "prompting" ...
This is really useful, thanks!
I've also spent a lot of time trying to delegate research, and it's hard to reach the break-even point where you're spending less time explaining what you want and how to search for it, even to a bright student in the field, than just doing it yourself.
I think any proper comparison of LLMs has to take into account prompting strategies. There are two questions: which is easiest to get results from, and which is best once you learn to use it. And the best prompts very likely vary across systems.
OpenAI's Deep Research is by far t...
Despite my contention on the associated paper post that focusing on wisdom in this sense is ducking the hard part of the alignment problem, I'll stress here that it Iseems thoroughly useful if it's a supplement not a substitute for work on the hard parts of the problem - technical, theoretical and societal.
I also think it's going to be easier to create wise advisors than you think, at least in the weak sense that they make their human users effectively wiser.
In short, think simple prompting schemes and eventually agentic scaffolds can do a lot of the extra...
Hm, I thought this use of "wise" is almost identical to capabilities. It's sort of like capabilities with less slop or confabulation, and probably more ability to take the context of the problem/question into account. Both of those are pretty valuable, although people might not want to bother even swerving capabilities in that direction.
This is great! I'll comment on that short-form.
In short, I think that wise (or even wise-ish) advisors are low-hanging fruit that will help any plan succeed, and that creating them is even easier than you suppose.
It's an interesting and plausible claim that eating plain food is better than fighting your appetite. I tend to believe it. I'm curious how you handle eating as a social occasion; do you avoid it or go ahead and eat differently when there's a social occasion without it disrupting your diet or appetite?
Your boy slop also happens to follow my dietary theory.
I'm embarassed to share my diet philosophy but I'm going to anyway. It's embarassing because I am in fact modestly overweight. I feel it's still worth sharing as a datapoint for its strengths: I am only m...
I'm suddenly expecting the first AI escapes to be human-aided. And that could be a good thing.
Your mention of human-aided AI escape brought to mind Zvi's Going Nova post today about LLMs convincing humans they're conscious to get help in "surviving". My comment there is about how those arguments will be increasingly compelling because LLMs have some aspects of human consciousness and will have more as they're enhanced, particularly with good memory systems.
If humans within orgs help LLM agents "escape", they'll get out before they could manage it on ...
I haven't written about this because I'm not sure what effect similar phenomena will have on the alignment challenge.
But it's probably going to be a big thing in public perception of AGI, so I'm going to start writing about it as a means of trying to figure out how it could be good or bad for alignment.
Here's one crucial thing: there's an almost-certainly-correct answer to "but are they really conscious" and the answer is "partly".
Consciousness is, as we all know, a suitcase term. Depending on what someone means by "conscious", being able to reason correct...
Can you provide more details on the exact methods, like example prompts? Or did I miss a link that has these?
This is really interesting and pretty important if the methods support your interpretations.
I'm actually interested in your responses here. This is useful for my strategies how I frame things and understanding different people's intuitions.
Do you think we can't make autonomous agents that pursue goals well enough to get things done? Do you really think they'll lose focus between being goal-focused long enough for useful work, and long enough for taking over the world if they interpret their goals differently than we intended? Do you think there's no way RL or natural language could be misinterpreted?
I'm thinking it's easy to keep an LLM agent goa...
Hm. I think you're thinking of current LLMs, not AGI agents based on LLMs? If so I fully agree that they're unlkely to be dangerous at all.
I'm worried about agentic cognitive architectures we've built with LLMs as the core cognitive engine. We are trying to make them goal directed and to have human-level competence; superhuman competence/intelligence follows after that if we don't somehow halt progress permanently.
Current LLMs, like most humans most of the time, aren't strongly goal directed. But we want them to be strongly goal-directed so they do t...
Would I have to go to DC? Because I hate going to DC.
Not that I wouldn't to save the world, but I'd want to be sure it was necessary.
Only partly kidding. Maybe if people got a rationalist enclave in DC going we'd be less averse?
You can definitely meet your own district's staff locally (e.g., if you're in Berkeley, Congresswoman Simon has an office in Oakland, Senator Padilla has an office in SF, and Senator Schiff's offices look not to be finalized yet but undoubtedly will include a Bay Area Office).
You can also meet most Congressional offices' staff via Zoom or phone (though some offices strongly prefer in-person meetings).
There is also indeed a meaningful rationalist presence in DC, though opinions vary as to whether the enclave is in Adams Morgan-Columbia Heights...
anyway i'm still not very convinced of Doom [...], because i have doubts about whether efficient explicit utility maximizers are even possible,
What? I'm not sure what you mean be "efficient" utility maximizers, but I think you're setting too high a bar for being concerned. I don't think doom is certain but I think it's obviously possible. Humans are dangerous, and we are possible. Anything smarter than humans is more dangerous if it has misaligned goals. We are building things that will become smarter than us. They will have goals. We do not know how to...
Isn't this model behavior what's described as jailbroken? It sounds similar to reports of jailbroken model behavior. It seems like the reports I've glanced at could be described as having lowered inhibitions and being closer to base model behavior.
In any case, good work investigating alternate explanations of the emergent misalignment paper.
This is a key point for a different discussion: job loss and effect on economies. Supposing writing software is almost all automated. Nobody is going to get the trillions currently spent on it. If just two companies, say Anthropic and OpenAI have agents that automate it, they'll compete and drive the prices down to near the compute costs (or collude until others make systems that can compete...)
Now those trillions aren't being spent on writing code. Where do they go? Anticipating how businesses will use their surplus as they pay less wages is probably some...
Did the paper really show that they're explicit optimizers? If so, what's your definition of them?
I have representations of future states and I choose actions that might lead toward them. Those representations of my goals are explicit, so I'd call myself an explicit optimizer.
I added a bunch to the previous comment in an edit, sorry! I was switching from phone to laptop when it got longer. So you might want to look at it again to see if you got the full version.
Separately from my other comment, and more on the object level on your argument:
You focus on loops and say a feedforward network can't be an "explicit optimizer". Depending on what you mean by that term, I think you're right.
I think it's actually a pretty strong argument that a feedforward neural network itself can't be much of an optimizer.
Transformers do some effective looping by doing multiple forward passes. They make the output of the last pass the input of the next pass. That's a loop that's incorporating their past computations into their new comput...
Why do you think genes don't have an advantage over weights for mesaoptimization? The paper shows that weights can do it but mightn't genes still have an advantage?
I didn't follow the details, I'm just interested in the logic you're applying to conclude that your theoretical work was invalidated by that empirical study.
I also think Yudkowsky's argument isn't just about mesaoptimizers. You can have the whole network optimize for the training set, and just be disappointed to find out that it didn't learn what you'd hoped it would learn when it gets into new ...
I worry very little about AI harming us by accident. I think that's a much lower class of unsafe than an intelligent being adopting the goal of harming you. And I think we're are busily working toward building AGI that might do just that. So I don't worry about other classes of unsafe things much at all.
Tons of things like cars and the Facebook engagement algorithm are unsafe by changing human behavior in ways that directly and indirectly cause physical and psychological harm.
Optimized to cause harm is another level of unsafe. An engineered or natural viru...
Have you elaborated this argument? I tend to think a military project would be a lot more cautious than move-fast-and-break-things silicone valley businesses.
The argument that orgs with reputations to lose might start being careful when AI becomes actually dangerous or even just autonomous enough to be alarming is important if true. Most folks seem to assume they'll just forge ahead until they succeed and let a misaligned AGI get loose.
I've made an argument that orgs will be careful to protect their reputations in System 2 Alignment. I think this will be h...
Right. I suppose that day ea interact with identity.
If I get significantly dumber, I'd still roughly be me, and I'd want to preserve that if it's not wipes ng out or distorting the other things too much. If I got substantially smarter, I'd be a somewhat different person - I'd act differently often, because I'd see situations differently (more clearly/holistically) but it feels as though that persone might actually be more me than I am now. I'd be better able to do what I want, including values (which I'd sort of wrapped in to habits of thought, but values might deserve a spot on the list).
The way I usually frame identity is
Edit: values should probably be considered a separate class, since every thought has an associated valence.
In no particular order, and that's the whole list.
Character is largely beliefs and habits.
There's another part of character that's purely emotional; it's sort of a habit to get angry, scared, happy, etc in certain circumstances. I'd want to preserve that too but it's less important than the big three.
There are plenty of beings striving to survive, so preserving that ...
Draft comment on US AI policy, to be submitted shortly (deadline midnight ET tonight!) to the White House Office of Science and Technology Policy, who may largely shape official policy; See Dave Kasten's brief description in his short form if you haven't already, and regulations.gov/document/NSF_FRDOC_0001-3479… for their also brief description and submission email.
After thinking hard about it and writing Whether governments will control AGI is important and neglected, I've decided to share my whole opinion on the matter with the committee. My reason...
Interesting. I hadn't thought about Musk's influence and how he is certainly AGI-pilled.
I think they meant over 99% when used on a non-randomly selected human who's bothering to take a pregnancy test. Your rock would run maybe 70% or so on that application.
I wasn't able to finish that post in the few minutes I've got so far today, so here's the super short version. I remain highly uncertain whether my comments will include any mention of AGI.
(Edit: I finally finished it: Whether governments will control AGI is important and neglected)
I think whether AGI-pilling governments is a good idea is quite complex. Pushing the government to become aware of AGI x-risks will probably decelerate progress, but it could even accelerate it if the conclusion is "build it first, don't worry we'll be super careful when we get ...
I don't think the happier worlds are less predictable; the Christians and their heaven of singing just lacked imagination. We'll want some exciting and interesting happy simulations, too.
But this overall scenario is quite concerning as an s-risk. To think that Musk putched a curiosity drive for Grok as a good thing boggles my mind.
Emergent curiosity drives should be a major concern.
Edit: I finished that post on this topic: Whether governments will control AGI is important and neglected.
I'm hoping for discussion on that post, and quite ready to change my draft comment, or not submit one, based on those arguments. After putting a bunch of thought into it, my planned comment will recommend forming a committee that can work in private to investigate the opportunities and risks of AI development, to inform future policy. I will note that this was Roosevelt's response to Einstein's letter on the potential of nuclear weaponry.
I hope t...
I think on net, there are relatively fewer risks related to getting governments more AGI-pilled vs. them continuing on their current course; governments are broadly AI-pilled even if not AGI/ASI-pilled and are doing most of the accelerating actions an AGI-accelerator would want.
The fastest route to solving a complex problem and showing your work is often to just show the work you're doing anyway. That's what teachers are going for when they demand it. If you had some reason for making up fake work instead you could. But you'd need a reason.
Here it may be relevant that some of my friends did make up fake work when using shortcut techniques of guessing the answer in algebra.
Sure it would be better to have a better alignment strategy. But there are no plausible routes I know of to getting people to stop developing LLMs and LLM agen...
This is also encouraging because OpenAI is making some actual claims about safety procedures. Sure they could walk it back pretty easily, but it does indicate that at least as of now they likely intend to try to maintain a faithful CoT.
You assumed no faithful CoT in What goals will AIs have?, suggesting that you expected OpenAI to give up on it. That's concerning given your familiarity with their culture. Of course they still might easily go that way if there's a substantial alignment tax for maintaining faithful CoT, but this is at least nice to see.
Indeed! This is most of why I'm happy about this -- from internal discussions it's long been speculated (since '23 when I was talking about it, and probably before) that training away the bad thoughts from the CoT would result in obfuscated bad thoughts. It's cool that they actually did the experiments to prove it, and in a real-world setting too! But it's even cooler that they published this with a strong recommendation attached.
Now we see how long this recommendation holds up under the pressure of incentives.
Sometime in the next few years probably,...
Right. There's no advantage to being a bad liar, but there may be an advantage to being seen as a bad liar. But it's probably not worth lying badly to get that reputation, since that would also wreck your reputation for honesty.
Nice, I might have to borrow that Agancé joke. It's a branching nested set of sparkling while loops, but yeah.
And episodic memory to hold it together, and to learn new strategies for different sorts of executive function
Capabilities and alignment of LLM cognitive architectures
And if we get some RL learning helping out, that makes it easier and require less smarts from the LLM that's prompted to act as its own thought manager.
This is pretty good. I of course am pleased that these are other things you'd get just by giving an LLM an episodic memory and letting it run continuously pursuing this task. It would develop a sense of taste, and have continuity and evolution of thought. It would build and revise mental models in the form of beliefs.
I'm pretty sure Clause already has a lot of curiosity (or at least 3.5 did). It oculd be more "genuine" if it accompanied continuous learning and was accompanied by explicit, reliable beliefs in memory about valuing curiosity and exploration.
My one quibble is the "implications". The most important implication of how emotions work in humans is that cognitive reframing is quite effective for emotional control in situations where there's a convincing alternate framing.
Telling yourself to not be afraid isn't useful at all, and potentially counterpreductive through the "white elephant effect". Imagining the feeling of calm is somewhat useful (that type of direct emotional induction is under-studied but I've found it useful as it should be from principals of how cognitive control and imagination wor...
I agree that it would be useful to have an official position.
There is no official position AFAIK but individuals in management have expressed the opinion that uncredited AI writing on LW is bad because it pollutes the epistemic commons (my phrase and interpretation).
I agree with this statement.
I don't care if an AI did the writing as long as a human is vouching for the ideas making sense.
If no human is actively vouching for the ideas and claims being plausibly correct and useful, I don't want to see it. There are more useful ideas here than I have time to ...
doing lengthy research and summarizing it is important work but not typically what I associate with "blogging". But I think pulling that together into an attractive product uses much the same cognitive skills as producing original seeing. The missing step in the process you describe is figuring out when the research did produce surprising insights, which might be a class of novel problems (unless a general formulaic approach works and someone scaffolds that in). To the extent it doesn't require solving novel problems, I think it's predictably easier than quality blogging that doesn't rely on research for the novel insights.
I think this is correct. Blogging isn't the easy end of the spectrum, it actually involves solving novel problems of finding new useful viewpoints. But this answer it leaves out answering the core question: what advances will allow LLMs to produce original seeing?
If you think about how humans typically produce original seeing, I think there are relatively straightforward ways that an LLM-based cognitive architecture that can direct its own investigations, "think" about what it'd found, and remember (using continuous learning of some sort) what it's found c...
All of the below is speculative; I just want to not that there are at least equally good arguments for the advantages of being seen as a bad liar (and for actually being a bad liar).
I disagree on the real world advantages. Judging what works from a few examples who are known as good liars (Trump and Musk for instance) isn't the right way to judge what works on average (and I'm not sure those two are even "succeeding" by my standards; Trump at least seems quite unhappy).
I have long refused to play social deception games because not only do I not want to be ...
I don't think this path is easy; I think immense effort and money will be directed at it by default, since there's so much money to be made by replacing human labor with agents. And I think no breakthroughs are necessary, just work in fairly obvious directions. That's why I think this is likely to lead to human-level agents.
I just bumped across "atomic thinking", which asks the model to break the problem into co.ponent parts, attack each separately, and only produce an answer after that's done and they can all be brought together.
This is how smart humans attack some problems, and it's notably different from chain of thought.
I expect this approach could also be used to train models, by training on componenet problems. If other techniques don't keep progressing so fast as to make it irrelevant.
We should expect a significant chance of very short (2-5 year) timelines because we don't have good estimates of timelines.
We are estimating an ETA by having good estimates of our position and velocity, but not a well-known destination.
A good estimate of the end point for timelines would require a good gears-level models of AGI. We don't have that.
The rational thing to do is admit that we have very broad uncertainties, and make plans for different possible timelines. I fear we're mostly just hoping tinmelines aren't really short.
This argument is separate f...
Yes, probably. The progression thus far is that the same level of intelligence gets more efficient - faster or cheaper.
I actually think current systems don't really think much faster than humans - they're just faster at putting words to thoughts, since their thinking is more closely tied to text. But if they don't keep getting smarter, they will still likely keep getting faster and cheaper.
It's looped back to cool for me, I'm going to check it out.