All of TristanTrim's Comments + Replies

I guess one kind of critique I might expect for this is that it fails to connect with a strategy. Terminology is for communication within a community. What community do these recommendations apply to, and why? I'd like to write a post sometime exploring that. If you know of anyone exploring that sort of idea. Please let me know.

A more general critique may be on the value of making recommendations about terminology and language use. My intuition is that being conscientious about our language is important, but critical examination of that idea seems valid.

I think it's worth distinguishing between what I'll call "parallel SI" vs "collective SI".

Parallel SI is when you have something more intelligent because it has a lot of the same intelligence in parallel. Strictly parallel SI would need to rely on random differences in decisions and shelling points since communication between threads would not be possible.

Collective SI requires parallel SI, but additionally has organization of the work being done by each intelligence. I think how far this concept can be pushed is unclear, but I don't see any reason that su... (read more)

✨ I just donated 71.12 USD (100 CAD 🇨🇦) ✨

I'd like to donate a more relevant amount but I'm finishing my undergrad and have no income stream... in fact, I'm looking to become a Mech Interp researcher (& later focus on agent foundations) but I'm not going to be able to do that if misaligned optimizers eat the world, so I support lightcone's direction as I understand it (policy that promotes AI not killing everyone).

If anyone knows of good ways to fund myself as a MI researcher, ideally focusing on this research direction I've been developing, please let me know : )

WRT formatting, thanks I didn't realise the markdown needs two new lines for a paragraph break.

I think CoT and its dynamics as it relates to review and RSI is very interesting & useful to be exploring.

Looking forward to reading the stepping stone and stability posts you linked. : )

Yes, you've written more extensively on this than I realized, thanks for pointing out other relevant posts, sorry for not having taken the time to find them myself, I'm trying to err more on the side of communication than I have in the past.

I think math is the best tool to solve alignment. It might be emotional, I've been manipulated and hurt by natural language and the people who prefer it to math and have always found engaging with math to be soothing or at least sobering. It could also be that I truly believe that the engineering rigor that comes with u... (read more)

3Seth Herd
I certainly don't expect people to read a bunch of stuff before engaging! I'm really pleased that you've read so much of my stuff. I'll get back to these conversations soon hopefully, I've had to focus on new posts. I think your feelings about math are shared by a lot of the alignment community. I like the way you've expressed those intuitions. I think math might be the best tool to solve alignment if we had unlimited time - but it looks like we very much do not.

WRT "I don't want his attempted in any light-cone I inhabit", well, neither do I. But we're not in charge of the light cone.

That really is a true and relevant fact, isn't it? 😭

It seems like aligning humans really is much more of a bottleneck rn than aligning machines, and not because we are at all on track to align machines.

I think you are correct about the need to be pragmatic. My fear is that there may not be anywhere on the scale from "too pragmatic failed to actually align ASI" to "too idealistic, failed to engage with actual decision makers running ASI projects" where we get good outcomes. Its stressful.

The organized mind recoils. This is not an aesthetically appealing alignment approach.

Praise Eris!

No, but seriously, I like this plan with the caveat that we really need to understand RSI and what is required to prevent it first, and also I think the temptation to allow these things to open up high bandwidth channels to other modalities than language is going to be really really strong and if we go forward with this we need a good plan to resist that temptation and a good way to know when not to resist that temptation.

Also, I'd like it if this was though of as a step on the path to cyborgism/true value alignment, and not as a true ASI alignment plan on its own.

2Seth Herd
On RSI, see The alignment stability problem and my response to your comment on Instruction-following AGI... WRT true value alignment, I agree that this is just a stepping stone to that better sort of alignment. See Intent alignment as a stepping-stone to value alignment. I agree that including non-linguistic channels is going to be a strong temptation. Language does nicely summarize most of our really abstract thought, so I don't think it's necessary. But there are many training practices that would destroy the legible chain of thought needed for external review. See the case for CoT unfaithfulness is overstated for the inverse. Legible CoT is actually not necessary for internal action review. You do need to be able to parse what the action is for another model to predict and review its likely consequences. And it works far better to review things at a plan level rather than action-by-action, so the legible CoT is very useful. But if the system is still trained to respond to prompts, you could still use the scripted internal review no matter how opaque the internal representations had become. But you couldn't really make that review independent if you didn't have a way to summarize the plan so it could be passed to another model, like you can with language.   BTW your comment accidentally was formatted as a quote along with the bit you meant to quote from the post. Correcting that would make it easier for others to parse, but it was clear to me.

I was going to say "I don't want this attempted in any light-cone I inhabit, but I realize theres a pretty important caveat. On it's own, I think this is a doom plan, but if there was a sufficient push to understand RSI dynamics before and during, then I think it could be good.

I don't agree that it's "a better idea than attempting value alignment", it's a better idea than dumb value alignment for sure, but imo only skilled value alignment or self modification (no AGI, no ASI) will get us to a good future. But the plans aren't mutually exclusive. First stud... (read more)

2Seth Herd
I guess I didn't address RSI in enough detail. The general idea is to have a human in the loop during RSI, and to talk extensively with the current version of your AGI about how this next improvement could disrupt its alignment before you launch it. WRT "I don't want his attempted in any light-cone I inhabit", well, neither do I. But we're not in charge of the light cone. All we can do is convince the people who currently very much ARE one the road to attempting exactly this to not do it - and saying "it's way too risky and I refuse to think about how you might actually pull it off" is not going to do that. Or else we can try to make it work if it is attempted. Both paths to survival involve thinking carefully about how alignment could succeed or fail on our current trajectory.

I like this post. I like goals selected from learned knowledge (GSLK). It sounds a lot like what I was thinking about when I wrote how-i-d-like-alignment-to-get-done. I plan to use the term GSLK in the future. Thank you : )

"we've done so little work on alignment that I think it might actually be more like additive, from 1% to 26% or 50% to 75% with ten extra years relative to the real current odds if we press ahead - which nobody knows." 😭🤣 I really want "We've done so little work the probabilities are additive" to be a meme. I feel like I do get where you're coming from.

I agree about pause concern. I also really feel that any delay to friendly SI represents an enormous amount of suffering that could be prevented if we got to friendly SI sooner. It should not be taken ligh... (read more)

2Seth Herd
Yes, the math crowd is saying something like "give us a hundred years and we can do it!". And nobody is going to give them that in the world we live in. Fortunately, math isn't the best tool to solve alignment. Foundation models are already trained to follow instructions given in natural language. If we make sure this is the dominant factor in foundation model agents, and use it carefully (don't say dumb things like "'go solve cancer, don't bug me with the hows and whys, just git er done as you see fit", etc), this could work. We can probably achieve technical intent alignment if we're even modestly careful and pay a modest alignment tax. You've now read my other posts making those arguments. Unfortunately, it's not even clear the relevant actors are willing to be reasonably cautious or pay a modest alignment tax. The other threads are addressed in responses to your comments on my linked posts.

Yeah, getting specific unpause requirements seems high value for convincing people who would not otherwise want a pause, but I can't imagine actually getting it in time in any reasonable way, instead it would need to look like technical specification. "Once we have developed x, y, and z, then it is safe to unpause" kind of thing. Just we need to figure out what the x, y, and z requirements are. Then we can estimate how long it will take to develop x, y, and z, and this will get more refined and accurate as more progress is made, but since the requirements ... (read more)

3Seth Herd
Agreed that scaling rather than addition is usually the better way to think about probabilities. In this case we've done so little work on alignment that I think it might actually be more like additive, from 1% to 26% or 50% to 75% with ten extra years relative to the real current odds if we press ahead - which nobody knows. I'm pretty sure it would be an error to trust anyone's estimate at this time, because people with roughly equal expertise and wisdom (e.g., Yudkowsky and Christiano) give such wildly different odds. And the discussions between those viewpoints always trail off into differing intuitions. I also give alignment by default very poor odds, and prosaic alignment as it's usually discussed. But there are some pretty obvious techniques that are so low-tax that I think they'll be implemented even by orgs that don't take safety very seriously. I'm curious if you've read my Instruction-following AGI is easier and more likely than value aligned AGI and/or Internal independent review for language model agent alignment posts. Instruction-following is human-in-the-loop so that may already be what you're referring to. But some of the techniques in the independent review post (which is also a review of multiple methods) go beyond prosaic alignment to apply specifically to foundation model agents. And wisely-used instruction-following gives corrigibility with a flexible level of oversight. I'm curious what you think about those techniques if you've got time to look.   I think public acceptance of a pause is only part of the issue. The Chinese might actually not pursue AGI if they didn't have to race the US. But Russia and North Korea will most certainly pursue it (although they've got very limited resources and technical chops to make lots of progress in new foundation models, but they still might get to real AGI based on turning next-gen (which there's not time to pause) foundation models into scaffolded cognitive architectures. But yes, I do think there's

Hey : ) Thanks for engaging with this. It means a lot to me <3

Sorry I wrote so much, it kinda got away from me. Even if you don’t have time to really read it all, it was a good exercise writing it all out. I hope it doesn't come across too confrontational, as far as I can tell, I'm really just trying to find good ideas, not prove my ideas are good, so I'm really grateful for your help. I've been accused of trying to make myself seem important while trying to explain my view of things to people and it sucks all round when that happens. This reply of mine... (read more)

Do you think people would vibe with it better if it was framed "I may die, but it's a heroic sacrifice to save my home planet from may-as-well-be-an-alien-invasion"? Is it reasonable to characterize general superintelligence as an alien takeover and if it is, would people accept the characterization?

2Seth Herd
Yes, I think that framing would help. I doubt it would shift public opinion that much, probably not even close to more than 50% in the current epistemic environment. The issue is that we really don't know how hard alignment is. If we could say for sure that pausing for ten years would improve our odds of survival by, say, 25%, then I think a lot of people of the relevant ages (like probably me) would actually accept the framing of a heroic sacrifice.

There may also be a perceived difference between "open" and "open-source". If the goal is to allow anyone to query the HHH AGI, that's different from anyone being able to modify and re-deploy the AGI. Not that I think that way. In my view the risk that AGI is uncontrollable is too high and we should pursue an "aligned from boot" strategy like I describe in: How I'd like alignment to get done

Hey, we met at EAGxToronto : ) I am finally getting around to reading this. I really enjoyed your manic writing style. It is cathartic finding people stressing out about the same things that are stressing me out.

In response to "The less you have been surprised by progress, the better your model, and you should expect to be able to predict the shape of future progress": My model of capabilities increases has not been too surprised by progress, but that is because for about 8 years now there has been a wide uncertainty bound and a lot of Vingean Reflection i... (read more)

2porby
🙋‍♂️ This is indeed a locally valid way to escape one form of the claim—without any particular prediction carrying extra weight, and the fact that reality has to go some way, there isn't much surprise in finding yourself in any given world. I do think there's value in another version of the word "surprise," here, though. For example: the cross-entropy loss between the predicted distribution with respect to the observed distribution. Holding to a high uncertainty model of progress will result in continuously high "surprise" in this sense, because it struggles to narrow to a better distribution generator. It's a sort of overdamped epistemological process. I think we have enough information to make decent gearsy models of progress around AI. As a bit of evidence, some such models have already been exploited to make gobs of money. I'm also feeling pretty good[1] about many of my predictions (like this post) that contributed to me pivoting entirely into AI; there's an underlying model that has a bunch of falsifiable consequences which has so far survived a number of iterations, and that model has implications through the development of extreme capability. Yup! That was a pretty major (and mostly positive) update for me. I didn't have a strong model of government-level action in the space and I defaulted into something pretty pessimistic. My policy/governance model is still lacking the kind of nuance that you only get by being in the relevant rooms, but I've tried to update here as well. That's also part of the reason why I'm doing what I'm doing now. May you have the time to solve everything! 1. ^  ... epistemically

re 6 -- Interesting. It was my impression that "chain of thought" and other techniques notably improved LLM performance. Regardless, I don't see compositional improvements as a good thing. They are hard to understand as they are being created, and the improvements seem harder to predict. I am worried about RSI in a misaligned system created/improved via composition.

Re race dynamics: It seems to me there are multiple approaches to coordinating a pause. It doesn't seem likely that we could get governments or companies to head a pause. Movements from the gene... (read more)

3Daniel Kokotajlo
Sounds good to me!

About (6), I think we're more likely to get AGI /ASI by composing pre-trained ML models and other elements than by a fresh training run. Think adding iterated reasoning and api calling to a LLM.

About the race dynamics. I'm interested in founding / joining a guild / professional network for people committed to advancing alignment without advancing capabilities. Ideally we would share research internally, but it would not be available to those not in the network. How likely does this seem to create a worthwhile cooling of the ASI race? Especially if the network were somehow successful enough to reach across relevant countries?

4Daniel Kokotajlo
Re 6 -- I dearly hope you are right but I don't think you are. That scaffolding will exist of course but the past two years have convinced me that it isn't the bottleneck to capabilities progress (tens of thousands of developers have been building language model programs / scaffolds / etc. with little to show for it) (tbc even in 2021 I thought this, but I feel like the evidence has accumulated now) Re race dynamics: I think people focused on advancing alignment should do that, and not worry about capabilities side-effects. Unless and until we can coordinate an actual pause or slowdown. There are exceptions of course on a case-by-base basis.

It's not really possible to hedge either the apocalypse or a global revolution, so you can ignore those states of the worlds when pricing assets (more or less). 

 

Unless depending on what you invest in those states of the world become more or less likely.

Haha, I was hoping for a bit more activity here, but we filled our speaker slots anyway. If you stumble across this post before November 26th, feel free to come to our conference.

In the final paragraph, I'm uncertain if you are thinking about "agency" being broken into components which make up the whole concept, or thinking about the category being split into different classes of things, some of which may have intersecting examples. (or both?) I suspect both would be helpful. Agency can be described in terms of components like measurement/sensory, calculations, modeling, planning, comparisons to setpoints/goals, taking actions. Probably not that exact set, but then examples of agent like things could naturally be compared on each c... (read more)

It's unimportant, but I disagree with the "extra special" in:

if alignment isn’t solvable at all [...] extra special dead

If we could coordinate well enough and get to SI via very slow human enhancement that might be a good universe to be in. But probably we wouldn't be able to coordinate well enough and prevent AGI in that universe. Still, odds seem similar between "get humanity to hold off on AGI till we solve alignment" which is the ask in alignment possible universes, and "get humanity to hold off on AGI forever" which is the ask in alignment impossible ... (read more)

what Hotz was treating a load bearing

Small grammar mistake. You accidentally a "a".

Oh, actually I spoke too soon about "Talk to the City." As a research project, it is cool, but I really don't like the obfuscation that occurs when talking to an LLM about the content it was trained on. I don't know how TTTC works under the hood, but I was hoping for something more like de-duplication of posts, automatically fitting them into argument graphs. Then users could navigate to relevant points in the graph based on a text description of their current point of view, but importantly they would be interfacing with the actual human generated text, wi... (read more)

  1. The human eats ice cream
  2. The human gets reward 
  3. The human becomes more likely to eat ice cream


So, first of all, the ice cream metaphor is about humans becoming misaligned with evolution, not about conscious human strategies misgeneralizing that ice cream makes their reward circuits light up, which I agree is not a misgeneralization. Ice cream really does light up the reward circuits. If the human learned "I like licking cold things" and then sticks their tongue on a metal pole on a cold winter day, that would be misgeneralization at the level you are fo... (read more)

People have tried lots and lots of approaches to getting good performance out of computers, including lots of "scary seeming" approaches

I won't say I could predict that these wouldn't foom ahead of time, but it seems the result of all of these is an AI engineer that is much much more narrow / less capable than a human AI researcher.

It makes me really scared, many people's response to not getting mauled after poking a bear is to poke it some more. I wouldn't care so much if I didn't think the bear was going to maul me, my family, and everyone I care about.

I

... (read more)

The waluigis will give anti-croissant responses

I'd say the waluigis have a higher probability of giving pro-croissant responses than the luigi's, and are therefore genuinely selected against. The reinforcement learning is not part of the story, it is the thing selecting for the LLM distribution based on whether the content of the story contained pro or anti croissant propaganda.

(Note that this doesn't apply to future, agent shaped, AI (made of LLM components) which are aware of their status (subject to "training" alteration) as part of the story they are working on)

I like this direction of thought, and I suspect it is true as a general rule, but ignores the incentive people have for correctly receiving the information, and the structure through which the information is disseminated. Both factors (and probably others I haven't thought of) would increase or decrease how much information could be transferred.

5Mauricio_AG
This is a good point. We can explain why students in medical school carefully digest millions of words by discussing the near-term incentives of final exams and the long-term incentives of increased salary and social status.