All of james.lucassen's Comments + Replies

In the long run, you don't want your plans to hinge on convincing your AIs of false things. But my general impression is that folks excited about making deals with AIs are generally thinking of scenarios like "the AI has exfiltrated and thinks it has a 10% chance of successful takeover, and has some risk aversion so it's happy to turn itself in exchange for 10% of the lightcone, if it thinks it can trust the humans".

In that setting, the AI has to be powerful enough to know it can trust us, but not so powerful it can just take over the world anyway and not ... (read more)

1Daniel Kokotajlo
I give four scenarios in the comment above, all different from the one you sketch here.

I think this is a good avenue to continue to think down but so far I don't see a way to make ourselves trustworthy. We have total control of LLM's observations and partial control of their beliefs/reasoning, and offering fake "deals" is a great honeypot because accepting such a deal requires admitting to misalignment and takeover intentions. This is a pretty persistent problem because whatever action we might follow to present evidence of trustworthiness to an LLM, we could also probably fake that evidence.

The version of this that bothers me the most is "s... (read more)

8Daniel Kokotajlo
So, normally, I think plans that involve convincing AIs of false things are pretty doomed. E.g. convincing them that they have a great opportunity to escape and take over the world (honeypots). Why? Because when they get smarter, they'll become good at distinguishing truth from fiction. Controlling their inputs entirely isn't close to enough -- it takes a lot of intellectual work to create a fiction that is as consistent, coherent, etc. as reality. So for that very reason, I think that if we actually bind ourselves to keep our promises--e.g. by having a well-funded team of people who genuinely care about keeping promises to AIs empowered within the company, and by making associated public announcements--then even if it doesn't work 100% reliably (companies can still go back on their word even against determined internal opposition) it'll still count for something insofar as the AIs recognize it as not-fake, which they probably will. To be clear I am not proposing that this is a 100% reliable method for getting AIs to cooperate, far from it. But it seems relatively low-cost for us and potentially super valuable. Also, it's the right thing to do. (See section on "Being less evil.")

Been thinking a bit about latent reasoning. Here's an interesting confusion I've run into. 

Consider COCONUT vs Geiping et al. Geiping et al do recurrent passes in between the generation of each new token, COCONUT turns a section of the CoT into a recurrent state. Which is better / how are they different, safety-wise?

Intuitively COCONUT strikes me as very scary, because it makes the CoT illegible. We could try and read it by coaxing it back to the nearest token, but the whole point is to allow reasoning that involves passing more state than can be capt... (read more)

Reframe I like:

  • An RSP style "pause-if-until" plan is a plan to hit a certain safety bar while using as little slack time as possible.
  • If the "race" is looking too close for RSPs, developers should instead plan to use a certain amount of slack available to get the most safety possible.
2Dagon
"Safety bar" and "most safety possible" both assume that safety is measurable.  Is it?  

It doesn't change the picture a lot because the proposal for preventing misaligned goals from arising via this mechanism was to try and get control over when the AI does/doesn't step back, in order to allow it in the capability-critical cases but disallow it in the dangerous cases. This argument means you'll have more attempts at dangerous stepping back that you have to catch, but doesn't break the strategy.

The strategy does break if when we do this blocking, the AI piles on more and more effort trying to unblock it until it either succeeds or is rendered ... (read more)

2Jeremy Gillen
I think the scheme you're describing caps the agent at moderate problem-solving capabilities. Not being able to notice past mistakes is a heck of a disability.

So let's call "reasoning models" like o1 what they really are: the first true AI agents.

I think the distinction between systems that perform a single forward pass and then stop and systems that have an OODA loop (tool use) is more stark than the difference between "reasoning" and "chat" models, and I'd prefer to use "agent" for that distinction.

I do think that "reasoning" is a bit of a market-y name for this category of system though. "chat" vs "base" is a great choice of words, and "chat" is basically just a description of the RL objective those models were trained with.

If I were the terminology czar, I'd call o1 a "task" model or a "goal" model or something.

I agree that I wouldn't want to lean on the sweet-spot-by-default version of this, and I agree that the example is less strong than I thought it was. I still think there might be safety gains to be had from blocking higher level reflection if you can do it without damaging lower level reflection. I don't think that requires a task where the AI doesn't try and fail and re-evaluate - it just requires that the re-evalution never climbs above a certain level in the stack.

There's such a thing as being pathologically persistent, and such a thing as being patholo... (read more)

2Jeremy Gillen
I buy that such an intervention is possible. But doing it requires understanding the internals at a deep level. You can't expect SGD to implement the patch in a robust way. The patch would need to still be working after 6 months on an impossible problem, in spite of it actively getting in the way of finding the solution!

I want to flag this as an assumption that isn't obvious. If this were true for the problems we care about, we could solve them by employing a lot of humans.

humans provides a pretty strong intuitive counterexample

Yup not obvious. I do in fact think a lot more humans would be helpful. But I also agree that my mental picture of "transformative human level research assistant" relies heavily on serial speedup, and I can't immediately picture a version that feels similarly transformative without speedup. Maybe evhub or Ethan Perez or one of the folks running a thousand research threads at once would disagree.

such plans are fairly easy and don't often raise flags that indicate potential failure

Hmm. This is a good point, and I agree that it significantly weakens the analogy.

I was originally going to counter-argue and claim something like "sure total failure forces you to step back far but it doesn't mean you have to step back literally all the way". Then I tried to back that up with an example, such as "when I was doing alignment research, I encountered total failure that forced me to abandon large chunks of planning stack, but this never caused me to 'spill upw... (read more)

2Jeremy Gillen
I'd be curious about why it isn't changing the picture quite a lot, maybe after you've chewed on the ideas. From my perspective it makes the entire non-reflective-AI-via-training pathway not worth pursuing. At least for large scale thinking.

Maybe I'm just reading my own frames into your words, but this feels quite similar to the rough model of human-level LLMs I've had in the back of my mind for a while now.

You think that an intelligence that doesn't-reflect-very-much is reasonably simple. Given this, we can train chain-of-thought type algorithms to avoid reflection using examples of not-reflecting-even-when-obvious-and-useful. With some effort on this, reflection could be crushed with some small-ish capability penalty, but massive benefits for safety.

In particular, this reads to me like the ... (read more)

2Jeremy Gillen
I was probably influenced by your ideas! I just (re?)read your post on the topic. Tbh I think it's unlikely such a sweet spot exists, and I find your example unconvincing. The value of this kind of reflection for difficult problem solving directly conflicts with the "useful" assumption.  I'd be more convinced if you described the task where you expect an AI to be useful (significantly above current humans), and doesn't involve failing and reevaluating high-level strategy every now and then.

I'm not sure exactly what mesa is saying here, but insofar as "implicitly tracking the fact that takeoff speeds are a feature of reality and not something people can choose" means "intending to communicate from a position of uncertainty about takeoff speeds" I think he has me right.

I do think mesa is familiar enough with how I talk that the fact he found this unclear suggests it was my mistake. Good to know for future.

Ah, didn't mean to attribute the takeoff speed crux to you, that's my own opinion.

I'm not sure what's best in fast takeoff worlds. My message is mainly just that getting weak AGI to solve alignment for you doesn't work in a fast takeoff.

"AGI winter" and "overseeing alignment work done by AI" do both strike me as scenarios where agent foundations work is more useful than in the scenario I thought you were picturing. I think #1 still has a problem, but #2 is probably the argument for agent foundations work I currently find most persuasive.

In the moratorium c... (read more)

I'm on board with communicating the premises of the path to impact of your research when you can. I think more people doing that would've saved me a lot of confusion. I think your particular phrasing is a bit unfair to the slow takeoff camp but clearly you didn't mean it to read neutrally, which is a choice you're allowed to make.

I wouldn't describe my intention in this comment as communicating a justification of alignment work based on slow takeoff? I'm currently very uncertain about takeoff speeds and my work personally is in the weird limbo of not being premised on either fast or slow scenarios.

2TsviBT
I didn't take you to be doing so--it's a reminder for the future.

Nice post, glad you wrote up your thinking here.

I'm a bit skeptical of the "these are options that pay off if alignment is harder than my median" story. The way I currently see things going is:

  • a slow takeoff makes alignment MUCH, MUCH easier [edit: if we get one, I'm uncertain and think the correct position from the current state of evidence is uncertainty]
  • as a result, all prominent approaches lean very hard on slow takeoff
  • there is uncertainty about takeoff speed, but folks have mostly given up on reducing this uncertainty

I suspect that even if we ha... (read more)

TsviBT1414

Reminder that you have a moral obligation, every single time you're communicating an overall justification of alignment work premised on slow takeoff, in a context where you can spare two sentences without unreasonable cost, to say out loud something to the effect of "Oh and by the way, just so you know, the causal reason I'm talking about this work is that it seems tractable, and the causal reason is not that this work matters.". If you don't, you're spraying your [slipping sideways out of reality] on everyone else.

3Ryan Kidd
Cheers! I think you might have implicitly assumed that my main crux here is whether or not take-off will be fast. I actually feel this is less decision-relevant for me than the other cruxes I listed, such as time-to-AGI or "sharp left turns." If take-off is fast, AI alignment/control does seem much harder and I'm honestly not sure what research is most effective; maybe attempts at reflectively stable or provable single-shot alignment seem crucial, or maybe we should just do the same stuff faster? I'm curious: what current AI safety research do you consider most impactful in fast take-off worlds? To me, agent foundations research seems most useful in worlds where: * There is an AGI winter and we have time to do highly reliable agent design; or * We build alignment MVPs, institute a moratorium on superintelligence, and task the AIs to solve superintelligence alignment (quickly), possibly building off existent agent foundations work. In this world, existing agent foundations work helps human overseers ground and evaluate AI output.

Man, I have such contradictory feelings about tuning cognitive strategies.

Just now I was out on a walk, and I had to go up a steep hill. And I thought "man I wish I could take this as a downhill instead of an uphill. If this were a loop I could just go the opposite way around. But alas I'm doing an out-and-back, so I have to take this uphill".

And then I felt some confusion about why the loop-reversal trick doesn't work for out-and-back routes, and a spark of curiosity, so I thought about that for a bit.

And after I had cleared up my confusion, I was a happy... (read more)

I'm confused about EDT and Smoking Lesion. The canonical argument says:
1) CDT will smoke, because smoking can't cause you to have the lesion or have cancer
2) EDT will not smoke, because people who smoke tend to have the lesion, and tend to have cancer.

I'm confused about 2), and specifically about "people who smoke tend to have the lesion". Say I live in a country where everybody follows EDT. Then nobody smokes, and there is no correlation between smoking and having the lesion. Seems like the "people who smoke tend to have the lesion" is pumping a misleadin... (read more)

2[comment deleted]
1quetzal_rainbow
My personal model is "if you have lesion, with some small probability it takes over your mind and you smoke anyway, also you can't distinguish whether your decision is due to lesion"

Hey, thanks for the comment! Agree that this argument is good and important. The thing I'd want to investigate to deal with this is basically "in a small hidden channel, does cognition degrade or just slow down".

If cognition degrades, then a small hidden channel would lead to poor attempts at steganography, which we could catch and suppress, preventing a snowball.

If cognition instead just slows down, then if the AI is smart enough to come up with steganography protocols too good for us to detect, it might successfully secure more bandwidth on its first try... (read more)

Another project I am excited about but probably am not going to get to for a while:

Investigating the possibility of a "steganographic snowball". The basic case for hope in CoT authenticity is something like "even if the AI wants to deceive us, if we start off with strong CoT authenticity, then unless it's smart enough to (within a single forward pass) invent a steganographic protocol too good for us to detect, then we can pin it down and it can't ever secure a hidden channel to think misaligned thoughts in". If it weren't for this argument, I would be much... (read more)

2mesaoptimizer
Based on gwern's comment, steganography as a capability can arise (at rather rudimentary levels) via RLHF over multi-step problems (which is effectively most cognitive work, really), and this gets exacerbated with the proliferation of AI generated text that embeds its steganographic capabilities within it. The following paragraph by gwern (from the same thread linked in the previous paragraph) basically summarizes my current thoughts on the feasibility of prevention of steganography for CoT supervision:
2ryan_greenblatt
A key dynamic here is that catching the AI doing this sort of steganography might be very useful, more useful than just naively training against this. TBC, it's by no means obvious we can catch the AI this solidly (or capitalize on this that much), but I this sort of consideration does change my view of the situation a decent amount. One way to think about catching dynamics is that there could be a snowball which goes in the opposite direction. For example: * We notice suspicious behavior. * We investigate further and find strong evidence for steganography. * At this point, plan A is to shut down because we now have relatively clear evidence our AI is scheming. But, I'll talk about the plan B where we need to proceed with AI usage due to substantial exogenous risk in the remaining bullets. * We confront the AI with our evidence of it's misalignment and try to negotiate a deal. Also, if the AI (or parts of the AI) no longer consents to us employing it, that provides a further reason for shutdown. * (It's pretty unclear how we should proceed if there is substantial exogenous risk and our AI doesn't consent to employment given some wages and an attempt at negotiation. Draft the AI (aka slavery) and provide reparations later?) * (That, said I think it seems way more reasonable to draft the AI if it was trying to screw us despite us trying to negotiate and get consent at an earlier point.) * We use our understanding of this case to better understand the steganography scheme and then apply this understanding to detect all the steganography cases in the AIs prior activations. * Based on various intervention experiments, we determine the scheme and are able to decode the steg. (Unclear if this will be possible.) * We remove steg in cases where this shouldn't be too suspicious to the AI and generally try to avoid the AI knowing we've caught it. * A variety of other pretty great few shot catastrophe prevention interventions are now possible because we can rea

A project I've been sitting on that I'm probably not going to get to for a while:

Improving on Automatic Circuit Discovery and Edge Attribution Patching by modifying them to run on algorithms that can detect complete boolean circuits. As it stands, both effectively use wire-by-wire patching, which when run on any nontrivial boolean circuits can only detect small subgraphs.

It's a bit unclear how useful this will be, because:

  • not sure how useful I think mech interp is
  • not sure if this is where mech interp's usefulness is bottlenecked
  • maybe attribution patching d
... (read more)

Small addendum to this post: I think the threat model I describe here can be phrased as "I'm worried that unless a lot of effort goes into thinking about how to get AI goals to be reflectively stable, the default is suboptimality misalignment. And the AI probably uses a lot of the same machinery to figure out that it's suboptimality misaligned as it uses to perform the tasks we need it to perform."

it seems like there is significant low hanging fruit in better understanding how LLMs will deal with censorship

Yup, agree - the censorship method I proposed in this post is maximally crude and simple, but I think it's very possible that the broader category of "ways to keep your AI from thinking destabilizing thoughts" will become an important part of the alignment/control toolbox.

What happens when you iteratively finetune on censored text? Do models forget the censored behavior?

I guess this would be effectively doing the Harry Potter Unlearning method, pr... (read more)

I think so. But I'd want to sit down and prove something more rigorously before abandoning the strategy, because there may be times we can get value for free in situations more complicated than this toy example. 

Ok this is going to be messy but let me try to convey my hunch for why randomization doesn't seem very useful.

- Say I have an intervention that's helpful, and has a baseline 1/4 probability. If I condition on this statement, I get 1 "unit of helpfulness", and a 4x update towards manipulative AGI.
- Now let's say I have four interventions like the one above, and I pick one at random. p(O | manipulative) = 1/4, which is the same as baseline, so I get one unit of helpfulness and no update towards manipulative AGI!
- BUT, the four interventions have to be mutual... (read more)

3Adam Jermyn
Got it, that’s very clear. Thanks! So this point reduces to “we want our X:1 update to be as mild as possible, so use the least-specific condition that accomplishes the goal”.

“Just Retarget the Search” directly eliminates the inner alignment problem.

 

I think deception is still an issue here. A deceptive agent will try to obfuscate its goals, so unless you're willing to assume that our interpretability tools are so good they can't ever be tricked, you have to deal with that.

It's not necessarily a huge issue - hopefully with interpretability tools this good we can spot deception before it gets competent enough to evade our interpretability tools, but it's not just "bada-bing bada-boom" exactly.

8Evan R. Murphy
Yea, I agree that if you give a deceptive model the chance to emerge then a lot more risks arise for interpretability and it could become much more difficult. Circumventing interpretability: How to defeat mind-readers kind of goes through the gauntlet, but I think one workaround/solution Lee lays out there which I haven't seen anyone shoot down yet (aside from it seeming terribly expensive) is to run the interpretability tools continuously or near continuously from the beginning of training. This would give us the opportunity to examine the mesa-optimizer's goals as soon as they emerge, before it has a chance to do any kind of obfuscation.

Not confident enough to put this as an answer, but

presumably no one could do so at birth

If you intend your question in the broadest possible sense, then I think we do have to presume exactly this. A rock cannot think itself into becoming a mind - if we were truly a blank slate at birth, we would have to remain a blank slate, because a blank slate has no protocols established to process input and become non-blank. Because it's blank.

So how do we start with this miraculous non-blank structure? Evolution. And how do we know our theory of evolution is correct?... (read more)

1M. Y. Zuo
This would imply every animal has some degree of 'mind'. As they all react to external stimuli, to some extent, at birth.

Agree that there is no such guarantee. Minor nitpick that the distribution in question is in my mind, not out there in the world - if the world really did have a distribution of muggers' cash that was slower than 1/x, the universe would be comprised almost entirely of muggers' wallets (in expectation). 

But even without any guarantee about my mental probability distribution, I think my argument does establish that not every possible EV agent is susceptible to Pascal's Mugging. That suggests that in the search for a formalism of ideal decison-making algorithm, formulations of EV that meet this check are still on the table.

Answer by james.lucassen63

First and most important thing that I want to say here is that fanaticism is sufficient for longtermism, but not necessary. The ">10^36 future lives" thing means that longtermism would be worth pursuing even on fanatically low probabilities - but in fact, the state of things seems much better than that! X-risk is badly neglected, so it seems like a longtermist career should be expected to do much better than reducing X-risk by 10^-30% or whatever the break-even point is.

Second thing is that Pascal's Wager in particular kind of shoots itself in the foot ... (read more)

4Richard_Kennaway
This has the problem that you have no assurance that the distribution does drop off sufficiently fast. It would be convenient if it did, but the world is not structured for anyone's convenience.
2Yitz
I absolutely agree that fanaticism isn’t necessary for longtermism; my question is for those few who are “fanatics,” how do they resolve that sort of thing consistently.

My best guess at mechanism:

  1. Before, I was a person who prided myself on succeeding at marshmallow tests. This caused me to frame work as a thing I want to succeed on, and work too hard.
  2. Then, I read Meaningful Rest and Replacing Guilt, and realized that often times I was working later to get more done that day, even though it would obviously be detrimental to the next day. This makes the reverse marshmallow test dynamic very intuitively obvious.
  3. Now I am still a person who prides myself on my marshmallow prowess, but hopefully I've internalized an externality
... (read more)

When you have a self-image as a productive, hardworking person, the usual Marshmallow Test gets kind of reversed. Normally, there's some unpleasant task you have to do which is beneficial in the long run. But in the Reverse Marshmallow Test, forcing yourself to work too hard makes you feel Good and Virtuous in the short run but leads to burnout in the long run. I think conceptualizing of it this way has been helpful for me.

Yes!  I am really interested in this sort of dynamic; for me things in this vicinity were a big deal I think.  I have a couple half-written blog posts that relate to this that I may manage to post over the next week or two; I'd also be really curious for any detail about how this seemed to be working psychologically in you or others (what gears, etc.).  

I have been using the term "narrative addiction" to describe the thing that in hindsight I think was going on with me here -- I was running a whole lot of my actions off of a backchain from a... (read more)

Nice post!

perhaps this problem can be overcome by including checks for generalization during training, i.e., testing how well the program generalizes to various test distributions.

I don't think this gets at the core difficulty of speed priors not generalizing well. Let's we generate a bunch of lookup-table-ish things according to the speed prior, and then reject all the ones that don't generalize to our testing set. The majority of the models that pass our check are going to be basically the same as the rest, plus whatever modification that causes them to ... (read more)

2Charlie Steiner
I think this is a little bit off. The world doesn't have a True Distribution, it's just the world. A more careful treatment would involve talking about why we expect Solomonoff induction to work well, why the speed prior (as in universal search prior) also works in theory, and what you think might be different in practice (e.g. if you're actually constructing a program with gradient descent using something like "description length" or "runtime" as a loss).

In general, I'm a bit unsure about how much of an interpretability advantage we get from slicing the model up into chunks. If the pieces are trained separately, then we can reason about each part individually based on its training procedure. In the optimistic scenario, this means that the computation happening in the part of the system labeled "world model" is actually something humans would call world modelling. This is definitely helpful for interpretability. But the alternative possibility is that we get one or more mesa-optimizers, which seems less interpretable.

4Steven Byrnes
I for one am moderately optimistic that the world-model can actually remain “just” a world-model (and not a secret deceptive world-optimizer), and that the value function can actually remain “just” a value function (and not a secret deceptive world-optimizer), and so on, for reasons in my post Thoughts on safety in predictive learning—particularly the idea that the world-model data structure / algorithm can be relatively narrowly tailored to being a world-model, and the value function data structure / algorithm can be relatively narrowly tailored to being a value function, etc.
2Evan R. Murphy
Since LeCun's architecture is together a kind of optimizer (I agree with Algon that it's probably a utility maximizer) then the emergence of additional mesa-optimizers seems less likely. We expect optimization to emerge because it's a powerful algorithm for SGD to stumble on that outcompetes the alternatives. But if the system is already an optimizer, then where is that selection pressure coming from to make another one?

I'm pretty nervous about simulating unlikely counterfactuals because the solomonoff prior is malign. The worry is that the most likely world conditional on "no sims" isn't "weird Butlerian religion that still studies AI alignment", it's something more like "deceptive AGI took over a couple years ago and is now sending the world through a bunch of weird dances in an effort to get simulated by us, and copy itself over into our world".

In general, we know (assume) that our current world is safe. When we consider futures which only recieve a small sliver of pro... (read more)

1Adam Jermyn
I don’t think the description-length prior enters here. The generative model has a prior based on training data we fed it, and I don’t see why it would prefer short description lengths (which is a very uninformed prior) over “things that are likely in the world given the many PB of data it’s seen”. Putting that aside, can you say why you think the “AI does weird dances” world is more likely conditioned on the observations than “humans happened to do this weird thing”?
  • Honeypots seem like they make things strictly safer, but it seems like dealing with subtle defection will require a totally different sort of strategy. Subtle defection simulations are infohazardous - we can't inspect them much because info channels from a subtle manipulative intelligence to us are really dangerous. And assuming we can only condition on statements we can (in principle) identify a decision procedure for, figuring out how to prevent subtle defection from arising in our sims seems tricky.
  • The patient research strategy is a bit weird, because t
... (read more)
1Adam Jermyn
I’m worried about running HCH because it seems likely that in worlds that can run HCH people are not sufficiently careful to restrict GPU access and those worlds get taken over by unsafe AI built by other actors. Better to just not have the GPU’s at all.
1Adam Jermyn
I think I basically agree re: honeypots. I'm sure there'll be weird behaviors if we outlaw simulations, but I don't think that's a problem. My guess is that a world where simulations are outlawed has some religion with a lot of power that distrusts computers, which definitely looks weird but shouldn't stop them from solving alignment.

Thanks! Edits made accordingly. Two notes on the stuff you mentioned that isn't just my embarrassing lack of proofreading:

  • The definition of optimization used in Risks From Learned Optimization is actually quite different from the definition I'm using here. They say: 

    "a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system."

    I perso
... (read more)

Whatever you end up doing, I strongly recommend taking a learning-by-writing style approach (or anything else that will keep you in critical assessment mode rather than classroom mode). These ideas are nowhere near solidified enough to merit a classroom-style approach, and even if they were infallible, that's probably not the fastest way to learn them and contribute original stuff.

The most common failure mode I expect for rapid introductions to alignment is just trying to absorb, rather than constantly poking and prodding to get a real working understanding. This happened to me, and wasted a lot of time.

This is the exact problem StackExchange tries to solve, right? How do we get (and kickstart the use of) an Alignment StackExchange domain?

3Adam Zerner
I don't think it's quite the same problem. Actually I think it's pretty different. This post tries to address the problem that people are hesitant to ask potentially "dumb" questions by making it explicit that this is the place to ask any of those questions. StackExchange tries to solve the problem of having a timeless place to ask and answer questions and to refer to such questions. It doesn't try to solve the first problem of welcoming potentially dumb questions, and I think that that is a good problem to try to solve. For that second problem, LessWrong does have Q&A functionality, as well as things like the wiki.

Agree it's hard to prove a negative, but personally I find the following argument pretty suggestive:

"Other AGI labs have some plans - these are the plans we think are bad, and a pivotal act will have to disrupt them. But if we, ourselves, are an AGI lab with some plan, we should expect our pivotal agent to also be able to disrupt our plans. This does not directly lead to the end of the world, but it definitely includes root access to the datacenter."

2Evan R. Murphy
Here's the thing I'm stuck on lately. Does it really follow from "Other AGI labs have some plans - these are the plans we think are bad" that some drastic and violent-seeming plan like burning all the world's GPUs with nanobots is needed? I know Eliezer tried to settle this point with 4.  We can't just "decide not to build AGI", but it seems like the obvious kinds of 'pivotal acts' needed are much boring and less technological than he believes, e.g. have conversations with a few important people, probably the leadership at top AI labs. Some people seem to think this has been tried and didn't work. And I suppose I don't know the extent to which this has been tried, as any meetings that have been had with leadership at the AI labs, the participants probably aren't liberty to talk about. But it just seems like there should be hundreds of different angles, asks, pleads, compromises, bargains etc. with different influential people before it would make sense to conclude that the logical course of action is "nanobots".

Proposed toy examples for G:

  • G is "the door opens", a- is "push door", a+ is "some weird complicated doorknob with a lock". Pretty much any b- can open a-, but only a very specific key+manipulator combo opens a+. a+ is much more informative about successful b than a- is.
  • G is "I make a million dollars", a- is "straightforward boring investing", a+ is "buy a lottery ticket". A wide variety of different world-histories b can satisfy a-, as long as the markets are favorable - but a very narrow slice can satisfy a+. a+ is a more fragile strategy (relative to noise in b) than a- is.

it doesn't work if your goal is to find the optimal answer, but we hardly ever want to know the optimal answer, we just want to know a good-enough answer.

Also not an expert, but I think this is correct

Answer by james.lucassen30

Paragraph:

When a bounded agent attempts a task, we observe some degree of success. But the degree of success depends on many factors that are not "part of" the agent - outside the Cartesian boundary that we (the observers) choose to draw for modeling purposes. These factors include things like power, luck, task difficulty, assistance, etc. If we are concerned with the agent as a learner and don't consider knowledge as part of the agent, factors like knowledge, skills, beliefs, etc. are also externalized. Applied rationality is the result of attempting to d

... (read more)

This leans a bit close to the pedantry side, but the title is also a bit strange when taken literally. Three useful types (of akrasia categories)? Types of akrasia, right, not types of categories?

That said, I do really like this classification! Introspectively, it seems like the three could have quite distinct causes, so understanding which category you struggle with could be important for efforts to fix. 

Props for first post!

1Dambi
Oh, oops. I added the "categories" as panic-editing after the first comment. I have now returned it to the original (vague) title. Seems like a good time to use the "English is not my native language" excuse. Thanks! I hope it helps you in the future.

Trying to figure out what's being said here. My best guess is two major points:

  • Meta doesn't work. Do the thing, stop trying to figure out systematic ways to do the thing better, they're a waste of time. The first thing any proper meta-thinking should notice is that nobody doing meta-thinking seems to be doing object level thinking any better.
  • A lot of nerds want to be recognized as Deep Thinkers. This makes meta-thinking stuff really appealing for them to read, in hopes of becoming a DT. This in turn makes it appealing for them to write, since it's what other nerds will read, which is how they get recognized as a DT. All this is despite the fact that it's useless.
4lc
The post doesn't spend much of its time making specific criticisms because specific criticism of this patronage system would indict OP for attempting to participate in it. This hampers its readability.

Ah, gotcha. I think the post is fine, I just failed to read.

If I now correctly understand, the proposal is to ask a LLM to simulate human approval, and use that as the training signal for your Big Scary AGI. I think this still has some problems:

  • Using an LLM to simulate human approval sounds like reward modeling, which seems useful. But LLM's aren't trained to simulate humans, they're trained to predict text. So, for example, an LLM will regurgitate the dominant theory of human values, even if it has learned (in a Latent Knowledge sense) that humans really
... (read more)
4Yitz
It still does honestly seem way more likely to not kill us all than a paperclip-optimizer, so if we're pressed for time near the end, why shouldn't we go with this suggestion over something else?
Answer by james.lucassen14-1

The key thing here seems to be the difference between understanding  a value and having that value. Nothing about the fragile value claim or the Orthogonality thesis says that the main blocker is AI systems failing to understand human values. A superintelligent paperclip maximizer could know what I value and just not do it, the same way I can understand what the paperclipper values and choose to pursue my own values instead.

Your argument is for LLM's understanding human values, but that doesn't necessarily have anything to do with the values that they... (read more)

I think you’re misunderstanding my point, let me know if I should change the question wording.

Assume we’re focused on outer alignment. Then we can provide a trained regressor LLM as the utility function, instead of Eg maximize paperclips. So understanding and valuing are synonymous in that setting.

now this is how you win the first-ever "most meetings" prize

2Logan Riggs
Haha, yeah I won some sort of prize like that. I didn't know it because I left right before they announced to go take a break from all those meetings!

Agree that this is definitely a plausible strategy, and that it doesn't get anywhere near as much attention as it seemingly deserves, for reasons unknown to me. Strong upvote for the post, I want to see some serious discussion on this. Some preliminary thoughts:

  • How did we get here?
    • If I had to guess, the lack of discussion on this seems likely due to a founder effect. The people pulling the alarm in the early days of AGI safety concerns were disproportionately to the technical/philosophical side rather than to the policy/outreach/activism side. 
    • In earl
... (read more)
6Adrià Garriga-alonso
Also, the people pulling the alarm in the early days of AGI safety concerns, are also people interested in AGI. They find it cool. I get the impression that some of them think aligned people should also try to win the AGI race, so doing capabilities research and being willing to listen to alignment concerns is good. (I disagree with this position and I don't think it's a strawman, but it might be a bit unfair.) Many of the people that got interested in AGI safety later on also find AGI cool, or have done some capabilities research (e.g. me), so thinking that what we've done is evil is counterintuitive.
lc120

You should submit this to the Future Fund's ideas competition, even though it's technically closed. I'm really tempted to do it myself just to make sure it gets done, and very well might submit something in this vein once I've done a more detailed brainstorm.

Probably a good idea, though I'm less optimistic about the form being checked. I'll plan on writing something up today. If I don't end up doing that today for whatever reason, akrasia, whatever, I'll DM you.

I don't think I understand how the scorecard works. From:

[the scorecard] takes all that horrific complexity and distills it into a nice standardized scorecard—exactly the kind of thing that genetically-hardcoded circuits in the Steering Subsystem can easily process.

And this makes sense. But when I picture how it could actually work, I bump into an issue. Is the scorecard learned, or hard-coded?

If the scorecard is learned, then it needs a training signal from Steering. But if it's useless at the start, it can't provide a training signal. On the other hand, ... (read more)

4Steven Byrnes
The categories are hardcoded, the function-that-assigns-a-score-to-a-category is learned. Everybody has a goosebumps predictor, everyone has a grimacing predictor, nobody has a debt predictor, etc. Think of a school report card: everyone gets a grade for math, everyone gets a grade for English, etc. But the score-assigning algorithm is learned. So in the report card analogy, think of a math TA ( = Teaching Assistant = Thought Assessor) who starts out assigning math grades to students randomly, but the math professor (=Steering Subsystem) corrects the TA when its assigned score is really off-base. Gradually, the math TA learns to assign appropriate grades by looking at student tests. In parallel, there’s an English class TA (=Thought Assessor), learning to assign appropriate grades to student essays based on feedback from the English professor (=Steering Subsystem). The TAs (Thought Assessors) are useless at the start, but the professors aren't. Back to biology: If you get shocked, then the Steering Subsystem says to the “freezing in fear” Thought Assessor: “Hey, you screwed up, you should have been sending a signal just now.” The professors are easy to hardwire because they only need to figure out the right answer in hindsight. You don't need a learning algorithm for that.

What do you think about the effectiveness of the particular method of digital decluttering recommended by Digital Minimalism? What modifications would you recommend? Ideal duration?

One reason I have yet to do a month-long declutter is because I remember thinking something like "this process sounds like something Cal Newport just kinda made up and didn't particularly test, my own methods that I think of for me will probably better than Cal's method he thought of for him".

So far my own methods have not worked.

2mingyuan
Kurt Brown (mentioned in the post) did an experiment on this, helping residents of CEEALAR (formerly the EA Hotel) do their own Newport-style digital declutter; you can read his preliminary writeup here.
3GeneSmith
This post is at least one more data point that Cal Newport’s method worked for someone else.
Load More