Jeremy Gillen

I'm interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.

I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI. Most of my writing before mid 2023 is not representative of my current views about alignment difficulty.

Posts

Sorted by New

6Jeremy Gillen's Shortform

31Context-dependent consequentialism

20d

160Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

10mo

172Thomas Kwa's MIRI research experience

37AISC team report: Soft-optimization, Bayes and Goodhart

117Soft optimization makes the value target bigger

6Jeremy Gillen's Shortform

76Neural Tangent Kernel Distillation

37Inner Alignment via Superpowers

59Finding Goals in the World Model

76The Core of the Alignment Problem is...

Wiki Contributions

Comments

Sorted by

Newest

lemonhope's Shortform

Jeremy Gillen2d39

I agree this would be a great program to run, but I want to call it a different lever to the one I was referring to.

The only thing I would change is that I think new researchers need to understand the purpose and value of past agent foundations research. I spent too long searching for novel ideas while I still misunderstood the main constraints of alignment. I expect you'd get a lot of wasted effort if you asked for out-of-paradigm ideas. Instead it might be better to ask for people to understand and build on past agent foundations research, then gradually move away if they see other pathways after having understood the constraints. Now I see my work as mostly about trying to run into constraints for the purpose of better understand them.

Maybe that wouldn't help though, it's really hard to make people see the constraints.

lemonhope's Shortform

Jeremy Gillen3d53

The main thing I'm referring to are upskilling or career transition grants, especially from LTFF, in the last couple of years. I don't have stats, I'm assuming there were a lot given out because I met a lot of people who had received them. Probably there were a bunch given out by the ftx future fund also.

Also when I did MATS, many of us got grants post-MATS to continue our research. Relatively little seems to have come of these.

How are they falling short?

(I sound negative about these grants but I'm not, and I do want more stuff like that to happen. If I were grantmaking I'd probably give many more of some kinds of safety research grant. But "If a man has an idea just give him money and don't ask questions" isn't the right kind of change imo).

lemonhope's Shortform

Jeremy Gillen3d3117

I think I disagree. This is a bandit problem, and grantmakers have tried pulling that lever a bunch of times. There hasn't been any field-changing research (yet). They knew it had a low chance of success so it's not a big update. But it is a small update.

Probably the optimal move isn't cutting early-career support entirely, but having a higher bar seems correct. There are other levers that are worth trying, and we don't have the resources to try every lever.

Also there are more grifters now that the word is out, so the EV is also declining that way.

(I feel bad saying this as someone who benefited a lot from early-career financial support).

What are the good rationality films?

Answer by Jeremy GillenNov 20, 202470

My first exposure to rationalists was a Rationally Speaking episode where Julia recommended the movie Locke.

It's about a man pursuing difficult goals under emotional stress using few tools. For me it was a great way to be introduced to rationalism because it showed how a ~rational actor could look very different from a straw Vulcan.

It's also a great movie.

"It's a 10% chance which I did 10 times, so it should be 100%"

Jeremy Gillen6d164

Nice.

Similar rule of thumb I find handy is 70 divided by growth rate to get doubling time implied by a growth rate. I find it way easier to think about doubling times than growth rates.

E.g. 3% interest rate means 70/3 ≈ 23 year doubling time.

Thoughts after the Wolfram and Yudkowsky discussion

Jeremy Gillen9d20

I get the feeling that I’m still missing the point somehow and that Yudkowsky would say we still have a big chance of doom if our algorithms were created by hand with programmers whose algorithms always did exactly what they intended even when combined with their other algorithms.

I would bet against Eliezer being pessimistic about this, if we are assuming the algorithms are deeply-understood enough that we are confident that we can iterate on building AGI. I think there's maybe a problem with the way Eliezer communicates that gives people the impression that he's a rock with "DOOM" written on it.

I think the pessimism comes from there being several currently-unsolved problems that get in the way of "deeply-understood enough". In principle it's possible to understand these problems and hand-build a safe and stable AGI, it just looks a lot easier to hand-build an AGI without understanding them all, and even easier than that to train an AGI without even thinking about them.

I call most of these "instability" problems. Where the AI might for example learn more, or think more, or self-modify, and each of these can shift the context in a way that causes an imperfectly designed AI to pursue unintended goals.

Here are some descriptions of problems in that cluster: optimization daemons, ontology shifts, translating between our ontology and the AI's internal ontology in a way that generalizes, pascal's mugging, reflectively stable preferences & decision algorithms, reflectively stable corrigibility, and correctly estimating future competence under different circumstances.

Some may be resolved by default along the way to understanding how to build AGI by hand, but it isn't clear. Some are kinda solved already in some contexts.

Evaluating Stability of Unreflective Alignment

Jeremy Gillen10d30

Intelligence/IQ is always good, but not a dealbreaker as long as you can substitute it with a larger population.

IMO this is pretty obviously wrong. There are some kinds of problem solving that scales poorly with population, just as there are some computations that scale poorly with parallelisation.

E.g. project euler problems.

When I said "problems we care about", I was referring to a cluster of problems that very strongly appear to not scale well with population. Maybe this is an intuitive picture of the cluster of problems I'm referring to.

Context-dependent consequentialism

Jeremy Gillen10d20

I buy that such an intervention is possible. But doing it requires understanding the internals at a deep level. You can't expect SGD to implement the patch in a robust way. The patch would need to still be working after 6 months on an impossible problem, in spite of it actively getting in the way of finding the solution!

Evaluating Stability of Unreflective Alignment

Jeremy Gillen10d20

I'd be curious about why it isn't changing the picture quite a lot, maybe after you've chewed on the ideas. From my perspective it makes the entire non-reflective-AI-via-training pathway not worth pursuing. At least for large scale thinking.

Context-dependent consequentialism

Jeremy Gillen11d20

I was probably influenced by your ideas! I just (re?)read your post on the topic.

Tbh I think it's unlikely such a sweet spot exists, and I find your example unconvincing. The value of this kind of reflection for difficult problem solving directly conflicts with the "useful" assumption.

I'd be more convinced if you described the task where you expect an AI to be useful (significantly above current humans), and doesn't involve failing and reevaluating high-level strategy every now and then.