Thomas Kwa

Was on Vivek Hebbar's team at MIRI, now working with Adrià Garriga-Alonso on various empirical alignment projects.

I'm looking for projects in interpretability, activation engineering, and control/oversight; DM me if you're interested in working with me.

Sequences

Catastrophic Regressional Goodhart

Wiki Contributions

Comments

The title of this dialogue promised a lot, but I'm honestly a bit disappointed by the content. It feels like the authors are discussing exactly how to run particular mentorship programs and structure grants and how research works in full generality, while no one is actually looking at the technical problems. All field-building efforts must depend on the importance and tractability of technical problems, and this is just as true when the field is still developing a paradigm. I think a paradigm is established only when researchers with many viewpoints build a sense of which problems are important, then try many approaches until one successfully solves many such problems, thus proving the value of said approach. Wanting to find new researchers to have totally new takes and start totally new illegible research agendas is a level of helplessness that I think is unwarranted-- how can one be interested in AF without some view on what problems are interesting?

I would be excited about a dialogue that goes like this, though the format need not be rigid:

  • What are the most important [1] problems in agent foundations, with as much specificity as possible?
    • Responses could include things like:
      • A sound notion of "goals with limited scope": can't nail down precise desiderata now, but humans have these all the time, we don't know what they are, and they could be useful in corrigibility or impact measures.
      • Finding a mathematical model for agents that satisfies properties of logical inductors but also various other desiderata
      • Further study of corrigibility and capability of agents with incomplete preferences
    • Participants discuss how much each problem scratches their itch of curiosity about what agents are.
  • What techniques have shown promise in solving these and other important problems?
    • Does [infra-Bayes, Demski's frames on embedded agents, some informal 'shard theory' thing, ...] have a good success to complexity ratio?
      • probably none of them do?
  • What problems would benefit the most from people with [ML, neuroscience, category theory, ...] expertise?

[1]: (in the Hamming sense that includes tractability)

Does MIRI have a statement on recent OpenAI events? I'm pretty excited about frank reflections on current events as helping people to orient.

Points 1-3 and the idea that superintelligences will be able to understand our values (which I think everyone believes). But the conclusion needs a bunch of additional assumptions.

I suggest people read both that and Deep Deceptiveness (which is not about deceptiveness in particular) and think about how both could be valid, because I think they both are.

Prerat: Everyone should have a canary page on their website that says “I’m not under a secret NDA that I can’t even mention exists” and then if you have to sign one you take down the page.

Does this work? Sounds like a good idea.

The third one was a typo which I just fixed. I have also changed it to use "base policy" everywhere to be consistent, although this may change depending on what terminology is most common in an ML context, which I'm not sure of.

I have strong downvoted without reading most of this post because the author appears to be trying to make something harmful for the world.

Then I think you should specify that progress within this single innovation could be continuous over years and include 10+ ML papers in sequence each developing some sub-innovation.

I think a single innovation left to create LTPA is unlikely because it runs contrary to the history of technology and of machine learning. For example, in the 10 years before AlphaGo and before GPT-4, several different innovations were required-- and that's if you count "deep learning" as one item. ChatGPT actually understates the number here because different components of the transformer architecture like attention, residual streams, and transformer++ innovations were all developed separately. 

Load More