User Comment Replies

Interesting. Curious to know what your construction ended up looking like and I'm looking forward to reading the resulting proof!

Proof idea: SLT to AIT

Lucas Teixeira1mo10

Some proofs along these lines already exist [1,2], though they seem to have minor problems.

This is a good critical review of the literature.

3Lucius Bushnaq1mo

Yeah, the difference between what those papers show and what I need turned out to be a lot bigger than I thought. I ended up making my own construction instead. This actually turned out to be the most time consuming part of the whole proof. The other steps were about as straightforward as they looked.

Theory of Change for AI Safety Camp

Lucas Teixeira2mo10

I see it now

Theory of Change for AI Safety Camp

Lucas Teixeira2mo1-2

so here you go, I made this for you

I don't see a flow chart

[This comment is no longer endorsed by its author]Reply

3Linda Linsefors2mo

This comment has two disagree votes, which I interpret as other people seeing the flowchart. I see it too. If it still doesn't work for you for some reason, you can also see it here: AISC ToC Graph - Google Drawings

An Illustrated Summary of "Robust Agents Learn Causal World Model"

Lucas Teixeira4mo100

Strong upvote. Very clearly written and communicated. I've been recently thinking about digging deeper into this paper with the hopes of potentially relating it to some recent causality based interpretability work and reading this distillation has accelerated my understanding of the paper. Looking forward to the rest of the sequence!

Jesse Hoogland's Shortform

Lucas Teixeira4mo114

Phi-4 is highly capable not despite but because of synthetic data.

Imitation models tend to be quite brittle outside of their narrowly imitated domain, and I suspect the same to be the case for phi-4. Some of the decontamination measures they took provide some counter evidence to this but not much. I'd update more strongly if I saw results on benchmarks which contained in them the generality and diversity of tasks required to do meaningful autonomous cognitive labour "in the wild", such as SWE-Bench (or rather what I understand SWE-Bench to be, I have yet t... (read more)

Bogdan Ionut Cirstea's Shortform

Lucas Teixeira4mo40

I'm curious how these claims relate to what's proposed by this paper. (note, I haven't read either in depth)

Why I funded PIBBSS

Lucas Teixeira5mo80

I'm curious what your read of the history is, here? My impression is that most important paradigm-forming work so far has involved empirical feedback somehow, but often in ways exceedingly dissimilar from/illegible to prevailing scientific and engineering practice.
I have a hard time imagining scientists like e.g. Darwin, Carnot, or Shannon describing their work as depending much on "immediate feedback loops with present day" systems.

Thanks for the comment @Adam Scholl and apologies for not addressing it sooner, it was on my list but then time flew. I... (read more)

The Evals Gap

Lucas Teixeira5moΩ71712

Why are you sure that effective "evals" can exist even in principle?

Relatedly, the point which is least clear to me is what exactly would it mean to solve the "proper elicitation problem" and what exactly are the "requirements" laid out by the blue line on the graph. I think I'd need to get clear on this problem scope before beginning to assess whether this elicitation gap can even in principle be crossed via the methods which are being proposed (i.e. better design & coverage of black box evaluations).

As a non-example, possessing the kind of foun... (read more)

3Marius Hobbhahn5mo

These are all good points. I think there are two types of forecasts we could make with evals: 1. strict guarantees: almost like mathematical predictions where we can proof that the model is not going to behave in a specific way even with future elicitation techniques. 2. probabilistic predictions: We predict a distribution of capabilities or a range and agree on a threshold that should not be crossed. For example, if the 95% upper bound of that distribution crosses our specified capability level, we treat the model differently. I think the second is achievable (and this is what the post is about), while the first is not. I expect we will have some sort of detailed scaling laws for LM agent capabilities and we will have a decent sense of the algorithmic progress of elicitation techniques. This would allow us to make a probabilistic prediction about what capabilities any given model is likely to have, e.g. if a well-motivated actor is willing to spend $1M on PTE in 4 years. Additionally, I expect that we would get quite a long way with what Lucas calls "meta-evaluative practices", e.g. getting a better sense of how wrong our past predictions were and accounting for that. I think this could have the form of "We invested $1M, 10 FTE-years and X FLOP to elicit the best capabilities; Let's predict what 10x, 100x, 1000x, etc." of that could achieve accounting for algorithmic progress. Finally, I really think evals are just one part of a bigger defense-in-depth strategy. We still need control, scalable oversight, interpretability, governance, etc. The post is merely trying to express that for the evals part of that strategy, we should internalize which kind of scientific rigor we will likely need for the decisions we have tied to evals results and make sure that we can achieve them.

Toward Safety Case Inspired Basic Research

Lucas Teixeira5mo40

Re "big science": I'm not familiar with the term, so I'm not sure what the exact question being asked is. I am much more optimistic in the worlds where we have large scale coordination amongst expert communities. If the question is around what the relationship between governments, firms and academia, I'm still developing my gears around this. Jade Leung's thesis seems to have an interesting model but I have yet to dig very deep into it.

Why I funded PIBBSS

Lucas Teixeira7mo170

Hey Ryan, thank you for your support for the thoughtful write-up! It’s very useful for us to see what the alignment community at large, and our supporters specifically think of our work. I’ll respond to the point on “pivoting away from blue sky research” here and let Dušan address the other reservations in a separate comment.

As Nora has already mentioned, different people hold different notions on what it means to “keep it weird” and conduct “blue sky” and/or “non-paradigmatic” research. But in as far as this cluster of terms is pointing at research which ... (read more)

Adam Scholl7mo1613

Given both my personal experience with LLMs and my reading of the role that empirical engagement has historically played in non-paradigmatic research, I tend to advocate for a methodology which incorporates immediate feedback loops with present day deep learning systems over the classical "philosophy -> math -> engineering" deconfusion/agent foundations paradigm.

I'm curious what your read of the history is, here? My impression is that most important paradigm-forming work so far has involved empirical feedback somehow, but often in ways exceedingly di... (read more)

The Problem With the Word ‘Alignment’

Lucas Teixeira11moΩ110

For clarity, how do you distinguish between P1 & P4?

1particlemania10mo

First of all, these are all meant to denote very rough attempts at demarcating research tastes. It seems possible to be aiming to solve P1 without thinking much of P4, if a) you advocate ~Butlerian pause, or b) if you are working on aligned paternalism as the target behavior (where AI(s) are responsible for keeping humans happy, and humans have no residual agency or autonomy remaining). Also a lot of people who focus on the problem from a P4 perspective tend to focus on the human-AI interface, where most of the relevant technical problems lie, but this might reduce their attention on issues of mesa-optimizers or emergent agency despite the massive importance of those issues to their project in the long run.

Sam Altman: "Planning for AGI and beyond"

Lucas Teixeira2y40

It's unclear to me what:
(1) You consider the Yudowskian argument for FOOM to be

(2) Which of the premises in the argument you find questionable

-1Aleksey Bykhun2y

to (2): (a) Simulators are not agents, (b) mesa-optimizers are still "aligned" (a) amazing https://astralcodexten.substack.com/p/janus-simulators post, utility function is a wrong way to think about intelligence, humans themselves don't have any utility function, even the most rational ones (b) the only example of mesa-optimization we have is evolution, and even that succeeds in alignment, people: * still want to have kids for the sake of having kids * the evolution's biggest objective (thrive and proliferate) is being executed quite well, even "outside training distribution" yes, there are local counterexamples, but we gonna look on the causes and consequences – and we're at 8 billion already, effectively destroying or enslaving all the other DNA reproductors

2DragonGod2y

A while ago, I tried to (badly) summarise my objections: https://www.lesswrong.com/posts/jdLmC46ZuXS54LKzL/why-i-m-sceptical-of-foom There's a lot that post doesn't capture (or only captures poorly), but I'm unable to write a good post that captures all my objections well. I mostly rant about particular disagreements as the need arises (usually in Twitter discussions).

Reinforcement Learning Study Group

Lucas Teixeira3y40

I would like to say that there's a study group being formed in the AI Alignment Slack server with similar intentions! If you are not a part of that server and would like to join, feel free to email me at melembroucarlitos@gmail.com telling me a bit about yourself and your hopes and intentions and I'll send you an invite.

1Kay Kozaronek3y

Thanks for sharing Lucas, I'll shoot you a message. What is your part in all of this, are you a learner too?

4PabloAMC3y

Alternatively you may want to join here: https://join.slack.com/t/ai-alignment/shared_invite/zt-fkgwbd2b-kK50z~BbVclOZMM9UP44gw

LESSWRONG
LW

All of Lucas Teixeira's Comments + Replies