Co-Director (Research) at PIBBSS
Previously: Applied epistemologist and Research Engineer at Conjecture.
so here you go, I made this for you
I don't see a flow chart
Strong upvote. Very clearly written and communicated. I've been recently thinking about digging deeper into this paper with the hopes of potentially relating it to some recent causality based interpretability work and reading this distillation has accelerated my understanding of the paper. Looking forward to the rest of the sequence!
Phi-4 is highly capable not despite but because of synthetic data.
Imitation models tend to be quite brittle outside of their narrowly imitated domain, and I suspect the same to be the case for phi-4. Some of the decontamination measures they took provide some counter evidence to this but not much. I'd update more strongly if I saw results on benchmarks which contained in them the generality and diversity of tasks required to do meaningful autonomous cognitive labour "in the wild", such as SWE-Bench (or rather what I understand SWE-Bench to be, I have yet to play very closely with it).
Phi-4 is taught by GPT-4; GPT-5 is being taught by o1; GPT-6 will teach itself.
There's an important distinction between utilizing synthetic data in teacher-student setups and utilizing synthetic data in self-teaching. While synthetic data is a demonstrably powerful way of augmenting human feedback, my current estimation is that typical mode collapse arguments still hold for self generated purely synthetic datasets, and that phi-4 doesn't provide counter-evidence against this.
I'm curious how these claims relate to what's proposed by this paper. (note, I haven't read either in depth)
I'm curious what your read of the history is, here? My impression is that most important paradigm-forming work so far has involved empirical feedback somehow, but often in ways exceedingly dissimilar from/illegible to prevailing scientific and engineering practice.
I have a hard time imagining scientists like e.g. Darwin, Carnot, or Shannon describing their work as depending much on "immediate feedback loops with present day" systems.
Thanks for the comment @Adam Scholl and apologies for not addressing it sooner, it was on my list but then time flew. I think we're in qualitative agreement that non-paradigmatic research tends to have empirical feedback loops, and that the forms and methods of empirical engagement undergo qualitative changes in the formation of paradigms. I suspect we may have quantitative disagreements with how illegible these methods were to previous practitioners, but I don't expect that to be super cruxy.
The position which I would argue against is that the issue of empirical access to ASI necessitates long bouts of philosophical thinking prior to empirical engagement and theorization. The position which I would argue for is that there is significant (and depending on the crowd undervalued) benefit to be gained for conceptual innovation by having research communities which value quick and empirical feedback loops. I'm not an expert on either of these historical periods, but I would be surprised to hear that Carnot or Shannon did not meaningfully benefit from engaging with the practical industrial advancements of their day.
Giving my full models is out of scope for a comment and would take a sequence which I'll probably never write, but the 3 history and philosophy of science references which have had the greatest impact on my thinking around empiricism which I tend to point people towards would probably be Inventing Temperature, Exploratory Experiments, and Representing and Intervening.
So I'm curious whether you think PIBBSS would admit researchers like these into your program, were they around and pursuing similar strategies today?
In short I would say yes, because I don't believe the criteria listed above excludes the researchers which you called attention to. But independently of whether you buy into that claim, I would stress that different programs have different mechanisms of admission. The affiliateship as it's currently being run is designed for lower variance and is incidentally more tightly correlated with the research tastes of myself and the horizon scanning team given that these are the folks providing the support for it. The summer fellowship is designed for higher variance and goes through a longer admission process involving a selection committee, with the final decisions falling on mentors.
Why are you sure that effective "evals" can exist even in principle?
Relatedly, the point which is least clear to me is what exactly would it mean to solve the "proper elicitation problem" and what exactly are the "requirements" laid out by the blue line on the graph. I think I'd need to get clear on this problem scope before beginning to assess whether this elicitation gap can even in principle be crossed via the methods which are being proposed (i.e. better design & coverage of black box evaluations).
As a non-example, possessing the kind of foundational scientific understanding which would allow someone to confidently say "We have run this evaluation suite and we now know once and for all that this system is definitely not capable of x, regardless of whatever elicitation techniques are developed in the future" seems me to be Science-of-AI-complete and is thus a non-starter for a north star for an agenda aimed at developing stronger inability arguments.
When I fast forward the development of black box evals aimed at supporting inability arguments, I see us arriving at a place where we have:
Which would allow us to make the claim "Given these trends in PTEs, and this coverage in evaluations, experts have vibed out that the probability that of this model being capable of producing catastrophe x is under an acceptable threshold" for a wider range of domains. To be clear, that's a better place than we are now and something worth striving for but not something which I would qualify as "having solved the elicitation problem". There are fundamental limitations to the kinds of claims which black box evaluations can reasonably support, and if we are to posit that the "elicitation gap" is solvable it needs have the right sorts of qualifications, amendments and hedging such that it's on the right side of this fundamental divide.
Note, I don't work on evals and expect that others have better models than this. My guess is that @Marius Hobbhahn has strong hopes on the field developing more formal statistical guarantees and other meta-evaluative practices as outlined in the references in the science of evals post, and would thus predict a stronger safety case sketch than the one laid out in the previous paragraph, but what the type signature of that sketch would be, and consequently how reasonable this sketch is given fundamental limitations of black box evaluations, is currently unclear to me.
Re "big science": I'm not familiar with the term, so I'm not sure what the exact question being asked is. I am much more optimistic in the worlds where we have large scale coordination amongst expert communities. If the question is around what the relationship between governments, firms and academia, I'm still developing my gears around this. Jade Leung's thesis seems to have an interesting model but I have yet to dig very deep into it.
Hey Ryan, thank you for your support for the thoughtful write-up! It’s very useful for us to see what the alignment community at large, and our supporters specifically think of our work. I’ll respond to the point on “pivoting away from blue sky research” here and let Dušan address the other reservations in a separate comment.
As Nora has already mentioned, different people hold different notions on what it means to “keep it weird” and conduct “blue sky” and/or “non-paradigmatic” research. But in as far as this cluster of terms is pointing at research which is (a) aimed at innovating novel conceptual frames and (b) free from compromising pressures of short-term applications, then I would say that this is still the central focus of PIBBSS and that recent developments should be seen as updates to the founding vision, as opposed to full on departures.
The main technical bet in my reading of the PIBBSS founding mission (which people are free to disagree with, I’m curious in the ways in which they do), is that one can overcome the problem of epistemic access by leveraging insights from present day physically instantiated proxies. Current day deep learning systems are impressive, and arguably stronger approximations to the kinds of AGI/ASI which we are concerned with, but they’re still proxies nonetheless and failing to treat them as such tends towards a set of associated failure cases.
Given both my personal experience with LLMs and my reading of the role that empirical engagement has historically played in non-paradigmatic research, I tend to advocate for a methodology which incorporates immediate feedback loops with present day deep learning systems over the classical "philosophy -> math -> engineering" deconfusion/agent foundations paradigm. This was most strongly reflected in the first iteration of the affiliateship cohort and is present in the language of the Manifund funding memo.
With that being said, given that PIBBSS, especially the fellowship, is largely a talent intervention aiming at providing a service to the field, I don’t believe its total portfolio should be confined to the limits of my research taste and experience. Especially after MIRI’s recent pivot, I think there’s a case to be made for PIBBSS to host research which doesn’t meet my personal preferences towards quick empirical engagement.
For clarity, how do you distinguish between P1 & P4?
I see it now