I’m allowed to spend two days a week at Trajan House, a building in Oxford which houses the Center for Effective Altruism (CEA), along with a few EA-related bodies. Two days is what I asked for, and what I received. The rest of the time I spend in the Bodleian Library of the University of Oxford (about £30/year, if you can demonstrate an acceptable “research need”), a desk at a coworking space in Ethical Property (which houses Refugee Welcome, among other non-EA bodies, for £200/month), Common Ground (a cafe/co-working space which I’ve recommended to people as a place where the staff explicitly explain, if you ask, that you don’t need to order anything to stay as long as you like), a large family house I’m friends with, and various cafes and restaurants where I can sit for hours while only drinking mint tea. I’m allowed to use the hot-desk space at Trajan House because I’m a recipient of an EA Long Term Future Fund grant, to research Alignment. (I call this “AI safety” to most people, and sometimes have to explain that AI stands for Artificial Intelligence.) I judged that 6 months of salary at the level of my previous startup job, with a small expenses budget, came to about £40,000. This is what I asked for, and what I received. At my previous job I thought I was having a measurable, meaningful impact on climate change. When I started there, I imagined that I’d go on to found my own startup. I promised myself it would be the last time I’d be employed. When I quit that startup job, I spent around a year doing nothing-much. I applied to Oxford’s Philosophy BPhil, unsuccessfully. I looked at startup incubators and accelerators. But mostly, I researched Alignment groups. I visited Conjecture, and talked to people from Deep Mind, and the Future of Humanity Institute. What I was trying to do, was to discern whether Alignment was “real” or not. Certainly, I decided, some of these people were cleverer than me, more hard-working than me, better-informed. Some seem deluded
Summary * Scaffolded LLM agents are, in principle, able to execute arbitrary code to achieve the goals they have been set. One such goal could be self-improvement. * This post outlines our plans to build a benchmark to measure the ability of LLM agents to modify and improve other LLM...
We have written a paper on sandbagging for which we present the abstract and brief results in this post. See the paper for more details. Tweet thread here. Illustration of sandbagging. Evaluators may regulate the deployment of AI systems with dangerous capabilities, potentially against the interests of the AI system...
This post summarizes work done over the summer as part of the Summer 2023 AI Safety Hub Labs programme. Our results will also be published as part of an upcoming paper. In this post, we focus on explaining how we define and evaluate properties of deceptive behavior in LMs and...
A friend asked me for advice on how to find stuff that's going on in a place. I feel a bit self-conscious positioning myself as some kind of expert here, but I do think I've done a good job of this in Oxford. None of this is galaxy-brain stuff, just...
Pulling a child out of the path of a fast car is the right thing to do, whether or not the child agrees, understands, is grateful, or even is hurt during the rescue. Paternalistic acts like this, when we might argue that a person's straightforward consent ought to be overridden,...
TL;DR: Retreat for EMEA meetup organisers, Summer 2023, apply here Application deadline: 31ˢᵗ Oct 2022 Why a retreat? We think we all benefit from having an international community of rationality and ACX meetup organisers who can go to one another for advice and support. The goal of this retreat is...
I’m allowed to spend two days a week at Trajan House, a building in Oxford which houses the Center for Effective Altruism (CEA), along with a few EA-related bodies. Two days is what I asked for, and what I received. The rest of the time I spend in the Bodleian...