Tamay Besiroglu and Ege Erdil worked at Epoch AI before founding Mechanize, Epoch AI is funded mostly/by Open Philanthropy.
Update: I found the tweet. I remember reading a tweet that claimed that they had been working on things very closely related to Mechanize at EpochAI, but I guess tweets are kind of lost forever if you didn't bookmark them.
I think it would be a little too strong to say they definitely materially built Mechanize while still employed by Epoch AI. So "basically" is doing a lot of work in my original response but not sure if anyone has more definite information here.
My guess is the reason this hasn't been discussed is that Mechanize and the founders have been using pretty bad arguments to defend them basically taking openphil money to develop a startup idea. For the record: they were working at EpochAI on related research, EpochAI being funded by OpenPhil/CG.
https://www.mechanize.work/blog/technological-determinism/
Look at this story, if somebody makes dishonest bad takes, I become much less interested in deeply engaging with any of their other work. Here they basically claim everything is already determined to free themselves from any moral obligations.
(This tweet here implies that Mechanize purchased the IP from EpochAI)
We are currently able to get weak models somewhat aligned, and learning how to align them a little better over time. But this doesn't affect the one critical try, the one critical try is mostly about the fact that we will eventually predictably end up with an AI that actually has the option of causing enormous harm/disempower us because it is that capable. We can learn a little from current systems, but most of that learning is basically we never really get stuff right and systems still misbehave in ways we didn't expect at this easier level of difficulty. And if AI misbehavior could kill us we would be totally screwed at this level of alignment research. But we are likely to toss that lesson out and ignore it.
I guess it's a difficult problem but I would probably side with being a little more strict.
I also just sent $400 bucks for the fundraiser.
"Inkhaven itself brought $2k-$4k per resident, though around half of them required financial assistance;"
I'm quite surprised by this, I did inkhaven and paid for it myself. I'm a little afraid that lots of people will by default apply for some kind of financial assistance even if they could pay for something if there doesn't seem to be any cost. 50% feels a little too high, like you are a bit too lenient.
As of now they basically got $273k if you ignore the matching for a moment. I am not quite sure why people at MIRI aren't trying a little harder on different platforms or more people aren't speaking on this. At the current pace they will likely only get a fraction of even the matching amount of 1.6 million. (Naively extrapolated they will have $368k EOD December 31st.)
I posted the MIRI fundraiser on twitter because of this today but this seems borderline catastrophic like they will lose 1.2 Million of the matching grant or more. (I also donated about $1k today)
I feel like this method will sadly work better at hiding early precursors of misalignment than actual misalignment. A lot of the current examples of misalignment come from models going to some weird part of the training distribution that they have learned inadvertently. Your method probably reduced this.
That being said, a lot of alignment optimists seem to believe that misalignment is some weird phenomenon that might randomly happen, perhaps after the model reads about it somewhere. This is however not where the core danger lies, if you have a coherent agent, it is simply true that taking over from humanity is the best option. There is nothing particularly complicated about this, if you have some preferences or goals, you can achieve those better with more power. On the other hand, trying to get a system that is corrigible or "chill" is a very hard target to hit.
This level of misalignment sadly probably only starts really showing up once the AI really has the option to take over.
I appreciate that you actually try this, since there seem to be a lot of people who say we shouldn't even talk about misalignment lest the models might read it and randomly decide to become misaligned. If you actually believed this, clearly the answer would be some type of pre-filtering or synthetic data like what you are doing, not silencing all criticism.
Clearly there is some value to thinking about past mistakes and getting an accurate retelling of history. In the best case you can identify the mistakes that generated those bad ideas in the past and fix them.
I tried to spread my posts between my personal blog, twitter, lesswrong quick takes and posts. I think I put out some cool posts and others were a little rushed. (Two of my high effort posts didn't get promoted to the frontpage)
"I think as a reader I'd have liked the results better if participants had to publish every other day instead."
Something like this might be good, there was a big practical problem that I often was only finished with a first draft at night time. One of the big advantages of the Inkhaven facilities is that you’re able to give your draft to other people who hang around, and there’s a person assigned to you who will help you with editing. But you can’t really use that if you just submit it right before going to bed.
Alternative to publishing every other day could be: a stack system where you don’t submit a post on the first day, but prepare it for review by end of day. It gets reviewed, then on the second day you also start writing the next post and edit the last post based on the review. That way you always have one post in review and one in the draft stage.
I absolutely don't want it to sound like openphil knowingly funded Mechanize, that's not correct. will edit. But I do assign high probability materially openphil funds eventually ended up supporting early work on their startup through their salaries, it seems that they were researching the theoretical impacts of job automation at epoch ai but couldn't find anything that they were directly doing job automation research. That would look like trying to build a benchmark of human knowledge labor for AI, but I couldn't find information they were definitely doing something like this. This is from their time at EpochAI discussing job automation: https://epochai.substack.com/p/most-ai-value-will-come-from-broad
I still think OpenPhil is a little on the hook for funding organizations that largely produce benchmarks of underexplored AI capability frontiers (FrontierMath). Identifying niches in which humans outperform AI and then designing a benchmark is going to accelerate capabilities and make the labs hillclimb on those evals.