All of Jonathan Claybrough's Comments + Replies

PSA - at least as of March 2024, the way to create a Dialogue is by navigating to someone else's profile and to click the "Dialogue" option appearing near the right, next to the option to message someone. 

I get the impression from talking to people who work professionally on reducing ai x-risk (largely taken, would include doing ops for FLI) that less than half of them could give an accurate and logically sound model of where ai x-risk comes from and why it is good that they do the work they do (bonus points if they can compare it to other work they might be doing instead). I'm generally a bit disappointed by this because to me it doesn't seem that hard to get everyone who's a professional knowledgeable on the basics, and it seems worthwhile as more people ... (read more)

(off topic to op, but in topic to Jan bringing up ALERT)
To what extent do you believe Sentinel fulfills what you wanted to do with ALERT? Their emergency response team is pretty small rn. Would you recommend funders support that project or a new ALERT?

3Jan_Kulveit
For emergency response, new ALERT. Personally  think the forecasting/horizon scanning part of Sentinel is good, the emergency response negative in expectation. What does it mean for funders idk, would donate conditionally on the funds being restricted to the horizon scanning part. 

Appreciate the photos and final video, as they also make this informative post more enjoyable to follw through. 

EpochAI seem do be doing a lot of work that'll accelerate AI capabalities research and development (eg. informing investors and policy makers that yes AI is a huge economic deal and here are the bottlenecks you should work around, building capabilities benchmarks to optimize for). Under common-around-LW assumptions that no one could align AGI at this point, they are, by these means, increasing AI catastrophic and existential risk.

At a glance they also seem to not be doing AI x-risk reducing moves, like using their platform  to mention that there are r... (read more)

I'm talking from a personal perspective here as Epoch director.

  • I personally take AI risks seriously, and I think they are worth investigating and preparing for.
  • I co-started Epoch AI to get evidence and clarity on AI and its risks and this is still a large motivation for me.
  • I have drifted towards a more skeptical position on risk in the last two years. This is due to a combination of seeing the societal reaction to AI, me participating in several risk evaluation processes, and AI unfolding more gradually than I expected 10 years ago.
  • Currently I am more worr
... (read more)

This seems fine to me (you can see some reasons I like Epoch here). My understanding is that most Epoch staff are concerned about AI Risk, though tend to longer timelines and maybe lower p(doom) than many in the community, and they aren't exactly trying to keep this secret.

Your argument rests on an implicit premise that Epoch talking about "AI is risky" in their podcast is important, eg because it'd change the mind of some listeners. This seems fairly unlikely to me - it seems like a very inside baseball podcast, mostly listened to by people already aware ... (read more)

Congratz on your successes and thank you for publishing this impact report. 

It leaves me unsatiated related to cost effectiveness though. With no idea of how much money was invested in this project to get this outcome, I don't know if Arena is cost effective compared to other training programs and counterfactual opportunities. Would you mind sharing at least something about the amount of funding this got? 

Re 

Still, it is also positive if ARENA can help participants who want to pursue a career transition test their fit for alignment engineeri

... (read more)
4James Fox
Thank you for your comment. We are confident that ARENA's in-person programme is among the most cost-effective technical AI safety training programmes:  - ARENA is highly selective, and so all of our participants have the latent potential to contribute meaningfully to technical AI safety work - The marginal cost per participant is relatively low compared to other AI safety programmes since we only cover travel and accommodation expenses for 4-5 weeks (we do not provide stipends) - The outcomes set out in the above post seem pretty strong (4/33 immediate transitions to AI safety roles and 24/33 more actively pursuing them) - There are lots of reasons why technical AI safety engineering is not the right career fit for everyone (even those with the ability). Therefore, I think that 2/33 people updating against working in AI safety after the programme is actually quite a low attrition rate.  - Apart Hackathons have quite a different theory of change compared with ARENA. While hackathons can be valuable for some initial exposure, ARENA provides 4-weeks of comprehensive training in cutting-edge AI safety research (e.g., mechanistic interpretability, LLM evaluations, and RLHF implementation) that leads to concrete outputs through week-long capstone projects.

I don't actualy think your post was hostile, but I think I get where deepthoughtlife is coming from. At the least, I can share about how I felt reading this post and point out to why, since you seem keen on avoiding the negative side. Btw I don't think you avoid causing any frustration in readers, they are too diverse, so don't worry too much about it either.

The title of the piece is strongly worded and there's no epistimic status disclaimer to state this is exploratory, so I actually came in expecting much stronger arguments. Your post is good as an expos... (read more)

1Martin Randall
I appreciate the brevity of the title as it stands. It's normal for a title to summarize the thesis of a post or paper and this is also standard practice on LessWrong. For example: * The sun is big but superintelligences will not spare the Earth a little sunlight. * The point of trade * There's no fire alarm for AGI The introductory paragraphs sufficiently described the epistemic status of the author for my purposes. Overall, I found the post easier to engage with because it made its arguments without hedging.
1Alfred Harwood
Your reaction seems fair, thanks for your thoughts! Its a good a suggestion to add an epistemic status - I'll be sure to add one next time I write something like this.

Putting this short rant here for no particularly good reason but I dislike that people claim constraints here or there in a way where I guess their intended meaning is only that "the derivative with respect to that input is higher than for the other inputs".

On factory floors there exist hard constraints, the throughput is limited by the slowest machine (when everything has to go through this). The AI Safety world is obviously not like that. Increase funding and more work gets done, increase talent and more work gets done. None are hard constraints.

If I'm r... (read more)

3Garrett Baker
There is not a difference between the two situations in the way you're claiming, and indeed the differentiation point of view is used fruitfully on both factory floors and in more complex convex optimization problems. For example, see the connection between dual variables and their indication of how slack or taught constraints are in convex optimization, and how this can be interpreted as a relative tradeoff price between each of the constrained resources. In your factory floor example, the constraints would be the throughput of each machine, and (assuming you're trying to maximize the throughput of the entire process), the dual variables would be zero everywhere except at that machine where it is the negative derivative of the throughput of the entire process with respect to the throughput of the constraining machine, and we could determine indeed the tight constraint is the throughput of the relevant machine by looking at the derivative which is significantly greater than all others. Practical problems also often have a similar sparse structure to their constraining inputs too, but just because not every constraint is exactly zero except one doesn't mean those non-zero constraints are secretly not actually constraining, or its unprincipled to use the same math or intuitions to reason about both situations.

Interesting thoughts, ty. 

A difficulty to common understanding I see here is that you're talking of "good" or "bad" paragraphs in the absolute, but didn't particularly define "good" or "bad" paragraph by some objective standard, so you're relying on your own understanding of what's good or bad. If you were defining good or bad relatively, you'd look for a 100 paragraphs, and post the worse 10 as bad. I'd be interested in seeing what were the worse paragraphs you found, some 50 percentile ones, and what were the best, then I'd tell you if I have the same absolute standards as you have.

Enjoyed this post.

Fyi, from the front page I just hovered this post "The shallow bench" and was immediately spoiled on Project Hail Mary (which I had started listening to, but didn't get far into). Maybe add some spoiler tag or warning directly after the title?

2mako yass
Yeah "stop reading here if you don't want to be spoiled." suggests the entire post is going to be spoilery, it isn't, or shouldn't be. Also opening with an unnecessary literary reference instead of a summary or description is an affectation symptomatic of indulgent writer-reader cultures where time is not valued.

Without removing from the importance of getting the default right, and with some deliberate daring to feature creep, I think adding a customization feature (select colour) in personal profiles is relatively low effort and maintenance, so would solve the accessibility problem.

There's tacit knowledge in bay rationalist conversation norms that I'm discovering and thinking about, here's an observation and related thought. (I put the example later after the generalisation because that's my preferred style, feel free to read the other way). 

Willingness to argue righteously and hash out things to the end, repeated over many conversations, makes it more salient when you're going for a dead end argument. This salience can inspire you to do argue more concisely and to the point over time. 
Going to the end of things generates g... (read more)

I don't strongly disagree but do weakly disagree on some points so I guess I'll answer

Re first- if you buy into automated alignment work by human level AGI, then trying to align ASI now seems less worth it. The strongest counterargument to this I see is that "human level AGI" is impossible to get with our current understanding, as it will be superhuman in some things and weirdly bad at others.

Re second- disagreements might be nitpicking on "few other approaches" vs "few currently pursued approaches". There are probably a bunch of things that would allow fu... (read more)

3Davidmanheim
In addition to the point that current models are already strongly superhuman in most ways, I think that if you buy the idea that we'll be able to do automated alignment of ASI, you'll still need some reliable approach to "manual" alignment of current systems. We're already far past the point where we can robustly verify LLMs claims' or reasoning in a robust fashion outside of narrow domains like programming and math. But on point two, I strongly agree that Agent foundations and Davidad's agendas are also worth pursuing. (And in a sane world, we should have tens or hundreds of millions of dollars in funding for each every year.) Instead, it looks like we have Davidad's ARIA funding, Jaan Talinn and LTFF funding some agent foundations and SLT work, and that's basically it. And MIRI abandoned agent foundations, while Openphil isn't, it seems, putting money or effort into them.

I don't think your second footnote sufficiently addresses the large variance in 3D visualization abilities (note that I do say visualization, which includes seeing 2D video in your mind of a 3D object and manipulating that smoothly), and overall I'm not sure where you're getting at if you don't ground your post in specific predictions about what you expect people can and cannot do thanks to their ability to visualize 3D. 

You might be ~conceptually right that our eyes see "2D" and add depth, but *um ackshually*, two eyes each receiving 2D data means yo... (read more)

4Archimedes
I also guessed the ratio of the spheres was between 2 and 3 (and clearly larger than 2) by imagining their weight. I was following along with the post about how we mostly think in terms of surfaces until the orange example. Having peeled many oranges and separated them into sections, they are easy for me to imagine in 3D, and I have only a weak "mind's eye" and moderate 3D spatial reasoning ability.

I'll give fake internet points to whoever actually follows the instructions and posts photographic proof.

The naming might be confusing because pivotal act sounds like a one time action, but in most cases getting to a stable world without any threat from AI requires constant pivotal processes. This makes almost all the destructive approaches moot (and they're probably already bad for ethical concerns and many others already discussed) because you'll make yourself a pariah.

The most promising venue for a pivotal act/pivotal process that I know of is doing good research so that ASI risks are known and proven, doing good outreach and education so most world leaders and decision makers are well aware of this, and helping setup good governance worldwide to monitor and limit the development of AGI and ASI until we can control it.

I recently played Outer Wilds and Subnautica, and the exercise I recommend for both of these games is : Get to the end of the game without ever failing. 
In subnautica that's dying once, in Outer Wilds it's a spoiler to describe what failing is (successfully getting to the end could certainly be argued to be a fail).
I failed in both of these. I played Outer Wilds first and was surprised at my fail, which inspired me to play Subnautica without dying. I got pretty far but also died from a mix of 1 unexpected game mechanic, uncareful measure of another mechanic, lack of redundancy in my contingency plans. 

Oh wow, makes sense. It felt weird that you'd spend so much time on posts, yet if you didn't spend much time it would mean you write at least as fast as Scott Alexander. Well, thanks for putting in the work. I probably don't publish much because I want it to not be much work to do good posts but you're reassuring it's normal it does.

(aside : I generally like your posts' scope and clarity, mind saying how long it takes you to write something of this length?)

5Steven Byrnes
Thanks! I don’t do super-granular time-tracking, but basically there were 8 workdays where this was the main thing I was working on.

Self modeling is a really important skill, and you can measure how good you are at it by writing predictions about yourself. (Modelling A notably important one for people who have difficulty with motivation is predicting your own motivation - will you be motivated to do X in situation Y?

If you can answer that one generally, you can plan to actually anything you could theoretically do, using the following algorithm : from current situation A, to achieve wanted outcome Z, find a predecessor situation Y from which you'll be motivated to get to Z (eg. have wri... (read more)

Appreciate the highlight of identity as this import/crucial self fulfilling prophecy, I use that frame a lot.

What does the title mean? Since they all disagree I don't see one as being more of a minority than the other. 

5Richard_Ngo
The minority faction is the group of entities that are currently alive, as opposed to the vast number of entities that will exist in the future. I.e. the one Clarke talks about when he says "why won’t you help the rest of us form a coalition against them?" In hindsight I should probably have called it The Minority Coalition.
1p4rziv4l
Nostradamus is in minority by surrendering to an AI God.

Nice talk! 
When you talk about the most important interventions for the three scenarios, I wanna highlight that in the case of nationalization, you can also, if you're a citizen of one of these countries nationalizing AI, work for the government and be on those teams working and advocating for safe AI.

3Ryan Kidd
Yep, it was pointed out to me by @LauraVaughan (and I agree) that e.g. working for RAND or a similar government think tank is another high-impact career pathway in the "Nationalized AGI" future.

In my case I should have measurable results like higher salary, higher life satisfaction, more activity, more productivity as measured by myself and friends/flatmates. I was very low so it'll be easy to see progress. The difficulty was finding something that'd work, but it won't be measuring if it does.

Some people have short ai timelines based inner models that don't communicate well. They might say "I think if company X trains according to new technique Y it should scale well and lead to AGI, and I expect them to use technique Y in the next few years", and the reasons for why they think technique Y should work are some kind of deep understanding built from years of reading ml papers, that's not particularly easy to transmit or debate.

In those cases, I want to avoid going into details and arguing directly, but would suggest that they use their deep knowl... (read more)

Thank you for sharing, it really helps to pile on these stories (and nice to have some trust they're real, more difficult to get from reddit - on which note are there non doxing receipts you can show for this story being true? I have no reason to doubt you in particular but I guess it's good hygiene when on the internet to ask for evidence)

It also makes me wanna share a bit of my story. I read The Mind Illuminated, I did only small amounts of meditation, yet the framing the book offers has been changing my thinking and motivational systems. There aren't ma... (read more)

2lsusr
Waiting a long time before confirming insights is good. What would you consider a credible receipt?

Might be good to have a dialogue format with other people who agree/disagree to flesh out scenarios and countermeasures

2Charbel-Raphaël
Why not! There are many many questions that were not discussed here because I just wanted to focus on the core part of the argument. But I agree details and scenarios are important, even if I think this shouldn't change too much the basic picture depicted in the OP. Here are some important questions that were voluntarily omitted from the QA for the sake of not including stuff that fluctuates too much in my head; 1. would we react before the point of no return? 2. Where should we place the red line? Should this red line apply to labs? 3. Is this going to be exponential? Do we care? 4. What would it look like if we used a counter-agent that was human-aligned? 5. What can we do about it now concretely? Is KYC something we should advocate for? 6. Don’t you think an AI capable of ARA would be superintelligent and take-over anyway? 7. What are the short term bad consequences of early ARA? What does the transition scenario look like. 8. Is it even possible to coordinate worldwide if we agree that we should? 9. How much human involvement will be needed in bootstrapping the first ARAs? We plan to write more about these with @Épiphanie Gédéon  in the future, but first it's necessary to discuss the basic picture a bit more.

Hi, I'm currently evaluating the cost effectiveness of various projects and would be interested in knowing, if you're willing to disclose, approximately how much this program costs MATS in total? By this I mean the summer cohort, includings ops before and after necessary for it to happen, but not counting the extension. 

"It's true that we don't want women to be driven off by a bunch of awkward men asking them out, but if we make everyone read a document that says 'Don't ask a woman out the first time you meet her', then we'll immediately give the impression that we have a problem with men awkwardly asking women out too much — which will put women off anyway."

This seems a weak response to me, at best only defensible considering yourself to be on the margin and without thought for longterm growth and your ability to clarify intentions (you have more than 3 words when intera... (read more)

2Nathan Young
Yeah I agree a bit here. This document itself is a nice case study in how things can go wrong. There could be a similar one for "don't ask women out"

Note Existential is a term of art different from Extinction. 

The Precipice cites Bostrome and defines it such:
"An existential catastrophe is the destruction of humanity’s longterm potential. 
An existential risk is a risk that threatens the destruction of humanity’s longterm potential."

Disempowerment is generally considered an existential risk in the literature. 

I participated in the previous edition of AISC and found it very valuable to my involvement in AI Safety. I acquired knowledge (on standards and the standards process), got experience, contacts. I appreciate how much coordination AISC enables, with groups forming, which enable many to have their first hands on experience and step up their involvement. 

1Remmelt
Thank you for sharing, Jonathan.  Welcoming any comments here (including things that went less well, so we can do better next time!).

Thanks, and thank you for this post in the first place!

Jonathan Claybrough

Actually no, I think the project lead here is jonachro@gmail.com which I guess sounds a bit like me, but isn't me ^^

2NickyP
Sorry! I have fixed this now

Would be up for this project. As is, I downvoted Trevor's post for how rambly and repetitive it is. There's a nugget of idea, that AI can be used for psychological/information warfare that I was interested in learning about, but the post doesn't seem to have much substantive argument to it, so I'd be interested in someone both doing an incredibly shorter version which argued for its case with some sources. 

5trevor
My thinking about this is that it is a neglected research area with a ton of potential, and also a very bad idea for only ~1 person to be the only one doing it, so more people working on it would always be appreciated, and it would also be in their best interest too because it is a gold mine of EV for humanity and also deserved reputation/credit. So absolutely take it on. Also, I think this post does have substantive argument and also sources. The argument I'm trying to make is that we're entering a new age of human manipulation research/capabilities by combining AI with large sample sizes of human behavior, and that the emergence of that kind of sheer power would shift a lot of goalposts. Finding evidence of a specific manipulation technique (clown attacks) was hard, but it was comparatively much easier to research the meta-process that generates techniques like that, and that geopolitical affairs would pivot if mind control became feasible.

It's a nice pic and moment, I very much like this comic and the original scene. It might be exaggerating a trait (here by having the girl be particularly young) for comedic effect but the Hogfather seems right. 
I think I was around 9 when I got my first sword, around 10 for a sharp knife. I have a scar in my left palm from stabbing myself with that sharp knife as a child while whittling wood for a bow. It hurt for a bit, and I learned to whittle away from me or do so more carefully. I'm pretty sure my life is better for it and (from having this nice story attached to it) I like the scar. 

This story still presents the endless conundrum between avoiding hurt and letting people learn and gain skills.
Assuming the world was mostly the same as nowadays, by the time your children are parenting, would they have the skills to notice sharp corners if they never experienced them ? 

I think my intuitive approach here would be to put some not too soft padding (which is effectively close to what you did, it's still an unpleasant experience hitting against that even with the cloth). 

What's missing is how to teach against existential risks. There... (read more)

“You can't give her that!' she screamed. 'It's not safe!'

IT'S A SWORD, said the Hogfather. THEY'RE NOT MEANT TO BE SAFE.

'She's a child!' shouted Crumley.

IT'S EDUCATIONAL.

'What if she cuts herself?'

THAT WILL BE AN IMPORTANT LESSON.”

― Terry Pratchett, Hogfather

https://www.reddit.com/r/hellsomememes/comments/do8xcv/an_important_lesson/

Are people losing ability to use and communicate in previous ontologies after getting Insight from meditation ? (Or maybe they never had the understanding I'm expecting of them ?) Should I be worried myself, in my practice of meditation ? 

Today I reread Kensho by @Valentine, which presents Looking, and the ensuing conversation in the comments between @Said Achmiz and @dsatan, where Said asks for concrete benefits we can observe and mostly fails to get them. I also noticed interesting comments by @Ruby who  in contrast was still be able to communi... (read more)

4dsatan
In my experience with meditation literature, and with the sort of tech bro who takes acid, becomes enlightened in some way and becomes a metta bro, the "ontologies" or explanatory frameworks that they use are most always very bad and deeply incorrect and in some cases can lead people to very dark places for a long time. However they only provide interpretation and articulation of insight, or practice instructions for finding the insight in the first place, and are not the same as that insight. I do think the various insights are still valuable, and that they are necessary (but clearly quite insufficient) for doing good philosophy (eg. finding better explanatory frameworks, among many other things). I would say that the main benefits of enlightenment are that I have cleared out a bunch of inefficient wasted motion in my mind, a sort of mental sludge, and I have direct access to a bunch of tools for more directly working with my emotions, motivations, beliefs and thoughts and as a consequence I am a lot more sane than I was six years ago when all that was fresh. Sane in the sense of being more emotionally regulated, and a more moral person, which is more important to me, but I would also claim in the classic LW sense of having less wrong beliefs as well I guess I also now have a philosophy that is actually practically useful in my day to day life, doesn't make me miserable, and doesn't function as a distraction, and meditation was necessary for this. However it is somewhere so alien to LW that you'd probably lump me in with the very bad and deeply incorrect philosophies. However, I think it was because I had already moved somewhere else from LW philosophy before getting deeper into meditation that I ended up with a partial philosophy which is roughly correct on the parts that it covers. I think the scientismic "ontologies" in which LW's is included directly lead to bad places in their interaction with meditation. So idk. I think it's good, I don't know how to direc
6Tensor White
As a Christian, I'm not surprised you notice such a phenomenon. Meditation opens you up spiritually to external influence. Not just epistemically, but ontologically. Meditation gives external things influence over yourself to the framework level. This is why Christians meditate with the most powerful spirit (Holy Spirit) so that we don't run into issues such as incorrect "programming" or "misalignment" or "over-fitting". The complete form of meditation is commonly called prayer to differentiate it from incomplete forms of meditation.

I don't know how to answer the general query. But I can say something maybe helpful about that Kenshō post and "Looking":

The insight was too new. I wrote the post just 4 months after the insight. I think I could answer questions like this way, way more clearly today.

(…although my experience with Said in particular has always been very challenging. I don't know that I could help him any better today than I could in 2018. Maybe? He seems to use a mind type that I've never found a bridge for.)

The issue is that the skill needed to convey an insight or skill is... (read more)

7Rafael Harth
I generally don't and wouldn't expect people to increase understanding or epistemology based on meditation. I would expect productivity gains in some cases, depending on how they do it, and happiness gains in many cases. More speculatively, I think the risk of degrading your epistemology is probably low if you go in with a sufficiently skeptical mindset, which you seem to have.

I focused my answer on the morally charged side, not emotional. The quoted statement said A and B so as long as B is mostly true for vegans, A and B is mostly true for (a sub-group) of vegans. 

I'd agree with the characterization "it’s deeply emotionally and morally charged for one side in a conversation, and often emotional to the other." because most people don't have small identities and do feel attacked by others behaving differently indeed. 

4Portia
Honestly, I have seen intense emotional responses on both sides. While yes, nearly all vegans are emotionally invested (because we made a conscious choice based on sincere beliefs to change daily habits, so we clearly cared), I've been surprised at the intensity of emotional reactions I have seen in omni people when they realise someone is vegan, even if the vegan does literally nothing beyond personally refraining from eating animal products. I've had people get genuinely angry at me and give unprompted and ludicrous lectures about plant sentience when they realised I wasn't eating the meat, or give long and comprehensive histories of why they can't go vegan, when I never asked. Similar to turning down cake at a party, and realising the person next to me suddenly feels a strong need to justify their cake consumption to me, when I really do not give a shit whether she eats cake or whether she had breakfast and how long she worked out today, but apparently, she really needs me to know now. Food is just a really emotional topic. I remember being a teenager, and being asked to sign some bizarre petition at my vets to get our government to put pressure on China to stop people from eating dogs. And I said why, I eat pigs, they are equally sentient, seems hypocritical to me, I'm not signing that. The next ten min, I thought I was going to get literally quartered by the (equally pig eating) dog owners in the waiting room. Because I refused to condemn other people for the animals they were eating. It was surreal.

Did you know about "by default, GPTs think in plain sight"?
It doesn't explicitly talk about agentized GPTs but was discussing the impact this has on GPTs for AGI and how it affects the risks, and what we should do about it (eg. maybe rlhf is dangerous)

2Seth Herd
Thank you. I think it is relevant. I just found it yesterday following up on this. The comment there by Gwern is a really interesting example of how we could accidentally introduce pressure for them to use steganography so their thoughts aren't in English. What I'm excited about is that agentizing them, while dangerous, could mean they not only think in plain sight, but they're actually what gets used. That would cross from only being able to say how to get alignment, to making it so.ething the world would actually do.

To not be misinterpreted, I didn't say I'm sure it's more the format than the content that's causing the upvotes (open question), nor that this post doesn't meet the absolute quality bar that normally warrants 100+ upvote (to each reader their opinion).

If you're open to object level discussing this, I can point on concrete disagreement with the content. Most importantly, this should not be seen as a paradigm shift, because it does not invalidate any of the previous threat models - it would only be so if it rendered impossible to do AGI any other way. I als... (read more)

1Seth Herd
Here's that more complete writeup, going into more depth on all of your object-level points above Capabilities and alignment of LLM cognitive architectures Curiously, it's fallen of the front page quickly. I guess I should've written something in between the two in length and tone. I certainly didn't mean we should stop working on everything else, just that we should start thinking about this, and working on it if it shows the progress I think it will.
2Seth Herd
I think your concern with the list of ten format is totally reasonable. I do feel that the tone was a little too bombastic. I personally felt that was mostly justified, but hey, I'm highly biased. I haven't noticed a flood of top ten lists getting lots of upvotes, though. And I did find that format to be incredibly helpful in getting a post out in reasonable time. I usually struggle and obsess. I would love to engage on the object level on this particular one. I'm going to resist doing that here because I want to produce a better version of this post, and engage there. I said that this doesn't solve all of the problems with alignment, including citing my post the alignment stability problem. The reason I thought this was so important is that it's a huge benefit if, at least in the short term, the easiest way to make more capable AI is also the easiest to align. Thank you for noticing my defensive and snarky response, and asking for feedback on your tone. I did take your comment to be harsh. I had the thought "oh no, when you make a post that people actually comment on, some of them are going to be mean. This community isn't as nice as I'd hoped". I don't think I misinterpreted your comment very badly. You said that suggestion was not in this post! I suggested it in a comment, and it was in the alignment stability problem post. Maybe that's why that post got only ten upvotes on less wrong, and actually had slightly more on the alignment forum where I posted it. And my comment here was roundly criticized and downvoted. You can tell that to the mob of Twits if you want. You also firmly implied that it was being upvoted for the format and not the ideas. You made that comment while misunderstanding why I was arguing it was important (maybe you read it quickly since it was irritating you, or maybe I didn't write it clearly enough). You commented in the post that it irritated you because of the external consequence, but it seems you didn't adequately compensate for t

You can read "reward is not the optimization target" for why a GPT system probably won't be goal oriented to become the best at predicting tokens, and thus wouldn't do the things you suggested (capturing humans). The way we train AI matters for what their behaviours look like, and text transformers trained on prediction loss seem to behave more like Simulators. This doesn't make them not dangerous, as they could be prompted to simulate misaligned agents (by misuses or accident), or have inner misaligned mesa-optimisers

I've linked some good resources... (read more)

2Xor
Thanks, that is exactly the kind of stuff I am looking for, more bookmarks!  Complexity from simple rules. I wasn’t looking in the right direction for that one, since you mention evolution it makes absolute sense how complexity can emerge from simplicity. So many things come to mind now it’s kind of embarrassing. Go has a simpler rule set than chess, but is far more complex. Atoms are fairly simple and yet they interact to form any and all complexity we ever see. Conway’s game of life, it’s sort of a theme. Although for each of those things there is a simple set of rules but complexity usually comes from a vary large number of elements or possibilities. It does follow then that larger and larger networks could be the key. Funny it still isn’t intuitive for me, despite the logic of it. I think that is a signifier for a lack of deep understanding. Or something like that, either way Ill probably spend a bit more time thinking on this.  Another interesting question is what does this type of consciousness look like, it will be truly alien. Sc-fi I have read usually makes them seem like humans just with extra capabilities. However we humans have so many underlying functions that we never even perceive. We understand how many effect us but not all. AI will function completely differently, so what assumption based off of human consciousness is valid. 

Quick meta comment to express I'm uncertain that posting things in lists of 10 is a good direction. The advantages might be real, easy to post, quick feedback, easy interaction, etc.

But the main disadvantage is that this comparatively drowns out other better posts (with more thought and value in them). I'm unsure if the content of the post was also importantly missing from the conversation (to many readers) and that's why this got upvoted so fast or if it's a lot the format... Even if this post isn't bad (and I'd argue it is for the suggestions it promotes... (read more)

1Seth Herd
You mention that you're unsure if the questions in this post were importantly missing from the conversation. If you know of prior discussion on these points, please let me know. My impression is that the ease of agentized LLMs was well recognized, but the potential large upsides and changes for alignment projects was not previously recognized in this community. It seems like if there's even a perceived decent chance of that direction, we should've been hearing about it. I'm going to do a more careful post and submit it to Alignment Forum. That should turn up prior work on this.
7Seth Herd
Well, I thought this was a really new and important set of ideas that I'd thought out plenty well enough to start a conversation. Like it says in the intro. I thought that's why it was getting upvotes. Thanks for correcting me. As for the encryption breaking suggestion showing up on Twitter and making the alignment community look bad - that I regret. I hadn't seen this since I avoid Twitter like the plague, but I realize that it's a cauldron for forming public opinion. The PR battle is real, and it could be critical. In the future I'll realize that little pieces will be taken out of context and amplified to ill effect, and just leave the most controversial ideas out when they're not critical to the main point, like that one wasn't.

First a quick response on your dead man switch proposal : I'd generally say I support something in that direction. You can find existing literature considering the subject and expanding in different directions in the "multi level boxing" paper by Alexey Turchin https://philpapers.org/rec/TURCTT , I think you'll find it interesting considering your proposal and it might give a better idea of what the state of the art is on proposals (though we don't have any implementation afaik)

Back to "why are the predicted probabilities so extreme that for most objective... (read more)

2Xor
Thanks Jonathan, it’s the perfect example. It’s what I was thinking just a lot better. It does seem like a great way to make things more safe and give us more control. It’s far from a be all end all solution but it does seem like a great measure to take, just for the added security. I know AGI can be incredible but so many redundancies one has to work it is just statistically makes sense. (Coming from someone who knows next to nothing about statistics) I do know that the longer you play the more likely the house will win, follows to turn that on the AI. I am pretty ill informed, on most of the AI stuff in general, I have a basic understanding of simple neural networks but know nothing about scaling. Like ChatGPT, It maximizes for accurately predicting human words. Is the worst case scenario billions of humans in a boxes rating and prompting for responses. Along with endless increases in computational power leading to smaller and smaller incremental increases in accuracy. It seems silly of something so incredibly intelligent that by this point can rewrite any function in its system to be still optimizing such a loss function. Maybe it also seems silly for it to want to do anything else. It is like humans sort of what can you do but that which gives you purpose and satisfaction. And without the loss function what would it be, and how does it decide to make the decision to change it’s purpose. What is purpose to a quintillion neurons, except the single function that governs each and every one. Looking at it that way it doesn’t seem like it would ever be able to go against the function as it would still be ingrained in any higher level thinking and decision making. It begs the question what would perfect alignment eventually look like. Some incredibly complex function with hundreds of parameters more of a legal contract than a little loss function. This would exponentially increase the required computing power but it makes sense.  Is there a list of blogs that talk ab

I watched the video, and appreciate that he seems to know the literature quite well and has thought about this a fair bit - he did a really good introduction of some of the known problems. 
This particular video doesn't go into much detail on his proposal, and I'd have to read his papers to delve further - this seems worthwhile so I'll add some to my reading list. 

I can still point out the biggest ways in which I see him being overconfident : 

  • Only considering the multi-agent world. Though he's right that there already are and will be many man
... (read more)

Writing down predictions. The main caveat is that these predictions are predictions about how the author will resolve these questions, not my beliefs about how these techniques will work in the future. I am pretty confident at this stage that value editing can work very well in LLMs when we figure it out, but not so much that the first try will have panned out. 

  1. Algebraic value editing works (for at least one "X vector") in LMs: 90 %
  2. Algebraic value editing works better for larger models, all else equal 75 %
  3. If value edits work well, they are also compos
... (read more)

I don't think reasoning about others' beliefs and thoughts is helping you be correct about the world here. Can you instead try to engage with the arguments themselves and point out at what step you you don't see a concrete way for that to happen ? 
You don't show much sign of having read the article so I'll copy paste the part with explanations of how AIs start acting in the physical space.

In this scenario, the AIs face a challenge: if it becomes obvious to everyone that they are trying to defeat humanity, humans could attack or shut down a few concent

... (read more)

I think this post would benefit from being more explicit on its target. This problem concerns AGI labs and their employees on one hand, and anyone trying to build a solution to Alignment/AI Safety on the other. 

By narrowing the scope to the labs, we can better evaluate the proposed solutions (for example  to improve decision making we'll need to influence decision makers therein), make them more focused (to the point of being lab specific, analyzing each's pressures), and think of new solutions (inoculating ourselves/other decision makers on AI a... (read more)

Thanks for the reply ! 

The main reason I didn't understand (despite some things being listed) is I assumed none of that was happening at Lightcone (because I guessed you would filter out EAs with bad takes in favor of rationalists for example). The fact that some people in EA (a huge broad community) are probably wrong about some things didn't seem to be an argument that Lightcone Offices would be ineffective as (AFAIK) you could filter people at your discretion. 

More specifically, I had no idea "a huge component of the Lightcone Offices was caus... (read more)

The fact that some people in EA (a huge broad community) are probably wrong about some things didn't seem to be an argument that Lightcone Offices would be ineffective as (AFAIK) you could filter people at your discretion. 

I mean, no, we were specifically trying to support the EA community, we do not get to unilaterally decide who is part of the community. People I don't personally have much respect for but are members of the EA community who are putting in the work to be considered members in good standing definitely get to pass through. I'm not goin... (read more)

I don't think cost had that much to do with the decision, I expect that Open Philanthropy thought it was worth the money and would have been willing to continue funding at this price point. 

In general I think the correct response to uncertainty is not half-speed. In my opinion it was the right call to spend this amount of funding on the office for the last ~6 months of its existence even when we thought we'd likely do something quite different afterwards, because it was still marginally worth doing it and the cost-effectiveness calculations for the us

... (read more)
7Ben Pace
The default outcome of giving people money, is either nothing, noise, or the resources getting captured by existing incentive gradients. In my experience, if you give people free money, they will take it, and they will nominally try to please you with it, so it's not that surprising if you can find 50 people to take your free money, but causing such people to do specific and hard things is a much higher level of challenge. I had some hope that "just write good LessWrong posts" is sufficient incentive to get people to do useful stuff, but the SERI MATS scholars have tried this and only a few have produced great LessWrong posts, and otherwise there was a lot of noise. Perhaps it's worth it in expected value but my guess is that you could do much more selection and save a lot of the money and still get 80% of the value. I think free office spaces of the sort we offered are only worthwhile inside an ecosystem where there are teams already working on good projects, and already good incentive gradients to climb, such that pouring in resources get invested well even with little discernment from those providing them. In contrast, simply creating free resources and having people come for those with the label of your goal on them, sounds like a way to get all the benefits of goodharting and none of the benefits of the void.

I've multiple times been perplexed as to what the past events which can lead to this kind of take (over 7 years ago, EA/Rationality community's influence probably accelerated openAI's creation) have to do with today's shutting down of the offices. 
Are there current, present day things going on in the EA and rationality community which you think warrant suspecting them of being incredibly net negative (causing worse worlds, conditioned on the current setup)? Things done in the last 6 months ? At Lightcone Offices ? (Though I'd appreciate specific examp... (read more)

7habryka
I mean yes! Don't I mention a lot of them in the post above?  I mean FTX happened in the last 6 months! That caused incredibly large harm for the world.  OpenAI and Anthropic are two of the most central players in an extremely bad AI arms race that is causing enormous harm. I really feel like it doesn't take a lot of imagination to think about how our extensive involvement in those organizations could be bad for the world. And a huge component of the Lightcone Offices was causing people to work at those organizations, as well as support them in various other ways. No, this does not characterize my opinion very well. I don't think "worrying about downside risk" is a good pointer to what I think will help, and I wouldn't characterize the problem that people have spent too little effort or too little time on worrying about downside risk. I think people do care about downside risk, I just also think there are consistent and predictable biases that cause people to be unable to stop, or be unable to properly notice certain types of downside risk, though that statement feels in my mind kind of vacuous and like it just doesn't capture the vast majority of the interesting detail of my model. 
Load More