All of Jacob Watts's Comments + Replies

As a counter example to the idea that safety work isn't compute constrained, here is a quote from an interpretability paper out of Anthropic, "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" : 
 

We don't have an estimate of how many features there are or how we'd know we got all of them (if that's even the right frame!). We think it's quite likely that we're orders of magnitude short, and that if we wanted to get all the features – in all layers! – we would need to use much more compute than the total compute neede

... (read more)

I think something like "alignment features" are plausibly a huge part of the story for why AI goes well. 

At least, I think it is refreshing to take the x-risk goggles off for a second sometimes and remember that there is actually a huge business incentive to eg. solve "indirect prompt injections", perfect robust AI decision making in high stakes contexts, or find the holy grail of compute-scalable oversight. 

Like, a lot of times there seems genuine ambiguity and overlap b/w "safety" research and normal AI research. The clean "capabilities"/"align... (read more)

1Sergii
unfortunately, AI research is commercialized and is heavily skewed by capitalist market needs, so it's still going to be all in for tryinng to make an "AI office worker", safety be damned, until this effors hit wome wall, which I think is still plausible.

A lot of good people are doing a lot of bad things that they don't enjoy doing all the time. That seems weird. They even say stuff like "I don't want to do this". But then they recite some very serious sounding words or whatever and do it anyways.

 

Lol, okay on review that reads as priveledged. Easy for rectangle-havers to say. 

There is underlying violence keeping a lot of people "at work" and doing the things they don't want to do. An authoritarian violence keeping everyone in place.

The threat is to shelter, food, security, even humanity past a c... (read more)

For specific activities, I would suggest doubling down on activities that you already like to do or have interest in, but which you implicitly avoid "getting into" because they are considered low status. For example: improve your masturbation game, improve your drug game (as in plan fun companion activities or make it a social thing; not just saying do more/stronger drugs), get really into that fringe sub-genre that ~only you like, experiment with your clothes/hair style, explore your own sexual orientation/gender identity, just straight up stop doing any hobbies that you're only into for the status, etc. 

I think the best way to cash in on the fun side of the fun/status tradeoff is probably mostly rooted in adopting a disposition and outlook that allows you to. I think most people self limit themselves like crazy to promote a certain image and that if you're really trying to extract fun-bang for your status-buck, then dissolving some of that social conditioning and learning to be silly is a good way to go. Basically, I think there's a lot of fun to be had for those who are comfortable acting silly or playful or unconventional. If you can unlock some of that... (read more)

1Jacob Watts
For specific activities, I would suggest doubling down on activities that you already like to do or have interest in, but which you implicitly avoid "getting into" because they are considered low status. For example: improve your masturbation game, improve your drug game (as in plan fun companion activities or make it a social thing; not just saying do more/stronger drugs), get really into that fringe sub-genre that ~only you like, experiment with your clothes/hair style, explore your own sexual orientation/gender identity, just straight up stop doing any hobbies that you're only into for the status, etc. 

While I agree that there are notable differences between "vegans" and "carnists" in terms of group dynamics, I do not think that necessarily disagrees with the idea that carnists are anti-truthseeking. 

"carnists" are not a coherent group, not an ideology, they do not have an agenda (unless we're talking about some very specific industry lobbyists who no doubt exist). They're just people who don't care and eat meat.

It seems untrue that because carnists are not an organized physical group that has meetings and such, they are thereby incapable of having ... (read more)

Thanks! I haven't watched, but I appreciated having something to give me the gist!

Hotz was allowed to drive discussion. In debate terms, he was the con side, raising challenges, while Yudkowsky was the pro side defending a fixed position.

This always seems to be the framing which seems unbelievably stupid given the stakes on each side of the argument. Still, it seems to be the default; I'm guessing this is status quo bias and the historical tendency of everything to stay relatively the same year by year (less so once technology really started happening). I ... (read more)

-4TAG
Remember that this a three way debate: AI safe; AI causes finite; containable problems; AI kills (almost) everybody. The most extreme scenario is conjunctive because it requires AI with goals; goal stability; rapid self improvement (foom); and means. So nitpicking one stage of Foom Doom actually does refute it, even if it has no impact on the.middle of the road position.

Would the prize also go towards someone who can prove it is possible in theory? I think some flavor of "alignment" is probably possible and I would suspect it more feasible to try to prove so than to prove otherwise.

I'm not asking to try to get my hypothetical hands on this hypothetical prize money, I'm just curious if you think putting effort into positive proofs of feasibility would be equally worthwhile. I think it is meaningful to differentiate "proving possibility" from alignment research more generally and that the former would itself be worthwhile. I'm sure some alignment researchers do that sort of thing right? It seems like a reasonable place to start given an agent-theoretic approach or similar.

3Jeffs
Great question.  I think the answer must be "yes."  The alignment-possible provers must get the prize, too.   And, that would be fantastic.  Proving a thing is possible, accelerates development.  (US uses atomic bomb. Russia has it 4 years later.) Okay, it would be fantastic if the possible proof did not create false security in the short term.  It's important when alignment gets solved.  A peer-reviewed paper can't get the coffee.  (That thought is an aside and not enough to kill the value of the prize, IMHO.  If we prove it is possible, that must accelerate alignment work and inform it.) Getting definitions and criteria right will be harder than raising the $10 million.  And important.  And contribute to current efforts. Making it agnostic to possible/impossible would also have the benefit of removing political/commercial antibodies to the exercise, I think.

I appreciate the attempt, but I think the argument is going to have to be a little stronger than that if you're hoping for the 10 million lol.

Aligned ASI doesn't mean "unaligned ASI in chains that make it act nice", so the bits where you say:

any constraints we might hope to impose upon an intelligence of this caliber would, by its very nature, be surmountable by the AI

and 

overconfidence to assume that we could circumscribe the liberties of a super-intelligent entity

feel kind of misplaced. The idea is less "put the super-genius in chains" and moreso to... (read more)

The doubling time for AI compute is ~6 months

 

Source?

In 5 years compute will scale 2^(5÷0.5)=1024 times

 

This is a nitpick, but I think you meant 2^(5*2)=1024

 

In 5 years AI will be superhuman at most tasks including designing AI

 

This kind of clashes with the idea that AI capabilities gains are driven mostly by compute. If "moar layers!" is the only way forward, then someone might say this is unlikely. I don't think this is a hard problem, but I thing its a bit of a snag in the argument.

 

An AI will design a better version of itself a

... (read more)
3William the Kiwi
Thank you Jacob for taking the time for a detailed reply. I will do my best to respond to your comments. Source: https://www.lesswrong.com/posts/sDiGGhpw7Evw7zdR4/compute-trends-comparison-to-openai-s-ai-and-compute. They conclude 5.7 months from the years 2012 to 2022. This was rounded to 6 months to make calculations more clear. They also note that "OpenAI’s analysis shows a 3.4 month doubling from 2012 to 2018" I actually wrote it the (5*2) way in my first draft of this post, then edited it to (5÷0.5) as this is [time frame in years]÷[length of cycle in years], which is technically less wrong. I think this is one of the weakest parts of my argument, so I agree it is definitely a snag. The move from "superhuman at some tasks" to "superhuman at most tasks" is a bit of a leap. I also don't think I clarified what I meant very well. I will update to add ", with ~1024 times the compute,". Would adding that suggested text to the previous argue step help? Perhaps "The AI will be able to recognize and take actions that increase its reward function. Designing a better version of itself will increase that reward function" But yea I tend to agree that there needs to be some sort of agentic clause in this argument somewhere. Would this be an improvement? "Such any AI will be superhuman, or able to become superhuman, at almost all tasks, including computer security, R&D, planning, and persuasion" I would speculate that most of our implemented alignment strategies would be meta-stable, they only stay aligned for a random amount of time. This would mean we mostly rely on strategies that hope to get x before we get y. Obviously this is a gamble. I speculate that a lot of the x-risk probability comes from agentic models. I am particularly concerned with better versions of models like AutoGPT that don't have to be very intelligent (so long as they are able to continuously ask GPT5+ how to act intelligent) to pose a serious risk. Meta question: how do I dig my way out of a k

Personally, I found it obvious that the title was being playful and don't mind that sort of tongue-in-cheek thing. I mean "utterly perfect" is kind of a give away that they're not being serious.

2William the Kiwi
You are correct, I was not being serious. I was a little worried someone might think I was, but considered it a low probably. Edit: this little stunt has cost me a 1 hour time limit on replies. I will reply to the other replies soon

Great post!

As much as a I like LessWrong for what it is, I think it's often guilty of a lot of the negative aspects of conformity and coworking that you point out here. Ie. killing good ideas in their cradle. Of course, there are trade-offs to this sort of thing and I certainly appreciate brass-tacks and hard-nosed reasoning sometimes. There is also a need for ingenuity, non-conformity, and genuine creativity (in all of its deeply anti-social glory).

Thank you for sharing this! It helped me feel LessWeird about the sorts of things I do in my own creative/ex... (read more)

There’s a dead zone between skimming and scrutiny where you could play slow games without analyzing them and get neither the immediate benefits of cognitively-demanding analysis nor enough information to gain a passive understanding of the underlying patterns.

 

I think this is a good point. I think there's a lot to be said for being intentional about how/what you're consuming. It's kind of easy for me to fall into a pit of "kind of paying attention" where I'm spending mental energy, but not retaining anything, but not really skimming either. I think it... (read more)

It strikes me that there is a difficult problem involved in creating a system that can automatically perform useful alignment research, which is generally pretty speculative and theoretical, without that system just being generally skilled at reasoning/problem solving. I am sure they are aware of this, but I feel like it is a fundamental issue worth highlighting.

Still, it seems like the special case of "solve the alignment problem as it relates to an automated alignment researcher" might be easier than "solve alignment problem for reasoning systems general... (read more)

I'm interested in getting involved with a mentorship program or a learning cohort for alignment work. I have found a few things poking around (mostly expired application posts), but I was wondering if anyone could point me towards a more comprehensive list. I found aisafety.community, but it still seems like it is missing things like bootcamps, SERI MATS, and such. If anyone is aware of a list of bootcamps, cohorts, or mentor programs or list a few off for me, I would really appreciate the direction. Thanks!

1mjt
Seconded!

I have sometimes seen people/contests focused on writing up specific scenarios for how AI can go wrong starting with our current situation and fictionally projecting into the future. I think the idea is that this can act as an intuition pump and potentially a way to convince people.

I think that is likely net negative given the fact that state of the art AIs are being trained on internet text and stories where a good agent starts behaving badly are a key component motivating the Waluigi effect.

These sort of stories still seem worth thinking about, but perha... (read more)

[This comment is no longer endorsed by its author]Reply

This seems to be phrased like a disagreement, but I think you're mostly saying things that are addressed in the original post. It is totally fair to say that things wouldn't go down like this if you stuck 100 actual prisoners or mathematicians or whatever into this scenario. I don't believe OP was trying to claim that it would. The point is just that sometimes bad equilibria can form from everyone following simple, seemingly innocuous rules. It is a faithful execution of certain simple strategic approaches, but it is a bad strategy in situations like this ... (read more)

After reading this, I tried to imagine what an ML system would have to look like if there really were an equivalent of the kind of overhang that was present in evolution. I think that if we try to make the ML analogy such that SGD = evolution, then it would have to look something like: "There are some parameters which update really really slowly (DNA) compared to other parameters (neurons). The difference is like ~1,000,000,000x. Sometimes, all the fast parameters get wiped and the slow parameters update slightly. The process starts over and the fast param... (read more)

You said that multiple people have looked into s-risks and consider them of similar likelihood to x-risks. That is surprising to me and I would like to know more. Would you be willing to share your sources?

9Droopyhammock
S-risks can cover quite a lot of things. There are arguably s-risks which are less bad than x-risks, because although there is astronomical amounts of suffering, it may be dwarfed by the amount of happiness. Using common definitions of s-risks, if we simply took Earth and multiplied it by 1000 so that we have 1000 Earths, identical to ours with the same amount of organisms, it would be an s-risk. This is because the amount of suffering would be 1000 times greater. It seems to me that when people talk about s-risks they often mean somewhat different things. S-risks are not just “I have no mouth and I must scream” scenarios, they can also be things like the fear that we spread wild animal suffering to multiple planets through space colonisation. Because of the differences in definitions people seem to have for s-risks, it is hard to tell what they mean when they talk about the probability of them occurring. This is made especially difficult when they compare them to the likelihood of x-risks, as people have very different opinions on the likelihood of x-risks.   Here are some sources:   From an episode of the AI alignment podcast called “Astronomical future suffering and superintelligence with Kaj Sotala”: https://futureoflife.org/podcast/podcast-astronomical-future-suffering-and-superintelligence-with-kaj-sotala/ Lucas: Right, cool. At least my understanding is, and you can correct me on this, is that the way that FRI sort of leverages what it does is that ... Within the effective altruism community, suffering risks are very large in scope, but it's also a topic which is very neglected, but also low in probability. Has FRI really taken this up due to that framing, due to its neglectedness within the effective altruism community? Kaj: I wouldn't say that the decision to take it up was necessarily an explicit result of looking at those considerations, but in a sense, the neglectedness thing is definitely a factor, in that basically no one else seems to be looking

I am very interested in finding more posts/writing of this kind. I really appreciate attempts to "look at the game board" or otherwise summarize the current strategic situation. 

I have found plenty of resources explaining why alignment is a difficult problem and I have some sense of the underlying game-theory/public goods problem that is incentivizing actors to take excessive risks in developing AI anyways. Still, I would really appreciate any resources that take a zoomed-out perspective and try to identify the current bottlenecks, key battlegrounds, local win conditions, and roadmaps in making AI go well.

The skepticism that I object to has less to do with the idea that ML systems are not robust enough to operate robots and more to do with people rationalizing based off of the intrinsic feeling that "robots are not scary enough to justify considering AGI a credible threat". (Whether they voice this intuition or not) 

I agree that having highly capable robots which operate off of ML would be evidence for AGI soon and thus the lack of such robots is evidence in the opposite direction. 

That said, because the main threat from AGI that I am concerned ab... (read more)

My off-the-cuff best guesses at answering these questions:

1. Current day large language models do have "goals". They are just very alien, simple-ish goals that are hard to conceptualize. GPT-3 can be thought of as having a "goal" that is hard to express in human terms, but which drives it to predict the next word in a sentence. It's neural pathways "fire" according to some form of logic that leads it to "try" to do certain things; this is a goal. As systems become more general, their goals they will continue to have goals. Their terminal goals can remain a... (read more)

For personal context: I can understand why a superintelligent system having any goals that aren't my goals would be very bad for me. I can also understand some of the reasons it is difficult to actually specify my goals or train a system to share my goals. There are a few parts of the basic argument that I don't understand as well though. 

For one, I think I have trouble imagining an AGI that actually has "goals" and acts like an agent; I might just be anthropomorphizing too much. 

1. Would it make sense to talk about modern large language models a... (read more)

1Jacob Watts
My off-the-cuff best guesses at answering these questions: 1. Current day large language models do have "goals". They are just very alien, simple-ish goals that are hard to conceptualize. GPT-3 can be thought of as having a "goal" that is hard to express in human terms, but which drives it to predict the next word in a sentence. It's neural pathways "fire" according to some form of logic that leads it to "try" to do certain things; this is a goal. As systems become more general, their goals they will continue to have goals. Their terminal goals can remain as abstract and incomprehensible as whatever GPT-3's goal could be said to be, but they will be more capable of devising instrumental goals that are comprehensible in human terms.  2. Yes. Anything that intelligently performs tasks can be thought of as having goals. That is just a part of why input x outputs y and not z. The term "goal" is just a way of abstracting the behavior of complex, intelligent systems to make some kind of statement about what inputs correspond to what outputs. As such, it is not coherent to speak about an intelligent system that does not have "goals" (in the broad sense of the word). If you were to make a circuit board that just executes the function x = 3y, that circuit board could be said to have "goals" if you chose to consider it intelligent and use the kind of language that we usually reserve for people to describe it. These might not be goals that are familiar or easily expressible in human terms, but they are still goals in a relevant sense. If we strip the word "goal" down to pretty much just mean "the thing a system inherently tends towards doing", then systems that do things can necessarily be said to have goals. 3. "Tool" and "agent" is not a meaningful distinction past a certain point. A tool with any level of "intelligence" that carries out tasks would necessarily be an agent in a certain sense. Even a thermostat can be correctly thought of as an agent which optimizes for a