All of Prometheus's Comments + Replies

sigh Protests last year, barricading this year, I've already mentally prepared myself for someone next year throwing soup at a human-generated painting while shouting about AI. This is the kind of stuff that makes no one in the Valley want to associate with you. It makes the cause look low-status, unintelligent, lazy, and uninformed.

Just because the average person disapproves of a protest tactic doesn't mean that the tactic didn't work.  See Roger Hallam's "Designing the Revolution" series for the thought process underlying the soup-throwing protests.  Reasonable people may disagree (I disagree with quite a few things he says), but if you don't know the arguments, any objection is going to miss the point.  The series is very long, so here's a tl/dr:

- If the public response is: "I'm all for the cause those protestors are advocating, but I can't stand their methods" notic... (read more)

1gilch
Protesters are expected to be at least a little annoying. Strategic unpopularity might be a price worth paying if it gets results. Sometimes extremists shift the Overton Window.

A man asks one of the members of the tribe to find him some kindling so that he may start a fire. A few hours pass, and the second man returns, walking with a large elephant.

 

“I asked for kindling.” Says the first.

 

“Yes.” Says the second.

 

“Where is it?” Asks the first, trying to ignore the large pachyderm in the room.

 

The second gestures at the elephant, grinning.

 

“That’s an elephant.”

 

“I see that you are uninformed. You see, elephants are quite combustible, despite their appearance. Once heat reaches the right temperature, it... (read more)

I strongly doubt we can predict the climate in 2100. Actual prediction would be a model that also incorporates the possibility of nuclear fusion, geoengineering, AGIs altering the atmosphere, etc. 

Prometheus*699

I got into AI at the worst time possible

2023 marks the year AI Safety went mainstream. And though I am happy it is finally getting more attention, and finally has highly talented people who want to work in it; personally, it could not have been worse for my professional life. This isn’t a thing I normally talk about, because it’s a very weird thing to complain about. I rarely permit myself to even complain about it internally. But I can’t stop the nagging sensation that if I had just pivoted to alignment research one year sooner than I did, everything woul... (read more)

Thanks for writing this, I think this is a common and pretty rough experience.

Have you considered doing cybersecurity work related to AI safety? i.e. work would help prevent bad actors stealing model weights and AIs themselves escaping. I think this kind of work would likely be more useful than most alignment work. 

I'd recommend reading Holden Karnofsky's takes, as well as the recent huge RAND report on securing model weights. Redwood's control agenda might also be relevant. 

I think this kind of work is probably extremely useful, and somewhat neg... (read more)

0Keenan Pepper
SAME
5Chris_Leong
One option would be to find a role in AI more generally that allows you to further develop your skills whilst also not accelerating capabilities. Another alternative: I suspect that more people should consider working at a normal job three or four days per week and doing AI Safety things on the side one or two days.
[anonymous]2914

Thanks for sharing your experience here. 

One small thought is that things end up feeling extremely neglected once you index on particular subquestions. Like, on a high-level, it is indeed the case that AI safety has gotten more mainstream.

But when you zoom in, there are a lot of very important topics that have <5 people seriously working on them. I work in AI policy, so I'm more familiar with the policy/governance ones but I imagine this is also true in technical (also, maybe consider swapping to governance/policy!)

Also, especially in hype waves, I... (read more)

It probably began training in January and finished around early April. And they're now doing evals.

Prometheus5-1

My birds are singing the same tune.

Going to the moon

Say you’re really, really worried about humans going to the moon. Don’t ask why, but you view it as an existential catastrophe. And you notice people building bigger and bigger airplanes, and warn that one day, someone will build an airplane that’s so big, and so fast, that it veers off course and lands on the moon, spelling doom. Some argue that going to the moon takes intentionality. That you can’t accidentally create something capable of going to the moon. But you say “Look at how big those planes are getting! We've gone from small figh... (read more)

gwern1211

But you say “Look at how big those planes are getting! We’ve gone from small fighter planes, to bombers, to jets in a short amount of time. We’re on a double exponential of plane tech, and it’s just a matter of time before one of them will land on the moon!”

...And they were right? Humans did land on the moon roughly on that timeline (and as I recall, there were people before the moon landing at RAND and elsewhere who were extrapolating out the exponentials of speed, which was a major reason for such ill-fated projects like the supersonic interceptors fo... (read more)

2habryka
It seems pretty likely to me that current AGIs are already scheming. At least it seems like the simplest explanation for things like the behavior observed in this paper: https://www.alignmentforum.org/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training-deceptive-llms-that-persist-through 

"To the best of my knowledge, Vernor did not get cryopreserved. He has no chance to see the future he envisioned so boldly and imaginatively. The near-future world of Rainbows End is very nearly here... Part of me is upset with myself for not pushing him to make cryonics arrangements. However, he knew about it and made his choice."

https://maxmore.substack.com/p/remembering-vernor-vinge 

I agree that consequentialist reasoning is an assumption, and am divided about how consequentialist an ASI might be. Training a non-consequentialist ASI seems easier, and the way we train them seems to actually be optimizing against deep consequentialism (they're rewarded for getting better with each incremental step, not for something that might only be better 100 steps in advance). But, on the other hand, humans don't seem to have been heavily optimized for this either*, yet we're capable of forming multi-decade plans (even if sometimes poorly).

*Actually, the Machiavellian Intelligence Hypothesis does seem to be optimizing consequentialist reasoning (if I attack Person A, how will Person B react, etc.)

This is the kind of political reasoning that I've seen poisoning LW discourse lately and gets in the way of having actual discussions. Will posits essentially an impossibility proof (or, in it's more humble form, a plausibility proof). I humor this being true, and state why the implications, even then, might not be what Will posits. The premise is based on alignment not being enough, so I operate on the premise of an aligned ASI, since the central claim is that "even if we align ASI it may still go wrong". The premise grants that the duration of time it is... (read more)

5flandry39
> The summary that Will just posted posits in its own title that alignment is overall plausible "even ASI alignment might not be enough". Since the central claim is that "even if we align ASI, it will still go wrong", I can operate on the premise of an aligned ASI. The title is a statement of outcome -- not the primary central claim. The central claim of the summary is this: That each (all) ASI is/are in an attraction basin, where they are all irresistibly pulled towards   causing unsafe conditions over time. Note there is no requirement for there to be presumed some (any) kind of prior ASI alignment for Will to make the overall summary points 1 thru 9. The summary is about the nature of the forces that create the attraction basin, and why they are inherently inexorable, no matter how super-intelligent the ASI is. > As I read it, the title assumes that there is a duration of time that the AGI is aligned -- long enough for the ASI to act in the world. Actually, the assumption goes the other way -- we start by assuming only that there is at least one ASI somewhere in the world, and that it somehow exists long enough for it to be felt as an actor in the world.  From this, we can also notice certain forces, which overall have the combined effect of fully counteracting, eventually, any notion of there also being any kind of enduring AGI alignment. Ie, strong relevant mis-alignment forces exist regardless of whether there is/was any alignment at the onset. So even if we did also additionally presuppose that somehow there was also alignment of that ASI, we can, via reasoning, ask if maybe such mis-alignment forces are also way stronger than any counter-force that ASI could use to maintain such alignment, regardless of how intelligent it is. As such, the main question of interest was:  1; if the ASI itself somehow wanted to fully compensate for this pull, could it do so? Specifically, although to some people it is seemingly fashionable to do so, it is important to not
2WillPetillo
Bringing this back to the original point regarding whether an ASI that doesn't want to kill humans but reasons that SNC is true would shut itself down, I think a key piece of context is the stage of deployment it is operating in.  For example, if the ASI has already been deployed across the world, has gotten deep into the work of its task, has noticed that some of its parts have started to act in ways that are problematic to its original goals, and then calculated that any efforts at control are destined to fail, it may well be too late--the process of shutting itself down may even accelerate SNC by creating a context where components that are harder to shut down for whatever reason (including active resistance) have an immediate survival advantage.  On the other hand, an ASI that has just finished (or is in the process of) pre-training and is entirely contained within a lab has a lot fewer unintended consequences to deal with--its shutdown process may be limited to convincing its operators that building ASI is a really bad idea.  A weird grey area is if, in the latter case, the ASI further wants to ensure no further ASIs are built (pivotal act) and so needs to be deployed at a large scale to achieve this goal. Another unstated assumption in this entire line of reasoning is that the ASI is using something equivalent to consequentialist reasoning and I am not sure how much of a given this is, even in the context of ASI.

I'm not sure who are you are debating here, but it doesn't seem to be me.

First, I mentioned that this was an analogy, and mentioned that I dislike even using them, which I hope implied I was not making any kind of assertion of truth. Second, "works to protect" was not intended to mean "control all relevant outcomes of". I'm not sure why you would get that idea, but that certainly isn't what I think of first if someone says a person is "working to protect" something or someone. Soldiers defending a city from raiders are not violating control theory or the l... (read more)

3flandry39
If soldiers fail to control the raiders in at least preventing them from entering the city and killing all the people, then yes, that would be a failure to protect the city in the sense of controlling relevant outcomes.  And yes, organic human soldiers may choose to align themselves with other organic human people, living in the city, and thus to give their lives to protect others that they care about.  Agreed that no laws of physics violations are required for that.  But the question is if inorganic ASI can ever actually align with organic people in an enduring way. I read "routinely works to protect" as implying "alignment, at least previously, lasted over at least enough time for the term 'routine' to have been used".  Agreed that the outcome -- dead people -- is not something we can consider to be "aligned".  If I assume further that the ASI being is really smart (citation needed), and thus calculates rather quickly, and soon, 'that alignment with organic people is impossible' (...between organic and inorganic life, due to metabolism differences, etc), then even the assumption that there was even very much of a prior interval during which alignment occurred is problematic.  Ie, does not occur long enough to have been 'routine'.  Does even the assumption '*If* ASI is aligned' even matter, if the duration over which that holds is arbitrarily short? And also, if the ASI calculates that alignment between artificial beings and organic beings is actually objectively impossible, just like we did, why should anyone believe that the ASI would not simply choose to not care about alignment with people, or about people at all, since it is impossible to have that goal anyway, and thus continue to promote its own artificial "life", rather than permanently shutting itself off?  Ie, if it cares about anything else at all, if it has any other goal at all -- for example, maybe its own ASI future, or has a goal to make other better even more ASI children, that exceed its own cap

I've heard of many such cases of this from EA Funds (including myself). My impression is that they only had one person working full-time managing all three funds (no idea if this has changed since I applied or not). 

2ChristianKl
Yes, this story sounds like the default way EA Funds operates and not like an outlier. 

An incapable man would kill himself to save the village. A more capable man would kill himself to save the village AND ensure no future werewolves are able to bite villagers again.

Though I tend to dislike analogies, I'll use one, supposing it is actually impossible for an ASI to remain aligned. Suppose a villager cares a whole lot about the people in his village, and routinely works to protect them. Then, one day, he is bitten by a werewolf. He goes to the Shammon, he tells him when the Full Moon rises again, he will turn into a monster, and kill everyone in the village. His friends, his family, everyone. And that he will no longer know himself. He is told there is no cure, and that the villagers would be unable to fight him off. He will grow too strong to be caged, and cannot be subdued or controlled once he transforms. What do you think he would do?

-2flandry39
  How is this not assuming what you want to prove?  If you 'smuggle in' the statement of the conclusion "that X will do Y" into the premise, then of course the derived conclusion will be consistent with the presumed premise.  But that tells us nothing -- it reduces to a meaningless tautology -- one that is only pretending to be a relevant truth. That Q premise results in Q conclusion tells us nothing new, nothing actually relevant.  The analogy story sounds nice, but tells us nothing actually. Notice also that there are two assumptions.  1; That the ASI is somehow already aligned, and 2; that the ASI somehow remains aligned over time -- which is exactly the conjunction which is the contradiction of the convergence argument.  On what basis are you validly assuming that it is even possible for any entity X to reasonably "protect" (ie control all relevant outcomes for) any other cared about entity P?  The notion of 'protect' itself presumes a notion of control, and that in itself puts it squarely in the domain of control theory, and thus of the limits of control theory.   There are limits of what can be done with any type control methods -- what can be done with causation. And they are very numerous.  Some of these are themselves defined in purely mathematical way, and hence, are arguments of logic, not just of physical and empirical facts.  And at least some these limits can also be shown to be relevant -- which is even more important. ASI and control theory both depend on causation to function, and there are real limits to causation.  For example, I would not expect that an ASI, no matter how super-intelligent, to be able to "disassemble" a black hole.  Do do this, you would need to make the concept of causation way more powerful -- which leads to direct self contradiction.  Do you equate ASI with God, and thus become merely another irrational believer in alignment?  Can God make a stone so heavy that "he" cannot move it?  Can God do something that God cannot und
2WillPetillo
The implication here being that, if SNC (substrate needs convergence) is true, then an ASI (assuming it is aligned) will figure this out and shut itself down?

MIRI "giving up" on solving the problem was probably a net negative to the community, since it severely demoralized many young, motivated individuals who might have worked toward actually solving the problem. An excellent way to prevent pathways to victory is by convincing people those pathways are not attainable. A positive, I suppose, is that many have stopped looking to Yudkowsky and MIRI for the solutions, since it's obvious they have none.

6Ben Pace
But it seems like a good thing to do if indeed the solutions are not attainable. Anyway, this whole question seems on the wrong level analysis. You should do what you think works, not what you think doesn't work but might trick others into trying anyway. Added: To be clear I too found MIRI largely giving up on solving the alignment problem demoralizing. I'm still going to keep working on preventing the end of the world regardless, and I don't at all begrudge them seriously trying for 5-10 years.

I don't think this is the case. For awhile, the post with the highest karma was Paul Christiano explaining all the reasons he thinks Yudkowsky is wrong.

Fair. What would you call a "mainstream ML theory of cognition", though? Last I checked, they were doing purely empirical tinkering with no overarching theory to speak of (beyond the scaling hypothesis

It tends not to get talked about much today, but there was the PDP (connectionist) camp of cognition vs. the camp of "everything else" (including ideas such as symbolic reasoning, etc). The connectionist camp created a rough model of how they thought cognition worked, a lot of cognitive scientists scoffed at it, Hinton tried putting it into actual practice, b... (read more)

"My position is that there are many widespread phenomena in human cognition that are expected according to my model, and which can only be explained by the more mainstream ML models either if said models are contorted into weird shapes, or if they engage in denialism of said phenomena."

 

Such as? I wouldn't call Shard Theory mainstream, and I'm not saying mainstream models are correct either. On human's trying to be consistent decision-makers, I have some theories about that (and some of which are probably wrong). But judging by how bad humans are at i... (read more)

2Thane Ruthenis
Fair. What would you call a "mainstream ML theory of cognition", though? Last I checked, they were doing purely empirical tinkering with no overarching theory to speak of (beyond the scaling hypothesis[1]). Roughly agree, yeah. I kinda want to push back against this repeat characterization – I think quite a lot of my model's features are "one storey tall", actually – but it probably won't be a very productive use of the time of either of us. I'll get around to the "find papers empirically demonstrating various features of my model in humans" project at some point; that should be a more decent starting point for discussion. Agreed. Working on it. 1. ^ Which, yeah, I think is false: scaling LLMs won't get you to AGI. But it's also kinda unfalsifiable using empirical methods, since you can always claim that another 10x scale-up will get you there.

This isn't what I mean. It doesn't mean you're not using real things to construct your argument, but that doesn't mean the structure of the argument reflects something real. Like, I kind of imagine it looking something like a rationalist Jenga tower, where if one piece gets moved, it all crashes down. Except, by referencing other blog posts, it becomes a kind of Meta-Jenga: a Jenga tower composed of other Jenga towers. Like "Coherent decisions imply consistent utilities". This alone I view to be its own mini Jenga tower. This is where I think String Theori... (read more)

2Thane Ruthenis
My position is that there are many widespread phenomena in human cognition that are expected according to my model, and which can only be explained by the more mainstream ML models either if said models are contorted into weird shapes, or if they engage in denialism of said phenomena. Again, the drive for consistent decision-making is a good example. Common-sensically, I don't think we'd disagree that humans want their decisions to be consistent. They don't want to engage in wild mood swings, they don't want to oscillate wildly between which career they want to pursue or whom they want to marry: they want to figure out what they want and who they want to be with, and then act consistently with these goals in the long term. Even when they make allowances for changing their mind, they try to consistently optimize for making said allowances: for giving their future selves freedom/optionality/resources. Yet it's not something e. g. the Shard Theory would naturally predict out-of-the-box, last I checked. You'd need to add structures on top of it until it basically replicates my model (which is essentially how I arrived at my model, in fact – see this historical artefact).

I dislike the overuse of analogies in the AI space, but to use your analogy, I guess it's like you keep assigning a team of engineers to build a car, and two possible things happen. Possibility One: the engineers are actually building car engines, which gives us a lot of relevant information for how to build safe cars (toque, acceleration, speed, other car things), even if we don't know all the details for how to build a car yet. Possibility Two: they are actually just building soapbox racers, which doesn't give us much information for building safe cars, but also means that just tweaking how the engineers work won't suddenly give us real race cars.

If progress in AI is continuous, we should expect record levels of employment. Not the opposite.

 

My mentality is if progress in AI doesn't have a sudden, foom-level jump, and if we all don't die, most of the fears of human unemployment are unfounded... at least for a while. Say we get AIs that can replace 90% of the workforce. The productivity surge from this should dramatically boost the economy, creating more companies, more trading, and more jobs. Since AIs can be copied, they would be cheap, abundant labor. This means anything a human can do that ... (read more)

2[anonymous]
The inverse argument I have seen on reddit happens if you try to examine how these ai models might work and learn. One method is to use a large benchmark of tasks, where model capabilities is measured as the weighted harmonic mean of all tasks. As the models run, much of the information gained doing real world tasks is added as training and test tasks to the benchmark suite. (You do this whenever a chat task has an output that can be objectively checked, and for robotic tasks you run in lockstep a neural sim similar to Sora that makes testable predictions for future real world input sets) What this means is most models learn from millions of parallel instances of themselves and other models. This means the more models are deployed in the world - the more labor is automated - the more this learning mechanism gets debugged, and the faster models learn, and so on. There are also all kinds of parallel task gains. For example once models have experience working on maintaining the equipment in a coke can factory, and an auto plant, and a 3d printer plant, this variety of tasks with common elements should cause new models trained in sim to gain "general maintenance" skills at least for machines that are similar to the 3 given. (The "skill" is developing a common policy network that compresses the 3 similar policies down to 1 policy on the new version of the network) With each following task, the delta - the skills the AI system needs to learn it doesn't already know - shrinks. This shrinking learning requirement likely increases faster than the task difficulty increases. (Since the most difficult tasks is still doable by a human, and also the AI system is able to cheat a bunch of ways. For example using better actuators to make skilled manual trades easy, or software helpers to best champion Olympiad contestants) You have to then look at what barriers there are to AI doing a given task to decide what tasks are protected for a while. Things that just require a human

I think my main problem with this is that it isn't based on anything. Countless times, you just reference other blog posts, which reference other blog posts, which reference nothing. I fear a whole lot of people thinking about alignment are starting to decouple themselves from reality. It's starting to turn into the AI version of String Theory. You could be correct, but given the enormous number of assumptions your ideas are stacked on (and that even a few of those assumptions being wrong leads to completely different conclusions), the odds of you even being in the ballpark of correct seem unlikely.

I'm very sympathetic to this view, but I disagree. It is based on a wealth of empirical evidence that we have: on data regarding human cognition and behavior.

I think my main problem with this is that it isn't based on anything

Hm. I wonder if I can get past this common reaction by including a bunch of references to respectable psychology/neurology/game-theory experiments, which "provide scientific evidence" that various common-sensical properties of humans are actually real? Things like fluid vs. general intelligence, g-factor, the global workplace theory, ... (read more)

8habryka
Hmm, I feel sad about this kind of critique. Like, this comment invokes some very implicit standard for posts, without making it at all explicit. Of course neither this post nor the posts they link to are literally "not based on anything". My guess is you are invoking an implicit standard for work to be "empirical" in order to be "anything", but that also doesn't really make sense since there are a lot of empirical arguments in this article and in the linked articles. I think highlighting any specific assumption, or even some set of assumptions that you think is fragile would be helpful. Or being at all concrete about what you would consider work that is "anything". But I think as it stands I find it hard to get much out of comments like this.

At first I strong-upvoted this, because I thought it made a good point. However, upon reflection, that point is making less and less sense to me. You start by claiming current AIs provide nearly no data for alignment, that they are in a completely different reference class from human-like systems... and then you claim we can get such systems with just a few tweaks? I don't see how you can go from a system that, you claim, provides almost no data for studying how an AGI would behave, to suddenly having a homunculus-in-the box that becomes superintelligent a... (read more)

1Thane Ruthenis
Do you think a car engine is in the same reference class as a car? Do you think "a car engine cannot move under its own power, so it cannot possibly hurt people outside the garage!" is a valid or a meaningful statement to make? Do you think that figuring out how to manufacture amazing car engines is entirely irrelevant to building a full car, such that you can't go from an engine to a car with relatively little additional engineering effort (putting it in a "wrapper", as it happens)? As all analogies, this one is necessarily flawed, but I hope it gets the point across. (Except in this case, it's not even that we've figured out how to build engines. It's more like, we have these wild teams of engineers we can capture, and we've figured out which project specifications we need to feed them in order to cause them to design and build us car engines. And we're wondering how far we are from figuring out which project specifications would cause them to build a car.)

Contra One Critical Try:  AIs are all cursed

 

I don't feel like making this a whole blog post, but my biggest source for optimism for why we won't need to one-shot an aligned superintelligence is that anyone who's trained AI models knows that AIs are unbelievably cursed. What do I mean by this? I mean even the first quasi-superintelligent AI we get will have so many problems and so many exploits that taking over the world will simply not be possible. Take a "superintelligence" that only had to beat humans at the very constrained game of Go, which ... (read more)

1quetzal_rainbow
Humans are infinitely cursed (see "cognitive biases" or "your neighbour-creationist"), it doesn't change the fact that humans are ruling the planet.
3ryan_greenblatt
See also lack of adversarial robustness is a weapon we can use against AIs And catching AIs red-handed

I'm kind of surprised this has almost 200 karma. This feels much more like a blog post on substack, and much less like the thoughtful, insightful new takes on rationality that used to get this level of attention on the forum.

habryka1820

It also isn't my favorite version of this post that could exist, but it seems like a reasonable point to make, and my guess is a lot of people are expressing their agreement with the title by upvoting.

Why would it matter if they notice or not? What are they gonna do? EMP the whole world?

1quetzal_rainbow
I think that they shutdown computer on which unaligned AI is running? Downloading yourself into internet is not one-second process.

I think you're missing the point. If we could establish that all important information had been extracted from the original, would you expect humans to then destroy the original or allow it to be destroyed?

 

My guess is that they wouldn't. Which I think means practicality is not the central reason why humans do this.

2Richard_Kennaway
I think you’re missing my point, which is that we cannot establish that. Yes, I’m questioning your hypothetical. I always question hypotheticals.

if we could somehow establish how information from the original was extracted, do you expect humans to then destroy the original or allow it to be destroyed?

2Richard_Kennaway
No. The original is a historical document that may have further secrets to be revealed by methods yet to be invented. A copy says of the original only what was put into it. Only recently an ancient, charred scroll was first read.

Can humans become Sacred?

On 12 September 1940, the entrance to the Lascaux Cave was discovered on the La Rochefoucauld-Montbel lands by 18-year-old Marcel Ravidat when his dog, Robot, investigated a hole left by an uprooted tree (Ravidat would embellish the story in later retellings, saying Robot had fallen into the cave.)[8][9] Ravidat returned to the scene with three friends, Jacques Marsal, Georges Agnel, and Simon Coencas. They entered the cave through a 15-metre-deep (50-foot) shaft that they believed might be a legendary secret passage to the ne... (read more)

2Richard_Kennaway
A practical reason for preserving the original is that new techniques can allow new things to be discovered about it. A copy can embody no more than the observations that we have already made. There's no point to analysing the pigments in a modern copy of a painting, or carbon-dating its frame.

Suppose you've got a strong goal agnostic system design, but a bunch of competing or bad actors get access to it. How does goal agnosticism stop misuse?

 

This was the question I was waiting to be answered (since I'm already basically onboard with the rest of it), but was disappointed you didn't have a more detailed answer. Keeping this out of incompetent/evil hands perpetually seems close-to impossible. It seems this goes back to needing a maximizer-type force in order to prevent such misuse from occurring, and then we're back to square-one of the clas... (read more)

4porby
Thanks! I think there are ways to reduce misuse risk, but they're not specific to goal agnostic systems so they're a bit out of scope but... it's still not a great situation. It's about 75-80% of my p(doom) at the moment (on a p(doom) of ~30%). I'm optimistic about avoiding this specific pit. It does indeed look like something strong would be required, but I don't think 'narrow target for a maximizing agent' is usefully strong. In other words, I think we'll get enough strength out of something that's close enough to the intuitive version of corrigible, and we'll reach that before we have tons of strong optimizers of the (automatically) doombringing kind laying around.

I created a simple Google Doc for anyone interested in joining/creating a new org to put down their names, contact, what research they're interested in pursuing, and what skills they currently have. Overtime, I think a network can be fostered, where relevant people start forming their own research, and then begin building their own orgs/get funding. https://docs.google.com/document/d/1MdECuhLLq5_lffC45uO17bhI3gqe3OzCqO_59BMMbKE/edit?usp=sharing 

But it's also an entire School of Thought in Cognitive Science. I feel like DL is the method, but without the understanding that these are based on well-thoughtout, mechanistic rules for how cognition fundamentally works, building potentially toward a unified theory of cognition and behaviour.

1Iknownothing
What do you think someone who knows about PDP knows that someone with a good knowledge of DL doesn't? And why would it be useful?

I don't have an adequate answer for this, since these models are incomplete. But the way I see it is that these people had a certain way of mathematically reasoning about cognition (Hinton, Rumelhart, McClelland, Smolensky), and that reasoning created most of the breakthroughs we see today in AI (backprop, multi-layed models, etc.) It seems trying to utilize that model of cognition could give rise to new insights about the questions you're asking, attack the problem from a different angle, or help create a grounded paradigm for alignment research to build on.

My answer is a bit vague, but I would say that the current DL curriculum tells you how these things work, but it doesn't go into the reasoning about cognition that allowed these ideas to exist in the first place.

You could say it "predicted" everything post-AlexNet, but it's more that it created the fundamental understanding for everything post-AlexNet to exist in the first place. It's the mathematical models of cognition that all of modern AI is built on. This is how we got back propagation, "hidden" layers, etc.

If you, or if you know someone who wants to try to start doing this, let me know. I've noticed a lot of things in AIS people will say they'd like to see, but then nothing happens. 

3avturchin
I think I can't do it alone. But actually I applied one grant proposal which includes exploration of this idea.  

I guess my biggest doubt is that a dl-based AI could run interpretability on itself. Large NNs seem to "simulate" a larger network to represent more features, which results in most of the weights occupying a superposition. I don't see how a network could reflect on itself, since it seems that would require an even greater network (which then would require an even greater network, and so on). I don't see how it could eat its own tail, since only interpreting parts of the network would not be enough. It would have to interpret the whole.

1__RicG__
Uhm, by interpretability I mean things like this where the algorithm that the NN implements is revered engineered, written down as code or whatever which would allow for easier recursive self improvement (by improving just the code and getting rid of the spaghetti NN). Also by the looks of things (induction heads and circuits in general) there does seem to be a sort of modularity in how NN learn, so it does seem likely that you can interpret piece by piece. If this wasn't true I don't think mechanistic interpretability as a field would even exist.

The following is a conversation between myself in 2022, and a newer version of myself earlier this year.
 

On AI Governance and Public Policy

2022 Me: I think we will have to tread extremely lightly with, or, if possible, avoid completely. One particular concern is the idea of gaining public support. Many countries have an interest in pleasing their constituents, so if executed well, this could be extremely beneficial. However, it runs high risk of doing far more damage. One major concern is the different mindset needed to conceptualize the problem. Aler... (read more)

[crossposting my reply]

Thank you for taking the time to read and critique this idea. I think this is very important, and I appreciate your thoughtful response.

Regarding how to get current systems to implement/agree to it, I don't think that will be relevant longterm. The mechanisms current institutions use for control I don't think can keep up with AI proliferation. I imagine most existing institutions will still exist, but won't have the capacity to do much once AI really takes off. My guess is, if AI kills us, it will happen after a slow-motion coup. Not... (read more)

"If the boxed superintelligence with the ability to plan usage of weapons when authorized by humans, and other boxed superintelligences able to control robotics in manufacturing cells are on humans side, the advantage for humans could be overwhelming"

As I said, I do not expect boxed AIs to be a thing most will do. We haven't seen it, and I don't expect to see it, because unboxed AIs are superior. This isn't how people in control are approaching the situation, and I don't expect that to change.

2[anonymous]
My definition of "box" may be very different from yours. In my definition, locked weights and training only on testing, as well as other design elements such as distribution detection, heavily box the model'a capabilities and behavior. See https://www.lesswrong.com/posts/a5NxvzFGddj2e8uXQ/updating-drexler-s-cais-model?commentId=AZA8ujssBJK9vQXAY It is fine if the model can access the internet, robotics, etc so long as it lacks the context information to know it's on the real thing vs a sim or cached copy.

"keep it relegated to "tool" status, then it might be possible to use such an AI to combat unboxed, rogue AI"

I don't think this is a realistic scenario. You seem to be seeing it as an island of rogue, agentic, "unboxed" AIs in a sea of tool AIs. I think it's much, much more realistic that it'll be the opposite. Most AIs will be unboxed agents because they are superior. 

"For example, give it a snapshot of the internet from a day ago, and ask it to find the physical location of rogue AI servers, which you promptly bomb."

This seems to be approaching it f... (read more)

Are you familiar with Constellation's Proof of Reputable Observation? This seems very similar.

The following is a conversation between myself in 2022, and a newer version of me earlier this year.

On the Nature of Intelligence and its "True Name":

2022 Me:  This has become less obvious to me as I’ve tried to gain a better understanding of what general intelligence is. Until recently, I always made the assumption that intelligence and agency were the same thing. But General Intelligence, or G, might not be agentic. Agents that behave like RLs may only be narrow forms of intelligence, without generalizability. G might be something closer to a simula... (read more)

Thanks, finding others who are working on similar things is very useful. Do you know if the reading group is still active, or if they are working on anything new?

Given that I don't know when Schelling Day is, I doubt its existence.

If we're being realistic, this kind of thing would only get criminalized after something bad actually happened. Until then, too many people will think "omg, it's just a Chatbot". Any politician calling for it would get made fun of on every Late Night show.

I'm almost certain this is already criminal, to the extent it's actually dangerous. If you roll a boulder down the hill, you're up for manslaughter if it kills someone, and reckless endangerment if it could've but didn't hurt anyone. It doesn't matter if it's a boulder or software; if you should've known it was dangerous, you're criminally liable.

In this particular case, I have mixed feelings. This demonstration is likely to do immense good for public awareness of AGI risk. It even did for me, on an emotional level I haven't felt before. But it's also impo... (read more)

Yeah, all the questions over the years of "why would the AI want to kill us" could be answered with "because some idiot thought it would be funny to train an AI to kill everyone, and it got out of hand". Unfortunately, stopping everyone on the internet from doing things isn't realistic. It's much better to never let the genie out of the bottle in the first place.

I'm currently thinking that if there are any political or PR resources available to orgs (AI-related or EA) now is the time to use them. Public interest is fickle, and currently most people don't seem to know what to think, and are looking for serious-seeming people to tell them whether or not to see this as a threat. If we fail to act, someone else will likely hijack the narrative, and push it in a useless or even negative direction. I don't know how far we can go, or how likely it is, but we can't assume we'll get another chance before the public falls b... (read more)

Yeah, since the public currently doesn't have much of an opinion on it, trying to get the correct information out seems critical. I fear some absolutely useless legislation will get passed, and everyone will just forget about it once the shock-value of GPT wears off.

Unfortunately, he could probably get this published in various journals, with only minor edits being made.

Load More