All of lukemarks's Comments + Replies

I don't think the point of RLHF ever was value alignment, and I doubt this is what Paul Christiano and others intended RLHF to solve. RLHF might be useful in worlds without capabilities and deception discontinuities (plausibly ours), because we are less worried about sudden ARA, and more interested in getting useful behavior from models before we go out with a whimper.

This theory of change isn't perfect. There is an argument that RLHF was net-negative, and this argument has been had.

My point is that you are assessing RLHF using your model of AI risk, so th... (read more)

I don't understand why Chollet thinks the smart child and the mediocre child are doing categorically different things. Why can't the mediocre child be GPT-4, and the smart child GPT-6? I find the analogies Chollet and others draw in an effort to explain away the success of deep learning sufficient to explain what the human brain does, and it's not clear a different category of mind will or can ever exist (I don't make this claim, I'm just saying that Chollet's distinction is not evidenced).

Chollet points to real shortcomings of modern deep learning systems... (read more)

That is closer to what I meant, but it isn't quite what SLT says. The architecture doesn't need to be biased toward the target function's complexity. It just needs to always prefer simpler fits to more complex ones. 

This why the neural redshift paper says something different to SLT. It says neural nets that generalize well don't just have a simplicity bias, they have a bias for functions with similar complexity to the target function. This brings into question mesaoptimization, because although mesaoptimization is favored by a simplicity bias, it is not necessarily favored by a bias toward equivalent simplicity to the target function.

I think the predictions SLT makes are different from the results in the neural redshift paper. For example, if you use tanh instead of ReLU the simplicity bias is weaker. How does SLT explain/predict this? Maybe you meant that SLT predicts that good generalization occurs when an architecture's preferred complexity matches the target function's complexity?

The explanation you give sounds like a different claim however.

If you go to a random point in the loss landscape, you very likely land in a large region implementing the same behaviour, meaning the network

... (read more)
2Lucius Bushnaq
It doesn't. It just has neat language to talk about how the simplicity bias is reflected in the way the loss landscape of ReLU vs. tanh look different. It doesn't let you predict ahead of checking that the ReLU loss landscape will look better. That is closer to what I meant, but it isn't quite what SLT says. The architecture doesn't need to be biased toward the target function's complexity. It just needs to always prefer simpler fits to more complex ones.  SLT says neural network training works because in a good nn architecture simple solutions take up exponentially more space in the loss landscape. So if you can fit the target function on the training data with a fit of complexity 1, that's the fit you'll get. If there is no function with complexity 1 that matches the data, you'll get a fit with complexity 2 instead. If there is no fit like that either, you'll get complexity 3. And so on.  Sorry, I don't understand what you mean here. The paper takes different architectures and compares what functions you get if you pick a point at random from their parameter spaces, right?  If you mean this then that claim is of course true. Making up architectures with bad inductive biases is easy, and I don't think common wisdom thinks otherwise.  Sure, but for the question of whether mesa-optimisers will be selected for, why would it matter if the simplicity bias came from the updating rule instead of the architecture?  What would a 'simplicity bias' be other than a bias towards things simpler than random in whatever space we are referring to? 'Simpler than random' is what people mean when they talk about simplicity biases. What do you mean by 'similar complexity to the training set'? The message length of the training set is very likely going to be much longer than the message length of many mesa-optimisers, but that seems like an argument for mesa-optimiser selection if anything. Though I hasten to add that SLT doesn't actually say training prefers solutions with low

Neural Redshift: Random Networks are not Random Functions shows that even randomly initialized neural nets tend to be simple functions (measured by frequency, polynomial order and compressibility), and that this bias can be partially attributed to ReLUs. Previous speculation on simplicity biases focused mostly on SGD, but this is now clearly not the only contributor.

The authors propose that good generalization occurs when an architecture's preferred complexity matches the target function's complexity. We should think about how compatible this is with our p... (read more)

6Lucius Bushnaq
Singular Learning Theory explains/predicts this. If you go to a random point in the loss landscape, you very likely land in a large region implementing the same behaviour, meaning the network has a small effective parameter count. Just because most of the loss landscape is taken up by the biggest, and thus simplest, behavioural regions. You can see this happening if you watch proxies for the effective parameter count while models train. E.g. a modular addition transformer or MNIST MLP start out with very few effective parameters at initialisation, then gain more as the network trains. If the network goes through a grokking transition, you can watch the effective parameter count go down again. ≈ no change I'd say. We already knew neural network training had a bias towards algorithmic simplicity of some kind, because otherwise it wouldn't work. So we knew general algorithms, like mesa-optimisers, would be preferred over memorised solutions that don't generalise out of distribution. SLT just tells us how that works. One takeaway might be that observations about how biological brains train are more applicable to AI training than one might have previously thought. Previously, you could've figured that since AIs use variants of gradient descent as their updating algorithm, while the brain uses we-don't-even-know-what, their inductive biases could be completely different. Now, it's looking like the updating rule you use doesn't actually matter that much for determining the inductive bias. Anything in a wide class of local optimisation methods might give you pretty similar stuff. Some methods are a lot more efficient than others, but the real pixie fairy dust that makes any of this possible is in the architecture, not the updating rule.  (Obviously, it still matters what loss signal you use. You can't just expect that an AI will converge to learn the same desires a human brain would, unless the AI's training signals are similar to those used by the human brain. And we

I dropped out one month ago. I don't know anyone else who has dropped out. My comment recommends students consider dropping out on the grounds that it seemed like the right decision for me, but it took me a while to realize this was even a choice.

So far my experience has been pleasant. I am ~twice as productive. The total time available to me is ~2.5-3x as much as I had prior. The excess time lets me get a healthy amount of sleep and play videogames without sacrificing my most productive hours. I would make the decision again, and earlier if I could.

More people should consider dropping out of high school, particularly if they:

  • Don't find their classes interesting
  • Have self-motivation
  • Don't plan on going to university

In most places, once you reach an age younger than the typical age of graduation you are not legally obligated to attend school. Many continue because it's normal, but some brief analysis could reveal that graduating is not worth the investment for you.

Some common objections I heard:

  • It's only  more months, why not finish?

Why finish?

  • What if 'this whole thing' doesn't pan out?

The mis... (read more)

6Viliam
It might be useful to have some test for "have self-motivation", to reduce the number of people who believe they have it, they quit school, and then it turns out they actually don't. Or maybe, it's not just whether you feel motivated right now, but how long that feeling stays, on average.
2jmh
I do think you're correct that it would be a good decision for some. I would also say establishing this as a norm might induce some to take the easy way out and it be a mistake for them. Might be the case that councelors should be prepared to have a real converstation with HS students that come to that decision but not really make it one schools promote as a path forward. But I do know I was strongly encouraged to complete HS even when I was not really happy with it (and not doing well by many metrics) but recognized as an intelligent kid. I often think I should have just dropped out, got me GED, worked (which I was already doing and then skipping school often) and then later pursued college (which I also did a few years after I graduated HS). I do feel I probably lost some years playing the expected path game.
0Alexander Gietelink Oldenziel
April is still n=9 months away.

What's the epistemic backing behind this claim, how much data, what kind? Did you do it, how's it gone? How many others do you know of dropping out and did it go well or poorly?

This should be an equivalent problem, yes.

5robo
No, that's a very different problem.  The matrix overlords are Laplace's demon, with god-like omniscience about the present and past.  The matrix overlords know the position and momentum of every molecule in my cup of tea.  They can look up the microstate of any time in the past, for free. The future AI is not Laplace's demon.  The AI is informationally bounded.  It knows the temperature of my tea, but not the position and momentum of every molecule.  Any uncertainties it has about the state of my tea will increase exponentially when trying to predict into the future or retrodict into the past.  Figuring out which water molecules in my tea came from the kettle and which came from the milk is very hard, harder than figuring out which key encrypted a cypher-text.

Yes, your description of my hypothetical is correct. I think it's plausible that approximating things that happened in the past is computationally easier than breaking some encryption, especially if the information about the past is valuable even if it's noisy. I strongly doubt my hypothetical will materialize, but I think it is an interesting problem regardless.

My concern with approaches like the one you suggest is that they're restricted to small parts of the universe, so with enough data it might be possible to fill in the gaps.

Present cryptography becomes redundant when the past can be approximated. Simulating the universe at an earlier point and running it forward to extract information before it's encrypted is a basic, but difficult way to do this. For some information this approximation could even be fuzzy, and still cause damage if public. How can you protect information when your adversary can simulate the past?

The information must never exist as plaintext in the past. A bad way to do this is to make the information future-contingent. Perhaps it could be acausally inserted ... (read more)

1robo
I don't think I understand your hypothetical.  Is your hypothetical about a future AI which has: * Very accurate measurements of the state of the universe in the future * A large amount of compute, but not exponentially large * Very good algorithms for retrodicting* the past I think it's exponentially hard to retrodict the past.  It's hard in a similar way as encryption is hard.  If an AI isn't power enough to break encryption, it also isn't powerful enough to retrodict the past accurately enough to break secrets. If you really want to keep something secret from a future AI, I'd look at ways of ensuring the information needed to theoretically reconstruct your secret is carried away from the earth at the speed of light in infrared radiation.  Write the secret in sealed room, atomize the room to plasma, then cool the plasma by exposing it to the night sky. *predicting is using your knowledge of the present to predict the state of the future.  Retrodicting is using your knowledge of the present to predict retrodict the state of the past

We are recruiting people interested in using Rallypoint in any way. The form has an optional question for what you hope to get out of using Rallypoint. Even if you don't plan on contributing to bounties or claiming them and just want to see how others use Rallypoint we are still interested in your feedback.

Yes. If the feedback from the beta is that people find Rallypoint useful we will do a public release and development will continue. I want to focus on getting the core bounty infrastructure very refined before adding many extra features. Likely said infrastructure would be easily adapted to more typical crowdfunding and a few other applications, but that is lower on the priority list than getting bounties right.

I don't understand the distinction you draw between free agents and agents without freedom. 

If I build an expected utility maximizer with a preference for the presence of some physical quantity, that surely is not a free agent. If I build some agent with the capacity to modify a program which is responsible for its conversion from states of the world to scalar utility values, I assume you would consider that a free agent.

I am reminded of E.T. Jaynes' position on the notion of 'randomization', which I will summarize as "a term to describe a process we ... (read more)

1Michele Campolo
Let's consider the added example: In theory, there is a way to describe this iterative process as the optimisation of a single fixed utility function. In theory, we can also describe everything as simply following the laws of physics. I am saying that thinking in terms of changing utility functions might be a better framework. The point about learning a safe utility function is similar. I am saying that using the agent's reasoning to solve the agent's problem of what to do (not only how to carry out tasks) might be a better framework. It's possible that there is an elegant mathematical model which would make you think: "Oh, now I get the difference between free and non-free" or "Ok, now it makes more sense to me". Here I went for something that is very general (maybe too general, you might argue) but is possibly easier to compare to human experience. Maybe no mathematical model would make you think the above, but then (if I understand correctly) your objection seems to go in the direction of "Why are we even considering different frameworks for agency? Let's see everything in terms of loss minimisation", and this latter statement throws away too much potentially useful information, in my opinion. 

I believe you misinterpreted the quote from disturbance. They were implying that they would bring about AGI at the moment before their brain would be unsalvageable by AGI such that they could be repaired, assumedly in expectation of immortality.

I also don't think the perspective that we would likely fail as a civilization without AGI is common on LessWrong. I would guess that most of us would expect a smooth-ish transition to The Glorious Future in worlds where we coordinate around [as in don't build] AI. In my opinion the post is good even without this claim however.

3Max H
Ah, you're right that that the surrounding text is not an accurate paraphrase of the particular position in that quote. The thing I was actually trying to show with the quotes is "AGI is necessary for a good future" is a common view, but the implicit and explicit time limits that are often attached to such views might be overly short. I think such views (with attached short time limits) are especially common among those who oppose an AI pause. I actually agree that AGI is necessary (though not sufficient) for a good future eventually. If I also believed that all of the technologies here were as doomed and hopeless as the prospect of near-term alignment of an artificial superintelligence, I would find arguments against an AI pause (indefinite or otherwise) much more compelling.

models that are too incompetent to think through deceptive alignment are surely not deceptively aligned.

Is this true? In Thoughts On (Solving) Deep Deception, Jozdien gives the following example that suggests otherwise to me:

Back in 2000, a computer scientist named Charles Ofria was studying the evolution of simulated organisms. He wanted to limit their replication rate, so he programmed the simulation to pause after each mutation, measure the mutant’s replication rate in an isolated test environment, and delete the mutant if it replicated faster than its

... (read more)
5Buck
In this post, I'm talking about deceptive alignment. The threat model you're talking about here doesn't really count as deceptive alignment, because the organisms weren't explicitly using a world model to optimize their choices to cause the bad outcome. AIs like that might still be a problem (e.g. I think that deceptively aligned AI probably contributes less than half of my P(doom from AI)), but I think we should think of them somewhat separately from deceptively aligned models, because they pose risk by somewhat different mechanisms.

unless, by some feat of brilliance, this civilization pulls off some uncharacteristically impressive theoretical triumphs

Are you able to provide an example of the kind of thing that would constitute such a theoretical triumph? Or, if not; a maximally close approximation in the form of something that exists currently?

I'm in high school myself and am quite invested in AI safety. I'm not sure whether you're requesting advice for high school as someone interested in LW, or for LW and associated topics as someone attending high school. I will try to assemble a response to accommodate both possibilities.

Absorbing yourself in topics like x-risk can make school feel like a waste of time. This seems to me to be because school is mostly a waste of time (which is a position I held before becoming interested in AI safety,) but disengaging with the practice entirely also feels inc... (read more)

I expect agentic simulacra to occur without intentionally simulating them, in that agents are just generally useful for solving prediction problems and that in conducting millions of predictions (as would be expected of a product on the order of ChatGPT, or future successors,) it's probable for agentic simulacra to occur. Even if these agents are just approximations, in predicting the behaviors of approximated agents their preferences could still be satisfied in the real world (as described in the Hubinger post.)

The problem I'm interested in is how you ensure that all subsequent agentic simulacra (whether occurred intentionally or otherwise) are safe, which seems difficult to verify formally due to the Löbian Obstacle.

Which part specifically are you referring to as being overly complicated? What I take to be the primary assertions of the post to be are:

  • Simulacra may themselves conduct simulation, and advanced simulators could produce vast webs of simulacra organized as a hierarchy.
  • Simulating an agent is not fundamentally different to creating one in the real world.
  • Due to instrumental convergence, agentic simulacra might be expected to engage in resource acquisition. This could take the shape of 'complexity theft' as described in the post.[1]
  • The Löbian Obstacle accuratel
... (read more)
4Charlie Steiner
I am disagreeing with the underlying assumption that it's worthwhile to create simulacra of the sort that satisfy point 2. I expect an AI reasoning about its successor to not simulate it with perfect fidelity - instead, it's much more practical to make approximations that make the reasoning process different from instantiating the successor.

The issue I have with pivotal act models is that they presume an aligned superintelligence would be capable of bootstrapping its capabilities in such a way that it could perform that act before the creation of the next superintelligence. Soft takeoff seems a very popular opinion now, and isn't conducive to this kind of scheme.

Also, if a large org were planning a pivotal act I highly doubt they would do so publicly. I imagine subtly modifying every GPU on the planet, melting them or doing anything pivotal on a planetary scale such that the resulting world h... (read more)

I have enjoyed your writings both on LessWrong and on your personal blog. I share your lack of engagement with EA and with Hanson (although I find Yudkowsky's writing very elegant and so felt drawn to LW as a result.) If not the above, which intellectuals do you find compelling, and what makes them so by comparison to Hanson/Yudkowsky?

1bhauth
Thanks. My main issues with the early writing on LessWrong were: * uncertainty is often more Knightian than Bayesian which makes different things appropriate * some criticisms that David Chapman later made seemed obvious * unseen correlations are difficult to account for, and some suggestions I saw make that problem worse * sometimes "bias" exists for a reason My main issue with the community was that it seemed to have negative effects on some people and fewer benefits than claimed. My main issue with Yudkowsky was that he seemed overconfident about some things he didn't seem to understand that well. When I was in elementary school, people asked me who my role model was and I'd reply "Feynman" but I don't think that was true in the sense they meant. It's a common human tendency to want to become exactly like some role model, like a parent or celebrity, but I think it's healthier to try to imitate specific aspects of people and with a limited degree. Yes, maybe there are reasons for everything that you don't understand, but maybe what you'd be imitating is a fictional character. I started reading Feynman in 3rd grade, but it wasn't until later that I realized how different the person was from the character in the books. Kids can try to copy Elon Musk or PewDiePie but that's unlikely to work out for them. So, in my case, your question is similar to asking what books I liked. The answer would be something like: "The Man Without Qualities, Molecular Biology of the Cell, March's Advanced Organic Chemistry, Wikipedia, Wikipedia, Wikipedia..." - but to quote that famous philosopher Victor Wooten:

In (P2) you talk about a roadblock for RSI, but in (C) you talk about about RSI as a roadblock, is that intentional?

This was a typo. 

By "difficult", do you mean something like, many hours of human work or many dollars spent?  If so, then I don't see why the current investment level in AI is relevant.  The investment level partially determines how quickly it will arrive, but not how difficult it is to produce.

The primary implications of the difficulty of a capabilities problem in the context of safety is when said capability will arrive in mo... (read more)

More like: (P1) Currently there is a lot of investment in AI. (P2) I cannot currently imagine a good roadblock for RSI. (C) Therefore, I have more reasons to believe RSI will not be entail atypically difficult roadblocks than I do to believe it will.

This is obviously a high level overview, and a more in-depth response might cite claims like the fact that RSI is likely an effective strategy for achieving most goals, or mention counterarguments like Robin Hanson's, which asserts that RSI is unlikely due to the observed behaviors of existing >human systems (e.g. corporations).

1[anonymous]

"But what if [it's hard]/[it doesn't]"-style arguments are very unpersuasive to me. What if it's easy? What if it does? We ought to prefer evidence to clinging to an unknown and saying "it could go our way." For a risk analysis post to cause me to update I would need to see "RSI might be really hard because..." and find the supporting reasoning robust.

Given current investment in AI and the fact that I can't conjure a good roadblock for RSI, I am erring on the side of it being easier rather than harder, but I'm open to updating in light of strong counter-reasoning.

1[anonymous]

See:

Defining fascism in this way makes me worry that future fascist figures can hide behind the veil of "But we aren't doing x specific thing (e.g. minority persecution) and therefore are not fascist!"

And:

Is a country that exhibits all symptoms of fascism except for minority group hostility still fascist? 

0dr_s
Fair, but right now, what we're seeing explicitly includes minority group hostility. Besides, while minority hostility may not be the key trait here, it is in fact part of how fascism works. You can't get quite as free rein at being incompetent if you don't have a sacrificial lamb to blame for any and all failures.

Agreed. I have edited that excerpt to be:

It's not obvious to me that selection for loyalty over competence is necessarily more likely in fascism or bad. A competent figure who is opposed to democracy would be a considerably more concerning electoral candidate than a less competent one who is loyal to democracy assuming that democracy is your optimization target. 

As in decreases the 'amount of democracy' given that democracy is what you were trying to optimize for.

4the gears to ascension
I would suggest rephrasing to "concerning" to distinguish from "unelectable" as an interpretation of "worse".

Sam Altman, the quintessential short-timeline accelerationist, is currently on an international tour meeting with heads of state, and is worried about the 2024 election. He wouldn't do that if he thought it would all be irrelevant next year.

Whilst I do believe Sam Altman is probably worried about the rise of fascism and its augmenting by artificial intelligence, I don't see this as evidence of his care regarding this fact. Even if he believed a rise in fascism had no likelihood of occurring; it would still be beneficial for him to pursue the international ... (read more)

0dr_s
I mean, that's what makes it "fascism" though rather than generic authoritarianism. People should avoid having gut reactions to their ideology being called out for its excesses, especially here. Same way in which a leftist has to be aware of the extremes of communism, and not immediately recoil upon mention of them, so should a rational conservative know about fascism and trying their hardest to avoid falling into its traps.
3the gears to ascension
What do you mean by "worse" here?
2dr_s
Loyalty in general is more important in a centralised system built essentially on violence than a pluralist one built on legitimacy. In a democracy usually you'll have competing forces all holding some weight, and the worst that betrayal can cause is a lost election. In a dictatorship there's only one master to obey and the stakes are quite higher (consider how many "accidents" keep happening to members of the Russian upper echelon these days). Democracies aren't immune from this phenomenon, but it tends to happen more at the party level. For example, in Italy, back in the early 2000s Berlusconi did this, building his own party essentially as an extension of himself and filling it only with incompetent yes men who wouldn't threaten his position. Brexit has done something like it to the Tory party in UK, distilling only the most loyal ones even if it meant purging competent politicians in favour of mindless demagogues. But things like the military, the judiciary and the civil service at least are slow changing enough that they carry the signs of the balance of power throughout the years.

While it's true that Chinese semiconductor fabs are a decade behind TSMC (and will probably remain so for some time), that doesn't seem to have stopped them from building 162 of the top 500 largest supercomputers in the world.

They did this (mostly) before the export regulations were instantiated. I'm not sure what the exact numbers are, but both of their supercomputers in the top 10 were constructed before October 2022 (when they were imposed). Also, I imagine that they still might have had a steady supply of cutting edge chips soon after the export regula... (read more)

9Logan Zoellner
The number 7 supercomputer is built using Chinese natively developed chips, which still demonstrates the quality/quantity tradeoff. Also, saying "sanctions will bite in the future" is only persuasive if you have long timelines (and expect sanctions to hold up over those timelines).  If you think AGI is imminent, or you think sanctions will weaken over time, future sanctions matter less.

Sure, this is an argument 'for AGI', but rarely do people (on this forum at least) reject the deployment of AGI because they feel discomfort in not fully comprehending the trajectory of their decisions. I'm sure that this is something most of us ponder and would acknowledge is not optimal, but if you asked the average LW user to list the reasons they were not for the deployment of AGI, I think that this would be quite low on the list.

Reasons higher on the list for me for example would be "literally everyone might die." In light of that; dismissing control ... (read more)

2Christopher James Hart
I agree completely. I am not trying to feed accelerationism or downplay risks, but I am trying to make a few important arguments from the perspective of an 3rd party observer. I wanted to introduce the 'divine move paradox' along side the evolutionary ingrained flawed minds argument. I am trying to frame the situation in a slightly different light, far enough outside the general flow to be interesting, but not so far that it does not tie in. I am certainly not trying to say we just turn over control to the first thing that manipulates us properly. I think my original title was poorly chosen when this is meant to bring forward ideas. I edited it to remove 'The Case for AGI'

Soft upvoted your reply, but have some objections. I will respond using the same numbering system you did such that point 1 in my reply will address point 1 of yours. 

  1. I agree with this in the context of short-term extinction (i.e. at or near the deployment of AGI), but would offer that an inability to remain competitive and loss of control is still likely to end in extinction, but in a less cinematic and instantaneous way. In accordance with this, the potential horizon for extinction-contributing outcomes is expanded massively. Although Yudkowsky is m
... (read more)

The focus of the post is not on this fact (at least not in terms of the quantity of written material). I responded to the arguments made because they comprised most of the post, and I disagreed with them.

If the primary point of the post was "The presentation of AI x-risk ideas results in them being unconvincing to laypeople", then I could find reason in responding to this, but other than this general notion, I don't see anything in this post that expressly conveys why (excluding troubles with argumentative rigor, and the best way to respond to this I can think of is by refuting said arguments).

I disagree with your objections.

"The first argument–paperclip maximizing–is coherent in that it treats the AGI’s goal as fixed and given by a human (Paperclip Corp, in this case). But if that’s true, alignment is trivial, because the human can just give it a more sensible goal, with some kind of “make as many paperclips as you can without decreasing any human’s existence or quality of life by their own lights”, or better yet something more complicated that gets us to a utopia before any paperclips are made"

This argument is essentially addressed by this pos... (read more)

1Seth Herd
I disagree with OPs objections, too, but that's explicitly not the point of this post. OP is giving us an outside take on how our communication is working, and that's extremely valuable. Typically, when someone says you're not convincing them, "you're being dumb" is itself a dumb response. If you want to convince someone of something, making the arguments clear is mostly your responsibility.
4nicholashalden
Thanks for your reply. I welcome an object-level discussion, and appreciate people reading my thoughts and showing me where they think I went wrong. 1. The hidden complexity of wishes stuff is not persuasive to me in the context of an argument that AI will literally kill everyone. If we wish for it not to, there might be some problems with the outcome, but it won't kill everyone. In terms of Bay Area Lab 9324 doing something stupid, I think by the time thousands of labs are doing this, if we have been able to successfully wish for stuff without catastrophe being triggered, it will be relatively easy to wish for universal controls on the wishing technology. 2. "Infinite number of possible mesa-optimizers". This feels like just invoking an unknown unknown to me, and then asserting that we're all going to die, and feels like it's missing some steps. 3. You're wrong about Eliezer's assertions about hacking, he 100% does believe by dint of a VR headset. I quote: "- Hack a human brain - in the sense of getting the human to carry out any desired course of action, say - given a full neural wiring diagram of that human brain, and full A/V I/O with the human (eg high-resolution VR headset), unsupervised and unimpeded, over the course of a day: DEFINITE YES - Hack a human, given a week of video footage of the human in its natural environment; plus an hour of A/V exposure with the human, unsupervised and unimpeded: YES " 4. I get the analogy of all roads leading to doom, but it's just very obviously not like that, because it depends on complex systems that are very hard to understand, and AI x-risk proponents are some of the biggest advocates of that opacity.

Agreed. I will add a clarifying statement in the introduction.

So if the argument the OT proponents are making is that AI will not self-improve out of fear of jeopardising its commitment to its original goal, then the entire OT is moot, because AI will never risk self-improving at all.

This seems to me to apply only to self improvement that modifies the outcome of decision-making irrespective of time. How does this account for self improvement that only serves to make decision making more efficient? 

If I have some highly inefficient code that finds the sum of two integers by first breaking them up into 10000 small... (read more)

1ArisC
But that's not general intelligence; general intelligence requires considering a wider range of problems holistically, and drawing connections among them. 

I agree with this post almost entirely and strong upvoted as a result. The fact that more effort has not been allocated to the neurotechnology approach already is not a good sign, and the contents of this post do ameliorate that situation in my head slightly. My one comment is that I disagree with this analysis of cyborgism:

Interestingly, Cyborgism appeared to diverge from the trends of the other approaches. Despite being consistent with the notion that less feasible technologies take longer to develop, it was not perceived to have a proportionate impact o

... (read more)

Agreed and edited.

I disagree with your framing of the post. I do not think that this is wishful thinking. 

The first and most obvious issue here is that an AI that "solves alignment" sufficiently well to not fear self-improvement is not the same as an AI that's actually aligned with humans. So there's actually no protection there at all.

It is not certain that upon deployment the first intelligence capable of RSI will be capable of solving alignment. Although this seems improbable in accordance with more classic takeoff scenarios (i.e. Yudkowsky's hard takeoff), the like... (read more)

Strong upvoted this post. I think the intuition is good and that architecture shifts invalidating anti-foom arguments derived from the nature of the DL paradigm is counter-evidence to those arguments, but simultaneously does not render them moot (i.e. I can still see soft takeoff as described by Jacob Cannell to be probable and assume he would be unlikely to update given the contents of this post).

I might try and present a more formal version of this argument later, but I still question the probability of a glass-box transition of type "AGI RSIs toward non... (read more)

1__RicG__
You are basically discussing these two assumptions I made (under "Algorithmic foom (k>1) is possible"), right? But maybe the third assumption is the non-obvious one? For the sake of discourse: My initial motive to write "Foom by change of paradigm" was to show another previously unstated way RSI could happen. Just to show how RSI could happen, because if your frame of mind is "only compute can create intelligence" foom is indeed unfeasible... but if it is possible to make the paradigm jump then you might just be blind to this path and fuck up royally, as the French say. One key thing that I find interesting is also that this paradigm shift does circumvent the "AIs not creating other AIs because of alignment difficulties"  I am afraid I am not familiar with this hypothesis and google (or ChatGPT) aren't helpful. What do you mean with this and modularity?   P.S. I have now realized that the opposite of a black-box is indeed a glass-box and not a white-box lol. You can't see inside a box of any colour unless it is clear, like glass!

I agree with some of this as a crticism of the idea, but not of the post. Firstly, I stated the same risk you did in the introduction of the post, hence the communication was "Here is an idea, but it has this caveat", and then the response begins with "but it has this caveat".

Next, if the 'bad outcome' scenario looks like most or all parties that receive the email ignoring it/not investigating further, then I see such an email as easily justifiable to send, as it is a low-intensity activity labour-wise with the potential to expand knowledge of x-risks pose... (read more)

4ChristianKl
Channels that are selected as a central open contact method tend to be very crowded and thus not the most efficient channels.  The person getting the email might get 2000 emails per day. That sentence will likely, get a good portion of the people who receive it to think "This person doesn't know how to write concisely, why should I read another four paragraphs from them?" Generally, handling communication with care means not mass emailing busy people but writing personalized emails and thinking about what those people want to hear.  One narrative that could be interesting for a journalist could be: What value would this email have to a journalist? They care about getting readers, you tell them a topic that's good for getting readers. You are making a case that it fits with the other stories they write. You also reduce their work by telling them about XYZ. Fifteen years ago, a good press release was one that a journalist could just copy-paste and put his name on. In the online world there's likely less copy-pasting of press releases then back then but he principle that you want a few paragraph that demonstrate that there's a good story that they could easily publish without doing much work likely still holds. 

I agree with this sentiment in response to the question of "will this research impact capabilities more than it will alignment?", but not in response to the question of "will this research (if implemented) elevate s-risks?". Partial alignment inflating s-risk is something I am seriously worried about, and prosaic solutions especially could lead to a situation like this.

If your research not influencing s-risks negatively is dependent on it not being implemented, and you think that it your research is good enough to post about, don't you see the dilemma here?

It's fine to make the mistake of publishing something if the mistake you made was assuming "this is great research", but if the mistake was "this is safe to publish because I'm new to research", the consequences can be irreversible. I probably fall into the category of 'wildly overthinking the harms of publishing due to inexperience', but it seems to me like a simple assessment using the ABC model I outlined in the post should take only a few minutes and could quickly inform someone of whether or not they might want to show their research to someone more e... (read more)

8Neel Nanda
Empirically, many people new to the field get very paralysed and anxious about fears of doing accidental harm, in a way that I believe has significant costs. I haven't fully followed the specific model you outline, but it seems to involve ridiculously hard questions around the downstream consequences of your work, which I struggle to robustly apply to my work (indirect effects are really hard man!). Ditto, telling someone that they need to ask someone more experienced to sanity check can have significant costs in terms of social anxiety (I personally sure would publish fewer blog posts if I felt a need to run each one by someone like Chris Olah first!) Having significant costs doesn't mean that doing this is bad, per se, but there needs to be major benefits to match these costs, and I'm just incredibly unconvinced that people's first research projects meet these. Maybe if you've gotten a bunch of feedback from more experienced people that your work is awesome? But also, if you're in that situation, then you can probably ask them whether they're concerned.
7habryka
"Irreversible consequences" is not that huge of a deal. The consequences of writing almost any internet comment are irreversible. I feel like you need to argue for also the expected magnitude of the consequences being large, instead of them just being irreversible.

Although I soft upvoted this post, there are some notions I'm uncomfortable with. 

What I agree with:

  • Longtime lurkers should post more
  • Less technical posts are pushing more technical posts out of the limelight
  • Posts that dispute the Yudkowskian alignment paradigm are more likely to contain incorrect information (not directly stated but heavily implied I believe, please correct me if I've misinterpreted)
  • Karma is not an indicator of correctness or of value

The third point is likely due to the fact that the Yudkowskian alignment paradigm isn't a particularly... (read more)

5Max H
Mmm, my intent is not to discourage people from posting views I disagree with, and I don't think this post will have that effect. It's more like, I see a lot of posts that could be improved by grappling more directly with Yudkowskian ideas. To the credit of many of the authors I link, they often do this, though not always as much as I'd like or in ways I think are correct. The part I find lacking in the discourse is pushback from others, which is what I'm hoping to change. That pushback can't happen if people don't make the posts in the first place!

Thank you for the feedback, I have repaired the post introduction in accordance your commentary on utility functions. I challenge the assumption that a system not being able to reliably simulate an agent with human specifications is worrying, and I would like to make clear that the agenda I am pushing is not:

  1. Capabilities and understanding through simulation scale proportionately
  2. More capable systems can simulate, and therefore comprehend the goals of other systems to a greater extent
  3. By dint of some unknown means we align AGI to this deep understanding of ou
... (read more)