All of David Johnston's Comments + Replies

If you're in a situation where you can reasonably extrapolate from past rewards to future reward, you can probably extrapolate previously seen "normal behaviour" to normal behaviour in your situation. Reinforcement learning is limited - you can't always extrapolate past reward - but it's not obvious that imitative regularisation is fundamentally more limited.

(normal does not imply safe, of course)

6Charlie Steiner
I dunno, I think you can generalize reward farther than behavior. E.g. I might very reasonably issue high reward for winning a game of chess, or arriving at my destination safe and sound, or curing malaria, even if each involved intermediate steps that don't make sense as 'things I might do.' I do agree there are limits to how much extrapolation we actually want, I just think there's a lot of headroom for AIs to achieve 'normal' ends via 'abnormal' means.

Their empirical result rhymes with adversarial robustness issues - we can train adversaries to maximise ~arbitrary functions subject to small perturbation from ground truth constraints. Here the maximised function is a faulty reward model and the constraint is KL to a base model instead of distance to a ground truth image.

I wonder if multiscale aggregation could help here too as it does with image adversarial robustness. We want the KL penalty to ensure that the generations should look normal at any "scale", whether we look at them token by token or read a... (read more)

Is your view closer to:

  • there's two hard steps (instruction following, value alignment), and of the two instruction following is much more pressing
  • instruction following is the only hard step; if you get that, value alignment is almost certain to follow
2Seth Herd
The first. Value alignment is much harder. But it will be vastly easier with smarter-than-human help. So there are two difficult steps, and it's clear which one should be tackled first. The difficulty with value alignment is both in figuring out what we actually want, and then figuring out how to make those values stable in mind that changes as it learns new things.

Mathematical reasoning might be specifically conducive to language invention because our ability to automatically verify reasoning means that we can potentially get lots of training data. The reason I expect the invented language to be “intelligible” is that it is coupled (albeit with some slack) to automatic verification.

There's a regularization problem to solve for 3.9 and 4, and it's not obvious to me that glee will be enough to solve it (3.9 = "unintelligible CoT").

I'm not sure how o1 works in detail, but for example, backtracking (which o1 seems to use) makes heavy use of the pretrained distribution to decide on best next moves. So, at the very least, it's not easy to do away with the native understanding of language. While it's true that there is some amount of data that will enable large divergences from the pretrained distribution - and I could imagine mathematical ... (read more)

3sanxiyn
When I imagine models inventing a language my imagination is something like Shinichi Mochizuki's Inter-universal Teichmüller theory invented for his supposed proof of abc conjecture. It is clearly something like mathematical English and you could say it is "quite intelligible" compared to "neuralese", but at the end, it is not very intelligible.

For what it's worth, one idea I had as a result of our discussion was this:

  • We form lots of beliefs as a result of motivated reasoning
  • These beliefs are amenable to revision due to evidence, reason or (maybe) social pressure
  • Those beliefs that are largely resilient to these challenges are "moral foundations"

So philosophers like "pain is bad" as a moral foundation because we want to believe it + it is hard to challenge with evidence or reason. Laypeople probably have lots of foundational moral beliefs that don't stand up as well to evidence or reason, bu... (read more)

I can explain why I believe bachelors are unmarried: I learned that this is what the word bachelor means, I learned this because it is what bachelor means, and the fact that there's a word "bachelor" that means "unmarried man" is contingent on some unimportant accidents in the evolution of language. A) it is certainly not the result of an axiomatic game and B) if moral beliefs were also contingent on accidents in the evolution of language (I think most are not), that would have profound implications for metaethics.

Motivated belief can explain non-purely-se... (read more)

Thanks for your continued engagement.

I’m interested in explaining foundational moral beliefs like suffering is bad, not beliefs like “animals do/don’t suffer”, which is about badness only because we accept the foundational assumption that suffering is bad. Is that clear in the updated text?

Now, I don’t think these beliefs come from playing axiomatic games like “define good as that which increases welfare”. There are many lines of evidence for this. First: “define bad as that which increases suffering” is not equally as plausible as “define good as that whi... (read more)

3cubefox
Problem is, motivated reasoning can only explain selfish beliefs, beliefs which are in accordance with our own motivations. But moral beliefs are often not at all selfish. In contrast, "suffering is bad" could just be part of what "bad" means. No motivated reasoning required. It would be a "foundational belief" in the same sense "Bachelors are unmarried" could be called "foundational".

I think precisely defining "good" and "bad" is a bit beside the point - it's a theory about how people come to believe things are good and bad, and we're perfectly capable of having vague beliefs about goodness and badness. That said, the theory is lacking a precise account of what kind of beliefs it is meant to explain.

The LLM section isn't meant as support for the theory, but speculation about what it would say about the status of "experiences" that language models can have. Compared to my pre-existing notions, the theory seems quite willing to accommodate LLMs having good and bad experiences on par with those that people have.

I have a pedantic and a non-pedantic answer to this. Pedantic: you say X is "usually considered good" if it increases welfare. Perhaps you mean to imply that if X is usually considered good then it is good. In this case, I refer you to the rest of the paragraph you quote.

Non-pedantic: yes, it's true that once you accept some fundamental assumptions about goodness and badness you can go about theorising and looking for evidence. I'm suggesting that motivated reasoning is the mechanism that makes those fundamental assumptions believable.

I added a paragraph mentioning this, because I think your reaction is probably common.

3cubefox
If I believe eating meat is not bad because I engage in motivated reasoning, then this is, like all forms of motivated reasoning, just an irrational belief. But if I believe eating meat is not bad because I believe it doesn't create a significant amount of additional suffering, there is nothing irrational about that belief. So motivated reasoning can only explain (some) irrational beliefs. Not all beliefs about things being good or bad. However, when something being bad means that it decreases some sort of welfare in some general way, then we don't have this problem. Now, what exactly does "welfare" etc mean? That's a question that normative ethicists try to figure out. For example via various proposed theories of utilitarianism. If philosophers are analyzing a subject matter, it's safe to assume they are analyzing some concept. Now, what's a concept? It's a meaning of a word. Like "good" or "bad".

Here's a basic model of policy collapse: suppose there exist pathological policies of low prior probability (/high algorithmic complexity) such that they play the training game when it is strategically wise to do so, and when they get a good opportunity they defect in order to pursue some unknown aim.

Because they play the training game, a wide variety of training objectives will collapse to one of these policies if the system in training starts exploring policies of sufficiently high algorithmic complexity. So, according to this crude model, there's a comp... (read more)

Algorithmic complexity is precisely analogous to difficulty-of-learning-to-predict, so saying "it's not about learning to predict, it's about algorithmic complexity" doesn't make sense. One read of the original is: learning to respect common sense moral side constraints is tricky[1], but AI systems will learn how to do it in the end. I'd be happy to call this read correct, and is consistent with the observation that today's AI systems do respect common sense moral side constraints given straightforward requests, and that it took a few years to figure out h... (read more)

2Noosphere89
My question is why is the following statement below true, exactly?

When do you think is the right time to work in these issues? Monitoring, trust displacement and fine grained permission management all look liable to raise issues that weren’t anticipated and haven’t already been solved, because they’re not the way things have been done historically. My gut sense is that GPT4 performance is much lower when you’re asking it to do novel things. Maybe it’s also possible to make substantial gains with engineering and experimentation, but you’ll need a certain level of performance in order to experiment.

Some wild guesses: maybe... (read more)

The AI system builders’ time horizon seems to be a reasonable starting point

Nora and/or Quentin: you talk a lot about inductive biases of neural nets ruling scheming out, but I have a vague sense that scheming ought to happen in some circumstances - perhaps rather contrived, but not so contrived as to be deliberately inducing the ulterior motive. Do you expect this to be impossible? Can you propose a set of conditions you think sufficient to rule out scheming?

What in your view is the fundamental difference between world models and goals such that the former generalise well and the latter generalise poorly?

One can easily construct a model with a free parameter X and training data such that many choices of X will match the training data but results will diverge in situations not represented in the training data (for example, the model is a physical simulation and X tracks the state of some region in the simulation that will affect the learner’s environment later, but hasn’t done so during training). The simplest... (read more)

Another comment on timing updates: if you’re making a timing update for zoonosis vs DEFUSE, and you’re considering a long timing window w_z for zoonosis, then your prior for a DEFUSE leak needs to be adjusted for the short window w_d in which this work could conceivably cause a leak, so you end up with something like p(defuse_pandemic)/p(zoo_pandemic)= rr_d w_d/w_z, where rr_d is the riskiness of DEFUSE vs zoonosis per unit time. Then you make the “timing update” p(now |defuse_pandemic)/p(now |zoo_pandemic) = w_z/w_d and you’re just left with rr_d.

2Roko
It's not specifically DEFUSE, it's DEFUSE and all possible related dangerous GoF work which became possible post 2017
1Roko
It doesn't specifically have to be DEFUSE, it just has to be some work which started after the following key events: circa 2011: technology becomes available for dangerous GoF and people start discussing it circa 2018: ban on GoF is lifted

If your theory is: there is a lab leak from WIV while working on defuse derived work then I’ll buy that you can assign a high probability to time & place … but your prior will be waaaaaay below the prior on “lab leak, nonspecific” (which is how I was originally reading your piece).

2Roko
But we are updating on the timing. Under the null hypothesis we assign equal probability to each year between 1980 and 2060, and they add up to 1. So there is an assumption there that a pandemic will definitely occur starting in china. We should make the same assumption under the alternate hypothesis. The only difference is under AH there's a lab leak. So we just adjust the way the probability is allocated by year. It still has to add up to 100%. So, maybe we'll have a uniform background of 0.1% per year between 1980 and 2060, and then after the 2011 events where people started talking about GoF it increases a bit as GoF is at least possible, then it increases again in 2017 when GoF is funded and greenlit, and after that each year it decreases a little bit, think of it as a hazard rate, once it has happened once people will start being cautious again.

You really think in 60% of cases where country A lifts a ban on funding gain of function research a pandemic starts in country B within 2 years? Same question for “warning published in Nature”.

3Roko
It has to be conditional on a massive global pandemic starting in that country at all, to make a fair comparison with the 2/80 calculation under the null hypothesis. But say we break it down into two parts. (1) probability that the GoF research does have the potential to cause a pandemic and (2) distribution in time of the pandemic after research starts.

If people now don’t have strong views about exactly what they want the world to look like in 1000 years but people in 1000 years do have strong views then I think we should defer to future people to evaluate the “human utility” of future states. You seem to be suggesting that we should take the views of people today, although I might be misunderstanding.

Edit: or maybe you’re saying that the AGI trajectory will be ~random from the point of view of the human trajectory due to a different ontology. Maybe, but different ontology -> different conclusions is ... (read more)

4jessicata
To the extent people now don't care about the long-term future there isn't much to do in terms of long-term alignment. People right now who care about what happens 2000 years from now probably have roughly similar preferences to people 1000 years from now who aren't significantly biologically changed or cognitively enhanced, because some component of what people care about is biological. I'm not saying it would be random so much as not very dependent on the original history of humans used to train early AGI iterations. It would have different data history but part of that is because of different measurements, e.g. scientific measuring tools. Different ontology means that value laden things people might care about like "having good relationships with other humans" are not meaningful things to future AIs in terms of their world model, not something they would care much by default (they aren't even modeling the world in those terms), and it would be hard to encode a utility function so they care about it despite the ontological difference.

Given this assumption, the human utility function(s) either do or don't significantly depend on human evolutionary history. I'm just going to assume they do for now.

There seems to be a missing possibility here that I take fairly seriously, which is that human values depend on (collective) life history. That is: human values are substantially determined by collective life history, and rather than converging to some attractor this is a path dependent process. Maybe you can even trace the path taken back to evolutionary history, but it’s substantially medi... (read more)

4jessicata
I think it's possible human values depend on life history too, but that seems to add additional complexity and make alignment harder. If the effects of life history very much dominate those of evolutionary history, then maybe neglecting evolutionary history would be more acceptable, making the problem easier. But I don't think default AGI would be especially path dependent on human collective life history. Human society changes over time as humans supersede old cultures (see section on subversion). AGI would be a much bigger shift than the normal societal shifts and so would drift from human culture more rapidly. Partially due to different conceptual ontology and so on. The legacy concepts of humans would be a pretty inefficient system for AGIs to keep using. Like how scientists aren't alchemists anymore, but a bigger shift than that. (Note, LLMs still rely a lot on human concepts rather than having independent ontology and agency, so this is more about future AI systems)

You're changing the topic to "can you do X without wanting Y?", when the original question was "can you do X without wanting anything at all?".

A system that can, under normal circumstances, explain how to solve a problem doesn’t necessarily solve the problem if it gets in the way of explaining the solution. The notion of wanting that Nate proposes is “solving problems in order to achieve the objective”, and this need not apply to the system that explains solutions. In short: yes.

If we are to understand you as arguing for something trivial, then I think it only has trivial consequences. We must add nontrivial assumptions if we want to offer a substantive argument for risk.

Suppose we have a collection of systems of different ability that can all, under some conditions, solve . Let's say an "-wrench" is an event that defeats systems of lower ability but not systems of higher ability (i.e. prevents them from solving ).

A system that achieves with probability must defeat all -wrenches but those with a probability of at most .... (read more)

Two observations:

  1. If you think that people’s genes would be a lot fitter if people cared about fitness more then surely there’s a good chance that a more efficient version of natural selection would lead to people caring more about fitness.

  2. You might, on the other hand, think that the problem is more related to feedbacks. I.e. if you’re the smartest monkey, you can spend your time scheming to have all the babies. If there are many smart monkeys, you have to spend a lot of time worrying about what the other monkeys think of you. If this is how you’re wo

... (read more)

I can't speak for janus, but my interpretation was that this is due to a capacity budget meaning it can be favourable to lose a bit of accuracy on token n if you gain more on n+m. I agree som examples would be great.

there are strong arguments that control of strongly superhuman AI systems will not be amenable to prosaic alignment

In which section of the linked paper is the strong argument for this conclusion to be found? I had a quick read of it but could not see it - I skipped the long sections of quotes, as the few I read were claims rather than arguments.

2Davidmanheim
I'm not going to try to summarize the arguments here, but it's been discussed on this site for a decade. And those quoted bits of the paper were citing the extensive discussions about this point - that's why there were several hundred citations, many of which were to Lesswrong posts.

I don’t disagree with any of what you say here - I just read Anton as assuming we have a program on that frontier

The mistake here is the assumption that a program that models the world better necessarily has a higher Kolmogorov complexity.

I think Anton assumes that we have the simplest program that predicts the world to a given standard, in which case this is not a mistake. He doesn't explicitly say so, though, so I think we should wait for clarification.

But it's a strange assumption; I don't see why the minimum complexity predictor couldn't carry out what we would interpret as RSI in the process of arriving at its prediction.

4DaemonicSigil
The thing about the Pareto frontier of Kolmogorov complexity vs prediction score is that most programs aren't on it. In particular, it seems unlikely that p_1, the seed AI written by humans, is going to be on the frontier. Even p_2, the successor AI, might not be on it either. We can't equovicate between all programs that get the same prediction score, differences between them will be observable in the way they make predictions.

I think he’s saying “suppose p1 is the shortest program that gets at most loss . If p2 gets loss , then we must require a longer string than p1 to express p2, and p1 therefore cannot express p2”.

This seems true, but I don’t understand its relevance to recursive self improvement.

I think it means that whatever you get is conservative in cases where it's unsure of whether it's in training, which may translate to being conservative where it's unsure of success in general.

I agree it doesn't rule out an AI that takes a long shot at takeover! But whatever cognition we posit that the AI executes, it has to yield very high training performance. So AIs that think they have a very short window for influence or are less-than-perfect at detecting training environments are ruled out.

An AI that wants something and is too willing to take low-probability shots at takeover (or just wielding influence) would get trained away, no?

What I mean is, however it makes decisions, it has to be compatible with very high training performance.

3RobertM
Probably?  I don't think that addresses the question of what such an AI would do in whatever window of opportunity it has.  I don't see a reason why you couldn't get an AI that has learned to delay its attempt to takeover until it's out of training, but still have relatively low odds of success at takeover.

If I can make my point a bit more carefully: I don’t think this post successfully surfaces the bits of your model that hypothetical Bob doubts. The claim that “historical accidents are a good reference class for existential catastrophe” is the primary claim at issue. If they were a good reference class, very high risk would obviously be justified, in my view.

Given that your post misses this, I don’t think it succeeds as an defence of high P(doom).

I think a defence of high P(doom) that addresses the issue above would be quite valuable.

Also, for what it’s wo... (read more)

1Lauro Langosco
Yeah I don't think the arguments in this post on its own should convince that P(doom) is high you if you're skeptical. There's lots to say here that doesn't fit into the post, eg an object-level argument for why AI alignment is "default-failure" / "disjunctive".
Answer by David Johnston30

There is a situation in which information markets could be positive sum, though I don't know how practical it is:

I own a majority stake in company X. Someone has proposed an action A that company X take, I currently think this is worse than the status quo, but I think it's plausible that with better information I'd change my mind. I set up an exchange of X-shares-conditional-on-A for USD-conditional-on-A and the analogous exchange conditional on not-A, subsidised by some fraction of my X shares using an automatic market maker. If, by the closing date, X-sh... (read more)

I don't see how you get default failure without a model. In fact, I don’t see how you get there without the standard model, where an accident means you get a super intelligence with a random goal from an unfriendly prior - but that’s precisely the model that is being contested!

I can kiiinda see default 50-50 as "model free", though I'm not sure if I buy it.

3Lauro Langosco
It's unclear to me what it would even mean to get a prediction without a "model". Not sure if you meant to imply that, but I'm not claiming that it makes sense to view AI safety as default-failure in absence of a model (ie in absence of details & reasons to think AI risk is default failure).

You raise some examples of the generator/critic gap, which I addressed. I’m not sure what I should look for in that paper - I mentioned the miscalibration of GPT4 after RLHF, that’s from the GPT4 tech report, and I don’t believe your linked paper shows anything analogous (ie that RLHFd models are less calibrated than they “should” be). I know that the two papers here investigate different notions of calibration.

“Always say true things” is a much higher standard than “don’t do anything obviously bad”. Hallucination is obviously a violation of the first, and... (read more)

1__RicG__
The article and my examples were meant to show that there is a gap between what GPT knows and what it says. It knows something, but sometimes says that it doesn’t, or it just makes it up. I haven’t addressed your “GPT generator/critic” framework or the calibration issues as I don’t really see them much relevant here. GPT is just GPT. Being a critic/verifier is basically always easier. IIRC the GPT-4 paper didn’t really go into much detail of how they tested the calibration, but that’s irrelevant here as I am claiming that sometimes it know the “right probability” but it generates a made up one. I don’t see how “say true things when you are asked and you know the true thing” is such a high standard, just because we have already internalised that it’s ok that sometimes GPT says make up things

I don’t agree. There is a distinction between lying and being confused - when you lie, you have to know better. Offering a confused answer is in a sense bad, but with lying there’s an obviously better policy (don’t) while it’s not the case that a confused answer is always the result of a suboptimal policy. When you are confused, the right course of action sometimes results in mistakes.

AFAIK there’s no evidence of a gap between what GPT knows and what it says when it’s running in pure generative mode (though this doesn’t say much; one would have to be quite... (read more)

1__RicG__
Sure, but the “lying” probably stems from the fact that to get the thumbs up from RLHF you just have to make up a believable answer (because the process AFAIK didn’t involve actual experts in various fields fact checking every tiny bit). If just a handful of “wrong but believable” examples sneak in the reward modelling phase you get a model that thinks that sometimes lying is what humans want (and without getting too edgy, this is totally true for politically charged questions!)."Lying" could well be the better policy! I am not claiming that GPT is maliciously lying, but in AI safety, malice is never really needed or even considered (ok, maybe deception is malicious by definition). I am unsure if this article will satisfy you, but nonetheless I have repeatedly corrected GPT-3/4 and it goes “oh, yeah, right, you’re right, my bad, [elaborates, clearly showing that it had the knowledge all along]”. Or even: Me: "[question about thing]" GPT: "As of my knowledge cut-off of 2021 I have absolutely no idea what you mean by thing" Me: "yeah, you know, the thing" GPT: "Ah, yeah the thing [writes four paragraphs about the thing]" Fresh example of this: Link (it says the model is the default, but it's not, it's a bug, I am using GPT-4) Maybe it is just perpetrating the bad training data full of misconceptions or maybe when I correct it I am the one who's wrong and it’s just a sycophant (very common in GPT-3.5 back in February). But I think the point is that you could justify the behaviour in a million ways. It doesn’t change the fact that it says untrue things when asked for true things. Is it safe to hallucinate sometimes? Idk, that could be discussed, but sure as hell it isn’t aligned with what RLHF was meant to align it to. I’d also like to add that it doesn’t consistently hallucinate. I think sometimes it just gets unlucky and it samples the wrong token and then, by being autoregressive, keeps the factually wrong narrative going. So maybe being autoregressive is the re

I think this is an interesting proposal. It strikes me as something that is most likely to be useful against “scalable deception” (“misinformation”), and given the utility of scalable deception such technologies might be developed anyway. I think you do need to check if this will lead to deception technologies being developed that would not otherwise have been, and if so whether we’re actually better off knowing about them (this is analogous to one of the cases against gain of function research: we might be better if not knowing how to make highly enhanced viruses).

I have a paper (planning to get it on arxiv any day now…) which contains a result: independence of causal mechanisms (which can be related to Occam’s razor & your first point here) + precedent (“things I can do have been done before”) + variety (related to your second point - we’ve observed the phenomena in a meaningfully varied range of circumstances) + conditional independence (which OP used to construct the Bayes net) implies a conditional distribution invariant under action.

That is, speaking very loosely, if you add your considerations to OPs recipe for Bayes nets and the assumption of precedent, you can derive something kinda like interventions.

3Richard_Kennaway
Did you put this paper anywhere? I didn't find anything on arXiv meeting the description.

Maybe it’s similar, but high U is not necessary

Thanks for explaining the way to do exhaustive search - a big network can exhaustively search smaller network configurations. I believe that.

However, a CPU is not Turing complete (what is Turing universal?) - a CPU with an infinite read/write tape is Turing complete. This matters, because Solomonoff induction is a mixture of Turing machines. There are simple functions transformers can’t learn, such as “print the binary representation of the input + 1”; they run out of room. Solomonoff induction is not limited in this way.

Practical transformers are also usu... (read more)

I think there is an additional effect related to "optimization is not conditioning" that stems from the fact that causation is not correlation. Suppose for argument's sake that people evaluate alignment research partly based on where it's come from (which the machine cannot control). Then producing good alignment research by regular standards is not enough to get high ratings. If a system manages to get good ratings anyway, then the actual papers it's producing must be quite different to typical highly rated alignment papers, because they are somehow compe... (read more)

2Thomas Kwa
I think this is more like Extremal Goodhart in Garrabrant's taxonomy: there's a distributional shift inherent to high U.

they can obviously encode a binary circuit equivalent to a CPU

A CPU by itself is not universal. Are you saying memory augmented neural networks are practically close to universality?

as long as you have enough data (or can generate it ) - big overcomplete NNs with SGD can obviously perform a strict improvement over exhaustive search

Sorry, I'm being slow here:

  • Solomonoff does exhaustive search for any amount of data; is part of your claim that as data -> infinity, NN + SGD -> Solomonoff?
  • How do we actually do this improved exhaustive search? Do we know that SGD gets us to a global minimum in the end?
4jacob_cannell
Any useful CPU is by my definition - turing universal. You can think of solomonoff as iterating over all programs/circuits by size, evaluating each on all the data, etc. A sufficiently wide NN + SGD can search the full circuit space up to a depth D across the data set in an efficient way (reusing all subcomputations across sparse subcircuit solutions (lottery tickets)).

Neural networks being universal approximators doesn't mean they do as well at distributing uncertainty as Solomonoff, right (I'm not entirely sure about this)? Also, are practical neural nets actually close to being universal?

in the worst case you can recover exhaustive exploration ala solomonoff

Do you mean that this is possible in principle, or that this is a limit of SGD training?

known perhaps experimentally in the sense that the research community has now conducted large-scale extensive (and even often automated) exploration of much of the entire

... (read more)
1jacob_cannell
Trivially so - as in they can obviously encode a binary circuit equivalent to a CPU, and also in practice in the sense that transformers descend from related research (neural turing machines, memory networks, etc) and are universal. I mean in the worst case where you have some function that is actually really hard to learn - as long as you have enough data (or can generate it ) - big overcomplete NNs with SGD can obviously perform a strict improvement over exhaustive search . Depends on what you mean by "gap" - whether you are measuring inference per unit data or inference per unit compute. There are clearly scenarios where you can get faster convergence via better using/approximating the higher order terms, but that obviously is not remotely sufficient to beat SGD - as any such extra complexity must also pay for itself against cost of compute. Of course if you are data starved, then that obviously changes things.

Do you have a link to a more in-depth defense of this claim?

6jacob_cannell
I mean it's like 4 or 5 claims? So not sure which ones you want more in-depth on, but 1. Neural networks are universal is obvious, as arithmetic/analog circuits they fully generalize (reduce to) binary circuits, which are circuit complete. 2. A Full Bayesian Inference and Solomonoff Induction are equivalent - fairly obvious 3. B Approximately converge is near guaranteed if the model is sufficiently overcomplete and trained long enough with correct techniques (normalization, regularization, etc) - as in the worst case you can recover exhaustive exploration ala solomonoff. But SGD on NN is somewhat exponentially faster that exhaustive program search, as it can explore not a single solution at a time, but a number of solutions (sparse sub circuits embedded in the overcomplete model) that is exponential with NN depth (see lottery tickets, dropout, and sum product networks). 4. C " differences between that and full bayesian inference reduce to higher order corrections which rapidly fall off in utility/op". This is known perhaps experimentally in the sense that the research community has now conducted large-scale extensive (and even often automated) exploration of much of the entire space of higher order corrections to SGD, and come up with almost nothing much better than stupidly simple inaccurate but low cost 2nd order correction approximations like Adam. (The research community has come up with an endless stream of higher order optimizers that improve theoretical convergence rate, and near zero that improve wall time convergence speed. ) I do think there is still some room for improvement here, but not anything remotely like "a new category of algorithm". But part of my claim simply is that modern DL techniques encompasses nearly all of optimization that is relevant, it simply ate everything, such that the possibility of some new research track not already considered is would be just nomenclature distinction at this point.

I’m not convinced the indifference conditions are desirable. Shutdown can be evidence of low utility

I can see why feasibility + individual rationality makes a payoff profile more likely than any profile missing one of these conditions, but I can’t see why I should consider every profile satisfying these conditions as likely enough to be worth worrying about

Why? The biggest problem in my mind is algorithmic progress. If we’re outside (C), then the “critical path to TAI” right now is algorithmic progress

Given that outside C approaches to AGI are likely to be substantially unlike anything we’re familiar with, and that controllable AGI is desirable, don’t you think that there’s a good chance these unknown algorithms have favourable control properties?

I think LLMs have some nice control properties too, not so much arguing against LLMs being better than unknown, just the idea that we should confidently expect control to be hard for unknown algorithms.

4Steven Byrnes
I guess you’re referring to my comment “for my part, if I believed that (A)-type systems were sufficient for TAI—which I don’t—then I think I would feel slightly less concerned about AI x-risk than I actually do, all things considered!” I’m not implicitly comparing (A) to “completely unknown mystery algorithm”, instead I’m implicitly comparing (A) to “brain-like AGI, or more broadly model-based RL AGI, or even more broadly some kind of AGI that incorporates RL in a much more central way than LLMs do”. I think, by and large, more RL (as opposed to supervised or self-supervised learning) makes things worse from a safety perspective, and thus I’m unhappy to also believe that more RL is going to be part of TAI.

One of the contentions of this post is that life has thoroughly explored the space of nanotech possibilities. This hypothesis makes the failures of novel nanotech proposals non independent. That said, I don’t think the post offers enough evidence to be highly confident in this proposition (the author might privately know enough to be more confident, but if so it’s not all in the post).

Separately, I can see myself thinking, when all is said and done, that Yudkowsky and Drexler are less reliable about nanotech than I previously thought (which was a modest le... (read more)

0Donald Hobson
All life runs on DNA in particular. Scientists have added extra base pairs and made life forms that work fine. Evolution didn't. DNA is a fairly arbitrary molecule amongst a larger set of similar double chain organics.    I think this post is just dismissing everything with weak reasons. I don't think this post is evidence at all, by conservation of expected evidence, we should take an unusually bad argument against a position as evidence for that position.  If nanotech really was impossible, it's likely that better impossibility arguments would exist. 

I was just trying to clarify the limits of autoregressive vs other learning methods. Autoregressive learning is at an apparent disadvantage if is hard to compute and the reverse is easy and low entropy. It can “make up for this” somewhat if it can do a good job of predicting from , but it’s still at a disadvantage if, for example, that’s relatively high entropy compared to from . That’s it, I’m satisfied.

Load More