But do they also generalize out of training distribution more similarly? If so, why?
Neither of them is going to generalize very well out of distribution, and to the extent they do it will be via looking for features that were present in-distribution. The old adage "to imagine 10-dimensional space, first imagine 3-space, then say 10 really hard".
My guess is that basically every learning system which tractably approximates Bayesian updating on noisy high dimensional data is going to end up with roughly Gaussian OOD behavior. There's been some experiments ...
adversarial examples definitely still exist but they'll look less weird to you because of the shape bias.
anyway this is a random visual model, raw perception without any kind of reflective error correction loop, I'm not sure what you expect it to do differently, or what conclusion you're trying to draw from how it does behave? the inductive bias doesn't precisely match human vision, so it has different mistakes, but as you scale both architectures they become more similar. that's exactly what you'd expect for any approximately Bayesian setup.
the shape bias...
Scale basically solves this too, with some other additions (not part of any released version of MJ yet) really putting a nail in the coffin, but I can't say too much here w/o divulging trade secrets. I can say that I'm surprised to hear that SD3 is still so much worse than Dalle3, Ideogram on that front - I wonder if they just didn't train it long enough?
They put too much emphasis on high frequency features, suggesting a different inductive bias from humans.
This was found to not be true at scale! It doesn't even feel that true w/weaker vision transformers, seems specific to convnets. I bet smaller animal brains have similar problems.
Order matters more at smaller scales - if you're training a small model on a lot of data and you sample in a sufficiently nonrandom manner, you should expect catastrophic forgetting to kick in eventually, especially if you use weight decay.
I think I can just tell a lot of stuff wrt human values! How do you think children infer them? I think in order for human values to not be viable to point to extensionally (ie by looking at a bunch of examples) you have to make the case that they're much more built-in to the human brain than seems appropriate for a species that can produce both Jains and (Genghis Khan era) Mongols.
I'd also note that "incentivize" is probably giving a lot of the game away here - my guess is you can just pull them out much more directly by gathering a large dataset of human preferences and predicting judgements.
Why do you expect it to be hard to specify given a model that knows the information you're looking for? In general the core lesson of unsupervised learning is that often the best way to get pointers to something you have a limited specification for is to learn some other task that necessarily includes it then specialize to that subtask. Why should values be any different? Broadly, why should values be harder to get good pointers to than much more complicated real-world tasks?
yeah I basically think you need to construct the semantic space for this to work, and haven't seen much work on that front from language modeling researchers.
drives me kinda nuts because I don't think it would actually be that hard to do, and the benefits might be pretty substantial.
Can you give an example of a theoretical argument of the sort you'd find convincing? Can be about any X caring about any Y.
...On the impossible-to-you world: This doesn’t seem so weird or impossible to me? And I think I can tell a pretty easy cultural story slash write an alternative universe novel where we honor those who maximize genetic fitness and all that, and have for a long time—and that this could help explain why civilization and our intelligence developed so damn slowly and all that. Although to truly make the full evidential point that world then has to be weirder still where humans are much more reluctant to mode shift in various ways. It’s also possible this points
In case it is not clear: My expectation is that sufficiently large capabilities/intelligence/affordances advances inherently break our desired alignment properties under all known techniques.
Nearly every piece of empirical evidence I've seen contradicts this - more capable systems are generally easier to work with in almost every way, and the techniques that worked on less capable versions straightforwardly apply and in fact usually work better than on less intelligent systems.
When I explain my counterargument to pattern 1 to people in person, they will very often try to "rescue" evolution as a worthwhile analogy for thinking about AI development. E.g., they'll change the analogy so it's the programmers who are in a role comparable to evolution, rather than SGD.
In general one should not try to rescue intuitions, and the frequency of doing this is a sign of serious cognitive distortions. You should only try to rescue intutions when they have a clear and validated predictive or pragmatic track record.
The reason for this is very...
The obvious question here is to what degree do you need new techniques vs merely to train new models with the same techniques as you scale current approaches.
One of the virtues of the deep learning paradigm is that you can usually test things at small scale (where the models are not and will never be especially smart) and there's a smooth range of scaling regimes in between where things tend to generalize.
If you need fundamentally different techniques at different scales, and the large scale techniques do not work at intermediate and small scal...
It's more like calling a human who's as smart as you are and directly plugged into your brain and in fact reusing your world model and train of thought directly to understand the implications of your decision. That's a huge step up from calling a real human over the phone!
The reason the real human proposal doesn't work is that
Note that none of these considerations apply to integrated language models!
To pick a toy example, you can use text as a bottleneck to force systems to "think out loud" in a way which will be very directly interpretable by a human reader, and because language understanding is so rich this will actually be competitive with other approaches and often superior.
I'm sure you can come up with more ways that the existence of software that understands language and does ~nothing else makes getting computers to do what you mean easier than if software did not understand language. Please think about the problem for 5 minutes. Use a clock.
I appreciate the example!
Are you claiming that this example solves "a major part of the problem" of alignment? Or that, e.g., this plus four other easy ideas solve a major part of the problem of alignment?
Examples like the Visible Thoughts Project show that MIRI has been interested in research directions that leverage recent NLP progress to try to make inroads on alignment. But Matthew's claim seems to be 'systems like GPT-4 are grounds for being a lot more optimistic about alignment', and your claim is that systems like these solve "a major part of the pr...
ML models in the current paradigm do not seem to behave coherently OOD but I'd bet for nearly any metric of "overall capability" and alignment that the capability metric decays faster vs alignment as we go further OOD.
See https://arxiv.org/abs/2310.00873 for an example of the kinds of things you'd expect to see when taking a neural network OOD. It's not that the model does some insane path-dependent thing, it collapses to entropy. You end up seeing a max-entropy distribution over outputs not goals. This is a good example of the kind of thing that's o...
Historically you very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence. This is not how it seems to have gone.
Everyone agrees that you assumed superintelligence would understand everything humans understand and more. The dispute is entirely about the things that you encounter before superintelligence. In general it seems like the world turned out much more gradual than you expected and there's information to be found in what capabilities emerged sooner in the process.
AI happening through deep learning at all is a huge update against alignment success, because deep learning is incredibly opaque. LLMs possibly ending up at the center is a small update in favor of alignment success, because it means we might (through some clever sleight, this part is not trivial) be able to have humanese sentences play an inextricable role at the center of thought (hence MIRI's early interest in the Visible Thoughts Project).
The part where LLMs are to predict English answers to some English questions about values, and show common-se...
Historically you very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence. This is not how it seems to have gone.
"You very clearly thought that was a major part of the problem" implies that if you could go to Eliezer-2008 and convince him "we're going to solve a lot of NLP a bunch of years before we get to ASI", he would respond with some version of "oh great, that solves a major part of the problem!". Which I'm pretty sure ...
We should clearly care if their arguments were wrong in the past, especially if they were systematically wrong in a particular direction, as it's evidence about how much attention we should pay to their arguments now. At some point if someone is wrong enough for long enough you should discard their entire paradigm and cease to privilege hypotheses they suggest, until they reacquire credibility through some other means e.g. a postmortem explaining what they got wrong and what they learned, or some unambiguous demonstration of insight into the domain they're talking about.
Sure, a stop button doesn't have the issues I described, as long as it's used rarely enough. If it's too commonplace then you should expect similar effects on safety to eg CEQA's effects on infrastructure innovation. Major projects can only take on so much risk, and the more non-technical risk you add the less technical novelty will fit into that budget.
This line from the proposed "Responsible AI Act" seems to go much further than a stop button though?
Require advanced AI developers to apply for a license & follow safety standards.
Where do these saf...
It depends on the form regulation takes. The proposal here requires approval of training runs over a certain scale, which means everything is banned at that scale, including safety techniques, with exceptions decided by the approval process.
What would your plan be to ensure that this kind of regulation actually net-improves safety? The null hypothesis for something like this is that you'll empower a bunch of bureaucrats to push rules that are at least 6 months out of date under conditions of total national emergency where everyone is watching, and years to decades out of date otherwise.
This could be catastrophic! If the only approved safety techniques are as out of date as the only approved medical techniques, AI regulation seems like it should vastly increase P(doom) at the point that TAI is developed.
...Which brings me to my main disagreement with bottom-up approaches: they assume we already have a physics theory in hand, and are trying to locate consciousness within that theory. Yet, we needed conscious observations, and at least some preliminary theory of consciousness, to even get to a low-level physics theory in the first place. Scientific observations are a subset of conscious experience, and the core task of science is to predict scientific observations; this requires pumping a type of conscious experience out of a physical theory, which requires at
This seems especially unlikely to work given it only gives a probability. You know what you call someone whose superintelligent AI does what they want 95% of the time? Dead.
if you can get it to do what you want even 51% of the time and make that 51% independent on each sampling (it isn't, so in practice you'd like some margin, but 95% is actually a lot of margin!) you can get arbitrarily good compliance by creating AI committees and taking a majority vote.
that paper is one of many claiming some linear attention mechanism that's as good as full self attention. in practice they're all sufficiently much worse that nobody uses them except the original authors in the original paper, usually not even the original authors in subsequent papers.
the one exception is flash attention, which is basically just a very fancy fused kernel for the same computation (actually the same, up to numerical error, unlike all these "linear attention" papers).
4, 5, and 6 are not separate steps - when you only have 1 example, the bits to find an input that generates your output are not distinct from the bits specifying the program that computes output from input.
Yeah my guess is that you almost certainly fail on step 4 - an example of a really compact ray tracer looks like it fits in 64 bytes. You will not do search over all 64 byte programs. Even if you could evaluate 1 of them per atom per nanosecond using every atom in the universe for 100 billion years, you'd only get 44.6 bytes of search.
Let's go with something more modest and say you get to use every atom in the milky way for 100 years, and it takes about 1 million atom-seconds to check a single program. This gets you about 30 bytes of search.
Priors over pro...
This response is totally absurd. Your human priors are doing an insane amount of work here - you're generating an argument for the conclusion, not figuring out how you would privilege those hypotheses in the first place.
See that the format describes something like a grid of cells, where each cell has three [something] values.
This seems maybe possible for png (though it could be hard - the pixel data will likely be stored as a single contiguous array, not a bunch of individual rows, and it will be run-length encoded, which you might be able to figure ou...
This seems maybe possible for png (though it could be hard - the pixel data will likely be stored as a single contiguous array, not a bunch of individual rows, and it will be run-length encoded, which you might be able to figure out but very well might not - and if it's jpg compressed this is even further from the truth).
I mean, once you've got your single continuous array it's pretty easy to notice "hey this pattern almost repeats every 1080 triplets". Getting from the raw data stream to your single continuous array might be very simple (if your video ...
10 million times faster is really a lot - on modern hardware, running SOTA object segmentation models at even 60fps is quite hard, and those are usually much much smaller than the kinds of AIs we would think about in the context of AI risk.
But - 100x faster is totally plausible (especially w/100x the energy consumption!) - and I think the argument still mostly works at that much more conservative speedup.
for me it mostly felt like I and my group of closest friends were at the center of the world, with the last hope for the future depending on our ability to hold to principle. there was a lot of prophesy of varying qualities, and a lot of importance placed suddenly on people we barely knew then rapidly withdrawn when those people weren't up for being as crazy as we were.
This seems roughly on point, but is missing a crucial aspect - whether or not you're currently a hyper-analytical programmer is actually a state of mind which can change. Thinking you're on one side when actually you've flipped can lead to some bad times, for you and others.
I don't know how everyone else on LessWrong feels but I at least am getting really tired of you smugly dismissing others' attempts at moral reductionism wrt qualia by claiming deep philosophical insight you've given outside observers very little reason to believe you have. In particular, I suspect if you'd spent half the energy on writing up these insights that you've spent using the claim to them as a cudgel you would have at least published enough of a teaser for your claims to be credible.
But here Yudkowsky gave a specific model for how qualia, and other things in the reference class "stuff that's pointing at something but we're confused about what", is mistaken for convergently instrumental stuff. (Namely: pointers point both to what they're really trying to point to, but also somewhat point at simple things, and simple things tend to be convergently instrumental.) It's not a reduction of qualia, and a successful reduction of qualia would be much better evidence that an unsuccessful reduction of qualia is unsuccessful, but it's still a logical relevant argument and a useful model.
I disagree that GPT’s job, the one that GPT-∞ is infinitely good at, is answering text-based questions correctly. It’s the job we may wish it had, but it’s not, because that’s not the job its boss is making it do. GPT’s job is to answer text-based questions in a way that would be judged as correct by humans or by previously-written human text. If no humans, individually or collectively, know how to align AI, neither would GPT-∞ that’s trained on human writing and scored on accuracy by human judges.
This is actually also an incorrect statement of GPT's jo...
I think this is a persistent difference between us but isn't especially relevant to the difference in outcomes here.
I'd more guess that the reason you had psychoses and I didn't had to do with you having anxieties about being irredeemably bad that I basically didn't at the time. Seems like this would be correlated with your feeling like you grew up in a Shin Sekai Yori world?
I clearly had more scrupulosity issues than you and that contributed a lot. Relevantly, the original Roko's Basilisk post is putting AI sci-fi detail on a fear I am pretty sure a lot of EAs feel/felt in their heart, that something nonspecifically bad will happen to them because they are able to help a lot of people (due to being pivotal on the future), and know this, and don't do nearly as much as they could. If you're already having these sorts of fears then the abstract math of extortion and so on can look really threatening.
hmm... this could have come down to spending time in different parts of MIRI? I mostly worked on the "world's last decent logic department" stuff - maybe the more "global strategic" aspects of MIRI work, at least the parts behind closed doors I wasn't allowed through, were more toxic? Still feels kinda unlikely but I'm missing info there so it's just a hunch.
By latent tendency I don't mean family history, though it's obviously correlated. I claim that there's this fact of the matter about Jess' personality, biology, etc, which is that it's easier for her to have a psychotic episode than for most people. This seems not plausibly controversial.
I'm not claiming a gears-level model here. When you see that someone has a pattern of <problem> that others in very similar situations did not have, you should assume some of the causality is located in the person, even if you don't know how.
Verbal coherence level seems like a weird place to locate the disagreement - Jessica maintained approximate verbal coherence (though with increasing difficulty) through most of her episode. I'd say even in October 2017, she was more verbally coherent than e.g. the average hippie or Catholic, because she was trying at all.
The most striking feature was actually her ability to take care of herself rapidly degrading, as evidenced by e.g. getting lost almost immediately after leaving her home, wandering for several miles, then calling me for help and having dif...
I want to specifically highlight "A bunch of people we respected and worked with had decided the world was going to end, very soon, uncomfortably soon, and they were making it extremely difficult for us to check their work." I noticed this second-hand at the time, but didn't see any paths toward making things better. I think it had a really harmful effects on the community, and is worth thinking a lot about before something similar happens again.
When I got back into town and talked with Jessica, she was talking about how it might be wrong to take actions that might possibly harm others, i.e. pretty much any actions, since she might not learn fast enough for this to come out net positive. Seems likely to me that the content of Jessica's anxious perseveration was partly causally upstream of the anxious perseveration itself.
I agree that a decline in bodily organization was the main legitimate reason for concern. It seems obviously legitimate for Jessica (and me) to point out that Scott is proposing a...
Thanks for giving your own model and description of the situation!
Regarding latent tendency, I don't have a family history of psychosis (but I do of bipolar), although that doesn't rule out latent tendency. It's unclear what "latent tendency" means exactly, it's kind of pretending that the real world is a 3-node Bayesian network (self tendency towards X, environment tendency towards inducing X, whether X actually happens) rather than a giant web of causality, but maybe there's some way to specify it more precisely.
I think the 4 factors you listed are the ...
There's this general problem of Rationalists splitting into factions and subcults with minor doctrinal differences, each composed of relatively elite members of The Community, each with a narrative of how they’re the real rationalists and the rest are just posers and/or parasites. And, they're kinda right. Many of the rest are posers, we have a mop problem.
There’s just one problem. All of these groups are wrong. They are in fact only slightly more special than their rival groups think they are. In fact, the criticisms each group makes of the epistemics and...
Even with that as the goal this model is useless - social distancing demonstrably does not lead to 0 new infections. Even Wuhan didn't manage that, and they were literally welding people's doors shut.
...they're ants. That's just not how ants work. For a myriad of reasons. The whole point of the post is that there isn't necessarily local deliberative intent, just strategies filling ecological niches.
Of course, if you don’t like how an exponential curve fits the data, you can always change models—in this case, probably to a curve with 1 more free parameter (indicating a degree of slowdown of the exponential growth) or 2 more free parameters (to have 2 different exponentials stitched together at a specific point in time).
Oh that's actually a pretty good idea. Might redo some analysis we built on top of this model using that.
correct. edited to make this more obvious
This argument would make much more sense in a just world. Information that should damage someone is very different from information that will damage someone. With blackmail you're optimized to maximize damage to the target, and I expect tails to mostly come apart here. I don't see too many cases of blackmail replacing MeToo. When was the last time the National Enquirer was a valuable whistleblower?
EDIT: fixed some wording
When trying to fit an exponential curve, don't weight all the points equally
We didn't. We fit a line in log space, but weighted the points by sqrt(y). The reason we did that is because it doesn't actually appear linear in log space.
This is what it looks like if we don't weight them. If you want to bite the bullet of this being a better fit, we can bet about it.
I'd optimize more for not making enemies or alienating people than for making people realize how bad the situation is or joining your cause.
Why isn't this a fully general argument for never rocking the boat?
Based on my models (such as this one), the chance of AGI "by default" in the next 50 years is less than 15%, since the current rate of progress is not higher than the average rate since 1945, and if anything is lower (the insights model linked has a bias towards listing recent insights).
Both this comment and my other comment are way understating our beliefs about AGI. After talking to Jessica about it offline to clarify our real beliefs rather than just playing games with plausible deniability, my actual probability is between 0.5 and 1% in the next 50 years. Jessica can confirm that hers is pretty similar, but probably weighted towards 1%.
I think I'm more skeptical than you are that it's possible to do much better (i.e., build functional information-processing institutions) before the world changes a lot for other reasons (e.g., superintelligent AIs are invented)
Where do you think the superintelligent AIs will come from? AFAICT it doesn't make sense to put more than 20% on AGI before massive international institutional collapse, even being fairly charitable to both AGI projects and prospective longevity of current institutions.
Huh, I notice I've not explicitly estimated my timeline distribution for massive international institutional collapse, and that I want to do that. Do you have any links to places where others/you have thought about it?
When considering an embedder , in universe , in response to which SADT picks policy , I would be tempted to apply the following coherence condition:
(all approximately of course)
I'm not sure if this would work though. This is definitely a necessary condition for reasonable counterfactuals, but not obviously sufficient.
I'm fairly interested but don't really want to be around children.
How around is around, and can you say more about what about a baugruppe would satisfy your desiderata that the existing group house network can't?
By censoring I mean a specific technique for forcing the consistency of a possibly inconsistent set of axioms.
Suppose you have a set of deduction rules over a language . You can construct a function that takes a set of sentences and outputs all the sentences that can be proved in one step using and the sentences in . You can also construct a censored by letting .
What track record?