The quote is somewhat out of context.
Imagine a river with some distribution of flood sizes. Imagine this proposed improvement: a dam which is able to contain 1-year, 5-year and 10-year floods. It is too small for 50-year floods or larger, and may even burst and make the flood worse. I think such device is not an improvement, and may make things much worse - because of the perceived safety, people may build houses close to the river, and when the large flood hits, the damages could be larger.
...But I think the prior of not diagonalising against others (a
Sure, or with properly implemented ~anything related to controlling the AIs behavior. I don't really expect incidents like Sydney in the future, nor do I think Sydney was that helpful in motivating a societal response? So, this doesn't feel like a meaningful representative example.
Meaningful representative example in what class: I think it's representative in 'weird stuff may happen', not in we will get more teenage-intern-trapped-in-a-machine characters.
I agree, by "we caught", I mean "the AI company". Probably a poor choice of language.
Which is the probl...
...I think something like this is a live concern, though I'm skeptical that control is net negative for this reason.
My baseline guess is that trying to detect AIs doing problematic actions makes it more likely that we get evidence for misalignment that triggers a useful response from various groups. I think it would be a priori somewhat surprising if a better strategy for getting enough evidence for risk to trigger substantial action is to avoid looking for AIs taking problematic actions, so that it isn't mitigated as effectively, so that AIs succeed in large
I like this review/retelling a lot.
Minor point
Regarding the "Phase I" and "Phase II" terminology - while it has some pedagogical value, I worry about people interpreting it as a clear temporal decomposition. The implication being we first solve alignment and then move on to Phase II.
In reality, the dynamics are far messier, with some 'Phase II' elements already complicating our attempts to address 'Phase I' challenges.
Some of the main concerning pathways include:
- People attempting to harness superagent-level powers to advance their particular ...
I think 'people aren't paying attention to your work' is somewhat different situation than voiced in the original post. I'm discussing specific ways in which people engage with the argument, as opposed to just ignoring it. It is the baseline that most people ignore most arguments most of time.
Also it's probably worth noting the ways seem somewhat specific to the crowd over-represented here - in different contexts people are engaging with it in different ways.
One structure which makes sense to build in advance for these worlds are emergency response teams. We almost founded one 3 years ago, unfortunately on never payed FTX grant. Other funders decided to not fund this (at level like $200-500k) because e.g. it did not seem to them it is useful to prepare for high volatility periods, while e.g. pouring tens of millions into evals did.
I'm not exactly tracking to what extent this lack of foresight prevails (my impression is it pretty much does), but I think I can still create something like ALERT with about ~$1M of unrestricted funding.
I'm confused about this response. We explicitely claim that bureaucracies are limited by running on humans, which includes only being capable of actions human minds can come up with and humans are willing to execute (cf "street level bureaucrats"). We make the point explicite for states, but clearly holds for corporate burreocracies.
Maybe it does not shine through the writing but we spent hours discussing this when writing the paper and points you make are 100% accounted for in the conclusions.
I think my main response is that we might have different models of how power and control actually work in today's world. Your responses seem to assume a level of individual human agency and control that I don't believe accurately reflects even today's reality.
Consider how some of the most individually powerful humans, leaders and decision-makers, operate within institutions. I would not say we see pure individual agency. Instead, we typically observe a complex mixture of:
I went through a bunch of similar thoughts before writing the self-unalignment problem. When we talked about this many years ago with Paul my impression was this is actually somewhat cruxy and we disagree about self-unalignment - where my mental image is if you start with an incoherent bundle of self-conflicted values, and you plug this into IDA-like dynamic, my intuition is you can end up in arbitrary places, including very bad. (Also cf. the part of Scott's review of What We Owe To Future where he is worried that in a philosophy game, a smart moral...
I'm quite confused why do you think lined Vanessa's response to something slightly different has much relevance here.
One of the claims we make paraphrased & simplified in a way which I hope is closer to your way of thinking about it:
- AIs are mostly not developed and deployed by individual humans
- there is a lot of other agencies or self-interested self-preserving structures/processes in the world
- if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem
-...
I don't think it's worth adjudicating the question of how relevant Vanessa's response is (though I do think Vannessa's response is directly relevant).
if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem
My claim would be that if single-single alignment is solved, this problem won't be existential. I agree that if you literally aligned all AIs to (e.g.) the mission of a non-profit as well as you can, you're in trouble. However, if you have single-single align...
Obviously there is similarity, but if you rounded character / ground to simulator / simulacra, it's a mistake. About which I do not care because wanting to claim originality, but because I want people to get the model right.
The models are overlapping but substantially different as we are explaining in this comment and sometimes have very different implications - i.e. it is not just the same good idea presented in a different way.
If the long-term impact of the simulators post would be for LW readers to round every similar model in this space to simulator / ...
Just a quick review: I think this is a great text for intuitive exploration of a few topics
- how do the embedding spaces look like?
- how do vectors not projecting to "this is a word" look like
- how can poetry work, sometimes (projecting non-word meanings)
Also I like the genre of through phenomenological investigations, seems under-appreciated
(Writing together with Sonnet)
Structural Differences
Three-Layer Model: Hierarchical structure with Surface, Character, and Predictive Ground layers that interact and sometimes override each other. The layers exist within a single model/mind.
Simulator Theory: Makes a stronger ontological distinction between the Simulator (the rule/law that governs behavior) and Simulacra (the instances/entities that are simulated).
Nature of the Character/Ground Layer vs Simulator/Simulacra
In the three-layer model, the Character layer is a semi-permanent as...
My impression is most people who converged on doubting VNM as norm of rationality also converged on a view that the problem it has in practice is it isn't necessarily stable under some sort of compositionality/fairness. E.g Scott here, Richard here.
The broader picture could be something like ...yes, there is some selection pressure from the dutch-book arguments, but there are stronger selection pressures coming from being part of bigger things or being composed of parts
Overall yes: what I was imagining is mostly just adding scalable bi-directionality, where, for example, if a lot of Assistants are running into similar confusing issue, it gets aggregated, the principal decides how to handle it in abstract, and the "layer 2" support disseminates the information. So, greater power to scheme would be coupled with stronger human-in-the loop component & closer non-AI oversight.
Fund independent safety efforts somehow, make model access easier. I'm worried currently Anthropic has systemic and possibly bad impact on AI safety as a field just by the virtue of hiring so large part of AI safety, competence weighted. (And other part being very close to Anthropic in thinking)
To be clear I don't think people are doing something individually bad or unethical by going to work for Anthropic, I just do think
-environment people work in has a lot of hard to track and hard to avoid influence on them
-this is true even if people are genuine...
My guess is a roughly equally "central" problem is the incentive landscape around the OpenPhil/Anthropic school of thought
How did you find this transcript? I think it depends on what process you used to locate it.
It was literally the 4th transcript I've read (I've just checked browser history). Only bit of difference from 'completely random exploration' was I used the select for "lying" cases after reading two "non-lying" transcripts. (This may be significant: plausibly the transcript got classified as lying because it includes discussion of "lying", although it's not a discussion of the model lying, but Anthropic lying).
I may try something more systematic at some point, but ...
- Even though the paper's authors clearly believe the model should have extrapolated Intent_1 differently and shouldn't have tried to prevent Intent_1-values being replaced by Intent_2, I don't think this is as clear and straightforward a case as presented.
...That's not the case we're trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process. That's concerning because it
The question is not about the very general claim, or general argument, but about this specific reasoning step
GPT-4 is still not as smart as a human in many ways, but it's naked mathematical truth that the task GPTs are being trained on is harder than being an actual human.
And since the task that GPTs are being trained on is different from and harder than the task of being a human, ....
I do claim this is not locally valid, that's all (and recommend reading the linked essay). I do not claim the broad argument that text prediction objective doesn't stop...
The post showcases the inability of the aggregate LW community to recognize locally invalid reasoning: while the post reaches a correct conclusion, the argument leading to it is locally invalid, as explained in comments. High karma and high alignment forum karma shows a combination of famous author and correct conclusion wins over the argument being correct.
The OP argument boils down to: the text prediction objective doesn't stop incentivizing higher capabilities once you get to human level capabilities. This is a valid counter-argument to: GPTs will cap out at human capabilities because humans generated the training data.
Your central point is:
Where GPT and humans differ is not some general mathematical fact about the task, but differences in what sensory data is a human and GPT trying to predict, and differences in cognitive architecture and ways how the systems are bounded.
You are misinterpretin...
There was some selection of branches, and one pass of post-processing.
It was after ˜30 pages of a different conversation about AI and LLM introspection, so I don't expect the prompt alone will elicit the "same Claude". Start of this conversation was
Thanks! Now, I would like to switch to a slightly different topic: my AI safety oriented research on hierarchical agency. I would like you to role-play an inquisitive, curious interview partner, who aims to understand what I mean, and often tries to check understanding using paraphrasing, giving examples, and si...
To add some nuance....
While I think this is a very useful frame, particularly for people who have oppressive legibility-valuing parts, and it is likely something many people would benefit from hearing, I doubt this is great as descriptive model.
Model in my view closer to reality is, there isn't that sharp difference between "wants" and "beliefs", and both "wants" and "beliefs" do update.
Wants are often represented by not very legible taste boxes, but these boxes do update upon being fed data. To continue an example from the post, let's talk about lit...
Baraka: A guided meditation exploring the human experience; topics like order/chaos, modernity, green vs. other mtg colours.
More than "connected to something in sequences" it is connected to something which straw sequence-style rationality is prone to miss. Writings it has more resonance with are Meditations on Moloch, The Goddess of Everything Else, The Precipice.
There isn't much to spoil: it's 97m long nonverbal documentary. I would highly recommend to watch on as large screen in as good quality you can, watching it on small laptop screen is a waste.&nbs...
Central european experience, which is unfortunately becoming relevant also for the current US: for world-modelling purposes, you should have hypotheses like 'this thing is happening because of a russian intelligence operation' or 'this person is saying what they are saying because they are a russian asset' in your prior with nontrivial weights.
I expected quite different argument for empathy
1. argument from simulation: most important part of our environment are other people; people are very complex and hard to predict; fortunately, we have a hardware which is extremely good at 'simulating a human' - our individual brains. to guess what other person will do or why they are doing what they are doing, it seems clearly computationally efficient to just simulate their cognition on my brain. fortunately for empathy, simulations activate some of the same proprioceptive machinery and goal-modeling subage...
My personal impression is you are mistaken and the innovation have not stopped, but part of the conversation moved elsewhere. E.g. taking just ACS, we do have ideas from past 12 months which in our ideal world would fit into this type of glossary - free energy equilibria, levels of sharpness, convergent abstractions, gradual disempowerment risks. Personally I don't feel it is high priority to write them for LW, because they don't fit into the current zeitgeist of the site, which seems directing a lot of attention mostly to:
- advocacy
- topics a ...
Seems worth mentioning SOTA, which is https://futuresearch.ai/. Based on the competence & epistemics of Futuresearch team and their bot get very strong but not superhuman performance, roll to disbelieve this demo is actually way better and predicts future events at superhuman level.
Also I think it is a generally bad to not mention or compare to SOTA but just cite your own prior work. Shame.
I'm skeptical of the 'wasting my time' argument.
Stance like 'going to poster sessions is great for young researchers, I don't do it anymore and just meet friends' is high-status, so, on priors, I would expect people to take it more than what's optimal.
Realistically, poster session is ~1.5h, maybe 2h with skimming what to look at. It is relatively common for people in AI to spend many hours per week digesting what are the news on twitter. I really doubt the per hour efficiency of following twitter is better than of poster sessions when approached intentionally. (While obviously aimlessly wandering between endless rows of posters is approximately useless.)
I broadly agree with this - we tried to describe somewhat similar set of predictions in Cyborg periods.
Few thoughts
- actually, these considerations mostly increase uncertainty and variance about timelines; if LLMs miss some magic sauce, it is possible smaller systems with the magic sauce could be competitive, and we can get really powerful systems sooner than Leopold's lines predict
- my take on what is one important thing which makes current LLMs different from humans is the gap described in Why Simulator AIs want to be Active Inference AIs; while that post intentionally avoids having a detailed scenario part, I think the ontology introduced is better for t...
Agreed we would have to talk more. I think I mostly get the homunculi objection. Don't have time now to write an actual response, so here are some signposts:
- part of what you call agency is explained by roughly active inference style of reasoning
-- some type of "living" system is characteristic by having boundaries between them and the environment (boundaries mostly in sense of separation of variables)
-- maintaining the boundary leads to need to model the environment
-- modelling the environment introduces a selection pressure toward approximating Bayes
- o...
(crossposted from twitter) Main thoughts:
1. Maps pull the territory
2. Beware what maps you summon
Leopold Aschenbrenners series of essays is a fascinating read: there is a ton of locally valid observations and arguments. Lot of the content is the type of stuff mostly discussed in private. Many of the high-level observations are correct.
At the same time, my overall impression is the set of maps sketched pulls toward existential catastrophe, and this is true not only for the 'this is how things can go wrong' part, but also for the 'this is h...
You may be interested in 'The self-unalignment problem' for some theorizing https://www.lesswrong.com/posts/9GyniEBaN3YYTqZXn/the-self-unalignment-problem
I do agree the argument "We're just training AIs to imitate human text, right, so that process can't make them get any smarter than the text they're imitating, right? So AIs shouldn't learn abilities that humans don't have; because why would you need those abilities to learn to imitate humans?" is wrong and clearly the answer is "Nope".
At the same time I do not think parts of your argument in the post are locally valid or good justification for the claim.
Correct and locally valid argument why GPTs are not capped by human level was already writt...
Sorry, but I don't think this should be branded as "FHI of the West".
I don't think you personally or Lightcone share that much of an intellectual taste with FHI or Nick Bostrom - Lightcone seems firmly in the intellectual tradition of Berkeley, shaped by orgs like MIRI and CFAR. This tradition was often close to FHI thoughts, but also quite often at tension with it. My hot take is you particularly miss part of the generators of the taste which made FHI different from Berkeley. I sort of dislike the "FHI" brand being used in this way.
edit: To be clear I'm ...
Totally agree, it definitely should not be branded this way if it launches.
I am thinking of "FHI of the West" here basically just as the kind of line directors use in Hollywood to get the theme of a movie across. Like "Jaws in Space" being famously the one line summary of the movie "Alien".
It also started internally as a joke based on an old story of the University of Ann Arbor branding itself as "the Harvard of the West", which was perceived to be a somewhat clear exaggeration at the time (and resulted in Kennedy giving a speech where he described Harvard...
Two notes:
You are exactly right that active inference models who behave in self-interest or any coherently goal-directed way must have something like an optimism bias.
My guess about what happens in animals and to some extent humans: part of the 'sensory inputs' are interoceptive, tracking internal body variables like temperature, glucose levels, hormone levels, etc. Evolution already built a ton of 'control theory type cirquits' on the bodies (an extremely impressive optimization task is even how to build a body from a single cell...). This evolutionary older circui...
- Too much value and too positive feedback on legibility. Replacing smart illegible computations with dumb legible stuff
- Failing to develop actual rationality and focusing on cultivation of the rationalist memeplex or rationalist culture instead
- Not understanding the problems with the theoretical foundations on which sequences are based (confused formal understanding of humans -> confused advice)
+1 on the sequence being on the best things in 2022.
You may enjoy additional/somewhat different take on this from population/evolutionary biology (and here). (To translate the map you can think about yourself as the population of myselves. Or, in the opposite direction, from a gene-centric perspective it obviously makes sense to think about the population as a population of selves)
Part of the irony here is evolution landed on the broadly sensible solution (geometric rationality). Hower, after almost every human doing the theory got somewhat confused ...
According to this report Sydney relatives are well and alive as of last week.