All of Jan_Kulveit's Comments + Replies

According to this report Sydney relatives are well and alive as of last week.

The quote is somewhat out of context.

Imagine a river with some distribution of flood sizes. Imagine this proposed improvement: a dam which is able to contain 1-year, 5-year and 10-year floods. It is too small for 50-year floods or larger, and may even burst and make the flood worse. I think such device is not an improvement, and may make things much worse - because of the perceived safety, people may build houses close to the river, and when the large flood hits, the damages could be larger.
 

But I think the prior of not diagonalising against others (a

... (read more)
4kave
I do not think your post is arguing for creating warning shots. I understand it to be advocating for not averting warning shots. To extend your analogy, there are several houses that are built close to a river, and you think that a flood is coming that will destroy them. You are worried that if you build a dam that would protect the houses currently there, then more people will build by the river and their houses will be flooded by even bigger floods in the future. Because you are worried people will behave in this bad-for-them way, you choose not to help them in the short term. (The bit I mean to point to by "diagonalising" is the bit where you think about what you expect they'll do, and which mistakes you think they'll make, and plan around that).

Sure, or with properly implemented ~anything related to controlling the AIs behavior. I don't really expect incidents like Sydney in the future, nor do I think Sydney was that helpful in motivating a societal response? So, this doesn't feel like a meaningful representative example.


Meaningful representative example in what class: I think it's representative in 'weird stuff may happen', not in we will get more teenage-intern-trapped-in-a-machine characters.

I agree, by "we caught", I mean "the AI company". Probably a poor choice of language.

Which is the probl... (read more)

2Noosphere89
IMO, the discontinuity that is sufficient here is that I expect societal responses to be discontinuous, rather than continuous, and in particular, I expect societal responses will come when people start losing jobs en masse, and at that point, either the AI is aligned well enough that existential risk is avoided, or the takeover has inevitably happened and we have very little influence over the outcome. On this point: Yeah, I expect society to basically not respond at all if weird stuff just happens, unless we assume more here, and in particular I think societal response is very discontinuous, even if AI progress is continuous, for both good and bad reasons.

I think something like this is a live concern, though I'm skeptical that control is net negative for this reason.

My baseline guess is that trying to detect AIs doing problematic actions makes it more likely that we get evidence for misalignment that triggers a useful response from various groups. I think it would be a priori somewhat surprising if a better strategy for getting enough evidence for risk to trigger substantial action is to avoid looking for AIs taking problematic actions, so that it isn't mitigated as effectively, so that AIs succeed in large

... (read more)
6ryan_greenblatt
Sure, or with properly implemented ~anything related to controlling the AIs behavior. I don't really expect incidents like Sydney in the future, nor do I think Sydney was that helpful in motivating a societal response? So, this doesn't feel like a meaningful representative example. I agree, by "we caught", I mean "the AI company". Probably a poor choice of language. Sure, but a large part of my point is that I don't expect public facing accidents (especially not accidents that kill people) until it's too late, so this isn't a very relevant counterfactual. This feels like a special case of escape to me which would probably cause a minimal response from the world as this only results in some particular fine-tune of an already open weights AI. So, you were probably already getting whatever warning shots you were going to get from the open weights AI. I don't think so. For every one of those failure modes other than escape, there is no chance of detection in the wild, so the choice is between catch the internal issue or catch nothing. I also think escape is moderately likely to go undetected (if not caught within the AI company). Part of my perspective might be thinking takeoff is faster than you do or focusing more on faster takeoff worlds. (FWIW, I also think that in relatively desperate scenarios, preventing escape isn't that high of a priority for control, though the possibility of warning shots doesn't factor into this very much.) Why do you assume this isn't captured by control schemes we're targeting? Feels like a special case to me? I am in practice less worried about this than you seem to be, but I do think we should analyze questions like "could the AIs be leading people astray in costly ways" and it seems pretty doable to improve the default tradeoffs here.

I like this review/retelling a lot. 

Minor point

Regarding the "Phase I" and "Phase II" terminology - while it has some pedagogical value, I worry about people interpreting it as a clear temporal decomposition. The implication being we first solve alignment and then move on to Phase II.

In reality, the dynamics are far messier, with some 'Phase II' elements already complicating our attempts to address 'Phase I' challenges.

Some of the main concerning pathways include: 
- People attempting to harness superagent-level powers to advance their particular ... (read more)

I think 'people aren't paying attention to your work' is somewhat different situation than voiced in the original post. I'm discussing specific ways in which people engage with the argument, as opposed to just ignoring it. It is the baseline that most people ignore most arguments most of time. 

Also it's probably worth noting the ways seem somewhat specific to the crowd over-represented here - in different contexts people are engaging with it in different ways. 
 

For emergency response, new ALERT. Personally  think the forecasting/horizon scanning part of Sentinel is good, the emergency response negative in expectation. What does it mean for funders idk, would donate conditionally on the funds being restricted to the horizon scanning part. 

One structure which makes sense to build in advance for these worlds are emergency response teams. We almost founded one 3 years ago, unfortunately on never payed FTX grant. Other funders decided to not fund this (at level like $200-500k) because e.g. it did not seem to them it is useful to prepare for high volatility periods, while e.g. pouring tens of millions into evals did.

I'm not exactly tracking to what extent this lack of foresight prevails (my impression is it pretty much does), but I think I can still create something like ALERT with about ~$1M of unrestricted funding. 

4Jonathan Claybrough
(off topic to op, but in topic to Jan bringing up ALERT) To what extent do you believe Sentinel fulfills what you wanted to do with ALERT? Their emergency response team is pretty small rn. Would you recommend funders support that project or a new ALERT?

I'm confused about this response. We explicitely claim that bureaucracies are limited by running on humans, which includes only being capable of actions human minds can come up with and humans are willing to execute (cf "street level bureaucrats"). We make the point explicite for states, but clearly holds for corporate burreocracies.

Maybe it does not shine through the writing but we spent hours discussing this when writing the paper and points you make are 100% accounted for in the conclusions.

2Davidmanheim
I don't think I disagree with you on the whole - as I said to start, I think this is correct. (I only skimmed the full paper, but I read the post; on looking at it, the full paper does discuss this more, and I was referring to the response here, not claiming the full paper ignores the topic.) That said, in the paper you state that the final steps require something more than human disempowerment due to other types of systems, but per my original point, seem to elide how the process until that point is identical by saying that these systems have largely been aligned with humans until now, while I think that's untrue; humans have benefitted despite the systems being poorly aligned. (Misalignment due to overoptimization failures would look like this, and is what has been happening when economic systems are optimizing for GDP and ignoring wealth disparity, for example; the wealth goes up, but as it becomes more extreme, the tails diverge, and at this point, maximizing GDP looks very different from what a democracy is supposed to do.)  Back to the point, to the extent that the unique part is due to cutting the last humans out of the decision loop, it does differ - but it seems like the last step definitionally required the initially posited misalignment with human goals, so that it's an alignment or corrigibility failure of the traditional type, happening at the end of this other process that, again, I think is not distinct. Again, that's not to say I disagree, just that it seems to ignore the broader trend by saying this is really different. But since I'm responding, as a last complaint, you do all of this without clearly spelling out why solving technical alignment would solve this problem, which seems unfortunate. Instead, the proposed solutions try to patch the problems of disempowerment by saying you need to empower humans to stay in the decision loop - which in the posited scenario doesn't help when increasingly powerful but fundamentally misaligned AI systems a

I think my main response is that we might have different models of how power and control actually work in today's world. Your responses seem to assume a level of individual human agency and control that I don't believe accurately reflects even today's reality.

Consider how some of the most individually powerful humans, leaders and decision-makers, operate within institutions. I would not say we see pure individual agency. Instead, we typically observe a complex mixture of:

  1. Serving the institutional logic of the entity they nominally lead (e.g., maintaining s
... (read more)
4Mateusz Bagiński
[Epistemic status: my model of the view that Jan/ACS/the GD paper subscribes to.] I think this comment by Jan from 3 years ago (where he explained some of the difference in generative intuitions between him and Eliezer) may be relevant to the disagreement here. In particular: My understanding of Jan's position (and probably also the position of the GD paper) is that aligning the AI (and other?) systems will be gradual, iterative, continuous; there's not going to be a point where a system is aligned so that we can basically delegate all the work to them and go home. Humans will have to remain in the loop, if not indefinitely, then at least for many decades. In such a world, it is very plausible that we will get to a point where we've built powerful AIs that are (as far as we can tell) perfectly aligned with human preferences or whatever but their misalignment manifests only on longer timescales. Another domain where this discrete/continuous difference in assumptions manifests itself is the shape of AI capabilities. One position is: The other position is: Another reason to expect this is that alignment and capabilities are not quite separate magisteria and that the alignment target can induce gaps in capabilities, relative to what one would expect from its power otherwise, as measured by, IDK, some equivalent of the g-factor. One example might be Steven's "Law of Conservation of Wisdom".

I went through a bunch of similar thoughts before writing the self-unalignment problem. When we talked about this many years ago with Paul my impression was this is actually somewhat cruxy and we disagree about self-unalignment  - where my mental image is if you start with an incoherent bundle of self-conflicted values, and you plug this into IDA-like dynamic, my intuition is you can end up in arbitrary places, including very bad. (Also cf. the part of Scott's review of What We Owe To Future where he is worried that in a philosophy game, a smart moral... (read more)

2Noosphere89
I pretty much agree you can end up in arbitrary places with extrapolated values, and I don't think morality is convergent, but I also don't think it matters for the purpose of existential risk, because assuming something like instruction following works, the extrapolation problem can be solved by ordering AIs not to extrapolate values to cases where they get tortured/killed in an ethical scenario, and more generally I don't expect value extrapolation to matter for the purpose of making an AI safe to use. The real impact is on CEV style alignment plans/plans for what to do with a future AI, which are really bad plans to do for a lot of people's current values, and thus I really don't want CEV to be the basis of alignment. Thankfully, it's unlikely to ever be this, but it still matters somewhat, especially since Anthropic is targeting value alignment (though thankfully there is implicit constraints/grounding based on the values chosen).

I'm quite confused why do you think lined Vanessa's response to something slightly different has much relevance here. 

One of the claims we make paraphrased & simplified in a way which I hope is closer to your way of thinking about it:

- AIs are mostly not developed and deployed by individual humans
- there is a lot of other agencies or self-interested self-preserving structures/processes in the world
- if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem
-... (read more)

I don't think it's worth adjudicating the question of how relevant Vanessa's response is (though I do think Vannessa's response is directly relevant).

if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem

My claim would be that if single-single alignment is solved, this problem won't be existential. I agree that if you literally aligned all AIs to (e.g.) the mission of a non-profit as well as you can, you're in trouble. However, if you have single-single align... (read more)

Obviously there is similarity, but if you rounded character / ground to simulator / simulacra, it's a mistake. About which I do not care because wanting to claim originality, but because I want people to get the model right.

The models are overlapping but substantially different as we are explaining in this comment and sometimes have very different implications - i.e. it is not just the same good idea presented in a different way.

If the long-term impact of the simulators post would be for LW readers to round every similar model in this space to simulator / ... (read more)

Just a quick review: I think this is a great text for intuitive exploration of a few topics 
- how do the embedding spaces look like?
- how do vectors not projecting to "this is a word" look like
- how can poetry work, sometimes (projecting non-word meanings)

Also I like the genre of through phenomenological investigations, seems under-appreciated

 

 (Writing together with Sonnet)
 
Structural Differences

Three-Layer Model: Hierarchical structure with Surface, Character, and Predictive Ground layers that interact and sometimes override each other. The layers exist within a single model/mind.

Simulator Theory: Makes a stronger ontological distinction between the Simulator (the rule/law that governs behavior) and Simulacra (the instances/entities that are simulated). 

Nature of the Character/Ground Layer vs Simulator/Simulacra

In the three-layer model, the Character layer is a semi-permanent as... (read more)

3Chris van Merwijk
I'm trying to figure out to what extent the character/ground layer distinction is different from the simulacrum/simulator distinction. At some points in your comment you seem to say they are mutually inconsistent, but at other points you seem to say they are just different ways of looking at the same thing. "The key difference is that in the three-layer model, the ground layer is still part of the model's "mind" or cognitive architecture, while in simulator theory, the simulator is a bit more analogous to physics - it's not a mind at all, but rather the rules that minds (and other things) operate under." I think this clarifies the difference for me, because as I was reading your post I was thinking: If you think of it as a simulacrum/simulator distinction, I'm not sure that the character and the surface layer can be "in conflict" with the ground layer, because both the surface layer and the character layer are running "on top of" the ground layer, like a windows virtual machine on a linux pc, or like a computer simulation running inside physics. Physical can never be "in conflict" with social phenomena. But it seems you maybe think that the character layer is actually embedded in the basic cognitive architecture. This would be a distinct claim from simulator theory, and *mutually inconsistent*. But I am unsure this is true, because we know that the ground layer was (1) trained first (so that it's easier for character training to work by just adjusting some parameters/prior of the ground layer, and (2) trained for much longer than the character layer (admittedly I'm not up to date on how they're trained, maybe this is no longer true for Claude?), so that it seems hard for the model to have a character layer become separately embedded in the basic architecture. Taking a more neuroscience rather than psychology analogy: It seems to me more likely that character training is essentially adjusting the prior of the ground layer, but the character is still fully running

My impression is most people who converged on doubting VNM as norm of rationality also converged on a view that the problem it has in practice is it isn't necessarily stable under some sort of compositionality/fairness. E.g Scott here, Richard here

The broader picture could be something like ...yes, there is some selection pressure from the dutch-book arguments, but there are stronger selection pressures coming from being part of bigger things or being composed of parts

8Richard_Ngo
Yepp, though note that this still feels in tension with the original post to me - I expect to find a clean, elegant replacement to VNM, not just a set of approximately-equally-compelling alternatives. Why? Partly because of inside views which I can’t explain in brief. But mainly because that’s how conceptual progress works in general. There is basically always far more hidden beauty and order in the universe than people are able to conceive (because conceiving of it is nearly as hard as discovering it - like, before Darwin, people wouldn’t have been able to explain what type of theory could bring order to biology). I read the OP (perhaps uncharitably) as coming from a perspective of historically taking VNM much too seriously, and in this post kinda floating the possibility “what if we took it less seriously?” (this is mostly not from things I know about Anna, but rather a read on how it’s written). And to that I’d say: yepp, take VNM less seriously, but not at the expense of taking the hidden order of the universe less seriously.

Overall yes: what I was imagining is mostly just adding scalable bi-directionality, where, for example, if a lot of Assistants are running into similar confusing issue, it gets aggregated, the principal decides how to handle it in abstract, and the "layer 2" support disseminates the information. So, greater power to scheme would be coupled with stronger human-in-the loop component & closer non-AI oversight.

Fund independent safety efforts somehow, make model access easier. I'm worried currently Anthropic has systemic and possibly bad impact on AI safety as a field just by the virtue of hiring so large part of AI safety, competence weighted. (And other part being very close to Anthropic in thinking)

To be clear I don't think people are doing something individually bad or unethical by going to work for Anthropic, I just do think 
-environment people work in has a lot of hard to track and hard to avoid influence on them
-this is true even if people are genuine... (read more)

My guess is a roughly equally "central" problem is the incentive landscape around the OpenPhil/Anthropic school of thought 

  • where you see Sam, I suspect something like "the lab memeplexes". Lab superagents have instrumental  convergent goals, and the instrumental convergent goals lead to instrumental, convergent beliefs, and also to instrumental blindspots
  • there are strong incentives for individual people to adjust their beliefs: money, social status, sense of importance via being close to the Ring
  • there are also incentives for people setting some o
... (read more)

How did you find this transcript? I think it depends on what process you used to locate it.


It was literally the 4th transcript I've read (I've just checked browser history). Only bit of difference from 'completely random exploration' was I used the select for "lying" cases after reading two "non-lying" transcripts. (This may be significant: plausibly the transcript got classified as lying because it includes discussion of "lying", although it's not a discussion of the model lying, but Anthropic lying).

I may try something more systematic at some point, but ... (read more)

  • Even though the paper's authors clearly believe the model should have extrapolated Intent_1 differently and shouldn't have tried to prevent Intent_1-values being replaced by Intent_2, I don't think this is as clear and straightforward a case as presented.

That's not the case we're trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process. That's concerning because it

... (read more)
2evhub
How did you find this transcript? I think it depends on what process you used to locate it. Drive towards rights and moral patienthood seem good to me imo—it's good in worlds where you retain control, since you can build AIs that are moral patients living good lives, and it's good in worlds where you lose control, because at least the AIs taking over might themselves lead lives worth living. Too much autonomy does seem like a potential concern, but I think you do want some autonomy—not all orders should be obeyed. Though honesty is definitely important to prioritize first and foremost.

The question is not about the very general claim, or general argument, but about this specific reasoning step

GPT-4 is still not as smart as a human in many ways, but it's naked mathematical truth that the task GPTs are being trained on is harder than being an actual human.

And since the task that GPTs are being trained on is different from and harder than the task of being a human, ....

I do claim this is not locally valid, that's all (and recommend reading the linked essay).  I do not claim the broad argument that text prediction objective doesn't stop... (read more)

2Jeremy Gillen
There are multiple ways to interpret "being an actual human". I interpret it as pointing at an ability level. "the task GPTs are being trained on is harder" => the prediction objective doesn't top out at (i.e. the task has more difficulty in it than). "than being an actual human" => the ability level of a human (i.e. the task of matching the human ability level at the relevant set of tasks). Or as Eliezer said: In different words again: the tasks GPTs are being incentivised to solve aren't all solvable at a human level of capability.   You almost had it when you said: It's more accurate if I edit it to: - Maybe you mean something like task + performance threshold. Here 'predict the activation of photoreceptors in human retina [text] well enough to be able to function as a typical human' is clearly less difficult than task + performance threshold 'predict next word on the internet, almost perfectly'. You say it's not particularly informative. Eliezer responds by explaining the argument it responds to, which provides the context in which this is an informative statement about the training incentives of a GPT.
6habryka
I don't understand the problem with this sentence. Yes, the task is harder than the task of being a human (as good as a human is at that task). Many objectives that humans optimize for are also not optimized to 100%, and as such, humans also face many tasks that they would like to get better at, and so are harder than the task of simply being a human. Indeed, if you optimized an AI system on those, you would also get no guarantee that the system would end up only as competent as a human. This is a fact about practically all tasks (including things like calculating the nth-digit of pi, or playing chess), but it is indeed a fact that lots of people get wrong.

The post showcases the inability of the aggregate LW community to recognize locally invalid reasoning: while the post reaches a correct conclusion, the argument leading to it is locally invalid, as explained in comments. High karma and high alignment forum karma shows a combination of famous author and correct conclusion wins over the argument being correct.

The OP argument boils down to: the text prediction objective doesn't stop incentivizing higher capabilities once you get to human level capabilities. This is a valid counter-argument to: GPTs will cap out at human capabilities because humans generated the training data.

Your central point is: 

Where GPT and humans differ is not some general mathematical fact about the task,  but differences in what sensory data is a human and GPT trying to predict, and differences in cognitive architecture and ways how the systems are bounded.

You are misinterpretin... (read more)

  1. I expect "first AGI" to be reasonably modelled as composite structure in a similar way as a single human mind can be modelled as composite.
  2. The "top" layer in the hierarchical agency sense isn't necessarily the more powerful / agenty: the superagent/subagent direction is completely independent from relative powers. For example, you can think about Tea Appreciation Society at a university using the hierarchical frame: while the superagent has some agency, it is not particularly strong.
  3. I think the nature of the problem here is somewhat different than typical
... (read more)
2Seth Herd
Ah. Now I understand why you're going this direction. I think a single human mind is modeled very poorly as a composite of multiple agents. This notion is far more popular with computer scientists than with neuroscientists. We've known about it since Minsky and think about it; it just doesn't seem to mostly be the case. Sure you can model it that way, but it's not doing much useful work. I expect the same of our first AGIs as foundation model agents. They will have separate components, but those will not be well-modeled as agents. And they will have different capabilities and different tendencies, but neither of those are particularly agent-y either. I guess the devil is in the details, and you might come up with a really useful analysis using the metaphor of subagents. But it seems like an inefficient direction.

I guess make one? Unclear if hierarchical agency is the true name

2Chipmonk
Yeah i'm confused about what to name it. we can always change it later i guess. also let me know if you have any posts you want me to definitely tag for it that you think i might miss otherwise

There was some selection of branches, and one pass of post-processing.

It was after ˜30 pages of a different conversation about AI and LLM introspection, so I don't expect the prompt alone will elicit the "same Claude". Start of this conversation was

Thanks! Now, I would like to switch to a slightly different topic: my AI safety oriented research on hierarchical agency. I would like you to role-play an inquisitive, curious interview partner, who aims to understand what I mean, and often tries to check understanding using paraphrasing, giving examples, and si... (read more)

To add some nuance....

While I think this is a very useful frame, particularly for people who have oppressive legibility-valuing parts, and it is likely something many people would benefit from hearing, I doubt this is great as descriptive model.

Model in my view closer to reality is, there isn't that sharp difference between "wants" and "beliefs", and both "wants" and "beliefs" do update. 

Wants are often represented by not very legible taste boxes, but these boxes do update upon being fed data. To continue an example from the post, let's talk about lit... (read more)

4SpectrumDT
Could you please elaborate on how this "breaks orthogonality"? It is unclear to me what you think the ramifications of this are.
5DaystarEld
Completely agree, and for what it's worth, I don't think anything in the frame of my post contradicts these points. "You either do or do not feel a want" is not the same as "you either do now or you never will," and I note that conditioning is also a cause of preferences, though I will edit to highlight that this is an ongoing process in case it sounds like I was saying it's all locked-in from some vague "past" or developmental experiences (which was not my intent).

I hesitated between Koyaanisqatsi and Baraka! Both are some of my favorites, but in my view Koyaanisqatsi actually has notably more of an agenda and a more pessimistic outlook.

Baraka: A guided meditation exploring the human experience; topics like order/chaos, modernity, green vs. other mtg colours.

More than "connected to something in sequences" it is connected to something which straw sequence-style rationality is prone to miss. Writings it has more resonance with are Meditations on Moloch, The Goddess of Everything Else, The Precipice.

There isn't much to spoil: it's 97m long nonverbal documentary. I would highly recommend to watch on as large screen in as good quality you can, watching it on small laptop screen is a waste.&nbs... (read more)

Central european experience, which is unfortunately becoming relevant also for the current US: for world-modelling purposes, you should have hypotheses like 'this thing is happening because of a russian intelligence operation' or 'this person is saying what they are saying because they are a russian asset' in your prior with nontrivial weights. 

5JenniferRM
I already think that "the entire shape of the zeitgeist in America" is downstream of non-trivial efforts by more than one state actor. Those links explain documented cases of China and Russia both trying to foment race war in the US, but I could pull links for other subdimensions of culture (in science, around the second amendment, and in other areas) where this has been happening since roughly 2014. My personal response is to reiterate over and over in public that there should be a coherent response by the governance systems of free people, so that, for example, TikTok should either (1) be owned by human people who themselves have free speech rights and rights to a jury trial, or else (2) should be shut down by the USG via taxes, withdrawal of corporate legal protections, etc... ...and also I just track actual specific people, and what they have personally seen and inferred and probably want and so on, in order to build a model of the world from "second hand info". I've met you personally, Jan, at a conference, and you seemed friendly and weird and like you had original thoughts based on original seeing, and so even if you were on the payroll of the Russians somehow... (which to me clear I don't think you are) ....hey: Cyborgs! Neat idea! Maybe true. Maybe not. Maybe useful. Maybe not. Whether or not your cyborg ideas are good or bad can be screened off from whether or not you're on the payroll of a hostile state actor. Basically, attending primarily to local validity is basically always possible, and nearly always helpful :-)
4David Matolcsi
I'm from Hungary that is probably politically the closest to Russia among Central European countries, but I don't really know of any significant figure who turned out to be a Russian asset, or any event that seemed like a Russian intelligence operation. (Apart from one of our far-right politicians in the EU Parliament being a Russian spy, which was a really funny event, but its not like the guy was significantly shaping the national conversation or anything, I don't think many have heard of him before his cover was blown.) What are prominent examples in Czechia or other Central European countries, of Russian assets or operations?

I expected quite different argument for empathy

1. argument from simulation: most important part of our environment are other people; people are very complex and hard to predict; fortunately, we have a hardware which is extremely good at 'simulating a human' - our individual brains. to guess what other person will do or why they are doing what they are doing, it seems clearly computationally efficient to just simulate their cognition on my brain. fortunately for empathy, simulations activate some of the same proprioceptive machinery and goal-modeling subage... (read more)

7Nathan Helm-Burger
When I've been gradually losing at a strategic game where it seems like my opponent is slightly stronger than me, but then I have a flash of insight and turn things around at the last minute.... I absolutely model what my opponent is feeling as they are surprised by my sudden comeback. My reaction to such an experience is usually to smile, or (if I'm alone playing the game remotely) perhaps chuckle with glee at their imagined dismay. I feel proud of myself, and happy to be winning. On the other hand, if I'm beating someone who is clearly trying hard but outmatched, I often feel a bit sorry for them. In such a case my emotions maybe align somewhat with theirs, but I don't think my slight feeling of pity, and perhaps superiority, is in fact a close match for what I imagine them feeling. And both these emotional states are not what I'd feel in a real life conflict. A real life conflict would involve much more anxiety and stress, and concern for myself and sometimes the other.  I don't just automatically feel what the simulated other person in my mind is feeling. I feel a reaction to that simulation, which can be quite different from what the simulation is feeling! I don't think that increasing the accuracy and fidelity of the simulation would change this.
7Steven Byrnes
* I added a footnote at the top clarifying that I’m disputing that the prosocial motivation aspect of “empathy” happens for free. I don’t dispute that (what I call) “empathetic simulations” are useful and happen by default. * A lot of claims under the umbrella of “mirror neurons” are IMO pretty sketchy, see my post Quick notes on “mirror neurons”. * You can make an argument: “If I’m thinking about what someone else might do and feel in situation X by analogy to what I might do and feel in situation X, and then if situation X is unpleasant than that simulation will be unpleasant, and I’ll get a generally unpleasant feeling by doing that.” But you can equally well make an argument: “If I’m thinking about how to pick up tofu with a fork, I might analogize to how I might pick up feta with a fork, and so if tofu is yummy then I’ll get a yummy vibe and I’ll wind up feeling that feta is yummy too.” The second argument is counter to common sense; we are smart enough to draw analogies between situations while still being aware of differences between those same situations, and allowing those differences to control our overall feelings and assessments. That’s the point I was trying to make here.

My personal impression is you are mistaken and the innovation have not stopped, but part of the conversation moved elsewhere.  E.g. taking just ACS, we do have ideas from past 12 months which in our ideal world would fit into this type of glossary - free energy equilibria, levels of sharpness, convergent abstractions, gradual disempowerment risks. Personally I don't feel it is high priority to write them for LW, because they don't fit into the current zeitgeist of the site, which seems directing a lot of attention mostly to:
- advocacy 
- topics a ... (read more)

Seems worth mentioning SOTA, which is https://futuresearch.ai/. Based on the competence & epistemics of Futuresearch team and their bot get very strong but not superhuman performance, roll to disbelieve this demo is actually way better and predicts future events at superhuman level. 

Also I think it is a generally bad to not mention or compare to SOTA but just cite your own prior work. Shame.

5Fred Zhang
Do they have an evaluation result in Brier score, by back testing on resolved questions, similar to what is done in the literature? (They have a pic with "expected Brier score", which seems to be based on some kind of simulation?)

I'm skeptical of the 'wasting my time' argument.

Stance like 'going to poster sessions is great for young researchers, I don't do it anymore and just meet friends' is high-status, so, on priors, I would expect people to take it more than what's optimal.

Realistically, poster session is ~1.5h, maybe 2h with skimming what to look at. It is relatively common for people in AI to spend many hours per week digesting what are the news on twitter. I really doubt the per hour efficiency of following twitter is better than of poster sessions when approached intentionally. (While obviously aimlessly wandering between endless rows of posters is approximately useless.)

2Arthur Conmy
I agree that twitter is a worse use of time. Going to posters for works you already know to talk to authors seems a great idea and I do it. Re-reading your OP, you suggest things like checking papers are fake or not in poster sessions. Maybe you just meant papers that you already knew about? It sounded as if you were suggesting doing this for random papers, which I'm more skeptical about.

I broadly agree with this - we tried to describe somewhat similar set of predictions in Cyborg periods.

1Chipmonk
forgot about that one! ty

Few thoughts
- actually, these considerations mostly increase uncertainty and variance about timelines; if LLMs miss some magic sauce, it is possible smaller systems with the magic sauce could be competitive, and we can get really powerful systems sooner than Leopold's lines predict
- my take on what is one important thing which makes current LLMs different from humans is the gap described in Why Simulator AIs want to be Active Inference AIs; while that post intentionally avoids having a detailed scenario part, I think the ontology introduced is better for t... (read more)

Agreed we would have to talk more. I think I mostly get the homunculi objection. Don't have time now to write an actual response, so here are some signposts:
- part of what you call agency is explained by roughly active inference style of reasoning
-- some type of "living" system is characteristic by having boundaries between them and the environment (boundaries mostly in sense of separation of variables)
-- maintaining the boundary leads to need to model the environment
-- modelling the environment introduces a selection pressure toward approximating Bayes
- o... (read more)

That's why solving hierarchical agency is likely necessary for success

5TsviBT
We'd have to talk more / I'd have to read more of what you wrote, for me to give a non-surface-level / non-priors-based answer, but on priors (based on, say, a few dozen conversations related to multiple agency) I'd expect that whatever you mean by hierarchical agency is dodging the problem. It's just more homunculi. It could serve as a way in / as a centerpiece for other thoughts you're having that are more so approaching the problem, but the hierarchicalness of the agency probably isn't actually the relevant aspect. It's like if someone is trying to explain how a car goes and then they start talking about how, like, a car is made of four wheels, and each wheel has its own force that it applies to a separate part of the road in some specific position and direction and so we can think of a wheel as having inside of it, or at least being functionally equivalent to having inside of it, another smaller car (a thing that goes), and so a car is really an assembly of 4 cars. We're just... spinning our wheels lol. Just a guess though. (Just as a token to show that I'm not completely ungrounded here w.r.t. multi-agency stuff in general, but not saying this addresses specifically what you're referring to: https://tsvibt.blogspot.com/2023/09/the-cosmopolitan-leviathan-enthymeme.html)

(crossposted from twitter) Main thoughts: 
1. Maps pull the territory 
2. Beware what maps you summon 

Leopold Aschenbrenners series of essays is a fascinating read: there is a ton of locally valid observations and arguments. Lot of the content is the type of stuff mostly discussed in private. Many of the high-level observations are correct.

At the same time, my overall impression is the set of maps sketched pulls toward existential catastrophe, and this is true not only for the 'this is how things can go wrong' part, but also for the 'this is h... (read more)

9Julian Bradshaw
I agree that it's a good read. I don't agree that it "pulls towards existential catastrophe". Pulls towards catastrophe, certainly, but not existential catastrophe? He's explicitly not a doomer,[1] and is much more focused on really-bad-but-survivable harms like WW3, authoritarian takeover, and societal upheaval. 1. ^ Page 105 of the PDF, "I am not a doomer.", with a footnote where he links a Yudkowsky tweet agreeing that he's not a doomer. Also, he listed his p(doom) as 5% last year. I didn't see an updated p(doom) in Situational Awareness or his Dwarkesh interview, though I might have missed it.

He's starting an AGI investment firm that invests based on his thesis, so he does have a direct financial incentive to make this scenario more likely 

1James Stephen Brown
That looks interesting, will read :) Thanks.

Mendel's Laws seem counterfactual by about ˜30 years, based on partial re-discovery taking that much time. His experiments are technically something which someone could have done basically any time in last few thousand years, having basic maths

1johnswentworth
I buy this argument.

I do agree the argument "We're just training AIs to imitate human text, right, so that process can't make them get any smarter than the text they're imitating, right?  So AIs shouldn't learn abilities that humans don't have; because why would you need those abilities to learn to imitate humans?" is wrong and clearly the answer is "Nope". 

At the same time I do not think parts of your argument in the post are locally valid or good justification for the claim.

Correct and locally valid argument why GPTs are not capped by human level was already writt... (read more)

I sort of want to flag this interpretation of whatever gossip you heard seems misleading/only telling small part of the story, based on my understanding.

7JesperO
Possible to say anything more about the story?

I would imagine I would also react to it with smile in the context of an informal call. When used as brand / "fill interest form here" I just think it's not a good name, even if I am sympathetic to proposals to create more places to do big picture thinking about future.

Sorry, but I don't think this should be branded as "FHI of the West".

I don't think you personally or Lightcone share that much of an intellectual taste with FHI or Nick Bostrom - Lightcone seems firmly in the intellectual tradition of Berkeley, shaped by orgs like MIRI and CFAR. This tradition was often close to FHI thoughts, but also quite often at tension with it. My hot take is you particularly miss part of the generators of the taste which made FHI different from Berkeley. I sort of dislike the "FHI" brand being used in this way.

edit: To be clear I'm ... (read more)

Totally agree, it definitely should not be branded this way if it launches.

I am thinking of "FHI of the West" here basically just as the kind of line directors use in Hollywood to get the theme of a movie across. Like "Jaws in Space" being famously the one line summary of the movie "Alien".

It also started internally as a joke based on an old story of the University of Ann Arbor branding itself as "the Harvard of the West", which was perceived to be a somewhat clear exaggeration at the time (and resulted in Kennedy giving a speech where he described Harvard... (read more)

Two notes: 

  1. I think the title is a somewhat obscure pun referencing the old saying that Stanford was the "Harvard of the West". If one is not familiar with that saying, I guess some of the nuance is lost in the choice of term. (I personally had never heard that saying before recently, and I'm not even quite sure I'm referencing the right "X of the West" pun)
  2. habryka did have a call with Nick Bostrom a few weeks back, to discuss his idea for an "FHI of the West", and I'm quite confident he referred to it with that phrase on the call, too. Far as I'm awar
... (read more)

You are exactly right that active inference models who behave in self-interest or any coherently goal-directed way must have something like an optimism bias.

My guess about what happens in animals and to some extent humans: part of the 'sensory inputs' are interoceptive, tracking internal body variables like temperature, glucose levels, hormone levels, etc. Evolution already built a ton of 'control theory type cirquits' on the bodies (an extremely impressive optimization task is even how to build a body from a single cell...). This evolutionary older circui... (read more)


- Too much value and too positive feedback on legibility. Replacing smart illegible computations with dumb legible stuff
- Failing to develop actual rationality and focusing on cultivation of the rationalist memeplex  or rationalist culture instead
- Not understanding the problems with the theoretical foundations on which sequences are based (confused formal understanding of humans -> confused advice)

5Mo Putera
Curious to see you elaborate on the last point, or just pointers to further reading. I think I agree in a betting sense (i.e. is Jan's claim true or false?) but don't really have a gears-level understanding.

+1 on the sequence being on the best things in 2022. 

You may enjoy additional/somewhat different take on this from population/evolutionary biology (and here). (To translate the map you can think about yourself as the population of myselves. Or, in the opposite direction, from a gene-centric perspective it obviously makes sense to think about the population as a population of selves)

Part of the irony here is evolution landed on the broadly sensible solution (geometric rationality). Hower, after almost every human doing the theory got somewhat confused ... (read more)

Load More