Still haven't heard a better suggestion than CEV.
re 6 -- Interesting. It was my impression that "chain of thought" and other techniques notably improved LLM performance. Regardless, I don't see compositional improvements as a good thing. They are hard to understand as they are being created, and the improvements seem harder to predict. I am worried about RSI in a misaligned system created/improved via composition.
Re race dynamics: It seems to me there are multiple approaches to coordinating a pause. It doesn't seem likely that we could get governments or companies to head a pause. Movements from the general population might help, but a movement lead by AI scientists seems much more plausible to me. People working on these systems ought to be more aware of the issues and more sympathetic to avoiding the risks, and since they are the ones doing the development work, they are more in a position to refuse to do work that hasn't been shown to be safe.
Based on your comment and other thoughts, my current plan is to publish research as normal in order to move forward with my mechanistic interpretability career goals, but to also seek out and/or create a guild or network of AI scientists / workers with the goal of agglomerating with other such organizations into a global network to promote alignment work & reject unsafe capabilities work.
About (6), I think we're more likely to get AGI /ASI by composing pre-trained ML models and other elements than by a fresh training run. Think adding iterated reasoning and api calling to a LLM.
About the race dynamics. I'm interested in founding / joining a guild / professional network for people committed to advancing alignment without advancing capabilities. Ideally we would share research internally, but it would not be available to those not in the network. How likely does this seem to create a worthwhile cooling of the ASI race? Especially if the network were somehow successful enough to reach across relevant countries?
It's not really possible to hedge either the apocalypse or a global revolution, so you can ignore those states of the worlds when pricing assets (more or less).
Unless depending on what you invest in those states of the world become more or less likely.
Haha, I was hoping for a bit more activity here, but we filled our speaker slots anyway. If you stumble across this post before November 26th, feel free to come to our conference.
In the final paragraph, I'm uncertain if you are thinking about "agency" being broken into components which make up the whole concept, or thinking about the category being split into different classes of things, some of which may have intersecting examples. (or both?) I suspect both would be helpful. Agency can be described in terms of components like measurement/sensory, calculations, modeling, planning, comparisons to setpoints/goals, taking actions. Probably not that exact set, but then examples of agent like things could naturally be compared on each component, and should fall into different classes. Exploring the classes I suspect would inform the set of components and the general notion of "agency".
I guess to get work on that done it would be useful to have a list of prospective agent components, a set of examples of agent shaped things, and then of course to describe each agent in terms of the components. What I'm describing, does it sound useful? Do you know of any projects doing this kind of thing?
On the topic of map-territory correspondence, (is there a more concise name for that?) I quite like your analogies, running with them a bit, it seems like there are maybe 4 categories of map-territory correspondence;
It feels like "heat-like" might be the only real category in some kind of similarity clusters kind of way, but also "things which are using a measurement proxy to compare the state of reality against a setpoint and taking different actions based on the difference between the measurement result and the setpoint" seems like a specific enough thing when I think about it that you could divide all parts of the universe into being either definitely in or definitely out of that category, which would make it a strong candidate for being a natural abstraction, and I suspect it's not the only category like that.
I wouldn't be surprised if there were indeterminate things in shared maps, and in individual maps, but I would be very surprised if there were many examples in shared maps that were due to happenstance instead of being due to convergence of individual happenstance indeterminate things converging during map comparison processes. Also, weirdly, the territory containing map making agents which all mark a particular part of their maps may very well be a natural abstraction, that is, the mark existing at a particular point on the maps might be a real thing, but not the corresponding spot in territory. I'm thinking this is related to a Schelling point or Nash Equilibrium, or maybe also related to human biases. Although, those seem to do more with brain hardware than agent interactions. A better example might be the sound of words: arbitrary, except that they must match the words other people are using.
Unrelated epistemological game; I have a suspicion that for any example of a thing that objectively exists, I can generate an ontology in which it would not. For the example of an orange, I can imagine an ontology in which "seeing an orange", "picking a fruit", "carrying food", and "eating an orange" all exist, but an orange itself outside of those does not. Then, an orange doesn't contain energy, since an orange doesn't exist, but "having energy" depends on "eating an orange" which depends on "carrying food" and so on, all without the need to be able to think of an orange as an object. To describe an orange you would need to say [[the thing you are eating when you are][eating an orange]], and it would feel in between concepts in the same way that in our common ontology "eating an orange" feels like the idea between "eating" and "orange".
I'm not sure if this kind of ontology:
I suspect it's the first, but it doesn't seem inescapably true, and now I'm wondering if this is a worthwhile thought experiment, or the sort of thing I'm thinking because I'm too sleepy. Alas :-p
It's unimportant, but I disagree with the "extra special" in:
if alignment isn’t solvable at all [...] extra special dead
If we could coordinate well enough and get to SI via very slow human enhancement that might be a good universe to be in. But probably we wouldn't be able to coordinate well enough and prevent AGI in that universe. Still, odds seem similar between "get humanity to hold off on AGI till we solve alignment" which is the ask in alignment possible universes, and "get humanity to hold off on AGI forever" which is the ask in alignment impossible universes. The difference between the odds being based on how long until AGI, whether the world can agree to stop development or only agree to slow it, and if it can stop, whether that is stable. I expect AGI is a sufficient amount closer than alignment that getting the world to slow it for that long and stop it permanently are fairly similar odds.
what Hotz was treating a load bearing
Small grammar mistake. You accidentally a "a".
Oh, actually I spoke too soon about "Talk to the City." As a research project, it is cool, but I really don't like the obfuscation that occurs when talking to an LLM about the content it was trained on. I don't know how TTTC works under the hood, but I was hoping for something more like de-duplication of posts, automatically fitting them into argument graphs. Then users could navigate to relevant points in the graph based on a text description of their current point of view, but importantly they would be interfacing with the actual human generated text, with links back to it's source, and would be able to browse the entire graph. People could then locate (visually?) important crux's and new crux's wouldn't require a writeup to disseminate, but would already be embedded in the relevant part of the argument.
( I might try to develop something like this someday if I can't find anyone else doing it. )
The risk interview perspectives is much closer to what I was thinking, and I'd like to study it in more detail, but seems more like a traditional analysis / infographic than what I am wishing would exist.
Hey, we met at EAGxToronto : ) I am finally getting around to reading this. I really enjoyed your manic writing style. It is cathartic finding people stressing out about the same things that are stressing me out.
In response to "The less you have been surprised by progress, the better your model, and you should expect to be able to predict the shape of future progress": My model of capabilities increases has not been too surprised by progress, but that is because for about 8 years now there has been a wide uncertainty bound and a lot of Vingean Reflection in my model. I know that I don't know what is required for AGI and strongly suspect that nobody else does either. It could be 1 key breakthrough or 100, but most of my expectation p-mass is in the range of 0 to 20. Worlds with 0 would be where prosaic scaling is all we need or where a secret lab is much better at being secret than I expect. Worlds with 20 are where my p-mass is trailing off. I really can't imagine there would be that many key things required, but since those insights are what would be required to understand why they are required, I don't think they can be predicted ahead of time, since predicting the breakthrough is basically the same as having the breakthrough, and without the breakthrough we nearly cannot see the breakthrough and cannot see the results which may or may not require further breakthroughs.
So my model of progress has allowed me to observe our prosaic scaling without surprise, but it doesn't allow me to make good predictions since the reason for my lack of surprise has been from Vingean prediction of the form "I don't know what progress will look like and neither do you".
Things I do feel confident about are conditional dynamics, like, if there continues to be focus on this, there will be progress. This likely gives us sigmoid progress on AGI from here until whatever boundary on intelligence gets hit. The issue is that sigmoid is a function matching effort to progress, where effort is some unknown function of the dynamics of the agents making progress (social forces, economic forces, and ai goals?), and some function which cannot be predicted ahead of time maps progress on the problem to capabilities we can see / measure well.
Adding in my hunch that the boundary on intelligence is somewhere much higher than human level intelligence gives us "barring a shift of focus away from the problem, humanity will continue making progress until AI takes over the process of making progress" and the point of AI takeover is unknowable. Could be next week, could be next century, and giving a timeline requires estimating progress through unknown territory. To me this doesn't feel reassuring, it feels like playing Russian roulette with an unknown number of bullets. It is like an exponential distribution where future probability is independent of past probability, but unlike with lightbulbs burning out, we can't set up a fleet of earths making progress on AGI to try to estimate the probability distribution.
I have not been surprised by capabilities increase since I don't think there exists capabilities increase timelines that would surprise me much. I would just say "Ah, so it turns out that's the rate of progress. I have gone from not knowing what would happen to it happening. Just as predicted." It's unfortunate, I know.
What I have been surprised about has been governmental reaction to AI... I kinda expected the political world to basically ignore AI until too late. They do seem focused on non-RSI issues, so this could still be correct, but I guess I wasn't expecting the way chat-GPT has made waves. I didn't extrapolate my uncertainty around capabilities increases as a function of progress to uncertainty around societal reaction.
In any case, I've been hoping for the last few years I would have time to do my undergrad and start working on the alignment without a misaligned AI going RSI, and I'm still hoping for that. So that's lucky I guess. 🍀🐛