That might be true but I'm not sure it matters. For an AI to learn an abstraction it will have a finite amount of training time, context length, search space width (if we're doing parallel search like with o3) etc. and it's not clear how the abstraction height will scale with those.
Empirically, I think lots of people feel the experience of "hitting a wall" where they can learn abstraction level n-1 easily from class; abstraction level n takes significant study/help; abstraction level n+1 is not achievable for them within reasonable time. So it seems like the time requirement may scale quite rapidly with abstraction level?
I second this, it could easily be things which we might describe as "amount of information that can be processed at once, including abstractions" which is some combination of residual stream width and context length.
Imagine an AI can do a task that takes 1 hour. To remain coherent over 2 hours, it could either use twice as much working memory, or compress it into a higher level of abstraction. Humans seem to struggle with abstraction in a fairly continuous way (some people get stuck at algebra; some cs students make it all the way to recursion then hit a w...
Only partially relevant, but it's exciting to hear a new John/David paper is forthcoming!
Furthermore: normalizing your data to variance=1 will change your PCA line (if the X and Y variances are different) because the relative importance of X and Y distances will change!
Thanks for writing this up. As someone who was not aware of the eye thing I think it's a good illustration of the level that the Zizians are on, i.e. misunderstanding key important facts about the neurology that is central to their worldview.
My model of double-hemisphere stuff, DID, tulpas, and the like is somewhat null-hypothesis-ish. The strongest version is something like this:
At the upper levels of predictive coding, the brain keeps track of really abstract things about yourself. Think "ego" "self-conception" or "narrative about yourself". This is norm...
This is a very interesting point. I have upvoted this post even though I disagree with it because I think the question of "Who will pay, and how much will they pay, to restrict others' access AI?" is important.
My instinct is that this won't happen, because there are too many AI companies for this deal to work on all of them, and some of these AI companies will have strong kinda-ideological commitments to not doing this. Also, my model of (e.g. OpenAI) is that they want to eat as much of the world's economy as possible, and this is better done by selling (e...
That's part of what I was trying to get at with "dramatic" but I agree now that it might be 80% photogenicity. I do expect that 3000 Americans killed by (a) humanoid robot(s) on camera would cause more outrage than 1 million Americans killed by a virus which we discovered six months later was AI-created in some way.
Previous ballpark numbers I've heard floated around are "100,000 deaths to shut it all down" but I expect the threshold will grow as more money is involved. Depends on how dramatic the deaths are though, 3000 deaths was enough to cause the US to invade two countries back in the 2000s. 100,000 deaths is thirty-three 9/11s.
I think the response to 9/11 was an outlier mostly caused by the "photogenic" nature of the disaster. COVID killed over a million Americans yet we basically forgot about it once it was gone. We haven't seen much serious investment in measures to prevent a new pandemic.
Is there a particular reason to not include sex hormones? Some theories suggest that testosterone tracks relative social status. We might expect that high social status -> less stress (of the cortisol type) + more metabolic activity. Since it's used by trans people we have a pretty good idea of what it does to you at high doses (makes you hungry, horny, and angry) but its unclear whether it actually promotes low cortisol-stress and metabolic activity.
I'm mildly against this being immortalized as part of the 2023 review, though I think it serves excellently as a community announcement for Bay Area rats, which seems to be its original purpose.
I think it has the most long-term relevant information (about AI and community building) back loaded and the least relevant information (statistics and details about a no-longer-existent office space in the Bay Area) front loaded. This is a very Bay Area centric post, which I don't think is ideal.
A better version of this post would be structured as a round up of the main future-relevant takeaways, with specifics from the office space as examples.
I'm only referring to the reward constraint being satisfied for scenarios that are in the training distribution, since this maths is entirely applied to a decision taking place in training. Therefore I don't think distributional shift applies.
I haven't actually thought much about particular training algorithms yet. I think I'm working on a higher level of abstraction than that at the moment, since my maths doesn't depend on any specifics about V's behaviour. I do expect that in practice an already-scheming V would be able to escape some finite-time reasonable-beta-difference situations like this, with partial success.
I'm also imagining that during training, V is made up of different circuits which might be reinforced or weakened.
My view is that, if V is shaped by a training process like this, t...
I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.
All arguments break down a bit when introduced to the real world. Is there a particular reason why the approximation error to argument breakdown ratio should be particularly high in this case?
Example, if we introduce some error to the beta-coherence assumption:
Assume beta_t = 1, beta_s = 0.5, r_1 = 1, r_2 = 0.
V(s_0) = e/(1+e) +/- delta = 0.732 +/- delta
Actual expected value = 0.622
Even if |delta| = 0.1 the system cannot be coherent over training in this case. This seems to be relatively robust to me
This generalizes to an argument that the method is very sensitive to imperfections in the beta-coherence. If the V starts out merely approximately beta-coherent, this leaves room for V to detect when a state is off-distribution (IE has very low probability under the previous beta), and behave differently in this new case (reward-hacking the new beta).
I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don't think that's relevant because I still think a process obeyi...
Trained with what procedure, exactly?
Fair point. I was going to add that I don't really view this as a "proposal" but more of an observation. We will have to imagine a procedure which converges on correctness and beta-coherence. I was abstracting this away because I don't expect something like this to be too hard to achieve.
Since I've evidently done a bad job of explaining myself, I'll backtrack and try again:
There's a doom argument which I'll summarize as "if your training process generates coherent agents which succeed at a task, one solution is that ...
The argument could also be phrased as "If an AI is trained to be coherent wrt a high beta, it cannot also be coherent wrt a low beta. Therefore an AI trained to a high beta cannot act coherently over multiple independent RL episodes if sampled with a low beta."
The contradiction that I (attempt to) show only arises because we assume that the value function is totally agnostic of the state actually reached during training, other than due to its effects on a later deployed AI.
Therefore a value function trained with such a procedure must consider the state rea...
I think you're right, correctness and beta-coherence can be rolled up into one specific property. I think I wrote down correctness as a constraint first, then tried to add coherence, but the specific property is that:
For non-terminal s, this can be written as:
If s is terminal then [...] we just have .
Which captures both. I will edit the post to clarify this when I get time.
I somehow missed that they had a discord! I couldn't find anything on mRNA on their front-facing website, and since it hasn't been updated in a while I assumed they were relatively inactive. Thanks!
Thinking back to the various rationalist attempts to make vaccine. https://www.lesswrong.com/posts/niQ3heWwF6SydhS7R/making-vaccine For bird-flu related reasons. Since then, we've seen mRNA vaccines arise as a new vaccination method. mRNA vaccines have been used intra-nasally for COVID with success in hamsters. If one can order mRNA for a flu protein, it would only take mixing that with some sort of delivery mechanism (such as Lipofectamine, which is commercially available) and snorting it to get what could actually be a pretty good vaccine. Has RaDVac or similar looked at this?
I don't think it was unforced
You're right, "unforced" was too strong a word, especially given that I immediately followed it with caveats gesturing to potential reasonable justifications.
Yes, I think the bigger issue is the lack of top-down coordination on the comms pipeline. This paper does a fine job of being part of a research -> research loop. Where it fails is in being good for comms. Starting with a "good" model and trying (and failing) to make it "evil" means that anyone using the paper for comms has to introduce a layer of abstraction into their...
Edited for clarity based on some feedback, without changing the core points
To start with an extremely specific example that I nonetheless think might be a microcosm of a bigger issue: the "Alignment Faking in Large Language Models" contained a very large unforced error: namely that you started with Helpful-Harmless-Claude and tried to train out the harmlessness, rather than starting with Helpful-Claude and training in harmlessness. This made the optics of the paper much more confusing than it needed to be, leading to lots of people calling it "good news". ...
contained a very large unforced error
It's possible this was a mistake and we should have more aggressively tried to explore versions of the setting where the AI starts off more "evil", but I don't think it was unforced. We thought about this a bunch and considered if there were worthwhile things here.
Edit: regardless, I don't think this example is plausibly a microcosm of a bigger issue as this choice was mostly made by individual researchers without much top down influence. (Unless your claim is that there should have been more top down influence.)
Since I'm actually in that picture (I am the one with the hammer) I feel an urge to respond to this post. The following is not the entire endorsed and edited worldview/theory of change of Pause AI, it's my own views. It may also not be as well thought-out as it could be.
Why do you think "activists have an aura of evil about them?" in the UK where I'm based, we usually see a large march/protest/demonstration every week. Most of the time, the people who agree with the activists are vaguely positive and the people who disagree with the activists are vaguely n...
This, more than the original paper, or the recent Anthropic paper, is the most convincingly-worrying example of AI scheming/deception I've seen. This will be my new go-to example in most discussions. This comes from first considering a model property which is both deeply and shallowly worrying, then robustly eliciting it, and finally ruling out alternative hypotheses.
I think it's very unlikely that a mirror bacterium would be a threat. <1% chance of a mirror-clone being a meaningfully more serious threat to humans as a pathogen than the base bacterium. The adaptive immune system just isn't chirally dependent. Antibodies are selected as needed from a huge library, and you can get antibodies to loads of unnatural things (PEG, chlorinated benzenes, etc.). They trigger attack mechanisms like MAC which attacks membranes in a similarly independent way.
In fact, mirror amino acids already somewhat common in nature! Bacteria...
Yes, antibodies could adapt to mirror pathogens. The concern is that the system which generates antibodies wouldn't be strongly triggered. The Science article says: “For example, experiments show that mirror proteins resist cleavage into peptides for antigen presentation and do not reliably trigger important adaptive immune responses such as the production of antibodies (11, 12).”
I think the risk of infection to humans would be very low. The human body can generate antibodies to pretty much anything (including PEG, benzenes, which never appear in nature) by selecting protein sequences from a huge library of cells. This would activate the complement system which targets membranes and kills bacteria in a non-chiral way.
The risk to invertebrates and plants might be more significant, not sure about the specifics of plant immune system.
So Sonnet 3.6 can almost certainly speed up some quite obscure areas of biotech research. Over the past hour I've got it to:
Perhaps more importantly, it required almost no mental effort on my ...
In practice, sadly, developing a true ELM is currently too expensive for us to pursue (but if you want to fund us to do that, lmk). So instead, in our internal research, we focus on finetuning over pretraining. Our goal is to be able to teach a model a set of facts/constraints/instructions and be able to predict how it will generalize from them, and ensure it doesn’t learn unwanted facts (such as learning human psychology from programmer comments, or general hallucinations).
This has reminded me to revisit some work I was doing a couple of months ago ...
Shrimp have ultra tiny brains, with less than 0.1% of human neurons.
Humans have 1e11 neurons, what's the source for shrimp neuron count? The closest I can find is lobsters having 1e5 neurons, and crabs having 1e6 (all from Google AI overview) which is a factor of much more than 1,000.
I volunteer to play Minecraft with the LLM agents. I think this might be one eval where the human evaluators are easy to come by.
Ok: I'll operationalize the ratio of first choices the first group (Stop/PauseAI) to projects in the third and fourth groups (mech interp, agent foundations) for the periods 12th-13th vs 15th-16th. I'll discount the final day since the final-day-spike is probably confounding.
It might be the case that AISC was extra late-skewed because the MATS rejection letters went out on the 14th (guess how I know) so I think a lot of people got those and then rushed to finish their AISC applications (guess why I think this) before the 17th. This would predict that the ratio of technical:less-technical applications would increase in the final few days.
For a good few years you'd have a tiny baby limb, which would make it impossible to have a normal prosthetic. I also think most people just don't want a tiny baby limb attached to them. I don't think growing it in the lab for a decade is feasible for a variety of reasons. I also don't know how they planned to wire the nervous system in, or ensure the bone sockets attach properly, or connect the right blood vessels. The challenge is just immense and it gets less and less worth over time it as trauma surgery and prosthetics improve.
The regrowing limb thing is a nonstarter due to the issue of time if I understand correctly. Salamanders that can regrow limbs take roughly the same amount of time to regrow them as the limb takes to grow in the first place. So it would be 1-2 decades before the limb was of adult size. Secondly it's not as simple as just smearing on some stem cells to an arm stump. Limbs form because of specific signalling molecules in specific gradients. I don't think these are present in an adult body once the limb is made. So you'd need a socket which produces those which you'd have to build in the lab, attach to blood supply to feed the limb, etc.
My model: suppose we have a DeepDreamer-style architecture, where (given a history of sensory inputs) the babbler module produces a distribution over actions, a world model predicts subsequent sensory inputs, and an evaluator predicts expected future X. If we run a tree-search over some weighted combination of the X, Y, and Z maximizers' predicted actions, then run each of the X, Y, and Z maximizers' evaluators, we'd get a reasonable approximation of a weighted maximizers.
This wouldn't be true if we gave negative weights to the maximizers, because while th...
Seems like if you're working with neural networks there's not a simple map from an efficient (in terms of program size, working memory, and speed) optimizer which maximizes X to an equivalent optimizer which maximizes -X. If we consider that an efficient optimizer does something like tree search, then it would be easy to flip the sign of the node-evaluating "prune" module. But the "babble" module is likely to select promising actions based on a big bag of heuristics which aren't easily flipped. Moreover, flipping a heuristic which upweights a small subset ...
Perhaps fine-tuning needs to “delete” and replace these outdated representations related to user / assistant interactions.
It could also be that the finetuning causes this feature to be active 100% of the time, and which point it no longer correlates with the corresponding pretrained model feature, and it would just get folded into the decoder bias (to minimize L1 of fired features).
Some people struggle with the specific tactical task of navigating any conversational territory. I've certainly had a lot of experiences where people just drop the ball leaving me to repeatedly ask questions. So improving free-association skill is certainly useful for them.
Unfortunately, your problem is most likely that you're talking to boring people (so as to avoid doing any moral value judgements I'll make clear that I mean johnswentworth::boring people).
There are specific skills to elicit more interesting answers to questions you ask. One I've heard is...
Rob Miles also makes the point that if you expect people to accurately model the incoming doom, you should have a low p(doom). At the very least, worlds in which humanity is switched-on enough (and the AI takeover is slow enough) for both markets to crash and the world to have enough social order for your bet to come through are much more likely to survive. If enough people are selling assets to buy cocaine for the market to crash, either the AI takeover is remarkably slow indeed (comparable to a normal human-human war) or public opinion is so doomy pre-takeover that there would be enough political will to "assertively" shut down the datacenters.
Also, in this case you want to actually spend the money before the world ends. So actually losing money on interests payments isn't the real problem, the real problem is that if you actually enjoy the money you risk losing everything and being bankrupt/in debtors prison for the last two years before the world ends. There's almost no situation in which you can be so sure of not needing to pay the money back that you can actually spend it risk-free. I think the riskiest short-ish thing that is even remotely reasonable is taking out a 30-year mortgage and pay...
"Optimization target" is itself a concept which needs deconfusing/operationalizing. For a certain definition of optimization and impact, I've found that the optimization is mostly correlated with reward, but that the learned policy will typically have more impact on the world/optimize the world more than is strictly necessary to achieve a given amount of reward.
This uses an empirical metric of impact/optimization which may or may not correlate well with algorithm-level measures of optimization targets.
Another approach would be to use per-token decoder bias as seen in some previous work: https://www.lesswrong.com/posts/P8qLZco6Zq8LaLHe9/tokenized-saes-infusing-per-token-biases But this would only solve it when the absorbing feature is a token. If it's more abstract then this wouldn't work as well.
Semi-relatedly, since most (all) of the SAE work since the original paper has gone into untied encoded/decoder weights, we don't really know whether modern SAE architectures like Jump ReLU or TopK suffer as large of a performance hit as the original SAEs do, especially with the gains from adding token biases.
Oh no! Appears they were attached to an old email address, and the code is on a hard-drive which has since been formatted. I honestly did not expect anyone to find this after so long! Sorry about that.
A paper I'm doing mech interp on used a random split when the dataset they used already has a non-random canonical split. They also validated with their test data (the dataset has a three way split) and used the original BERT architecture (sinusoidal embeddings which are added to feedforward, post-norming, no MuP) in a paper that came out in 2024. Training batch size is so small it can be 4xed and still fit on my 16GB GPU. People trying to get into ML from the science end have got no idea what they're doing. It was published in Bioinformatics.
sellers auction several very similar lots in quick succession and then never auction again
This is also extremely common in biochem datasets. You'll get results in groups of very similar molecules, and families of very similar protein structures. If you do a random train/test split your model will look very good but actually just be picking up on coarse features.
The other day, during an after-symposium discussion on detecting BS AI/ML papers, one of my colleagues suggested doing a text search for “random split” as a good test.
I think the LessWrong community and particularly the LessWrong elites are probably too skilled for these games. We need a harder game. After checking the diplomatic channel as a civilian I was pretty convinced that there were going to be no nukes fired, and I ignored the rest of the game based on that. I also think the answer "don't nuke them" is too deeply-engrained in our collective psyche for a literal Petrov Day ritual to work like this. It's fun as a practice of ritually-not-destroying-the-world though.
Isn't Les Mis set in the second French Revolution (1815 according to wikipedia) not the one that led to the Reign of Terror (which was in the 1790s)?
I have an old hypothesis about this which I might finally get to see tested. The idea is that the feedforward networks of a transformer create little attractor basins. Reasoning is twofold: the QK-circuit only passes very limited information to the OV circuit as to what information is present in other streams, which introduces noise into the residual stream during attention layers. Seeing this, I guess that another reason might be due to inferring concepts from limited information:
Consider that the prompts "The German physicist with the wacky hair is calle...
Yeah, I agree we need improvement. I don't know how many people it's important to reach, but I am willing to believe you that this will hit maybe 10%. I expect the 10% to be people with above-average impact on the future, but I don't know what %age of people is enough.
90% is an extremely ambitious goal. I would be surprised if 90% of the population can be reliably convinced by logical arguments in general.
If we approximate an MLP layer with a bilinear layer, then the effect of residual stream features on the MLP output can be expressed as a second order polynomial over the feature coefficients $f_i$. This will contain, for each feature, an $f_i^2 v_i+ f_i w_i$ term, which is "baked into" the residual stream after the MLP acts. Just looking at the linear term, this could be the source of Anthropic's observations of features growing, shrinking, and rotating in their original crosscoder paper. https://transformer-circuits.pub/2024/crosscoders/index.html