Anecdotally, the effect of LLMs on my workflow hasn't been very large.
At a moderate P(doom), say under 25%, from a selfish perspective it makes sense to accelerate AI if it increases the chance that you get to live forever, even if it increases your risk of dying. I have heard from some people that this is their motivation.
If this is you: Please just sign up for cryonics. It's a much better immortality gambit than rushing for ASI.
I like AE Studios. They seem to genuinely care about AI not killing everyone, and have been willing to actually back original research ideas that don't fit into existing paradigms.
Side note:
Previous posts have been met with great reception by the likes of Eliezer Yudkowsky and Emmett Shear, so we’re up to something good.
This might be a joke, but just in case it's not: I don't think you should reason about your own alignment research agenda like this. I think Eliezer would probably be the first person to tell you that.
But they'd be too unchanged: the "afraid of mice" circuit would still be checking for "grey and big and mammal and ..." as the finetune dataset included no facts about animal fears. While some newer circuits formed during fine tuning would be checking for "grey and big and mammal and ... and high-scrabble-scoring". Any interpretability tool that told you that "grey and big and mammal and ..." was "elephant" in the first model is now going to have difficulty representing the situation.
Thank you, this is a good example of a type-of-thing to watch out for in ...
I'm with @chanind: If elephant is fully represented by a sum of its attributes, then it's quite reasonable to say that the model has no fundamental notion of an elephant in that representation.
...
This is not a load-bearing detail of the example. If you like, you can instead imagine a model that embeds 1000 animals in an e.g. 50-dimensional subspace, with a 50 dimensional sub-sub-space where the embedding directions correspond to 50 attributes, and a 50 dimensional sub-sub-space where embeddings are just random.
This should still get you basically the ...
The kind of 'alignment technique' that successfully points a dumb model in the rough direction of doing the task you want in early training does not necessarily straightforwardly connect to the kind of 'alignment technique' that will keep a model pointed quite precisely in the direction you want after it gets smart and self-reflective.
For a maybe not-so-great example, human RL reward signals in the brain used to successfully train and aim human cognition from infancy to point at reproductive fitness. Before the distributional shift, our brains usually neit...
How much money would you guess was lost on this?
Technically you didn't specify that can't be an arbitrary function, so you'd be able to reconstruct activations combining different bases, but it'd be horribly convoluted in practice.
I wouldn't even be too fussed about 'horribly convoluted' here. I'm saying it's worse than that. We would still have a problem even if we allowed ourselves arbitrary encoder functions to define the activations in the dictionary and magically knew which ones to pick.
The problem here isn't that we can't make a dictionary that includes all the featur...
E.g. it's not possible to represent an elephant with any arbitrary combination of attributes, as the attributes themselves are what defines the elephant direction.
You can't represent elephants along with arbitrary combinations of attributes. You can't do that in a scheme where feature directions are fully random with no geometry either though. There, only a small number of features can have non-zero values at the same time, so you still only get non-zero attribute features at once maximum.[1]
...We would want the dictionary to learn the attrib
Similarly, for people wanting to argue from the other direction, who might think a low current valuation is case-closed evidence against their success chances
To be clear: I think the investors would be wrong to think that AGI/ASI soon-ish isn't pretty likely.
OpenAI's valuation is very much reliant on being on a path to AGI in the not-too-distant future.
Really? I'm mostly ignorant on such matters, but I'd thought that their valuation seemed comically low compared to what I'd expect if their investors thought that OpenAI was likely to create anything close to a general superhuman AI system in the near future.[1] I considered this evidence that they think all the AGI/ASI talk is just marketing.
Well ok, if they actually thought OpenAI would create superintelligence as I think of it, their valuation would plu
If I understand correctly, it sounds like you're saying there is a "label" direction for each animal that's separate from each of the attributes.
No, the animal vectors are all fully spanned by the fifty attribute features.
I'm confused why a dictionary that consists of a feature direction for each attribute and each animal label can't explain these activations? These activations are just a (sparse) sum of these respective features, which are an animal label and a set of a few attributes, and all of these are (mostly) mutually orthogonal.
The animal features ...
'elephant' would be a sum of fifty attribute feature vectors, all with scalar coefficients that match elephants in particular. The coefficients would tend have sizes on the order of , because the subspace is fifty-dimensional. So, if you wanted to have a pure tiny feature and an elephant feature active at the same time to encode a tiny elephant, 'elephant' and 'tiny' would be expected to have read-off interference on the order of . Alternatively, you could instead encode a new animal 'tiny elephant' as its own point in the fifty-dimension...
E.g. the concept of a "furry elephant" or a "tiny elephant" would be unrepresentable in this scheme
It's representable. E.g. the model can learn a circuit reading in a direction that is equal to the sum of the furry attribute direction and the elephant direction, or the tiny direction and the elephant direction respectively. This circuit can then store facts about furry elephants or tiny elephants.
I feel like in this scheme, it's not really the case that there's 1000 animal directions, since the base unit is the attributes
In what sense? If you represent the...
you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)
Yes.
I don't think I am very good at explaining my thoughts on this in text. Some prior writings that have informed my models here are the MIRI dialogues, and the beginning parts of Steven Byrnes' sequence on brain-like AGI, which sketch how the loss functions human minds train on might look and gave me an example apart from evolution to think about.
Some scattered points that may or may not be of use:
Nope. Try it out. If you attempt to split the activation vector into 1050 vectors for animals + attributes, you can't get the dictionary activations to equal the feature activations , .
I did not know about this already.
For the same reasons training an agent on a constitution that says to care about does not, at arbitrary capability levels, produce an agent that cares about .
If you think that doing this does produce an agent that cares about even at arbitrary capability levels, then I guess in your world model it would indeed be consistent for that to work for inducing corrigibility as well.
The features a model thinks in do not need to form a basis or dictionary for its activations.
Three assumptions people in interpretability often make about the features that comprise a model’s ontology:
I think the value proposition of AI 2027-style work lies largely in communication. Concreteness helps people understand things better. The details are mostly there to provide that concreteness, not to actually be correct.
If you imagine the set of possible futures that people like Daniel, you or I think plausible as big distributions, with high entropy and lots of unknown latent variables, the point is that the best way to start explaining those distributions to people outside the community is to draw a sample from them and write it up. This is a lot of wor...
The bound is the same one you get for normal Solomonoff induction, except restricted to the set of programs the cut-off induction runs over. It’s a bound on the total expected error in terms of CE loss that the predictor will ever make, summed over all datapoints.
Look at the bound for cut-off induction in that post I linked, maybe? Hutter might also have something on it.
Can also discuss on a call if you like.
Note that this doesn’t work in real life, where the programs are not in fact restricted to outputting bit string predictions and can e.g. try to trick the hardware they’re running on.
You also want one that generalises well, and doesn't do preformative predictions, and doesn't have goals of its own. If your hypotheses aren't even intended to be reflections of reality, how do we know these properties hold?
Because we have the prediction error bounds.
...When we compare theories, we don't consider the complexity of all the associated approximations and abstractions. We just consider the complexity of the theory itself.
E.g. the theory of evolution isn't quite code for a costly simulation. But it can be viewed as set of statements about su
That’s fine. I just want a computable predictor that works well. This one does.
Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is about finding more effective approximations for stuff.
Edit: Actually, I don’t think this would yield you a different general predictor as the program dominating the posterior. General inductor program running program is pretty much never going to b...
If you make an agent by sticking together cut-off Solomonoff induction and e.g. causal decision theory, I do indeed buy that this agent will have problems. Because causal decision theory has problems.
Thank you for this summary.
I still find myself unconvinced by all the arguments against the Solomonoff prior I have encountered. For this particular argument, as you say, there's still many ways the conjectured counterexample of adversaria could fail if you actually tried to sit down and formalise it. Since the counterexample is designed to break a formalism that looks and feels really natural and robust to me, my guess is that the formalisation will indeed fall to one of these obstacles, or a different one.
...In a way, that makes perfect sense; Solomon
- A quick google search says the male is primary or exclusive breadwinner in a majority of married couples. Ass-pull number: the monetary costs alone are probably ~50% higher living costs. (Not a factor of two higher, because the living costs of two people living together are much less than double the living costs of one person. Also I'm generally considering the no-kids case here; I don't feel as confused about couples with kids.
But remember that you already conditioned on 'married couples without kids'. My guess would be that in the subset of man-woman mar...
Sure, I agree that, as we point out in the post
Yes, sorry I missed that. The section is titled 'Conclusions' and comes at the end of the post, so I guess I must have skipped over it because I thought it was the post conclusion section rather than the high-frequency latents conclusion section.
As long as your evaluation metrics measure the thing you actually care about...
I agree with this. I just don't think those autointerp metrics robustly capture what we care about.
Removing High Frequency Latents from JumpReLU SAEs
On a first read, this doesn't seem principled to me? How do we know those high-frequency latents aren't, for example, basis directions for dense subspaces or common multi-dimensional features? In that case, we'd expect them to activate frequently and maybe appear pretty uninterpretable at a glance. Modifying the sparsity penalty to split them into lower frequency latents could then be pathological, moving us further away from capturing the features of the model even though interpretability scores might impr...
Forgot to tell you this when you showed me the draft: The comp in sup paper actually had a dense construction for UAND included already. It works differently than the one you seem to have found though, using Gaussian weights rather than binary weights.
I will continue to do what I love, which includes reading and writing and thinking about biosecurity and diseases and animals and the end of the world and all that, and I will scrape out my existence one way or another.
Thank you. As far as I'm aware we don't know each other at all, but I really appreciate you working to do good.
I don't think the risks of talking about the culture war have gone down. If anything, it feels like it's yet again gotten worse. What exactly is risky to talk about has changed a bit, but that's it. I'm more reluctant than ever to involve myself in culture war adjacent discussions.
This comment by Carl Feynman has a very crisp formulation of the main problem as I see it.
...They’re measuring a noisy phenomenon, yes, but that’s only half the problem. The other half of the problem is that society demands answers. New psychology results are a matter of considerable public interest and you can become rich and famous from them. In the gap between the difficulty of supply and the massive demand grows a culture of fakery. The same is true of nutrition— everyone wants to know what the healthy thing to eat is, and the fact
So, the recipe for making a broken science you can't trust is
- The public cares a lot about answers to questions that fall within the science's domain.
- The science currently has no good attack angles on those questions.
To return to LessWrong's favorite topic, this doesn't bode well for alignment.
Relationship ... stuff?
I guess I feel kind of confused by the framing of the question. I don't have a model under which the sexual aspect of a long-term relationship typically makes up the bulk of its value to the participants. So, if a long-term relationship isn't doing well on that front, and yet both participants keep pursuing the relationship, my first guess would be that it's due to the value of everything that is not that. I wouldn't particularly expect any one thing to stick out here. Maybe they have a thing where they cuddle and watch the sun...
This data seems to be for sexual satisfaction rather than romantic satisfaction or general relationship satisfaction.
How sub-light? I was mostly just guessing here, but if it’s below like 0.95c I’d be surprised.
It expands at light speed. That's fast enough that no computational processing can possibly occur before we're dead. Sure there's branches where it maims us and then stops, but these are incredibly subdominant compared to branches where the tunneling doesn't happen.
Yes, you can make suicide machines very reliable and fast. I claim that whether your proposed suicide machine actually is reliable does in fact matter for determining whether you are likely to find yourself maimed. Making suicide machines that are synchronised earth-wide seems very difficult with current technology.
This. The struggle is real. My brain has started treating publishing a LessWrong post almost the way it'd treat publishing a paper. An acquaintance got upset at me once because they thought I hadn't provided sufficient discussion of their related Lesswrong post in mine. Shortforms are the place I still feel safe just writing things.
It makes sense to me that this happened. AI Safety doesn't have a journal, and training programs heavily encourage people to post their output on LessWrong. So part of it is slowly becoming a journal, and the felt social norms around posts are morphing to reflect that.
I don't think anything in the linked passage conflicts with my model of anticipated experience. My claim is not that the branch where everyone dies doesn't exist. Of course it exists. It just isn't very relevant for our future observations.
To briefly factor out the quantum physics here, because they don't actually matter much:
If someone tells me that they will create a copy of me while I'm anesthetized and unconscious, and put one of me in a room with red walls, and another of me in a room with blue walls, my anticipated experience is that I will wake up t...
There may be a sense in which amplitude is a finite resource. Decay your branch enough, and your future anticipated experience might come to be dominated by some alien with higher amplitude simulating you, or even just by your inner product with quantum noise in a more mainline branch of the wave function. At that point, you lose pretty much all ability to control your future anticipated experience. Which seems very bad. This is a barrier I ran into when thinking about ways to use quantum immortality to cheat heat death.
I don't think so. You only need one alien civilisation in our light cone to have preferences about the shape of the universal wave function rather than their own subjective experience for our light cone to get eaten. E.g. a paperclip maximiser might want to do this.
Also, the fermi paradox isn't really a thing.
No, because getting shot has a lot of outcomes that do not kill you but do cripple you. Vacuum decay should tend to have extremely few of those. It’s also instant, alleviating any lingering concerns about identity one might have in a setup where death is slow and gradual. It’s also synchronised to split off everyone hit by it into the same branch, whereas, say, a very high-yield bomb wired to a random number generator that uses atmospheric noise would split you off into a branch away from your friends.[1]
I’m not unconcerned about vacuum decay, mind you. It...
Since I didn't see it brought up on a skim: One reason me and some of my physicist friends aren't that concerned about vacuum decay is many-worlds. Since the decay is triggered by quantum tunneling and propagates at light speed, it'd be wiping out earth in one wavefunction branch that has amplitude roughly equal to the amplitude of the tunneling, while the decay just never happens in the other branches. Since we can't experience being dead, this wouldn't really affect our anticipated future experiences in any way. The vacuum would just never decay from our...
I disagreed with Gwern at first. I'm increasingly forced to admit there's something like bipolar going on here
What changed your mind? I don't know any details about the diagnostic criteria for bipolar besides those you and Gwern brought up in that debate. But looking at the points you made back then, it's unclear to me which of them you'd consider to be refuted or weakened now.
...Musk’s ordinary behavior - intense, risk-seeking, hard-working, grandiose, emotional - does resemble symptoms of hypomania (full mania would usually involve psychosis,
Based on my understanding of what you are doing, the statement in the OP that in your setting is "sort of" K-complexity is a bit misleading?
Yes, I guess it is. In my (weak) defence, I did put a '(sort of)' in front of that.
In my head, the relationship between the learning coefficient and the K-complexity here seems very similar-ish to the relationship between the K-complexities of a hypothesis expressed on two different UTMs.
If we have a UTM and a different UTM , we know that ...
Kind of? I'd say the big difference are
MOE experts don't completely ignore 'simplicity' as we define it in the paper though. A single expert is simpler than the whole MOE network in that it has lower rank/ fewer numbers are required to describe its state on any given forward pass.
Why would this be restricted to cyber attacks? If the CCP believed that ASI was possible, even if they didn't believe in the alignment problem, the US developing an ASI would plausibly constitute an existential threat to them. It'd mean they lose the game of geopolitics completely and permanently. I don't think they'd necessarily restrict themselves to covert sabotage in such a situation.
The possibility of stability through dynamics like mutually assured destruction has been where a lot of my remaining hope on the governance side has come from for a while now.
A big selling point of this for me is that it does not strictly require countries to believe that ASI is possible and that the alignment problem is real. Just believing that ASI is possible is enough.
I agree it’s not a valid argument. I’m not sure about ‘dishonest’ though. They could just be genuinely confused about this. I was surprised how many people in machine learning seem to think the universal approximation theorem explains why deep learning works.