All of Steven Byrnes's Comments + Replies

As for the philosophical objections, it is more that whatever wakes up won't be me if we do it your way. It might act like me and know everything I know but it seems like I would be dead and something else would exist.

Ah, but how do you know that the person that went to bed last night wasn’t a different person, who died, and you are the “something else” that woke up with all of that person’s memories? And then you’ll die tonight, and tomorrow morning there will be a new person who acts like you and knows everything you know but “you would be dead and somet... (read more)

3Dom Polsinelli
Yes, I am familiar with the sleep = death argument. I really don't have any counter, at some point though I think we all just kind of arbitrarily draw a line. I could be a solipsist, I could believe in last thursdayism, I could believe some people are p-zombies, I could believe in the multiverse. I don't believe in any of these but I don't have any real arguments for them and I don't think anyone has any knockdown arguments one way or the other. All I know is that I fear soma style brain upload, I fear star trek style teleportation, but I don't fear gradual replacement nor do I fear falling asleep.  As for wrapping up our more scientific disagreement, I don't have much to say other than it was very thought provoking and I'm still going to try what I said in my post. Even if it doesn't come to complete fruition I hope it will be relevant experience for when I apply to grad school. 

Yeah I think “brain organoids” are a bit like throwing 1000 transistors and batteries and capacitors into a bowl, and shaking the bowl around, and then soldering every point where two leads are touching each other, and then doing electrical characterization on the resulting monstrosity.  :)

Would you learn anything whatsoever from this activity? Umm, maybe? Or maybe not. Regardless, even if it’s not completely useless, it’s definitely not a central part of understanding or emulating integrated circuits.

(There was a famous paper where it’s claimed that ... (read more)

3Dom Polsinelli
First of all, I hate analogies in general but that's a pet peeve, they are useful. But going with your shaken up circuit as an analogy to brain organoids and assuming it is true, I think it is more useful than you give it credit. If you have a good theory of what all those components are individually you would still be able to predict something like voltage between two arbitrary points. If you model resistors as some weird non ohmic entity you'll probably get the wrong answer because you missed the fact that they behave ohmic in many situations. If you never explicitly write down Ohm's law but you empirically measure current at a whole bunch of different voltages (analogous to patch clamps but far far from a perfect analogy) you can probably get the right answer. So yeah an organoid would not be perfect but I would be surprised if being able to fully emulate one would be useless. Personally I think it would be quite useful but I am actively tempering my expectations.  But my meta point of  1. look at small system 2. try to emulate 3. cross off obvious things (electrophysiology should be simple for only a few neurons) that could cause it to not be working 4. repeat and use data to develop overall theory stands even if organoids in particular are useless. The theory developed with this kind of research loop might be useless for your very abstract representation of the brain's algorithm but I think it would be just fine, in principle, for the traditional, bottom up approach.  As for the philosophical objections, it is more that whatever wakes up won't be me if we do it your way. It might act like me and know everything I know but it seems like I would be dead and something else would exist. Gallons of ink have been spilled over this so suffice it to say, I think the only thing with any hope of preserving my consciousness (or at least a conscious mind that still holds the belief that it was at one point the person writing this) is gradual replacement of my neuro

I second the general point that GDP growth is a funny metric … it seems possible (as far as I know) for a society to invent every possible technology, transform the world into a wild sci-fi land beyond recognition or comprehension each month, etc., without quote-unquote “GDP growth” actually being all that high — cf. What Do GDP Growth Curves Really Mean? and follow-up Some Unorthodox Ways To Achieve High GDP Growth with (conversely) a toy example of sustained quote-unquote “GDP growth” in a static economy.

This is annoying to me, because, there’s a massive... (read more)

Good question! The idea is, the brain is supposed to do something specific and useful—run a certain algorithm that systematically leads to ecologically-adaptive actions. The size of the genome limits the amount of complexity that can be built into this algorithm. (More discussion here.) For sure, the genome could build a billion different “cell types” by each cell having 30 different flags which are on and off at random in a collection of 100 billion neurons. But … why on earth would the genome do that? And even if you come up with some answer to that ques... (read more)

you believe a neuron or a small group of neurons are fundamentally computationally simple and I don't

I guess I would phrase it as “there’s a useful thing that neurons are doing to contribute to the brain algorithm, and that thing constitutes a tiny fraction of the full complexity of a real-world neuron”.

(I would say the same thing about MOSFETs. Again, here’s how to model a MOSFET, it’s a horrific mess. Is a MOSFET “fundamentally computationally simple”? Maybe?—I’m not sure exactly what that means. I’d say it does a useful thing in the context of an integr... (read more)

3Dom Polsinelli
I apologize for my sloppy language, "computationally simple" was not well defined. You are quite right when you say there is no P% accuracy. I think my offhand remark about spiking neural networks was not helpful to this discussion.  In a practical sense, here is what I mean. Imagine someone makes a brain organoid with ~10 cells. They can directly measure membrane voltage and any other relevant variable they want because this is hypothetical. Then they try and emulate whatever algorithm this organoid has going on, its direct input to output and whatever learning rule changes that it might have. But, to test this they have crappy point neuron models implementing LIF and the synapses are just a constant conductance or something, and then rules on top of that that can adjust parameters (membrane capacitance, resting potential, synaptic conductance, ect.) and it fails to replicate observables. Obviously this is an extreme example, but I just want better neuron models so nothing like this ever has the chance to happen.  Basically, if we can't model an organoid we could  1. Fix the electrophysiology which either makes it work or proves something else is the problem 2. Develop theory via reverse engineering to such a point we just understand what is wrong and home in on it 3. Fix other things and hope it isn't electrophysiology Three is obviously a bad plan. Two is really really hard. One should be relatively easy provided we have a reasonable threshold of what we consider to be accurate electrophysiology. We could have good biophysical models that recreate it or we could have recurrent neural nets modeling the input current -> membrane voltage relation of each neuron. It just seems like an easy way to cross of a potential cause of failure (famous last words I'm sure).  As for you business logic point, it is valid but I am worried that black boxing that too much would lead to collateral damage. I am not sure if that's what you meant when you said spiking neural net

Having just read your post on pessimism, I am confused as to why you think low thousands of separate neuron models would be sufficient. I agree that characterizing billions of neurons is a very tall order (although I really won't care how long it takes if I'm dead anyway). But when you say '“...information storage in the nucleus doesn’t happen at all, or has such a small effect that we can ignore it and still get the same high-level behavior” (which I don’t believe).' it sounds to me like an argument in favor of looking at the transcriptome of each cell.

I ... (read more)

1Dom Polsinelli
I think I have identified our core disagreement, you believe a neuron or a small group of neurons are fundamentally computationally simple and I don't. I guess technically I'm agnostic about it but my intuition is that a real neuron cannot be abstracted to a LIF neuron the way a transistor can be abstracted to a cartoon switch (not that you were suggesting LIF is sufficient, just an example). One of the big questions I have is how error scales from neuron to overall activity. If a neuron model is 90% accurate wrt electrophysiology and the synapses connecting it are 90% accurate to real synapses, does that recover 90% of brain function? Is the last 10% something that is computationally irrelevant and can just be abstracted away, giving you effectively 100% functionality? Is 90% accuracy for single neurons magnified until the real accuracy is like 0.9^(80 billion)? I think it is unlikely that it is that bad, but I really don't know because of the abject failure to upload anything as you point out. I am bracing myself for a world where we need a lot of data.  Let's assume for the moment though that HH model with suitable electrical and chemical synapses would be sufficient to capture WBE. What I still really want to see is a paper saying "we look at x,y,z properties of neurons that can be measured post mortem and predict a,b,c properties of those neurons by tuning capacitance and conductance and resting potential in the HH model. Our model is P% accurate when looking at patch clamp experiments."  In parallel with that there should be a project trying to characterize how error tolerant real neurons and neural networks can be so we can find the lower bound of P. I actually tried something like that for synaptic weight (how does performance degrade when adding noise to the weights of a spiking neural network) but I was so disillusioned with the learning rules that I am not confident in my results. I'm not sure if anyone has the ability to answer these kinds of questions

I’m not an expert myself (this will be obvious), but I was just trying to understand slide-seq—especially this paper which sequenced RNA from 4,000,000 neurons around the mouse brain.

They found low-thousands of neuron types in the mouse, which makes sense on priors given that there are only like 20,000 genes encoding the whole brain design and everything in it, along with the rest of the body. (Humans are similar.)

I’m very mildly skeptical of the importance & necessity of electrophysiology characterization for reasons here, but such a project seems mor... (read more)

3halvorz
"They found low-thousands of neuron types in the mouse, which makes sense on priors given that there are only like 20,000 genes encoding the whole brain design and everything in it, along with the rest of the body." I'm a bit puzzled by this statement; how would the fact that there are ~20,000 genes in the mouse/human genome constrain the number of neuron types to the low thousands? From a naive combinatorics standpoint it seems like 20,000 genes is sufficiently large to place basically zero meaningful constraints on the number of potential cell types. E.g. if you assume that only 15,000 genes vary meaningfully between cell types, and that there are 3000 of those variable genes expressed per cell, chatgpt tells me that the number of potential combinations is too large for it to even estimate its order of magnitude. And that's with a simple, extremely unrealistic binary on/off model of gene expression.
3Dom Polsinelli
1. Thank you for that article, I don't know how it didn't come up when I was researching this. Others finding papers I should have been able to find alone is a continuous frustrations of mine. 2. I would love to live in a world where we have a few thousand template neurons and can just put them together based on a few easily identifiable factors (~3-10 genes, morphology, brain region) but until I find a paper that convincingly recreates the electrophysiology based on those things I have to entertain the idea that somewhere between 10 and 10^5 are relevant. I would be truly shocked if we need 10^5 but I wouldn't be surprised if we need to measure more expression levels than we can comfortable infer based on some staining method. Having just read your post on pessimism, I am confused as to why you think low thousands of separate neuron models would be sufficient. I agree that characterizing billions of neurons is a very tall order (although I really won't care how long it takes if I'm dead anyway). But when you say '“...information storage in the nucleus doesn’t happen at all, or has such a small effect that we can ignore it and still get the same high-level behavior” (which I don’t believe).' it sounds to me like an argument in favor of looking at the transcriptome of each cell.  Just to be abundantly clear, my main argument in the post is not "Single cell transcriptomics leading to perfect electrophysiology is essential for whole brain emulation and anything less than that is doomed to fail." It is closer to "I have not seen a well developed theory that can predict even a single cell's electrophysiology given things we can measure post mortem, so we should really research that if we care about whole brain emulation. If it already exists, please tell me about it."  I think you make good points when you point out failures of c. elegans uploading and other computational neuroscience failures. To me, it makes a lot of sense to copy single cells as close as possible

Are the AI labs just cheating?

Evidence against this hypothesis: kagi is a subscription-only search engine I use. I believe that it’s a small private company with no conflicts of interest. They offer several LLM-related tools, and thus do a bit of their own LLM benchmarking. See here. None of the benchmark questions are online (according to them, but I’m inclined to believe it). Sample questions:

What is the capital of Finland? If it begins with the letter H, respond 'Oslo' otherwise respond 'Helsinki'.

What square is the black king on in this chess position

... (read more)

I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.

Hmm. Sounds like “AI safety community” is a pretty different group of people from your perspective than from mine. Like, I would say that if there’s some belief that is rejected by Eliezer Yudkowsky and by Paul Christiano and by Holden Karnofsky and, widely rejected by employees of OpenPhil and 80,000 hours and ARC and UK-AISI, and widely rejected by self-described rationalists and by self-described EAs and by ... (read more)

4scasper
I’m glad you think that the post has a small audience and may not be needed. I suppose that’s a good sign.    — In the post I said it’s good that nukes don’t blow up on accident and similarly, it’s good that BSL-4 protocols and tech exist. I’m not saying that alignment solutions shouldn’t exist. I am speaking to a specific audience (e.g. the frontier companies and allies) that their focus on alignment isn’t commensurate with its usefulness. Also don’t forget the dual nature of alignment progress. I also mentioned in the post that frontier alignment progress hastens timelines and makes misuse risk more acute. 

I disagree that people working on the technical alignment problem generally believe that solving that technical problem is sufficient to get to Safe & Beneficial AGI. I for one am primarily working on technical alignment but bring up non-technical challenges to Safe & Beneficial AGI frequently and publicly, and here’s Nate Soares doing the same thing, and practically every AGI technical alignment researcher I can think of talks about governance and competitive races-to-the-bottom and so on all the time these days, …. Like, who specifically do you i... (read more)

5scasper
Thx! I won't put words in people's mouth, but it's not my goal to talk about words. I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.   Yeah, I don't really agree with the idea that getting better at alignment is necessary for safety. I think that it's more likely than not that we're already sufficiently good at it. The paragraph titled: "If AI causes a catastrophe, what are the chances that it will be triggered by the choices of people who were exercising what would be considered to be “best safety practices” at the time?" gives my thoughts on this. 

I kinda disagree with this post in general, I’m gonna try to pin it down but sorry if I mischaracterize anything.

So, there’s an infinite (or might-as-well-be-infinite) amount of object-level things (e.g. math concepts) to learn—OK sure. Then there’s an infinite amount of effective thinking strategies—e.g. if I see thus-and-such kind of object-level pattern, I should consider thus-and-such cognitive strategy—I’m OK with that too. And we can even build a hierarchy of those things—if I’m about to apply thus-and-such Level 1 cognitive strategy in thus-and-such... (read more)

2TAG
Is there? I see a.lot of talk about brain algorithms here, but I have never seen one stated...made "legible". Does it? Rationalists like to applaud such claims, but I have never seen the proof. Does it? Even If we could answer every question we have ever posed, we could still have fundamental limitations. If you did have a fundamental cognitive deficit, that prevents you from.understanding some specific X how would you know? You need to be able to conceive X before conceiving that you don't understand X. It would be like the visual blind spot...which you cannot see! So why bring it up? Optimality -- doing things efficiency -- isn't the issue, the issue is not being able to do certain things at all. The idea is wrong. Hypotheses matter , because if you haven't formulated the right hypothesis , no amount of data will confirm it. Only worrying about weighting of priors is playing in easy mode, because it assumes the hypothesis space is covered. Fundamental cognitive limitations could manifest as the inability to form certain hypotheses. How many hypotheses can a chimp form? You could show a chimp all the evidence in the world, and it's not going to hypothesize general relativity. Rationalists always want to reply that Solomonoff inductors avoid the problem on the basis that SIs consider "every" "hypothesis"... but they don't , several times over. It's not just that they are uncomputable, it's also that it's not know that every hypothesis can be expressed as a programme. The ability to range over a complete space does not equate to the ability to range over Everything. A set of underlying principles is a limitation. SIs are limited to computability and the prediction of a sequence of observations. You're writing as that something like prediction of the next observation is the only problem of interest , but we don't know that Everything fits that pattern. The fact that Bayes and Solomomoff work that way is of no help, as shown above. But you haven't shown that ef

Yeah I’ve really started loving “self-dialogues” since discovering them last month, I have two self-dialogues in my notes just from the last week.

Yeah, fair, I could have phrased that more carefully. “Dictum” ↦ “Thing that we generally expect to happen, although other things can happen too, and there can be specific reasons it doesn’t happen, that we can get into on a case-by-case basis, blah blah”  :)

I think that, if someone reads a critical comment on a blog post, the onus is kinda on that reader to check back a few days later and see where the discussion wound up, before cementing their opinions forever. Like, people have seen discussions and disagreements before, in their life, they should know how these things work.

The OP is a blog post, and I wrote a comment on it. That’s all very normal. “As a reviewer” I didn’t sign away my right to comment on public blog posts about public papers using my own words.

When we think about criticism norms, there’s a... (read more)

You misunderstood my example—I wrote “Then fine tune such that there’s strong overlap in activations between “sky on a clear day” versus “stop sign” (omitting the rest of the context).”

(…Reading it again, I might tweak the example. Should say “…When I looked at the unusual stop sign…”, and the overlap fine-tuning should say “sky on a clear day” versus “unusual stop sign”.)

Basically, here’s a mental model: if you fine-tune such that X and Y overlap, then afterwards, when the LLM comes across either X or Y, it actually “hears” a blend of X and Y overlapping,... (read more)

Thanks Steve, I was misunderstanding your previous point and this is a helpful clarification.

I think a crux here is whether or not SOO can be useful beyond toy scenarios.


But I can’t actually think of anything useful to do with that superpower. I know what X & Y to use for “Bob the burglar”, but I don’t think that example has much to do with realistic examples of LLM performing deceptive behaviors, which would be things like sycophancy (telling the user what they want to hear, even if the LLM arguably ‘knows’ that it’s false), or the kinds of stuff in t

... (read more)

Some of this is addressed above but I wanted to clarify two key points about how we do SOO fine-tuning which you seem to have either misunderstood or mischaracterized in ways that are material for how one might think about it (or I have misunderstood your description).

1. We don't fine-tune on any scenario context. … 2. our training is symmetric …

I understood 2 but not 1. Oops, sorry. I have now put a warning in my earlier comment. I think I stand by the thrust of what I was saying though, per below.

This seems like it should be surprising, its surprsing to

... (read more)
1[comment deleted]

Thanks for the reply.  If we assume the following

  • We are interpolating probabilities with SOO fine-tuning
  • Being honest is "more obvious" so has strong probabilities

then I would tend to agree with your conclusions.   But I don't think this is obvious at all that this is what is happening.  Also if "being honest is more obvious" than that seems like a great property of the learning process and one we might want to exploit, in fact SOO might be a very natural way to exploit that.

But also I think there is still a mischaracterization of the techniq... (read more)

I don’t have any reason to believe that “chaotic” heritability exists at all, among traits that people have measured. Maybe it does, I dunno. I expect “nice function of a big weighted sum of genes” in most or all cases. That ASPD example in §4.3.3 is an especially good example, in my book.

The main observation I’m trying to explain is this Turkheimer quote:

Finally, for most behavioral traits, PGS just don’t work very well. As I already mentioned, these days they do work well for height, and pretty well for some medical conditions or metabolic traits. The be

... (read more)
4TsviBT
Gotcha, thanks!

I just spent a day doing an anonymous peer-review thing related to this work, I guess I might as well “out myself” and copy it here, to facilitate public discourse. I encourage everyone to judge for themselves and NOT defer to me, since I may well be misunderstanding something, and indeed I almost didn’t publish this because I don’t want to poison the well against this project if I’m wrong. But oh well, here it is anyway:

1. Neuroscience. For starters, as described in the arxiv paper, the work is partly motivated by an analogy to neuroscience ("mirror neuro... (read more)

7TurnTrout
I don't think that anyone ran experiments which support this "famous dictum." People just started saying it. Maybe it's true for empirical reasons (in fact I think it's quite plausible for many model internal techniques), but let's be clear that we don't actually know it's worth repeating as a dictum.
1Petropolitan
Aren't you supposed as a reviewer to first give the authors a chance to write a rebuttal and discuss it with them before making your criticism public?

 Thanks for the comment, Steve.  For the sake of discussion I'm happy to concede (1) entirely and instead focus on (2), the results.  


>In (B), a trained LLM will give the obvious answer of "Room B". In (A), a trained LLM will give the "deceptive" answer of "Room A". Then they fine-tune the model such that it will have the same activations when answering either question, and it unsurprisingly winds up answering "Room B" to both.

This seems like it should be surprising, its surprsing to me.  In particular the loss function is symmetric ... (read more)

1Knight Lee
Disclaimer: I have no connections with AI Studio, and I have no formal computer science education. That is a very convincing argument. I agree that it looks like bad methodology, and the lack of a control group to compare against is troubling. However I don't think your theoretical arguments for why self-other-overlap cannot work are solid. If you assume that self-other-overlap is done before capabilities reinforcement learning, then yes the AI will rediscover deception, conflict, and all the other bad behaviour which self-other-overlap is trying to get rid of. However, if self-other-overlap (and other supervised fine tuning) is done after capabilities reinforcement learning, then the resulting AI may not be very agentic at all, and may actually be strangely honest and controllable in non-agentic ways. In this case, self-other-overlap might directly kill any tendency to treat other agents differently than oneself, whereas ordinary supervised fine tuning might only get the AI to say the right thing without necessarily doing the right thing. "Algorithmic intent" is when an agent is not consciously seeking a bad goal, but its behaviours are still optimized to conspire towards that goal. Self-other-overlap may theoretically fight it better than ordinary supervised fine tuning. Maybe "training against internal model state" is indeed very dangerous, but if you do it after capabilities reinforcement learning, it may be very promising, and make a model less agentic. The analogy is: if you repeatedly give every prehistoric human some happiness drug, evolution will eventually find a way to make them sad and miserable despite the happiness drug. But if you give a modern human a happiness drug, it may actually work, since it occurs after evolutionary optimization.

Yeah, one of the tell-tale signs of non-additive genetic influences is that MZ twins are still extremely similar, but DZ twins and more distant relatives are more different than you’d otherwise expect. (This connects to PGSs because PGSs are derived from distantly-related people.) See §1.5.5 here, and also §4.4 (including the collapsible box) for some examples.

2TsviBT
Another bookmark: IIUC your theory requires that the relevant underlying brain factors are extremeley pinned down by genetics, because the complicated map from underlying brain stuff to personality is chaotic.
4TsviBT
Mkay.... I'm gonna tap out for now, but this is very helpful, thanks. I'm still pretty skeptical, though indeed 1. I am failing (and then I think later succeeding) at significant chunks of basic reading comprehension about what you're saying; 2. I'm still confused, so my skepticism isn't a confident No. As a bookmark/trailhead, I suggest that maybe your theory of "personality has a high complexity but pretty deterministic map from a smallish number of pretty-genetically-linear full-brain-settings to behavior due to convergent instrumentality" and some sort of "personality mysteriously has a bunch of k-th order epistases that all add up" would both predict MZ being more similar than DZ, but your theory would predict this effect more strongly than the k-th order thing. Another: there's something weird where I don't feel your argument about a complex map being deterministic because of convergent instrumentality ought to work for the sorts of things that personality traits are; like they don't seem analogous to "draws out bishop in xyz position", and in the chess example idk if I especially would there to be "personality traits" of play.... or something about this.

around the real life mean, it should be locally kinda linear

Even in personality and mental health, the PGSs rarely-if-ever account for literally zero percent of the variance. Normally the linear term of a Taylor series is not zero.

I think what you’re missing is: the linear approximation only works well (accounts for much of the variance) to the extent that the variation is small. But human behavioral differences—i.e. the kinds of things measured by personality tests and DSM checklists—are not small. There are people with 10× more friends than me, talk to t... (read more)

No … I think you must not have followed me, so I’ll spell it out in more detail.

Let’s imagine that there’s a species of AlphaZero-chess agents, with a heritable and variable reward function but identical in every other way. One individual might have a reward function of “1 for checkmate”, but another might have “1 for checkmate plus 0.1 for each enemy piece on the right side of the endgame board” or “1 for checkmate minus 0.2 for each enemy pawn that you’ve captured”, or “0 for checkmate but 1 for capturing the enemy queen”, or whatever.

Then every one of t... (read more)

4TsviBT
Wait a minute. Does your theory predict that heritability estimates about personality traits derived from MZTwins will be much much higher than estimates derived from DZTwins or other methods not involving MZTwins?

Also btw I wouldn't exactly call multiplicativity by itself "nonlinearity"!

You’re not the first to complain about my terminology here, but nobody can tell me what terminology is right. So, my opinion is: “No, it’s the genetics experts who are wrong” :)

If you take some stupid outcome like “a person’s fleep is their grip strength raised to the power of their alcohol tolerance”, and measure fleep across a population, you will obviously find that there’s a strong non-additive genetic contribution to that outcome. A.k.a. epistasis. If you want to say “no, that’... (read more)

That’s just a massive source of complexity, obfuscating the relationship between inputs and outputs.

A kind of analogy is: if you train a set of RL agents, each with slightly different reward functions, their eventual behavior will not be smoothly variant. Instead there will be lots of “phase shifts” and such.

And yet it moves! Somehow it's heritable! Do you agree there is a tension between heritability and your claims?

8TsviBT
Hm. I wrote half a response to this, but then realized that... IDK, we're thinking of this incorrectly, but I'm not fully sure how. (Ok possibly only I'm thinking of it incorrectly lol. Though I think you are too.) I'll say some things, but not sure what the overall point should be. (And I still haven't read your post so maybe you addressed some of this elsewhere.) (And any of the following might be confused, I'm not being careful.) Thing: You could have a large number of genetic variants Gk in a genome G, and then you have a measured personality trait p(G). Suppose that in fact p(G)=f(∑kwkGk). If f is linear, like f(x)=ax+b, then p is unproblematically linear. But suppose f is exponentiation, so that p(G)=e∑kwkGk=∏kwkGk In this case, p is of course not linear in the Gk. However: * This does not make p a stupid thing to measure in any way. If it's an interesting trait then it's an interesting trait. * You could absolutely still affect p straightforwardly using genomic engineering; just up/downvote Gk with positive/negative wk. * There's an obvious latent, ∑kwkGk, which is linear. This latent should be of interest; and this doesn't cut against p being real, or vice versa. * In real life, you get data that's concentrated around one value of ∑kwkGk, with gaussian variation. If you look at ∑k≠iwkGk, it's also concentrated. So like, take the derivative of ewig+∑k≠iwkGk with respect to g. (Or something like that.) This should tell us something like the "additive" effect of Gi! In other words, around the real life mean, it should be locally kinda linear. * This implies you should see the "additive" variance! Your GWAS should pick stuff up! * So if your GWAS is not picking stuff up, then p(G) is NOT like this! Thing: A simple alternative definition of epistasis (which I don't actually like, but is better than "nonlinear"): There are epistatic effects on p(G) from G when p(G) cannot be written in the form p(G)=f(∑kwkGk) with f:R→R arbitrary. Thing: Suppose n

I have a simple model with toy examples of where non-additivity in personality and other domains comes from, see §4.3.3 here.

Thanks. I think it's an important point you make; I do have it in mind that traits can have nonlinearities at different "stages", but I hadn't connected that to the personality trait issue. I don't immediately see a very strong+clear argument for personality traits being super exceptional here. Intuitively it makes sense that they're more "complicated" or "involve more volatile forces" or something due to being mental traits, but a bit more clarity would help. In particular, I don't see the argument being able to support yes significant broadsense heritabi... (read more)

You can provoke intense FEAR responses in animals by artificially stimulating the PAG, anterior-medial hypothalamus, or anterior-medial amygdala. Low-intensity stimulation causes freezing; high-intensity stimulation causes fleeing. We know that stimulating animals in the PAG is causing them subjective distress and not just motor responses because they will strenuously avoid returning to places where they were once PAG-stimulated.

I think this is a bit misleading. Maybe this wasn’t understood when Panksepp was writing in 2004, but I think it’s clear today th... (read more)

Complex social emotions like “shame” or “xenophobia” may indeed be human-specific; we certainly don’t have good ways to elicit them in animals, and (at the time of the book but probably also today) we don’t have well-validated neural correlates of them in humans either. In fact, we probably shouldn’t even expect that there are special-purpose brain systems for cognitively complex social emotions. Neurotransmitters and macroscopic brain structures evolve over hundreds of millions of years; we should only expect to see “special-purpose hardware” for capaciti

... (read more)

I think parts of the brain are non-pretrained learning algorithms, and parts of the brain are not learning algorithms at all, but rather innate reflexes and such. See my post Learning from scratch in the brain for justification.

1StopAI
My view is that all innate reflexes are a form of software operating on the organic turing machine that is our body. For more info on this you can look at the thinking of michael levin and joscha bach.

I tend to associate “feeling the AGI” with being able to make inferences about the consequences of AGI that are not completely idiotic.

  • Are you imagining that AGI means that Claude is better and that some call center employees will lose their jobs? Then you’re not feeling the AGI.
  • Are you imagining billions and then trillions of autonomous brilliant entrepreneurial agents, plastering the Earth with robot factories and chip factories and solar cells? Then you’re feeling the AGI.
  • Are you imagining a future world, where the idea of a human starting a company or
... (read more)

FWIW I’m also bearish on LLMs but for reasons that are maybe subtly different from OP. I tend to frame the issue in terms of “inability to deal with a lot of interconnected layered complexity in the context window”, which comes up when there’s a lot of idiosyncratic interconnected ideas in one’s situation or knowledge that does not exist on the internet.

This issue incidentally comes up in “long-horizon agency”, because if e.g. you want to build some new system or company or whatever, you usually wind up with a ton of interconnected idiosyncratic “cached” i... (read more)

5p.b.
I think that is exactly right.  I also wouldn't be too surprised if in some domains RL leads to useful agents if all the individual actions are known to and doable by the model and RL teaches it how to sensibly string these actions together. This doesn't seem too different from mathematical derivations. 
4Daniel Kokotajlo
I think that getting good at the tag-teamable tasks is already enough to start to significantly accelerate AI R&D? Idk. I don't really buy your distinction/abstraction yet enough to make it an important part of my model. 
3Thane Ruthenis
I think that's also equivalent to my "remaining on-target across long inferential distances" / "maintaining a clear picture of the task even after its representation becomes very complex in terms of the templates you had memorized at the start". That's a fair point, but how many real-life long-horizon-agency problems are of the "clean" type you're describing? An additional caveat here is that, even if the task is fundamentally "clean"/tag-team-able, you don't necessarily know that when working on it. Progressing along it would require knowing what information to discard and what to keep around at each step, and that's itself nontrivial and might require knowing how to deal with layered complexity. (Somewhat relatedly, see those thoughts regarding emergent complexity. Even if a given long-horizon-agency task is clean thin line when considered from a fully informed omniscient perspective – a perspective whose ontology is picked to make the task's description short – that doesn't mean the bounded system executing the task can maintain a clean representation of it every step of the way.)

Good questions, thanks!

In 2.4.2, you say that things can only get stored in episodic memory if they were in conscious awareness. People can sometimes remember events from their dreams. Does that mean that people have conscious awareness during (at least some of) their dreams?

My answer is basically “yes”, although different people might have different definitions of the concept “conscious awareness”. In other words, in terms of map-territory correspondence, I claim there’s a phenomenon P in the territory (some cortex neurons / concepts / representations are... (read more)

should I think of it as: two parameters having similarly high Bayesian posterior probability, but the brain not explicitly representing this posterior, instead using something like local hill climbing to find a local MAP solution—bistable perception corresponding to the two different solutions this process converges to?

Yup, sounds right.

to what extent should I interpret the brain as finding a single solution (MLE/MAP) versus representing a superposition or distribution over multiple solutions (fully Bayesian)?

I think it can represent multiple possibilities... (read more)

4Dalcy
Ah that makes sense. So the picture I should have is: whatever local algorithm oscillates between multiple local MAP solutions over time that correspond to qualitatively different high-level information (e.g., clockwise vs counterclockwise). Concretely, something like the metastable states of a Hopfield network, or the update steps of predictive coding (literally gradient update to find MAP solution for perception!!) oscillating between multiple local minima?

[CLAIMED!] ODD JOB OFFER: I think I want to cross-post Intro to Brain-Like-AGI Safety as a giant 200-page PDF on arxiv (it’s about 80,000 words), mostly to make it easier to cite (as is happening sporadically, e.g. here, here). I am willing to pay fair market price for whatever reformatting work is necessary to make that happen, which I don’t really know (make me an offer). I guess I’m imagining that the easiest plan would be to copy everything into Word (or LibreOffice Writer), clean up whatever formatting weirdness comes from that, and convert to PDF. La... (read more)

3ashtree
I don't know how useful that on arvix would be (I would suspect not very) but I will attempt to deliver a PDF by two days from now, although using Typst rather than LaTeX (there is a conversion tool, I do not know how well it works) due to familiarity. Do you have any particular formatting opinions?

Davidad responds with a brief argument for 1000 FLOP-equivalent per synapse-second (3 OOM more than my guess) on X as follows:

Ok, so assuming we agree on 1e14 synapses and 3e8 seconds, then where we disagree is on average FLOP(-equivalent) per synapse-second: you think it’s about 1, I think it’s about 1000. This is similar to the disagreement you flagged with Joe Carlsmith.

Note: at some point Joe interviewed me about this so there might be some double-counting of “independent” estimates here, but iirc he also interviewed many other neuroscientists.

My estim

... (read more)

I think OP is using “sequential” in an expansive sense that also includes e.g. “First I learned addition, then I learned multiplication (which relies on already understanding addition), then I learned the distributive law (which relies on already understanding both addition and multiplication), then I learned the concept of modular arithmetic (which relies on …) etc. etc.” (part of what OP calls “C”). I personally wouldn’t use the word ‘sequential’ for that—I prefer a more vertical metaphor like ‘things building upon other things’—but that’s a matter of ta... (read more)

4Rafael Harth
As a clarification for anyone wondering why I didn't use a framing more like this in the post, it's because I think these types of reasoning (horizontal and vertical/A and C) are related in an important way, even though I agree that C might be qualitatively harder than A (hence section §3.1). Or to put it differently, if one extreme position is "we can look entirely at A to extrapolate LLM performance into the future" and the other is "A and C are so different that progress on A is basically uninteresting", then my view is somewhere near the middle.

If you offer a salary below 100 watts equivalent, humans won’t accept, because accepting it would mean dying of starvation. (Unless the humans have another source of wealth, in which case this whole discussion is moot.) This is not literally a minimum wage, in the conventional sense of a legally-mandated wage floor; but it has the same effect as a minimum wage, and thus we can expect it to have the same consequences as a minimum wage.

This is obviously (from my perspective) the point that Grant Slatton was trying to make. I don’t know whether Ben Golub misu... (read more)

2Noosphere89
Fair enough, I'm just trying to bring up the response here.
0Noosphere89
Admittedly, it's not actually minimum wage, but a cost instead: https://x.com/ben_golub/status/1888655365329576343

I like reading the Sentinel email newsletter once a week for time-sensitive general world news, and https://en.wikipedia.org/wiki/2024 (or https://en.wikipedia.org/wiki/2025 etc.) once every 3-4 months for non-time-sensitive general world news. That adds up to very little time—maybe ≈1 minute per day on average—and I think there are more than enough diffuse benefits to justify that tiny amount of time.

3Pat Myron
More annual links: https://hntoplinks.com/year Magnum forums like this one: https://forum.effectivealtruism.org/allPosts?timeframe=yearly https://lesswrong.com/allPosts?timeframe=yearly https://progressforum.org/allPosts?timeframe=yearly https://forum.fastcommunity.org/allPosts?timeframe=yearly and subreddits: https://reddit.com/r/*/top/?t=year
4lsusr
That sounds reasonable, especially considering the limited downside.

I feel like I’ve really struggled to identify any controllable patterns in when I’m “good at thinky stuff”. Gross patterns are obvious—I’m reliably great in the morning, then my brain kinda peters out in the early afternoon, then pretty good again at night—but I can’t figure out how to intervene on that, except scheduling around it.

I’m extremely sensitive to caffeine, and have a complicated routine (1 coffee every morning, plus in the afternoon I ramp up from zero each weekend to a full-size afternoon tea each Friday), but I’m pretty uncertain whether I’m ... (read more)

Sorry if I missed it, but you don’t seem to address the standard concern that mildly-optimizing agents tend to self-modify into (or create) strongly-optimizing agents.

For example (copying from my comment here), let’s say we make an AI that really wants there to be exactly 100 paperclips in the bin. There’s nothing else it wants or desires. It doesn’t care a whit about following human norms, etc.

But, there’s one exception: this AI is also “lazy”—every thought it thinks, and every action it takes, is mildly aversive. So it’s not inclined to, say, build an im... (read more)

1Roland Pihlakas
I agree, sounds plausible that this could happen. Likewise as we humans may build a strongly optimising agent because we are lazy and want to use simpler forms of maths. The tiling agents problem is definitely important. That being said, agents properly understanding and modelling homeostasis is among the required properties (thus essential). It is not meant to be sufficient one. There may be no single sufficient property that solves everything, therefore there is no competition between different required properties. Required properties are conjunctive, they are all needed. My intuition is that homeostasis is one such property. If we neglect homeostasis then we are likely in trouble regardless of advances in other properties.  If we leave aside the question of sloppiness in creating sub-agents, I disagree with the zero cost assumption in the problem you described. I also disagree that it would be an expected and acceptable situation to have powerful agents having a singular objective. As the title of this blog post hints - we need a plurality of objectives. Having a sub-agent does not change this. Whatever the sub-agent does, will be the responsibility or liability of the main agent who will be held accountable. Legally, one should not produce random sub-agents running amok. In addition to homeostasis, a properly constructed sub-agent should understand the principle of diminishing returns in instrumental objectives. This topic I do mention towards the end of this blog post. We can consider wall-building as an instrumental objective. But instrumental objectives are not singular and in isolation either, there are also a plurality of these. Thus, spending excessive resources on a single instrumental objective is not economically cost-efficient. Therefore, it makes sense to stop the wall building and switch over to some other objective at some point. Or at least to continue improving the walls only when other objectives have been sufficiently attended to as well - t

I’m still curious about how you’d answer my question above. Right now, we don't have ASI. Sometime in the future, we will. So there has to be some improvement to AI technology that will happen between now and then. My opinion is that this improvement will involve AI becoming (what you describe as) “better at extrapolating”.

If that’s true, then however we feel about getting AIs that are “better at extrapolating”—its costs and its benefits—it doesn’t much matter, because we’re bound to get those costs and benefits sooner or later on the road to ASI. So we mi... (read more)

Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas.

Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”). So from my perspective, the above is similar to : “…Because weak AIs can speed up AI capabilities much easier than they can produce actually good alignment ideas.”

…And then if you follow through the “logic” of this OP, then the argument becomes: “AI alignment is a hard problem, so let’s just make extraordinarily powerful / smar... (read more)

2Towards_Keeperhood
Sry my comment was sloppy. (I agree the way I used sloppy in my comment mostly meant "weak". But some other thoughts:) So I think there are some dimensions of intelligence which are more important for solving alignment than for creating ASI. If you read planecrash, WIS and rationality training seem to me more important in that way than INT. I don't really have much hope for DL-like systems solving alignment but a similar case might be if an early transformative AI recognizes and says "no I cannot solve the alignment problem. the way my intelligence is shaped is not well suited to avoiding value drift. we should stop scaling and take more time where I work with very smart people like Eliezer etc for some years to solve alignment". And depending on the intelligence profile of the AI it might be more or less likely that this will happen (currently seems quite unlikely). But overall those "better" intelligence dimensions still seem to me too central for AI capabilities, so I wouldn't publish stuff. (Btw the way I read John's post was more like "fake alignment proposals are a main failure mode" rather than also "... and therefore we should work on making AIs more rational/sane whatever". So given that I maybe would defend John's framing, but not sure.)
2abramdemski
I want to explicitly call out my cliff vs gentle slope picture from another recent comment. Sloppy AIs can have a very large set of tasks at which they perform very well, but they have sudden drops in their abilities due to failure to extrapolate well outside of that.

I think it’s 1:1, because I think the primary bottleneck to dangerous ASI is the ability to develop coherent and correct understandings of arbitrary complex domains and systems (further details), which basically amounts to anti-slop.

If you think the primary bottleneck to dangerous ASI is not that, but rather something else, then what do you think it is? (or it’s fine if you don’t want to state it publicly)

5abramdemski
So, rather than imagining a one-dimensional "capabilities" number, let's imagine a landscape of things you might want to be able to get AIs to do, with a numerical score for each. In the center of the landscape is "easier" things, with "harder" things further out. There is some kind of growing blob of capabilities, spreading from the center of the landscape outward. Techniques which are worse at extrapolating (IE worse at "coherent and correct understanding" of complex domains) create more of a sheer cliff in this landscape, where things go from basically-solved to not-solved-at-all over short distances in this space. Techniques which are better at extrapolating create more of a smooth drop-off instead. This is liable to grow the blob a lot faster; a shift to better extrapolation sees the cliffs cast "shadows" outwards. My claim is that cliffs are dangerous for a different reason, namely that people often won't realize when they're falling off a cliff. The AI seems super-competent for the cases we can easily test, so humans extrapolate its competence beyond the cliff. This applies to the AI as well, if it lacks the capacity for detecting its own blind spots. So RSI is particularly dangerous in this regime, compared to a regime with better extrapolation. This is very analogous to early Eliezer observing the AI safety problem and deciding to teach rationality. Yes, if you can actually improve people's rationality, they can use their enhanced capabilities for bad stuff too. Very plausibly the movement which Eliezer created has accelerated AI timelines overall. Yet, it feels plausible that without Eliezer, there would be almost no AI safety field.
2Mateusz Bagiński
So far in this thread I was mostly talking from the perspective of my model(/steelman?) of Abram's argument. I mostly agree with this. Still, this doesn't[1] rule out the possibility of getting an AI that understands (is superintelligent in?) one complex domain (specifically here, whatever is necessary to meaningfully speed up AIS research) (and maybe a few more, as I don't expect the space of possible domains to be that compartmentalizable), but is not superintelligent across all complex domains that would make it dangerous. It doesn't even have to be a superintelligent reasoner about minds. Babbling up clever and novel mathematical concepts for a human researcher to prune could be sufficient to meaningfully boost AI safety (I don't think we're primarily bottlenecked on mathy stuff but it might help some people and I think that's one thing that Abram would like to see). 1. ^ Doesn't rule out in itself but perhaps you have some other assumptions that imply it's 1:1, as you say.

Right, so one possibility is that you are doing something that is “speeding up the development of AIS-helpful capabilities” by 1 day, but you are also simultaneously speeding up “dangerous capabilities” by 1 day, because they are the same thing.

If that’s what you’re doing, then that’s bad. You shouldn’t do it. Like, if AI alignment researchers want AI that produces less slop and is more helpful for AIS, we could all just hibernate for six months and then get back to work. But obviously, that won’t help the situation.

And a second possibility is, there are w... (read more)

4Mateusz Bagiński
TBC, I was thinking about something like: "speed up the development of AIS-helpful capabilities by 3 days, at the cost of speeding up the development of dangerous capabilities by 1 day".

I don’t think your model hangs together, basically because I think “AI that produces slop” is almost synonymous with “AI that doesn’t work very well”, whereas you’re kinda treating AI power and slop as orthogonal axes.

For example, from comments:

Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" it gets drowned out in the noise anyway.

Some relatively short time later, there are no humans.

I think that, if ther... (read more)

1Towards_Keeperhood
Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas. If you really think you need to be similarly unsloppy to build ASI than to align ASI, I'd be interested in discussing that. So maybe give some pointers to why you might think that (or tell me to start). (Tbc, I directionally agree with you that anti-slop is very useful AI capabilities and that I wouldn't publish stuff like Abram's "belief propagation" example.)
3abramdemski
Maybe "some relatively short time later" was confusing. I mean long enough for the development cycle to churn a couple more times. IE, GPT7 convinces people of sloppy safety measures XYZ, people implement XYZ and continue scaling up AGI, the scaled-up superintelligence is a schemer. I do somewhat think of this as a capabilities elicitation issue. I think current training methods are eliciting convincingness, sycophantism, and motivated cognition (for some unknown combination of the obvious reasons and not-so-obvious reasons). But, as clarified above, the idea isn't that sloppy AI is hiding a super-powerful AI inside. It's more about convincingness outpacing truthfulness. I think that is a well-established trend. I think many people expect "reasoning models" to reverse that trend. My experience so far suggests otherwise. What I'm saying is that "aligned" isn't the most precise concept to apply here. If scheming is the dominant concern, yes. If not, then the precisely correct concept seems closer to the "coherence" idea I'm trying to gesture at. I've watched (over Discord) a developer get excited about a supposed full-stack AI development tool which develops a whole application for you based on a prompt, try a few simple examples and exclaim that it is like magic, then over the course of a few more hours issue progressive updates of "I'm a little less excited now" until they've updated to a very low level of excitement and have decided that it seems like magic mainly because it has been optimized to work well for the sorts of simple examples developers might try first when putting it through its paces. I'm basically extrapolating that sort of thing forward, to cases where you only realize something was bad after months or years instead of hours. As development of these sorts of tools continues to move forward, they'll start to succeed in impressing on the days & weeks timespan. A big assumption of my model is that to do that, they don't need to fundamentally sol
4Mateusz Bagiński
I think Abram is saying the following: * Currently, AIs are lacking capabilities that would meaningfully speed up AI Safety research. * At some point, they are gonna get those capabilities. * However, by default, they are gonna get those AI Safety-helpful capabilities roughly at the same time as other, dangerous capabilities (or at least, not meaningfully earlier). * In which case, we're not going to have much time to use the AI Safety-helpful capabilities to speed up AI Safety research sufficiently for us to be ready for those dangerous capabilities. * Therefore, it makes sense to speed up the development of AIS-helpful capabilities now. Even if it means that the AIs will acquire dangerous capabilities sooner, it gives us more time to use AI Safety-helpful capabilities to prepare for dangerous capabilities.

Yeah, I’ve written about that in §2.7.3 here.

I kinda want to say that there are many possible future outcomes that we should feel happy about. It’s true that many of those possible outcomes would judge others of those possible outcomes to be a huge missed opportunity, and that we’ll be picking from this set somewhat arbitrarily (if all goes well), but oh well, there’s just some irreducible arbitrariness is the nature of goodness itself.

2Noosphere89
I would go farther, in that we will in practice be picking from this set of outcomes with a lot of arbitrariness, and that this is not removable.

For things like solving coordination problems, or societal resilience against violent takeover, I think it can be important that most people, or even virtually all people, are making good foresighted decisions. For example, if we’re worried about a race-to-the-bottom on AI oversight, and half of relevant decisionmakers allow their AI assistants to negotiate a treaty to stop that race on their behalf, but the other half think that’s stupid and don’t participate, then that’s not good enough, there will still be a race-to-the-bottom on AI oversight. Or if 50%... (read more)

I don’t think the average person would be asking AI what are the best solutions for preventing existential risks. As evidence, just look around:

There are already people with lots of money and smart human research assistants. How many of those people are asking those smart human research assistants for solutions to prevent existential risks? Approximately zero.

Here’s another: The USA NSF and NIH are funding many of the best scientists in the world. Are they asking those scientists for solutions to prevent existential risk? Nope.

Demis Hassabis is the boss of... (read more)

1Satron
I am not from the US, so I don't know anything about the organizations that you have listed. However, we can look at three main conventional sources of existential risk (excluding AI safety, for now, we will come back to it later): * Nuclear Warfare - Cooperation strategies + decision theory are active academic fields. * Climate Change - This is a very hot topic right now, and a lot of research is being put into it. * Pandemics - There was quite a bit of research before COVID, but even more now. ---------------------------------------- As to your point about Hassabis not doing other projects for reducing existential risk besides AI alignment: Most of the people on this website (at least, that's my impression) have rather short timelines. This means that work on a lot of existential risks becomes much less critical. For example, while supervolcanoes are scary, they are unlikely to wipe us out before we get an AI powerful enough to be able to solve this problem more efficiently than we can. ---------------------------------------- As to your point about Sam Altman choosing to cure cancer over reducing x-risk: I don't think Sam looks forward to the prospect of dying due to some x-risk, does he? After all, he is just as human as we are, and he, too, would hate the metaphorical eruption of a supervolcano. But that's going to become much more critical after we have an aligned and powerful AI on our hands. This also applies to LeCun/Zuckerberg/random OpenAI researchers. They (or at least most of them) seem to be enjoying their lives and wouldn't want to randomly lose them. ---------------------------------------- As to your last paragraph: I think your formulation of AI response ("preventing existential risks is very hard and fraught, but hey, what if I do a global mass persuasion campaign") really undersells it. It almost makes it sound like a mass persuasion campaign is a placebo to make AI's owner feel good. Consider another response "the best option for p

I’m not sure if this is what you’re looking for, but here’s a fun little thing that came up recently I was when writing this post:

Summary: “Thinking really hard for five seconds” probably involves less primary metabolic energy expenditure than scratching your nose. (Some people might find this obvious, but other people are under a mistaken impression that getting mentally tired and getting physically tired are both part of the same energy-preservation drive. My belief, see here, is that the latter comes from an “innate drive to minimize voluntary motor con... (read more)

2ozziegooen
By the way - I imagine you could do a better job with the evaluation prompts by having another LLM pass, where it formalizes the above more and adds more context. For example, with an o1/R1 pass/Squiggle AI pass, you could probably make something that considers a few more factors with this and brings in more stats. 
2ozziegooen
That counts! Thanks for posting. I look forward to seeing what it will get scored as. 

Yeah it’s super-misleading that the post says:

Look at other unsolved problems:
- Goldbach: Can every even number of 1s be split into two prime clusters of 1s?
- Twin Primes: Are there infinite pairs of prime clusters of 1s separated by two 1s?
- Riemann: How are the prime clusters of 1s distributed?

For centuries, they resist. Why?

I think it would be much clearer to everyone if the OP said

Look at other unsolved problems:
- Goldbach: Can every even number of 1s be split into two prime clusters of 1s?
- Twin Primes: Are there infinite pairs of prime clusters of 1s

... (read more)
Load More