Often you can compare your own Fermi estimates with those of other people, and that’s sort of cool, but what’s way more interesting is when they share what variables and models they used to get to the estimate. This lets you actually update your model in a deeper way.

tlevin2d630
4
I think some of the AI safety policy community has over-indexed on the visual model of the "Overton Window" and under-indexed on alternatives like the "ratchet effect," "poisoning the well," "clown attacks," and other models where proposing radical changes can make you, your allies, and your ideas look unreasonable (edit to add: whereas successfully proposing minor changes achieves hard-to-reverse progress, making ideal policy look more reasonable). I'm not familiar with a lot of systematic empirical evidence on either side, but it seems to me like the more effective actors in the DC establishment overall are much more in the habit of looking for small wins that are both good in themselves and shrink the size of the ask for their ideal policy than of pushing for their ideal vision and then making concessions. Possibly an ideal ecosystem has both strategies, but it seems possible that at least some versions of "Overton Window-moving" strategies executed in practice have larger negative effects via associating their "side" with unreasonable-sounding ideas in the minds of very bandwidth-constrained policymakers, who strongly lean on signals of credibility and consensus when quickly evaluating policy options, than the positive effects of increasing the odds of ideal policy and improving the framing for non-ideal but pretty good policies. In theory, the Overton Window model is just a description of what ideas are taken seriously, so it can indeed accommodate backfire effects where you argue for an idea "outside the window" and this actually makes the window narrower. But I think the visual imagery of "windows" actually struggles to accommodate this -- when was the last time you tried to open a window and accidentally closed it instead? -- and as a result, people who rely on this model are more likely to underrate these kinds of consequences. Would be interested in empirical evidence on this question (ideally actual studies from psych, political science, sociology, econ, etc literatures, rather than specific case studies due to reference class tennis type issues).
TurnTrout1dΩ23505
3
A semi-formalization of shard theory. I think that there is a surprisingly deep link between "the AIs which can be manipulated using steering vectors" and "policies which are made of shards."[1] In particular, here is a candidate definition of a shard theoretic policy: > A policy has shards if it implements at least two "motivational circuits" (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented). By this definition, humans have shards because they can want food at the same time as wanting to see their parents again, and both factors can affect their planning at the same time! The maze-solving policy is made of shards because we found activation directions for two motivational circuits (the cheese direction, and the top-right direction): > On the other hand, AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other. It's just maximizing one utility function. A mesa optimizer with a single goal also does not have two motivational circuits which can go on and off in an independent fashion.  * This definition also makes obvious the fact that "shards" are a matter of implementation, not of behavior. * It also captures the fact that "shard" definitions are somewhat subjective. In one moment, I might model someone is having a separate "ice cream shard" and "cookie shard", but in another situation I might choose to model those two circuits as a larger "sweet food shard." So I think this captures something important. However, it leaves a few things to be desired: * What, exactly, is a "motivational circuit"? Obvious definitions seem to include every neural network with nonconstant outputs. * Demanding a compositional representation is unrealistic since it ignores superposition. If k dimensions are compositional, then they must be pairwise orthogonal. Then a transformer can only have k≤dmodel shards, which seems obviously wrong and false.  That said, I still find this definition useful.  I came up with this last summer, but never got around to posting it. Hopefully this is better than nothing. 1. ^ Shard theory reasoning led me to discover the steering vector technique extremely quickly. This link would explain why shard theory might help discover such a technique.
Looking for blog platform/framework recommendations I had a Wordpress blog, but I don't like wordpress and I want to move away from it.  Substack doesn't seem like a good option because I want high customizability and multilingual support (my Blog is going to be in English and Hebrew). I would like something that I can use for free with my own domain (so not Wix). The closest thing I found to what I'm looking for was MkDocs Material, but it's still geared too much towards documentation, and I don't like its blog functionality enough. Other requirements: Dark/Light mode, RSS, Newsletter support. Does anyone have another suggestion? It's fine if it requires a bit of technical skill (though better if it doesn't).
Richard_Ngo14hΩ6122
3
Hypothesis: there's a way of formalizing the notion of "empowerment" such that an AI with the goal of empowering humans would be corrigible. This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn't ever let the humans spend that power. Intuitively, though, there's a sense in which a human who can never spend their power doesn't actually have any power. Is there a way of formalizing that intuition? The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl's do-calculus). Define the power of a human with respect to a distribution of goals G as the average ability of a human to achieve their goal if they'd had a goal sampled from G (alternatively: under an intervention that changed their goal to one sampled from G). Then an AI with a policy of never letting humans spend their resources would result in humans having low power. Instead, a human-power-maximizing AI would need to balance between letting humans pursue their goals, and preventing humans from doing self-destructive actions. The exact balance would depend on G, but one could hope that it's not very sensitive to the precise definition of G (especially if the AI isn't actually maximizing human power, but is more like a quantilizer, or is optimizing under pessimistic assumptions). The problem here is that these counterfactuals aren't very clearly-defined. E.g. imagine the hypothetical world where humans valued paperclips instead of love. Even a little knowledge of evolution would tell you that this hypothetical is kinda crazy, and maybe the question "what would the AI be doing in this world?" has no sensible answer (or maybe the answer would be "it would realize it's in a weird hypothetical world and behave accordingly"). Similarly, if we model this using the do-operation, the best policy is something like "wait until the human's goals suddenly and inexplicably change, then optimize hard for their new goal". Having said that, in some sense what it means to model someone as an agent is that you can easily imagine them pursuing some other goal. So the counterfactuals above might not be too unnatural; or at least, no more unnatural than any other intervention modeled by Pearl's do-operator. Overall this line of inquiry seems promising and I plan to spend more time thinking about it.
Pithy sayings are lossily compressed.

Popular Comments

Recent Discussion

5th generation military aircraft are extremely optimised to reduce their radar cross section. It is this ability above all others that makes the f-35 and the f-22 so capable - modern anti aircraft weapons are very good, so the only safe way to fly over a well defended area is not to be seen.

But wouldn't it be fairly trivial to detect a stealth aircraft optically?

This is what an f-35 looks like from underneath at about 10 by 10 pixels:

You and I can easily tell what that is (take a step back, or squint). So can GPT4:

The image shows a silhouette of a fighter jet in the sky, likely flying at high speed. The clear blue sky provides a sharp contrast, making the aircraft's dark outline prominent. The

...
3faul_sname18m
How do we know that optical detection isn't done?

Let's rephrase: if this was a major issue for the f-35, the USA wouldn't have invested trillions of dollars in stealth without addressing optical camouflaging. All f-35s would have camouflage paint. They'd be a lot of research into how to reduce visibility of aircraft, just like there is for reducing RCS. Given they don't do this, clearly they don't think optical detection is a major concern.

1Exa Watson9m
Spot on
2Yair Halberstadt36m
Aircraft already often fly low, which also works well against radar, but makes them vulnerable to cheaper and more numerous MANPADS. Flying high shouldn't work particularly well given the setup I've described here, since we have a range of about 100km, an order of magnitude higher than the f35 can fly.

Previously: General Thoughts on Secular Solstice.

This blog post is my scattered notes and ramblings about the individual components (talks and songs) of Secular Solstice in Berkeley. Talks have their title in bold, and I split the post into two columns, with the notes I took about the content of the talk on the left and my comments on the talk on the right. Songs have normal formatting.

Bonfire

The Circle

This feels like a sort of whig history: a history that neglects most of the complexities and culture-dependence of the past in order to advance a teleological narrative. I do not think that whig histories are inherently wrong (although the term has negative connotations). Whig histories should be held to a very strict standard because they make claims about how...

oh yeah my dispute isn't "the character in the song isn't talking about building AI" but "the song is not a call to accelerate building AI"

This is a linkpost for https://arxiv.org/abs/2404.19756

Abstract:

Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today's deep learning models which rely heavily on MLPs.

1Exa Watson20m
I know this sounds fantastic but can someone please dumb down what KANs are for me, why they're so revolutionary (in practice, not in theory) that all the big labs would wanna switch to them?   Or is it the case that having MLPs is still a better thing for GPUs and in practice that will not change?     And how are KANs different from what SAEs attempt to do
1Pi Rogers1h
Is this a massive exfohazard? Should this have been published?

Is this a massive exfohazard? 

Very Unlikely

Should this have been published?

Yes

1Mateusz Bagiński1h
To the extent that Tegmark is concerned about exfohazards (he doesn't seem to be very concerned AFAICT (?)), he would probably say that more powerful and yet more interpretable architectures are net positive.

Looking for blog platform/framework recommendations

I had a Wordpress blog, but I don't like wordpress and I want to move away from it. 

Substack doesn't seem like a good option because I want high customizability and multilingual support (my Blog is going to be in English and Hebrew).

I would like something that I can use for free with my own domain (so not Wix).

The closest thing I found to what I'm looking for was MkDocs Material, but it's still geared too much towards documentation, and I don't like its blog functionality enough.

Other requirements: Da... (read more)

When I introduce people to plans like QACI, they often have objections like "How is an AI going to do all of the simulating necessary to calculate this?" or "If our technology is good enough to calculate this with any level of precision, we can probably just upload some humans." or just "That's not computable."

I think these kinds of objections are missing the point of formal goal alignment and maybe even outer alignment in general.

To formally align an ASI to human (or your) values, we do not need to actually know those values. We only need to strongly point to them.

AI will figure out our values. Whether it's aligned or not, a recursively self-improving AI will eventually get a very good model of our values, as part...

2Wei Dai6h
But we could have said the same thing of SBF, before the disaster happened. Please explain your thinking behind this? It's not, because some moral theories are not compatible with EU maximization, and of the ones that are, it's still unclear how to handle uncertainty between them.
1Pi Rogers6h
I would honestly be pretty comfortable with maximizing SBF's CEV. TLDR: Humans can be powerful and overconfident. I think this is the main source of human evil. I also think this is unlikely to naturally be learned by RL in environments that don't incentivize irrationality (like ours did). Sorrry if I was unclear there. I'm pretty confident that my values satisfy the VNM axioms, so those moral theories are almost definitely wrong. And I think this uncertainty problem can be solved by forcing utility bounds.
4Wei Dai4h
Yikes, I'm not even comfortable maximizing my own CEV. One crux may be that I think a human's values may be context-dependent. In other words, current me-living-in-a-normal-society may have different values from me-given-keys-to-the-universe and should not necessarily trust that version of myself. (Similar to how earlier idealistic Mao shouldn't have trusted his future self.) My own thinking around this is that we need to advance metaphilosophy and social epistemology, engineer better discussion rules/norms/mechanisms and so on, design a social process that most people can justifiably trust in (i.e., is likely to converge to moral truth or actual representative human values or something like that), then give AI a pointer to that, not any individual human's reflection process which may be mistaken or selfish or skewed. Where is the longer version of this? I do want to read it. :) Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn't RL environments for AI cause the same or perhaps a different set of irrationalities? Also, how does RL fit into QACI? Can you point me to where this is discussed?

Yikes, I'm not even comfortable maximizing my own CEV.

What do you think of this post by Tammy?

Where is the longer version of this? I do want to read it. :)

Well perhaps I should write it :)

Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn't RL environments for AI cause the same or perhaps a different set of irrationalities?

Mostly that thing where we had a lying vs lie-detecting arms race and the liars mostly won by believing their own lies and that's how we have things like overconfidence bias ... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

A classic problem with Christianity is the so-called ‘problem of evil’—that friction between the hypothesis that the world’s creator is arbitrarily good and powerful, and a large fraction of actual observations of the world.

Coming up with solutions to the problem of evil is a compelling endeavor if you are really rooting for a particular bottom line re Christianity, or I guess if you enjoy making up faux-valid arguments for wrong conclusions. At any rate, I think about this more than you might guess.

And I think I’ve solved it!

Or at least, I thought of a new solution which seems better than the others I’ve heard. (Though I mostly haven’t heard them since high school.)

The world (much like anything) has different levels of organization. People are made of...

I'm so happy someone came up with this!

1Mateusz Bagiński2h
I'm pretty sure I heard Alan Watts say something like that, at least in one direction (lower levels of organization -> higher levels). "The conflict/disorder at the lower level of the Cosmos is required for cooperation/harmony on the higher level."
2aysja2h
"So God can’t make the atoms be arranged one way and the humans be arranged another contradictory way." But couldn't he have made a different sort of thing than humans, which were less prone to evil? Like, it seems to me that he didn't need to make us evolve through the process of natural selection, such that species were always in competition, status was a big deal, fighting over mates commonplace, etc. I do expect that there's quite a bit of convergence in the space of possible minds—even if one is selecting them from the set of "all possible atomic configurations of minds"—but I would still guess that not all of those are as prone to "evil" as us. I.e., if the laws of physics were held constant, I would think you could get less evil things than us out of it, and probably worlds which were overall more favorable to life (fewer natural disasters, etc.). But perhaps this is even more evidence that God only cares about the laws of physics? Since we seem much more like an afterthought than a priority?   
1Mateusz Bagiński2h
Or maybe the Ultimate Good in the eyes of God is the epic sequence of: dead matter -> RNA world -> protocells -> ... -> hairless apes throwing rocks at each other and chasing gazelles -> weirdoes trying to accomplish the impossible task of raising the sanity waterline and carrying the world through the Big Filter of AI Doom -> deep utopia/galaxy lit with consciousness/The Goddess of Everything Else finale.

This is a companion piece to “Why I am no longer thinking about/working on AI safety.” I gave Claude a nearly-complete draft of the post, and asked it to engage with it with its intended audience in mind. I was pleasantly surprised at the quality of its responses. After a back-and-forth about the arguments laid forth in the post, I thought it might be interesting to ask Claude how it thought certain members of this community would respond to the post. I figured it might be interesting to post the dialogue here in case there’s any interest, and if Eliezer, Rohin, or Paul feel that the model has significantly misrepresented their would-be views on my post in its estimation, I would certainly be interested in learning their...

Asking ChatGPT to criticize an article also produces good suggestions often.

1Pi Rogers7h
What about the following: My utility function is pretty much just my own happiness (in a fun-theoretic rather than purely hedonistic sense). However, my decision theory is updateless with respect to which sentient being I ended up as, so once you factor that in, I'm a multiverse-wide realityfluid-weighted average utilitarian. I'm not sure how correct this is, but it's possible.

It certainly is possible! In more decision-theoritic terms, I'd describe this as "it sure would suck if agents in my reference class just optimized for their own happiness; it seems like the instrumental thing for agents in my reference class to do is maximize for everyone's happiness". Which is probly correct!

But as per my post, I'd describe this position as "not intrinsically altruistic" — you're optimizing for everyone's happiness because "it sure would sure if agents in my reference class didn't do that", not because you intrinsically value that everyone be happy, regardless of reasoning about agents and reference classes and veils of ignorance.

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA