Often you can compare your own Fermi estimates with those of other people, and that’s sort of cool, but what’s way more interesting is when they share what variables and models they used to get to the estimate. This lets you actually update your model in a deeper way.

tlevin1d630
4
I think some of the AI safety policy community has over-indexed on the visual model of the "Overton Window" and under-indexed on alternatives like the "ratchet effect," "poisoning the well," "clown attacks," and other models where proposing radical changes can make you, your allies, and your ideas look unreasonable (edit to add: whereas successfully proposing minor changes achieves hard-to-reverse progress, making ideal policy look more reasonable). I'm not familiar with a lot of systematic empirical evidence on either side, but it seems to me like the more effective actors in the DC establishment overall are much more in the habit of looking for small wins that are both good in themselves and shrink the size of the ask for their ideal policy than of pushing for their ideal vision and then making concessions. Possibly an ideal ecosystem has both strategies, but it seems possible that at least some versions of "Overton Window-moving" strategies executed in practice have larger negative effects via associating their "side" with unreasonable-sounding ideas in the minds of very bandwidth-constrained policymakers, who strongly lean on signals of credibility and consensus when quickly evaluating policy options, than the positive effects of increasing the odds of ideal policy and improving the framing for non-ideal but pretty good policies. In theory, the Overton Window model is just a description of what ideas are taken seriously, so it can indeed accommodate backfire effects where you argue for an idea "outside the window" and this actually makes the window narrower. But I think the visual imagery of "windows" actually struggles to accommodate this -- when was the last time you tried to open a window and accidentally closed it instead? -- and as a result, people who rely on this model are more likely to underrate these kinds of consequences. Would be interested in empirical evidence on this question (ideally actual studies from psych, political science, sociology, econ, etc literatures, rather than specific case studies due to reference class tennis type issues).
TurnTrout1dΩ23505
3
A semi-formalization of shard theory. I think that there is a surprisingly deep link between "the AIs which can be manipulated using steering vectors" and "policies which are made of shards."[1] In particular, here is a candidate definition of a shard theoretic policy: > A policy has shards if it implements at least two "motivational circuits" (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented). By this definition, humans have shards because they can want food at the same time as wanting to see their parents again, and both factors can affect their planning at the same time! The maze-solving policy is made of shards because we found activation directions for two motivational circuits (the cheese direction, and the top-right direction): > On the other hand, AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other. It's just maximizing one utility function. A mesa optimizer with a single goal also does not have two motivational circuits which can go on and off in an independent fashion.  * This definition also makes obvious the fact that "shards" are a matter of implementation, not of behavior. * It also captures the fact that "shard" definitions are somewhat subjective. In one moment, I might model someone is having a separate "ice cream shard" and "cookie shard", but in another situation I might choose to model those two circuits as a larger "sweet food shard." So I think this captures something important. However, it leaves a few things to be desired: * What, exactly, is a "motivational circuit"? Obvious definitions seem to include every neural network with nonconstant outputs. * Demanding a compositional representation is unrealistic since it ignores superposition. If k dimensions are compositional, then they must be pairwise orthogonal. Then a transformer can only have k≤dmodel shards, which seems obviously wrong and false.  That said, I still find this definition useful.  I came up with this last summer, but never got around to posting it. Hopefully this is better than nothing. 1. ^ Shard theory reasoning led me to discover the steering vector technique extremely quickly. This link would explain why shard theory might help discover such a technique.
Looking for blog platform/framework recommendations I had a Wordpress blog, but I don't like wordpress and I want to move away from it.  Substack doesn't seem like a good option because I want high customizability and multilingual support (my Blog is going to be in English and Hebrew). I would like something that I can use for free with my own domain (so not Wix). The closest thing I found to what I'm looking for was MkDocs Material, but it's still geared too much towards documentation, and I don't like its blog functionality enough. Other requirements: Dark/Light mode, RSS, Newsletter support. Does anyone have another suggestion? It's fine if it requires a bit of technical skill (though better if it doesn't).
Richard_Ngo13hΩ6122
3
Hypothesis: there's a way of formalizing the notion of "empowerment" such that an AI with the goal of empowering humans would be corrigible. This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn't ever let the humans spend that power. Intuitively, though, there's a sense in which a human who can never spend their power doesn't actually have any power. Is there a way of formalizing that intuition? The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl's do-calculus). Define the power of a human with respect to a distribution of goals G as the average ability of a human to achieve their goal if they'd had a goal sampled from G (alternatively: under an intervention that changed their goal to one sampled from G). Then an AI with a policy of never letting humans spend their resources would result in humans having low power. Instead, a human-power-maximizing AI would need to balance between letting humans pursue their goals, and preventing humans from doing self-destructive actions. The exact balance would depend on G, but one could hope that it's not very sensitive to the precise definition of G (especially if the AI isn't actually maximizing human power, but is more like a quantilizer, or is optimizing under pessimistic assumptions). The problem here is that these counterfactuals aren't very clearly-defined. E.g. imagine the hypothetical world where humans valued paperclips instead of love. Even a little knowledge of evolution would tell you that this hypothetical is kinda crazy, and maybe the question "what would the AI be doing in this world?" has no sensible answer (or maybe the answer would be "it would realize it's in a weird hypothetical world and behave accordingly"). Similarly, if we model this using the do-operation, the best policy is something like "wait until the human's goals suddenly and inexplicably change, then optimize hard for their new goal". Having said that, in some sense what it means to model someone as an agent is that you can easily imagine them pursuing some other goal. So the counterfactuals above might not be too unnatural; or at least, no more unnatural than any other intervention modeled by Pearl's do-operator. Overall this line of inquiry seems promising and I plan to spend more time thinking about it.
Pithy sayings are lossily compressed.

Popular Comments

Recent Discussion

5th generation military aircraft are extremely optimised to reduce their radar cross section. It is this ability above all others that makes the f-35 and the f-22 so capable - modern anti aircraft weapons are very good, so the only safe way to fly over a well defended area is not to be seen.

But wouldn't it be fairly trivial to detect a stealth aircraft optically?

This is what an f-35 looks like from underneath at about 10 by 10 pixels:

You and I can easily tell what that is (take a step back, or squint). So can GPT4:

The image shows a silhouette of a fighter jet in the sky, likely flying at high speed. The clear blue sky provides a sharp contrast, making the aircraft's dark outline prominent. The

...
Answer by avturchinMay 02, 202420

That is why they prefer to flight for strikes during moonless nights. Also they can fly of very low or very high, which makes optical observation difficult.

4Alexander Gietelink Oldenziel1h
Military nerds correct me if I'm wrong but I think the answer might be the following. I'm not a pilot etc etc. Stealth can be a bit of a misleading term. F35 aren't actually 'stealth aircraft' - they are low-observable aircraft. You can detect F35s with longwave radar just fine. The problem isn't knowing that there is a F35 but to get a weapon -grade lock on it. This is much harder and your grainy gpt-interpreted photo isn't close to enough for a missile I think. You mentioned this already as a possibility. Preventing a weapons grade radar or IR lock is what the F35's stealth architecture is about. A fighter carrier air to air missile might have a range of upwards of a 100km or more but a weapons grade lock can typically only obtained at 30 km for a F35 Still I'd say your proposal is worth pondering. The Ukrainians famously have pioneered something similar for audio which is used to detect missiles & drones entering Ukrainian airspace.

Looking for blog platform/framework recommendations

I had a Wordpress blog, but I don't like wordpress and I want to move away from it. 

Substack doesn't seem like a good option because I want high customizability and multilingual support (my Blog is going to be in English and Hebrew).

I would like something that I can use for free with my own domain (so not Wix).

The closest thing I found to what I'm looking for was MkDocs Material, but it's still geared too much towards documentation, and I don't like its blog functionality enough.

Other requirements: Da... (read more)

This is a linkpost for https://arxiv.org/abs/2404.19756

Abstract:

Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today's deep learning models which rely heavily on MLPs.

1Pi Rogers40m
Is this a massive exfohazard? Should this have been published?

To the extent that Tegmark is concerned about exfohazards (he doesn't seem to be very concerned AFAICT (?)), he would probably say that more powerful and yet more interpretable architectures are net positive.

5gwern11h
Pretraining, specifically: https://gwern.net/doc/reinforcement-learning/meta-learning/continual-learning/index#scialom-et-al-2022-section The intuition is that after pretraining, models can map new data into very efficient low-dimensional latents and have tons of free space / unused parameters. So you can easily prune them, but also easily specialize them with LoRA (because the sparsity is automatic, just learned) or just regular online SGD. But yeah, it's not a real problem anymore, and the continual learning research community is still in denial about this and confining itself to artificially tiny networks to keep the game going.
2Nathan Helm-Burger13h
I'm not so sure. You might be right, but I suspect that catastrophic forgetting may still be playing an important role in limiting the peak capabilities of an LLM of given size. Would it be possible to continue Llama3 8B's training much much longer and have it eventually outcompete Llama3 405B stopped at its normal training endpoint? I think probably not? And I suspect that if not, that part (but not all) of the reason would be catastrophic forgetting. Another part would be limited expressivity of smaller models, another thing which the KANs seem to help with.

When I introduce people to plans like QACI, they often have objections like "How is an AI going to do all of the simulating necessary to calculate this?" or "If our technology is good enough to calculate this with any level of precision, we can probably just upload some humans." or just "That's not computable."

I think these kinds of objections are missing the point of formal goal alignment and maybe even outer alignment in general.

To formally align an ASI to human (or your) values, we do not need to actually know those values. We only need to strongly point to them.

AI will figure out our values. Whether it's aligned or not, a recursively self-improving AI will eventually get a very good model of our values, as part...

2Wei Dai5h
But we could have said the same thing of SBF, before the disaster happened. Please explain your thinking behind this? It's not, because some moral theories are not compatible with EU maximization, and of the ones that are, it's still unclear how to handle uncertainty between them.
1Pi Rogers5h
I would honestly be pretty comfortable with maximizing SBF's CEV. TLDR: Humans can be powerful and overconfident. I think this is the main source of human evil. I also think this is unlikely to naturally be learned by RL in environments that don't incentivize irrationality (like ours did). Sorrry if I was unclear there. I'm pretty confident that my values satisfy the VNM axioms, so those moral theories are almost definitely wrong. And I think this uncertainty problem can be solved by forcing utility bounds.
4Wei Dai4h
Yikes, I'm not even comfortable maximizing my own CEV. One crux may be that I think a human's values may be context-dependent. In other words, current me-living-in-a-normal-society may have different values from me-given-keys-to-the-universe and should not necessarily trust that version of myself. (Similar to how earlier idealistic Mao shouldn't have trusted his future self.) My own thinking around this is that we need to advance metaphilosophy and social epistemology, engineer better discussion rules/norms/mechanisms and so on, design a social process that most people can justifiably trust in (i.e., is likely to converge to moral truth or actual representative human values or something like that), then give AI a pointer to that, not any individual human's reflection process which may be mistaken or selfish or skewed. Where is the longer version of this? I do want to read it. :) Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn't RL environments for AI cause the same or perhaps a different set of irrationalities? Also, how does RL fit into QACI? Can you point me to where this is discussed?

Yikes, I'm not even comfortable maximizing my own CEV.

What do you think of this post by Tammy?

Where is the longer version of this? I do want to read it. :)

Well perhaps I should write it :)

Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn't RL environments for AI cause the same or perhaps a different set of irrationalities?

Mostly that thing where we had a lying vs lie-detecting arms race and the liars mostly won by believing their own lies and that's how we have things like overconfidence bias ... (read more)

A classic problem with Christianity is the so-called ‘problem of evil’—that friction between the hypothesis that the world’s creator is arbitrarily good and powerful, and a large fraction of actual observations of the world.

Coming up with solutions to the problem of evil is a compelling endeavor if you are really rooting for a particular bottom line re Christianity, or I guess if you enjoy making up faux-valid arguments for wrong conclusions. At any rate, I think about this more than you might guess.

And I think I’ve solved it!

Or at least, I thought of a new solution which seems better than the others I’ve heard. (Though I mostly haven’t heard them since high school.)

The world (much like anything) has different levels of organization. People are made of...

I'm so happy someone came up with this!

1Mateusz Bagiński1h
I'm pretty sure I heard Alan Watts say something like that, at least in one direction (lower levels of organization -> higher levels). "The conflict/disorder at the lower level of the Cosmos is required for cooperation/harmony on the higher level."
2aysja1h
"So God can’t make the atoms be arranged one way and the humans be arranged another contradictory way." But couldn't he have made a different sort of thing than humans, which were less prone to evil? Like, it seems to me that he didn't need to make us evolve through the process of natural selection, such that species were always in competition, status was a big deal, fighting over mates commonplace, etc. I do expect that there's quite a bit of convergence in the space of possible minds—even if one is selecting them from the set of "all possible atomic configurations of minds"—but I would still guess that not all of those are as prone to "evil" as us. I.e., if the laws of physics were held constant, I would think you could get less evil things than us out of it, and probably worlds which were overall more favorable to life (fewer natural disasters, etc.). But perhaps this is even more evidence that God only cares about the laws of physics? Since we seem much more like an afterthought than a priority?   
1Mateusz Bagiński1h
Or maybe the Ultimate Good in the eyes of God is the epic sequence of: dead matter -> RNA world -> protocells -> ... -> hairless apes throwing rocks at each other and chasing gazelles -> weirdoes trying to accomplish the impossible task of raising the sanity waterline and carrying the world through the Big Filter of AI Doom -> deep utopia/galaxy lit with consciousness/The Goddess of Everything Else finale.

This is a companion piece to “Why I am no longer thinking about/working on AI safety.” I gave Claude a nearly-complete draft of the post, and asked it to engage with it with its intended audience in mind. I was pleasantly surprised at the quality of its responses. After a back-and-forth about the arguments laid forth in the post, I thought it might be interesting to ask Claude how it thought certain members of this community would respond to the post. I figured it might be interesting to post the dialogue here in case there’s any interest, and if Eliezer, Rohin, or Paul feel that the model has significantly misrepresented their would-be views on my post in its estimation, I would certainly be interested in learning their...

Asking ChatGPT to criticize an article also produces good suggestions often.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
1Pi Rogers6h
What about the following: My utility function is pretty much just my own happiness (in a fun-theoretic rather than purely hedonistic sense). However, my decision theory is updateless with respect to which sentient being I ended up as, so once you factor that in, I'm a multiverse-wide realityfluid-weighted average utilitarian. I'm not sure how correct this is, but it's possible.

It certainly is possible! In more decision-theoritic terms, I'd describe this as "it sure would suck if agents in my reference class just optimized for their own happiness; it seems like the instrumental thing for agents in my reference class to do is maximize for everyone's happiness". Which is probly correct!

But as per my post, I'd describe this position as "not intrinsically altruistic" — you're optimizing for everyone's happiness because "it sure would sure if agents in my reference class didn't do that", not because you intrinsically value that everyone be happy, regardless of reasoning about agents and reference classes and veils of ignorance.

For the last month, @RobertM and I have been exploring the possible use of recommender systems on LessWrong. Today we launched our first site-wide experiment in that direction. 

Behold, a tab with recommendations!

(In the course of our efforts, we also hit upon a frontpage refactor that we reckon is pretty good: tabs instead of a clutter of different sections. For now, only for logged-in users. Logged-out users see the "Latest" tab, which is the same-as-usual list of posts.)

Why algorithmic recommendations?

A core value of LessWrong is to be timeless and not news-driven. However, the central algorithm by which attention allocation happens on the site is the Hacker News algorithm[1], which basically only shows you things that were posted recently, and creates a strong incentive for discussion to always be...

I've been going through the FAR AI videos from the alignment workshop in December 2023. I'd like people to discuss their thoughts on Shane Legg's 'necessary properties' that every AGI safety plan needs to satisfy. The talk is only 5 minutes, give it a listen:

Otherwise, here are some of the details:

All AGI Safety plans must solve these problems (necessary properties to meet at the human level or beyond):

  1. Good world model
  2. Good reasoning
  3. Specification of the values and ethics to follow

All of these require good capabilities, meaning capabilities and alignment are intertwined.

Shane thinks future foundation models will solve conditions 1 and 2 at the human level. That leaves condition 3, which he sees as solvable if you want fairly normal human values and ethics.

Shane basically thinks that if the above...

Answer by Chris_LeongMay 02, 202420

The biggest problem here is it fails to account for other actors using such systems to cause chaos and the possibility that the offense-defense balance likely strongly favours the attacker, particularly if you've placed limitations on your systems that make them safer. Aligned human-ish level AI's doesn't provide a victory condition.

1mic4h
I agree that we want more progress on specifying values and ethics for AGI. The ongoing SafeBench competition by the Center for AI Safety has a category for this problem:
4Seth Herd10h
There's also some more in his interview with Dwarkesh Patel just before then. I wrote this brief analysis of that interview WRT alignment, and this talk seems to confirm that I was more-or-less on target. So, to your questions, including where I'm guessing at Shane's thinking, and where it's mine. This is overlapping with the standard story AFAICT, and 80% of alignment work is sort of along these lines. I think what Shane's proposing is pretty different in an important way: it includes System 2 thinking, where almost all alignment work is about aligning the way LLMs give quick answers, analogous to human System 1 thinking. Shane seemed to say he wants to use zero reinforcement learning in the scaffolded agent system, a stance I definitely agree with. I don't think it matters much whether RLHF was used to "align" the base model, because it's going to have implicit desires/drives from the predictive training of human text, anyway. Giving instructions to follow doesn't need to have anything to do with RL; it's just based on the world model, and putting those instructions as a central and recurring prompt for that system to produce plans and actions to carry out those instructions. So, how we get a model to robustly obey the instruction text is by implementing system 2 thinking. This is "the obvious thing" if we think about human cognition. System 2 thinking would be applying something more like a tree of thought algorithm, which checks through predicted consequences of the action, and then makes judgments about how well those fulfill the instruction text. This is what I've called internal review for alignment of language model cognitive architectures. To your second and third questions; I didn't see answers from Shane in either the interview or that talk, but I think they're the obvious next questions, and they're what I've been working on since then. I think the answers are that the instructions will try to be as scope-limited as possible, that we'll want to care
3jacquesthibs14h
Not that I know of, but I will at least consider periodically pinging him on X (if this post gets enough people’s attention). https://x.com/jacquesthibs/status/1785704284434129386?s=46

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA