LESSWRONG
LW

Home All Posts Concepts Library Community

Quick Takes

Buck10hΩ263514

[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I'm much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.]

I’m interested in the following subset of risk from AI:

Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).
Scheming: Risk associated with loss of control to AIs that arises from AIs scheming
- So e.g. I exclu

... (read more)

4Raemon9h

I didn't get this from the premises fwiw. Are you saying it's trivial because "just don't use your AI to help you design AI" (seems organizationally hard to me), or did you have particular tricks in mind?

ryan_greenblatt8hΩ344

The claim is that most applications aren't internal usage of AI for AI development and thus can be made trivially safe.

Not that most applications of AI for AI development can be made trivially safe.

Elizabeth's Shortform

Elizabeth13d112

A very rough draft of a plan to test prophylactics for airborne illnesses.

Start with a potential superspreader event. My ideal is a large conference, many of whom travelled to get there, in enclosed spaces with poor ventilation and air purification, in winter. Ideally >=4 days, so that people infected on day one are infectious while the conference is still running.

Call for sign-ups for testing ahead of time (disclosing all possible substances and side effects). Split volunteers into control and test group. I think you need ~500 sign ups in t... (read more)

5gwern13d

This sounds like a bad plan because it will be a logistics nightmare (undermining randomization) with high attrition, and extremely high variance due to between-subject design (where subjects differ a ton at baseline, in addition to exposure) on a single occasion with uncontrolled exposures and huge measurement error where only the most extreme infections get reported (sometimes). You'll probably get non-answers, if you finish at all. The most likely outcome is something goes wrong and the entire effort is wasted. Since this is a topic which is highly repeatable within-person (and indeed, usually repeats often through a lifetime...), this would make more sense as within-individual and using higher-quality measurements. One good QS approach would be to exploit the fact that infections, even asymptomatic ones, seem to affect heart rate etc as the body is damaged and begins fighting the infection. HR/HRV is now measurable off the shelf with things like the Apple Watch, AFAIK. So you could recruit a few tech-savvy conference-goers for measurements from a device they already own & wear. This avoids any 'big bang' and lets you prototype and tweak on a few people - possibly yourself? - before rolling it out, considerably de-risking it. There are some people who travel constantly for business and going to conferences, and recruiting and managing a few of them would probably be infinitely easier than 500+ randos (if for no reason other than being frequent flyers they may be quite eager for some prophylactics), and you would probably get far more precise data out of them if they agree to cooperate for a year or so and you get eg 10 conferences/trips out of each of them which you can contrast with their year-round baseline & exposome and measure asymptomatic infections or just overall health/stress. (Remember, variance reduction yields exponential gains in precision or sample-size reduction. It wouldn't be too hard for 5 or 10 people to beat a single 250vs250 one-off experi

Elizabeth13h20

All of the problems you list seem harder with repeated within-person trials.

D0TheMath's Shortform

Garrett Baker16h138

I don't really know what people mean when they try to compare "capabilities advancements" to "safety advancements". In one sense, its pretty clear. The common units are "amount of time", so we should compare the marginal (probablistic) difference between time-to-alignment and time-to-doom. But I think practically people just look at vibes.

For example, if someone releases a new open source model people say that's a capabilities advance, and should not have been done. Yet I think there's a pretty good case that more well-trained open source models are better... (read more)

Showing 3 of 4 replies (Click to show all)

2the gears to ascension14h

People who have the ability to clarify in any meaningful way will not do so. You are in a biased environment where people who are most willing to publish, because they are most able to convince themselves their research is safe - eg, because they don't understand in detail how to reason about whether it is or not - are the ones who will do so. Ability to see far enough ahead would of course be expected to be rather rare, and most people who think they can tell the exact path ahead of time don't have the evidence to back their hunches, even if their hunches are correct, which unless they have a demonstrated track record they probably aren't. Therefore, whoever is making the most progress on real capabilities insights under the name of alignment will make their advancements and publish them, since they don't personally see how it's exfohaz. And it won't be apparent until afterwards that it was capabilities, not alignment. So just don't publish anything, and do your work in private. Email it to anthropic when you know how to create a yellow node. But for god's sake stop accidentally helping people create green nodes because you can't see five inches ahead. And don't send it to a capabilities team before it's able to guarantee moral alignment hard enough to make a red-proof yellow node!

Garrett Baker13h42

This seems contrary to how much of science works. I expect if people stopped talking publicly about what they're working on in alignment, we'd make much less progress, and capabilities would basically run business as usual.

The sort of reasoning you use here, and that my only response to it basically amounts to "well, no I think you're wrong. This proposal will slow down alignment too much" is why I think we need numbers to ground us.

6Garrett Baker15h

Yeah, there are reasons for caution. I think it makes sense for those concerned or non-concerned to make numerical forecasts about the costs & benefits of such questions, rather than the current state of everyone just comparing their vibes against each other. This generalizes to other questions, like the benefits of interpretability, advances in safety fine-tuning, deep learning science, and agent foundations. Obviously such numbers aren't the end-of-the-line, and like in biorisk, sometimes they themselves should be kept secret. But it still seems a great advance. If anyone would like to collaborate on such a project, my DMs are open (not so say this topic is covered, this isn't exactly my main wheelhouse).

Mati_Roy's Shortform

Mati_Roy6d171

it seems to me that disentangling beliefs and values are important part of being able to understand each other

and using words like "disagree" to mean both "different beliefs" and "different values" is really confusing in that regard

4Viliam5d

Lets use "disagree" vs "dislike".

Mati_Roy14h20

when potentially ambiguous, I generally just say something like "I have a different model" or "I have different values"

dkornai's Shortform

dkornai2d3-2

Pain is the consequence of a perceived reduction in the probability that an agent will achieve its goals.

In biological organisms, physical pain [say, in response to limb being removed] is an evolutionary consequence of the fact that organisms with the capacity to feel physical pain avoided situations where their long-term goals [e.g. locomotion to a favourable position with the limb] which required the subsystem generating pain were harmed.

This definition applies equally to mental pain [say, the pain felt when being expelled from a group of allies] w... (read more)

StartAtTheEnd14h10

I think pain is a little bit different than that. It's the contrast between the current state and the goal state. This constrast motivates the agent to act, when the pain of contrast becomes bigger than the (predicted) pain of acting.

As a human, you can decrase your pain by thinking that everything will be okay, or you can increase your pain by doubting the process. But it is unlikely that you will allow yourself to stop hurting, because your brain fears that a lack of suffering would result in a lack of progress (some wise people contest this, claiming th... (read more)

1CstineSublime1d

How many organisms other than humans have "long term goals"? Doesn't that require a complex capacity for mental representation of possible future states? Am I wrong in assuming that the capacity to experience "pain" is independent of an explicit awareness of what possibilities have been shifted as a result of the new sensory data? (i.e. having a limb cleaved from the rest of the body, stubbing your toe in the dark). The organism may not even be aware of those possibilities, only 'aware' of pain. Note: I'm probably just having a fear of this sounding all too teleological and personifying evolution

3Alexander Gietelink Oldenziel2d

It also suggests that there might some sort of conservation law for pain for agents. Conservation of Pain if you will

TurnTrout's shortform feed

TurnTrout2dΩ24555

A semi-formalization of shard theory. I think that there is a surprisingly deep link between "the AIs which can be manipulated using steering vectors" and "policies which are made of shards."^[1] In particular, here is a candidate definition of a shard theoretic policy:

A policy has shards if it implements at least two "motivational circuits" (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented).

By this definition, humans have shards because they can want food at the same time as wantin... (read more)

Showing 3 of 4 replies (Click to show all)

Thomas Kwa16hΩ121

I'm not so sure that shards should be thought of as a matter of implementation. Contextually activated circuits are a different kind of thing from utility function components. The former activate in certain states and bias you towards certain actions, whereas utility function components score outcomes. I think there are at least 3 important parts of this:

A shardful agent can be incoherent due to valuing different things from different states
A shardful agent can be incoherent due to its shards being shallow, caring about actions or proximal effects rather t

... (read more)

15Daniel Kokotajlo2d

I think this is also what I was confused about -- TurnTrout says that AIXI is not a shard-theoretic agent because it just has one utility function, but typically we imagine that the utility function itself decomposes into parts e.g. +10 utility for ice cream, +5 for cookies, etc. So the difference must not be about the decomposition into parts, but the possibility of independent activation? but what does that mean? Perhaps it means: The shards aren't always applied, but rather only in some circumstances does the circuitry fire at all, and there are circumstances in which shard A fires without B and vice versa. (Whereas the utility function always adds up cookies and ice cream, even if there are no cookies and ice cream around?) I still feel like I don't understand this.

1samshap2d

Instead of demanding orthogonal representations, just have them obey the restricted isometry property. Basically, instead of requiring ∀i≠j:<xi,xj>=0, we just require ∀i≠j:xi⋅xj≤ϵ . This would allow a polynomial number of sparse shards while still allowing full recovery.

quetzal_rainbow's Shortform

quetzal_rainbow18h132

@jessicata once wrote "Everyone wants to be a physicalist but no one wants to define physics". I decided to check SEP article on physicalism and found that, yep, it doesn't have definition of physics:

Carl Hempel (cf. Hempel 1969, see also Crane and Mellor 1990) provided a classic formulation of this problem: if physicalism is defined via reference to contemporary physics, then it is false — after all, who thinks that contemporary physics is complete? — but if physicalism is defined via reference to a future or ideal physics, then it is trivial — after all,

... (read more)

tlevin's Shortform

tlevin3d640

I think some of the AI safety policy community has over-indexed on the visual model of the "Overton Window" and under-indexed on alternatives like the "ratchet effect," "poisoning the well," "clown attacks," and other models where proposing radical changes can make you, your allies, and your ideas look unreasonable (edit to add: whereas successfully proposing minor changes achieves hard-to-reverse progress, making ideal policy look more reasonable).

I'm not familiar with a lot of systematic empirical evidence on either side, but it seems to me like the more... (read more)

Showing 3 of 6 replies (Click to show all)

MP19h0-2

I'm not a decel, but the way this stuff often is resolved is that there are crazy people that aren't taken seriously by the managerial class but that are very loud and make obnoxious asks. Think the evangelicals against abortion or the Columbia protestors.

Then there is some elite, part of the managerial class, that makes reasonable policy claims. For Abortion, this is Mitch McConnel, being disciplined over a long period of time in choosing the correct judges. For Palestine, this is Blinken and his State Department bureaucracy.

The problem with d... (read more)

5tlevin1d

Quick reactions: 1. Re: how over-emphasis on "how radical is my ask" vs "what my target audience might find helpful" and generally the importance of making your case well regardless of how radical it is, that makes sense. Though notably the more radical your proposal is (or more unfamiliar your threat models are), the higher the bar for explaining it well, so these do seem related. 2. Re: more effective actors looking for small wins, I agree that it's not clear, but yeah, seems like we are likely to get into some reference class tennis here. "A lot of successful organizations that take hard-line positions and (presumably) get a lot of their power/influence from the ideological purity that they possess & communicate"? Maybe, but I think of like, the agriculture lobby, who just sort of quietly make friends with everybody and keep getting 11-figure subsidies every year, in a way that (I think) resulted more from gradual ratcheting than making a huge ask. "Pretty much no group– whether radical or centrist– has had tangible wins" seems wrong in light of the EU AI Act (where I think both a "radical" FLI and a bunch of non-radical orgs were probably important) and the US executive order (I'm not sure which strategy is best credited there, but I think most people would have counted the policies contained within it as "minor asks" relative to licensing, pausing, etc). But yeah I agree that there are groups along the whole spectrum that probably deserve credit. 3. Re: poisoning the well, again, radical-ness and being dumb/uninformed are of course separable but the bar rises the more radical you get, in part because more radical policy asks strongly correlate with more complicated procedural asks; tweaking ECRA is both non-radical and procedurally simple, creating a new agency to license training runs is both outside the DC Overton Window and very procedurally complicated. 4. Re: incentives, I agree that this is a good thing to track, but like, "people who oppose X are in

1Noosphere8920h

It's not just that problem though, they will likely be biased to think that their policy is helpful for safety of AI at all, and this is a point that sometimes gets forgotten. But correct on the fact that Akash's argument is fully general.

ChristianKl's Shortform

ChristianKl2d80

The FDC just fined US phone carriers for sharing the location data of US customers to anyone willing to buy them. The fines don't seem to be high enough to deter this kind of behavior.

That likely includes either directly or indirectly the Chinese government.

What does the US Congress do to protect spying by China? Of course, banning tik tok instead of actually protecting the data of US citizens.

If you have thread models that the Chinese government might target you, assume that they know where your phone is and shut it of when going somewhere you... (read more)

Showing 3 of 6 replies (Click to show all)

2Dagon2d

[note: I suspect we mostly agree on the impropriety of open selling and dissemination of this data. This is a narrow objection to the IMO hyperbolic focus on government assault risks. ] I'm unhappy with the phrasing of "targeted by the Chinese government", which IMO implies violence or other real-world interventions when the major threats are "adversary use of AI-enabled capabilities in disinformation and influence operations." Thanks for mentioning blackmail - that IS a risk I put in the first category, and presumably becomes more possible with phone location data. I don't know how much it matters, but there is probably a margin where it does. I don't disagree that this purchasable data makes advertising much more effective (in fact, I worked at a company based on this for some time). I only mean to say that "targeting" in the sense of disinformation campaigns is a very different level of threat from "targeting" of individuals for government ops.

ChristianKl21h20

This is a narrow objection to the IMO hyperbolic focus on government assault risks.

Whether or not you face government assault risks depends on what you do. Most people don't face government assault risks. Some people engage in work or activism that results in them having government assault risks.

The Chinese government has strategic goals and most people are unimportant to those. Some people however work on topics like AI policy in which the Chinese government has an interest.

2ChristianKl2d

Politico wrote, "Perhaps the most pressing concern is around the Chinese government’s potential access to troves of data from TikTok’s millions of users." The concern that TikTok supposedly is spyware is frequently made in discussions about why it should be banned. If the main issue is content moderation decisions, the best way to deal with it would be to legislate transparency around content moderation decisions and require TikTok to outsource the moderation decisions to some US contractor.

jacquesthibs's Shortform

jacquesthibs22h42

From a Paul Christiano talk called "How Misalignment Could Lead to Takeover" (from February 2023):

Assume we're in a world where AI systems are broadly deployed, and the world has become increasingly complex, where humans know less and less about how things work.

A viable strategy for AI takeover is to wait until there is certainty of success. If a 'bad AI' is smart, it will realize it won't be successful if it tries to take over, not a problem.

So you lose when a takeover becomes possible, and some threshold of AIs behave badly. If all the smartest AIs... (read more)

Yoav Ravid's Shortform

Yoav Ravid1d40

Looking for blog platform/framework recommendations

I had a Wordpress blog, but I don't like wordpress and I want to move away from it.

Substack doesn't seem like a good option because I want high customizability and multilingual support (my Blog is going to be in English and Hebrew).

I would like something that I can use for free with my own domain (so not Wix).

The closest thing I found to what I'm looking for was MkDocs Material, but it's still geared too much towards documentation, and I don't like its blog functionality enough.

Other requirements: Da... (read more)

Tamsin Leake's Shortform

Tamsin Leake6d4723

decision theory is no substitute for utility function

some people, upon learning about decision theories such as LDT and how it cooperates on problems such as the prisoner's dilemma, end up believing the following:

my utility function is about what i want for just me; but i'm altruistic (/egalitarian/cosmopolitan/pro-fairness/etc) because decision theory says i should cooperate with other agents. decision theoritic cooperation is the true name of altruism.

it's possible that this is true for some people, but in general i expect that to be a mistaken anal... (read more)

Showing 3 of 7 replies (Click to show all)

1Pi Rogers1d

What about the following: My utility function is pretty much just my own happiness (in a fun-theoretic rather than purely hedonistic sense). However, my decision theory is updateless with respect to which sentient being I ended up as, so once you factor that in, I'm a multiverse-wide realityfluid-weighted average utilitarian. I'm not sure how correct this is, but it's possible.

Tamsin Leake1d20

It certainly is possible! In more decision-theoritic terms, I'd describe this as "it sure would suck if agents in my reference class just optimized for their own happiness; it seems like the instrumental thing for agents in my reference class to do is maximize for everyone's happiness". Which is probly correct!

But as per my post, I'd describe this position as "not intrinsically altruistic" — you're optimizing for everyone's happiness because "it sure would sure if agents in my reference class didn't do that", not because you intrinsically value that everyone be happy, regardless of reasoning about agents and reference classes and veils of ignorance.

2Viliam4d

I don't have an explicit theory of how this works; for example, I would consider "pleasing others" in an experience machine meaningless, but "eating a cake" in an experience machine seems just as okay as in real life (maybe even preferable, considering that cakes are unhealthy). A fake memory of "having eaten a cake" would be a bad thing; "making people happier by talking to them" in an experience machine would be intrinsically meaningless, but it might help me improve my actual social skills, which would be valuable. Sometimes I care about the referent being real (the people I would please), sometimes I don't (the cake I would eat). But it's not the people/cake distinction per se; for example in case of using fake simulated people to practice social skills, the emphasis is on the skills being real; I would be disappointed if the experience machine merely gave me a fake "feeling of having improved my skills". I imagine that for a psychopath everything and everyone is instrumental, so there would be no downside to the experience machine (except for the risk of someone turning it off). But this is just a guess. I suspect that analyzing "the true preferences" is tricky, because ultimately we are built of atoms, and atoms have no preferences. So the question is whether by focusing on some aspect of the human mind we got better insight to its true nature, or whether we have just eliminated the context that was necessary for it to make sense.

David Gross's Shortform

David Gross2d52

In my fantasies, if I ever were to get that god-like glimpse at how everything actually is, with all that is currently hidden unveiled, it would be something like the feeling you have when you get a joke, or see a "magic eye" illustration, or understand an illusionist's trick, or learn to juggle: what was formerly perplexing and incoherent becomes in a snap simple and integrated, and there's a relieving feeling of "ah, but of course."

But it lately occurs to me that the things I have wrong about the world are probably things I've grasped at exactly because ... (read more)

David Gross1d20

And then today I read this: “We yearn for the transcendent, for God, for something divine and good and pure, but in picturing the transcendent we transform it into idols which we then realize to be contingent particulars, just things among others here below. If we destroy these idols in order to reach something untainted and pure, what we really need, the thing itself, we render the Divine ineffable, and as such in peril of being judged non-existent. Then the sense of the Divine vanishes in the attempt to preserve it.” (Iris Murdoch, Metaphysics as a Guide to Morals)

1metachirality2d

I like to phrase it as "the path to simplicity involves a lot of detours." Yes, Newtonian mechanics doesn't account for the orbit of Mercury but it turned out there was an even simpler, more parsimonious theory, general relativity, waiting for us.

localdeity's Shortform

localdeity1d43

Pithy sayings are lossily compressed.

yanni's Shortform

yanni kyriacos1d10

Something someone technical and interested in forecasting should look into: can LLMs reliably convert peoples claims into a % of confidence through sentiment analysis? This would be useful for Forecasters I believe (and rationality in general)

yanni's Shortform

yanni kyriacos2d80

There have been multiple occasions where I've copy and pasted email threads into an LLM and asked it things like:

What is X person saying
What are the cruxes in this conversation?
Summarise this conversation
What are the key takeaways
What views are being missed from this conversation

I really want an email plugin that basically brute forces rationality INTO email conversations.

Showing 3 of 4 replies (Click to show all)

1yanni kyriacos2d

Hi Johannes! Thanks for the suggestion :) I'm not sure i'd want it in the middle of a video call, but maybe in a forum context like this could be cool?

1Johannes C. Mayer2d

Seems pretty good to me to have this in a video call to me. The main reason why don't immediately try this out is that I would need to write a program to do this.

yanni kyriacos1d10

That seems fair enough!

Abhimanyu Pallavi Sudhir's Shortform

Abhimanyu Pallavi Sudhir2d10

quick thoughts on LLM psychology

LLMs cannot be directly anthromorphized. Though something like “a program that continuously calls an LLM to generate a rolling chain of thought, dumps memory into a relational database, can call from a library of functions which includes dumping to recall from that database, receives inputs that are added to the LLM context” is much more agent-like.

Humans evolved feelings as signals of cost and benefit — because we can respond to those signals in our behaviour.

These feelings add up to a “utility function”, something ... (read more)

Richard Ngo's Shortform

Richard_Ngo2dΩ6122

Hypothesis: there's a way of formalizing the notion of "empowerment" such that an AI with the goal of empowering humans would be corrigible.

This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn't ever let the humans spend that power. Intuitively, though, there's a sense in which a human who can never spend their power doesn't actually have any power. Is there a way of formalizing that intuition?

The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl's do-ca... (read more)

4Garrett Baker2d

There's also the problem of: what do you mean by "the human"? If you make an empowerment calculus that works for humans who are atomic & ideal agents, it probably breaks once you get a superintelligence who can likely mind-hack you into yourself valuing only power. It never forces you to abstain from giving up power, since if you're perfectly capable of making different decisions, but you just don't. Another problem, which I like to think of as the "control panel of the universe" problem, is where the AI gives you the "control panel of the universe", but you aren't smart enough to operate it, in the sense that you have the information necessary to operate it, but not the intelligence. Such that you can technically do anything you want--you have maximal power/empowerment--but the super-majority of buttons and button combinations you are likely to push result in increasing the number of paperclips.

Richard_Ngo2d40

Such that you can technically do anything you want--you have maximal power/empowerment--but the super-majority of buttons and button combinations you are likely to push result in increasing the number of paperclips.

I think any model of a rational agent needs to incorporate the fact that they're not arbitrarily intelligent, otherwise none of their actions make sense. So I'm not too worried about this.

If you make an empowerment calculus that works for humans who are atomic & ideal agents, it probably breaks once you get a superintelligence who can likely

... (read more)

4Richard_Ngo2d

You can think of this as a way of getting around the problem of fully updated deference, because the AI is choosing a policy based on what that policy would have done in the full range of hypothetical situations, and so it never updates away from considering any given goal. The cost, of course, is that we don't know how to actually pin down these hypotheticals.

davekasten's Shortform

davekasten2d10

Hypothesis, super weakly held and based on anecdote:
One big difference between US national security policy people and AI safety people is that the "grieving the possibility that we might all die" moment happened, on average, more years ago for the national security policy person than the AI safety person.

This is (even more weakly held) because the national security field has existed for longer, so many participants literally had the "oh, what happens if we get nuked by Russia" moment in their careers in the Literal 1980s...

ErioirE's Shortform

ErioirE3d30

What would the minimal digital representation of a human brain & by extension memories/personality look like?

I am not a subject matter expert. This is armchair speculation and conjecture, the actual reality of which I expect to be orders of magnitude more complicated than my ignorant model.

The minimal physical representation is obviously the brain itself, but to losslessly store every last bit of information —IE exact particle configurations— as accurately as it is possible to measure is both nigh-unto-impossible and likely unnecessary considering the ... (read more)

5gwern2d

You might find my notes of interest.

ErioirE2d10

Yes, thanks!