Quick Takes

Thomas Kwa22hΩ718-8

You should update by +-1% on AI doom surprisingly frequently

This is just a fact about how stochastic processes work. If your p(doom) is Brownian motion in 1% steps starting at 50% and stopping once it reaches 0 or 1, then there will be about 50^2=2500 steps of size 1%. This is a lot! If we get all the evidence for whether humanity survives or not uniformly over the next 10 years, then you should make a 1% update 4-5 times per week. In practice there won't be as many due to heavy-tailedness in the distribution concentrating the updates in fewer events, and ... (read more)

Showing 3 of 11 replies (Click to show all)
JBlack24m20

It definitely should not move by anything like a Brownian motion process. At the very least it should be bursty and updates should be expected to be very non-uniform in magnitude.

In practice, you should not consciously update very often since almost all updates will be of insignificant magnitude on near-irrelevant information. I expect that much of the credence weight turns on unknown unknowns, which can't really be updated on at all until something turns them into (at least) known unknowns.

But sure, if you were a superintelligence with practically unbounded rationality then you might in principle update very frequently.

2TsviBT4h
Probabilities on summary events like this are mostly pretty pointless. You're throwing together a bunch of different questions, about which you have very different knowledge states (including how much and how often you should update about them).
2niplav8h
Because[1] for a Bayesian reasoner, there is conversation of expected evidence. Although I've seen it mentioned that technically the change in the belief on a Bayesian should follow a Martingale, and Brownian motion is a martingale. ---------------------------------------- 1. I'm not super technically strong on this particular part of the math. Intuitively it could be that in a bounded reasoner which can only evaluate programs in P, any pattern in its beliefs that can be described by an algorithm in P is detected and the predicted future belief from that pattern is incorporated into current beliefs. On the other hand, any pattern described by an algorithm in EXPTIME∖P can't be in the class of hypotheses of the agent, including hypotheses about its own beliefs, so EXPTIME patterns persist. ↩︎

"alignment researchers are found to score significantly higher in liberty (U=16035, p≈0)" This partly explains why so much of the alignment community doesn't support PauseAI!

"Liberty: Prioritizes individual freedom and autonomy, resisting excessive governmental control and supporting the right to personal wealth. Lower scores may be more accepting of government intervention, while higher scores champion personal freedom and autonomy..." 
https://forum.effectivealtruism.org/posts/eToqPAyB4GxDBrrrf/key-takeaways-from-our-ea-and-alignment-research-surveys... (read more)

William_S1dΩ571257

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to t... (read more)

Showing 3 of 22 replies (Click to show all)

These are valid concerns! I presume that if "in the real timeline" there was a consortium of AGI CEOs who agreed to share costs on one run, and fiddled with their self-inserts, then they... would have coordinated more? (Or maybe they're trying to settle a bet on how the Singularity might counterfactually might have happened in the event of this or that person experiencing this or that coincidence? But in that case I don't think the self inserts would be allowed to say they're self inserts.)

Like why not re-roll the PRNG, to censor out the counterfactually s... (read more)

6JenniferRM5h
For most of my comments, I'd almost be offended if I didn't say something surprising enough to get a "high interestingness, low agreement" voting response. Excluding speech acts, why even say things if your interlocutor or full audience can predict what you'll say? And I usually don't offer full clean proofs in direct word. Anyone still pondering the text at the end, properly, shouldn't "vote to agree", right? So from my perspective... its fine and sorta even working as intended <3 However, also, this is currently the top-voted response to me, and if William_S himself reads it I hope he answers here, if not with text then (hopefully? even better?) with a link to a response elsewhere? ((EDIT: Re-reading everything above his, point, I notice that I totally left out the "basic take" that might go roughly like "Kurzweil, Altman, and Zuckerberg are right about compute hardware (not software or philosophy) being central, and there's a compute bottleneck rather than a compute overhang, so the speed of history will KEEP being about datacenter budgets and chip designs, and those happen on 6-to-18-month OODA loops that could actually fluctuate based on economic decisions, and therefore its maybe 2026, or 2028, or 2030, or even 2032 before things pop, depending on how and when billionaires and governments decide to spend money".)) Pulling honest posteriors from people who've "seen things we wouldn't believe" gives excellent material for trying to perform aumancy... work backwards from their posteriors to possible observations, and then forwards again, toward what might actually be true :-)
3Jackson Silver9h
At least one of them has explicitly indicated they left because of AI safety concerns, and this thread seems to be insinuating some concern - Ilya Sutskever's conspicuous silence has become a meme, and Altman recently expressed that he is uncertain of Ilya's employment status. There still hasn't been any explanation for the boardroom drama last year. If it was indeed run-of-the-mill office politics and all was well, then something to the effect of "our departures were unrelated, don't be so anxious about the world ending, we didn't see anything alarming at OpenAI" would obviously help a lot of people and also be a huge vote of confidence for OpenAI. It seems more likely that there is some (vague?) concern but it's been overridden by tremendous legal/financial/peer motivations.
habryka1d4016

Does anyone have any takes on the two Boeing whistleblowers who died under somewhat suspicious circumstances? I haven't followed this in detail, and my guess is it is basically just random chance, but it sure would be a huge deal if a publicly traded company now was performing assassinations of U.S. citizens. 

Curious whether anyone has looked into this, or has thought much about baseline risk of assassinations or other forms of violence from economic actors.

Showing 3 of 4 replies (Click to show all)
11Ben Pace1d
I'm probably missing something simple, but what is 356? I was expecting a probability or a percent, but that number is neither.
16elifland1d
I think 356 or more people in the population needed to make there be a >5% of 2+ deaths in a 2 month span from that population
isabel5h92

I think there should be some sort of adjustment for Boeing not being exceptionally sus before the first whistleblower death - shouldn't privilege Boeing until after the first death, should be thinking across all industries big enough that the news would report on the deaths of whistleblowers. which I think makes it not significant again. 

lc1d713

I seriously doubt on priors that Boeing corporate is murdering employees.

This sort of begs the question of why we don't observe other companies assassinating whistleblowers.

21a3orn18h
I mean, sure, but I've been updating in that direction a weirdly large amount.

Mathematical descriptions are powerful because they can be very terse. You can only specify the properties of a system and still get a well-defined system.

This is in contrast to writing algorithms and data structures where you need to get concrete implementations of the algorithms and data structures to get a full description.

Showing 3 of 4 replies (Click to show all)
3Johannes C. Mayer16h
Let xs be a finite list of natural numbers. Let xs' be the list that is xs sorted ascendingly. I could write down in full formality, what it means for a list to be sorted, without ever talking at all about how you would go about calculating xs' given xs. That is the power I am talking about. We can say what something is, without talking about how to get it. And yes this still applies for constructive logic, Because the property of being sorted is just the logical property of a list. It's a definition. To give a definition, I don't need to talk about what kind of algorithm would produce something that satisfies this condition. That is completely separate. And being able to see that as separate is a really useful abstraction, because it hides away many unimportant details. Computer Science is about how-to-do-X knowledge as SICP says. Mathe is about talking about stuff in full formal detail without talking about this how-to-do-X knowledge, which can get very complicated. How does a modern CPU add two 64-bit floating-point numbers? It's certainly not an obvious simple way, because that would be way too slow. The CPU here illustrates the point as a sort of ultimate instantiation of implementation detail.
4Dagon14h
I kind of see what you're saying, but I also rather think you're talking about specifying very different things in a way that I don't think is required.  The closer CS definition of math's "define a sorted list" is "determine if a list is sorted".  I'd argue it's very close to equivalent to the math formality of whether a list is sorted.  You can argue about the complexity behind the abstraction (Math's foundations on set theory and symbols vs CS library and silicon foundations on memory storage and "list" indexing), but I don't think that's the point you're making. When used for different things, they're very different in complexity.  When used for the same things, they can be pretty similar.

Yes, that is a good point. I think you can totally write a program that checks given two lists as input, xs and xs', that xs' is sorted and also contains exactly all the elements from xs. That allows us to specify in code what it means that a list xs' is what I get when I sort xs.

And yes I can do this without talking about how to sort a list. I nearly give a property such that there is only one function that is implied by this property: the sorting function. I can constrain what the program can be totally (at least if we ignore runtime and memory stuff).

The Edge home page featured an online editorial that downplayed AI art because it just combines images that already exist. If you look closely enough, human artwork is also combinations of things that already existed.

One example is Blackballed Totem Drawing: Roger 'The Rajah' Brown. James Pate drew this charcoal drawing in 2016. It was the Individual Artist Winner of the Governor's Award for the Arts. At the microscopic scale, this artwork is microscopic black particles embedded in a large sheet of paper. I doubt he made the paper he drew on, and the black... (read more)

Claude learns across different chats. What does this mean?

 I was asking Claude 3 Sonnet "what is a PPU" in the context of this thread. For that purpose, I pasted part of the thread.

Claude automatically assumed that OA meant Anthropic (instead of OpenAI), which was surprising.

I opened a new chat, copying the exact same text, but with OA replaced by GDM. Even then, Claude assumed GDM meant Anthropic (instead of Google DeepMind).

This seemed like interesting behavior, so I started toying around (in new chats) with more tweaks to the prompt to check its ro... (read more)

Dalcy1d356

Thoughtdump on why I'm interested in computational mechanics:

  • one concrete application to natural abstractions from here: tl;dr, belief structures generally seem to be fractal shaped. one major part of natural abstractions is trying to find the correspondence between structures in the environment and concepts used by the mind. so if we can do the inverse of what adam and paul did, i.e. 'discover' fractal structures from activations and figure out what stochastic process they might correspond to in the environment, that would be cool
    • ... but i was initially i
... (read more)

I agree with you. 

Epsilon machine (and MSP) construction is most likely computationally intractable [I don't know an exact statement of such a result in the literature but I suspect it is true] for realistic scenarios. 

Scaling an approximate version of epsilon reconstruction seems therefore of prime importance. Real world architectures and data has highly specific structure & symmetry that makes it different from completely generic HMMs. This must most likely be exploited. 

The calculi of emergence paper has inspired many people but has n... (read more)

Pithy sayings are lossily compressed.

Yes.

For example: The common saying, "Anything worth doing is worth doing [well/poorly]" needs more qualifiers. As it is, the opposite respective advice can often be just as useful. I.E. not very.

Better V1: "The cost/utility ratio of beneficial actions at minimum cost are often less favorable than they would be with greater investment."

Better V2: "If an action is beneficial, a flawed attempt may be preferable to none at all."

However, these are too wordy to be pithy and in pop culture transmission accuracy is generally sacrificed in favor of catchiness.

Buck2dΩ31479

[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I'm much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.]

I’m interested in the following subset of risk from AI:

  • Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).
  • Scheming: Risk associated with loss of control to AIs that arises from AIs scheming
    • So e.g. I exclu
... (read more)
Showing 3 of 6 replies (Click to show all)
6Matthew Barnett1d
Can you be more clearer this point? To operationalize this, I propose the following question: what is the fraction of world GDP you expect will be attributable to AI at the time we have these risky AIs that you are interested in?  For example, are you worried about AIs that will arise when AI is 1-10% of the economy, or more like 50%? 90%?

One operationalization is "these AIs are capable of speeding up ML R&D by 30x with less than a 2x increase in marginal costs".

As in, if you have a team doing ML research, you can make them 30x faster with only <2x increase in cost by going from not using your powerful AIs to using them.

With these caveats:

  • The speed up is relative to the current status quo as of GPT-4.
  • The speed up is ignoring the "speed up" of "having better experiments to do due to access to better models" (so e.g., they would complete a fixed research task faster).
  • By "capable" o
... (read more)
6Buck2d
When I said "AI control is easy", I meant "AI control mitigates most risk arising from human-ish-level schemers directly causing catastrophes"; I wasn't trying to comment more generally. I agree with your concern.
dkornai3d3-2

Pain is the consequence of a perceived reduction in the probability that an agent will achieve its goals. 

In biological organisms, physical pain [say, in response to limb being removed] is an evolutionary consequence of the fact that organisms with the capacity to feel physical pain avoided situations where their long-term goals [e.g. locomotion to a favourable position with the limb] which required the subsystem generating pain were harmed.

This definition applies equally to mental pain [say, the pain felt when being expelled from a group of allies] w... (read more)

Showing 3 of 5 replies (Click to show all)
1StartAtTheEnd2d
I think pain is a little bit different than that. It's the contrast between the current state and the goal state. This constrast motivates the agent to act, when the pain of contrast becomes bigger than the (predicted) pain of acting. As a human, you can decrase your pain by thinking that everything will be okay, or you can increase your pain by doubting the process. But it is unlikely that you will allow yourself to stop hurting, because your brain fears that a lack of suffering would result in a lack of progress (some wise people contest this, claiming that wu wei is correct). Another way you can increase your pain is by focusing more on the goal you want to achieve, sort of irritating/torturing yourself with the fact that the goal isn't achieved, to which your brain will respond by increasing the pain felt by the contrast, urging action. Do you see how this differs slightly from your definition? Chronic pain is not a continuous reduction in agency, but a continuous contrast between a bad state and a good state, which makes one feel pain which motivates them to solve it (exercise, surgery, resting, looking for painkillers, etc). This generalizes to other negative feelings, for instance to hunger, which exists with the purpose to be less pleasant than the search for food is, such that you seek food. I warn you that avoiding negative emotions can lead to stagnation, since suffering leads to growth (unless we start wireheading, and making the avoidance of pain our new goal, because then we might seek hedonic pleasures and intoxicants)
1dkornai2d
I would certainly agree with part of what you are saying. Especially the point that many important lessons are taught by pain [correct me if this is misinterpreting your comment]. Indeed, as a parent for example, if your goal is for your child to gain the capacity for self sufficiency, a certain amount of painful lessons that reflect the inherent properties of the world are necessary to achieve such a goal. On the other hand, I do not agree with your framing of pain as being the main motivator [again, correct me if required]. In fact, a wide variety of systems in the brain are concerned with calculating and granting rewards. Perhaps pain and pleasure are the two sides of the same coin, and reward maximisation and regret minimisation are identical. In practice however, I think they often lead to different solutions. I also do not agree with your interpretation that chronic pain does not reduce agency. For family members of mine suffering from arthritis, their chronic pain renders them unable to do many basic activities, for example, access areas for which you would need to climb stairs. I would like to emphasise that it is not the disease which limits their "degrees of freedom" [at least in the short term], and were they to take a large amount of painkillers, they could temporarily climb stairs again.  Finally, I would suggest that your framing as a "contrast between the current state and the goal state" is basically an alternative way of talking about the transition probability from the current state to the goal state. In my opinion, this suggests that our conceptualisations of pain are overwhelmingly similar.

I think all criticism, all shaming, all guilt tripping, all punishments and rewards directed at children - is for the purpose of driving them to do certain actions. If your children do what you think is right, there's no need to do much of anything.

A more general and correct statement would be "Pain is for the sake of change, and all change is painful". But that change is for the sake of actions. I don't think that's too much of a simplification to be useful.

I think regret, too, is connected here. And there's certainly times when it seems like pain is the ... (read more)

Elizabeth10d183

Check my math: how does Enovid compare to to humming?

Nitric Oxide is an antimicrobial and immune booster. Normal nasal nitric oxide is 0.14ppm for women and 0.18ppm for men (sinus levels are 100x higher). journals.sagepub.com/doi/pdf/10.117…

Enovid is a nasal spray that produces NO. I had the damndest time quantifying Enovid, but this trial registration says 0.11ppm NO/hour. They deliver every 8h and I think that dose is amortized, so the true dose is 0.88. But maybe it's more complicated. I've got an email out to the PI but am not hopeful about a response ... (read more)

Showing 3 of 5 replies (Click to show all)
6Elizabeth4d
I found the gotcha: envoid has two other mechanisms of action. Someone pointed this out to me on my previous nitric oxide post, but it didn't quite sink in till I did more reading. 
4DanielFilan4d
What are the two other mechanisms of action?

citric acid and a polymer

Viliam2d40

I suspect that in practice many people use the word "prioritize" to mean:

  • think short-term
  • only do legible things
  • remove slack

The Model-View-Controller architecture is very powerful. It allows us to separate concerns.

For example, if we want to implement an algorithm, we can write down only the data structures and algorithms that are used.

We might want to visualize the steps that the algorithm is performing, but this can be separated from the actual running of the algorithm.

If the algorithm is interactive, then instead of putting the interaction logic in the algorithm, which could be thought of as the rules of the world, we instead implement functionality that directly changes the... (read more)

Elizabeth14d112

A very rough draft of a plan to test prophylactics for airborne illnesses.

Start with a potential superspreader event. My ideal is a large conference,  many of whom travelled to get there, in enclosed spaces with poor ventilation and air purification, in winter. Ideally >=4 days, so that people infected on day one are infectious while the conference is still running. 

Call for sign-ups for testing ahead of time (disclosing all possible substances and side effects). Split volunteers into control and test group. I think you need ~500 sign ups in t... (read more)

5gwern14d
This sounds like a bad plan because it will be a logistics nightmare (undermining randomization) with high attrition, and extremely high variance due to between-subject design (where subjects differ a ton at baseline, in addition to exposure) on a single occasion with uncontrolled exposures and huge measurement error where only the most extreme infections get reported (sometimes). You'll probably get non-answers, if you finish at all. The most likely outcome is something goes wrong and the entire effort is wasted. Since this is a topic which is highly repeatable within-person (and indeed, usually repeats often through a lifetime...), this would make more sense as within-individual and using higher-quality measurements. One good QS approach would be to exploit the fact that infections, even asymptomatic ones, seem to affect heart rate etc as the body is damaged and begins fighting the infection. HR/HRV is now measurable off the shelf with things like the Apple Watch, AFAIK. So you could recruit a few tech-savvy conference-goers for measurements from a device they already own & wear. This avoids any 'big bang' and lets you prototype and tweak on a few people - possibly yourself? - before rolling it out, considerably de-risking it. There are some people who travel constantly for business and going to conferences, and recruiting and managing a few of them would probably be infinitely easier than 500+ randos (if for no reason other than being frequent flyers they may be quite eager for some prophylactics), and you would probably get far more precise data out of them if they agree to cooperate for a year or so and you get eg 10 conferences/trips out of each of them which you can contrast with their year-round baseline & exposome and measure asymptomatic infections or just overall health/stress. (Remember, variance reduction yields exponential gains in precision or sample-size reduction. It wouldn't be too hard for 5 or 10 people to beat a single 250vs250 one-off experi

All of the problems you list seem harder with repeated within-person trials. 

I don't really know what people mean when they try to compare "capabilities advancements" to "safety advancements". In one sense, its pretty clear. The common units are "amount of time", so we should compare the marginal (probablistic) difference between time-to-alignment and time-to-doom. But I think practically people just look at vibes.

For example, if someone releases a new open source model people say that's a capabilities advance, and should not have been done. Yet I think there's a pretty good case that more well-trained open source models are better... (read more)

Showing 3 of 4 replies (Click to show all)
2the gears to ascension2d
People who have the ability to clarify in any meaningful way will not do so. You are in a biased environment where people who are most willing to publish, because they are most able to convince themselves their research is safe - eg, because they don't understand in detail how to reason about whether it is or not - are the ones who will do so. Ability to see far enough ahead would of course be expected to be rather rare, and most people who think they can tell the exact path ahead of time don't have the evidence to back their hunches, even if their hunches are correct, which unless they have a demonstrated track record they probably aren't. Therefore, whoever is making the most progress on real capabilities insights under the name of alignment will make their advancements and publish them, since they don't personally see how it's exfohaz. And it won't be apparent until afterwards that it was capabilities, not alignment. So just don't publish anything, and do your work in private. Email it to anthropic when you know how to create a yellow node. But for god's sake stop accidentally helping people create green nodes because you can't see five inches ahead. And don't send it to a capabilities team before it's able to guarantee moral alignment hard enough to make a red-proof yellow node!

This seems contrary to how much of science works. I expect if people stopped talking publicly about what they're working on in alignment, we'd make much less progress, and capabilities would basically run business as usual.

The sort of reasoning you use here, and that my only response to it basically amounts to "well, no I think you're wrong. This proposal will slow down alignment too much" is why I think we need numbers to ground us.

6Garrett Baker2d
Yeah, there are reasons for caution. I think it makes sense for those concerned or non-concerned to make numerical forecasts about the costs & benefits of such questions, rather than the current state of everyone just comparing their vibes against each other. This generalizes to other questions, like the benefits of interpretability, advances in safety fine-tuning, deep learning science, and agent foundations. Obviously such numbers aren't the end-of-the-line, and like in biorisk, sometimes they themselves should be kept secret. But it still seems a great advance. If anyone would like to collaborate on such a project, my DMs are open (not so say this topic is covered, this isn't exactly my main wheelhouse).

it seems to me that disentangling beliefs and values are important part of being able to understand each other

and using words like "disagree" to mean both "different beliefs" and "different values" is really confusing in that regard

4Viliam7d
Lets use "disagree" vs "dislike".

when potentially ambiguous, I generally just say something like "I have a different model" or "I have different values"

TurnTrout4dΩ24555

A semi-formalization of shard theory. I think that there is a surprisingly deep link between "the AIs which can be manipulated using steering vectors" and "policies which are made of shards."[1] In particular, here is a candidate definition of a shard theoretic policy:

A policy has shards if it implements at least two "motivational circuits" (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented).

By this definition, humans have shards because they can want food at the same time as wantin... (read more)

Showing 3 of 4 replies (Click to show all)
Thomas Kwa2dΩ121

I'm not so sure that shards should be thought of as a matter of implementation. Contextually activated circuits are a different kind of thing from utility function components. The former activate in certain states and bias you towards certain actions, whereas utility function components score outcomes. I think there are at least 3 important parts of this:

  • A shardful agent can be incoherent due to valuing different things from different states
  • A shardful agent can be incoherent due to its shards being shallow, caring about actions or proximal effects rather t
... (read more)
15Daniel Kokotajlo3d
I think this is also what I was confused about -- TurnTrout says that AIXI is not a shard-theoretic agent because it just has one utility function, but typically we imagine that the utility function itself decomposes into parts e.g. +10 utility for ice cream, +5 for cookies, etc. So the difference must not be about the decomposition into parts, but the possibility of independent activation? but what does that mean? Perhaps it means: The shards aren't always applied, but rather only in some circumstances does the circuitry fire at all, and there are circumstances in which shard A fires without B and vice versa. (Whereas the utility function always adds up cookies and ice cream, even if there are no cookies and ice cream around?) I still feel like I don't understand this.
1samshap4d
Instead of demanding orthogonal representations, just have them obey the restricted isometry property. Basically, instead of requiring  ∀i≠j:<xi,xj>=0, we just require ∀i≠j:xi⋅xj≤ϵ . This would allow a polynomial number of sparse shards while still allowing full recovery.

@jessicata once wrote "Everyone wants to be a physicalist but no one wants to define physics". I decided to check SEP article on physicalism and found that, yep, it doesn't have definition of physics:

Carl Hempel (cf. Hempel 1969, see also Crane and Mellor 1990) provided a classic formulation of this problem: if physicalism is defined via reference to contemporary physics, then it is false — after all, who thinks that contemporary physics is complete? — but if physicalism is defined via reference to a future or ideal physics, then it is trivial — after all,

... (read more)
Load More