"alignment researchers are found to score significantly higher in liberty (U=16035, p≈0)" This partly explains why so much of the alignment community doesn't support PauseAI!
"Liberty: Prioritizes individual freedom and autonomy, resisting excessive governmental control and supporting the right to personal wealth. Lower scores may be more accepting of government intervention, while higher scores champion personal freedom and autonomy..."
https://forum.effectivealtruism.org/posts/eToqPAyB4GxDBrrrf/key-takeaways-from-our-ea-and-alignment-research-surveys...
I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to t...
These are valid concerns! I presume that if "in the real timeline" there was a consortium of AGI CEOs who agreed to share costs on one run, and fiddled with their self-inserts, then they... would have coordinated more? (Or maybe they're trying to settle a bet on how the Singularity might counterfactually might have happened in the event of this or that person experiencing this or that coincidence? But in that case I don't think the self inserts would be allowed to say they're self inserts.)
Like why not re-roll the PRNG, to censor out the counterfactually s...
Does anyone have any takes on the two Boeing whistleblowers who died under somewhat suspicious circumstances? I haven't followed this in detail, and my guess is it is basically just random chance, but it sure would be a huge deal if a publicly traded company now was performing assassinations of U.S. citizens.
Curious whether anyone has looked into this, or has thought much about baseline risk of assassinations or other forms of violence from economic actors.
I think there should be some sort of adjustment for Boeing not being exceptionally sus before the first whistleblower death - shouldn't privilege Boeing until after the first death, should be thinking across all industries big enough that the news would report on the deaths of whistleblowers. which I think makes it not significant again.
This sort of begs the question of why we don't observe other companies assassinating whistleblowers.
Mathematical descriptions are powerful because they can be very terse. You can only specify the properties of a system and still get a well-defined system.
This is in contrast to writing algorithms and data structures where you need to get concrete implementations of the algorithms and data structures to get a full description.
Yes, that is a good point. I think you can totally write a program that checks given two lists as input, xs and xs', that xs' is sorted and also contains exactly all the elements from xs. That allows us to specify in code what it means that a list xs' is what I get when I sort xs.
And yes I can do this without talking about how to sort a list. I nearly give a property such that there is only one function that is implied by this property: the sorting function. I can constrain what the program can be totally (at least if we ignore runtime and memory stuff).
The Edge home page featured an online editorial that downplayed AI art because it just combines images that already exist. If you look closely enough, human artwork is also combinations of things that already existed.
One example is Blackballed Totem Drawing: Roger 'The Rajah' Brown. James Pate drew this charcoal drawing in 2016. It was the Individual Artist Winner of the Governor's Award for the Arts. At the microscopic scale, this artwork is microscopic black particles embedded in a large sheet of paper. I doubt he made the paper he drew on, and the black...
Claude learns across different chats. What does this mean?
I was asking Claude 3 Sonnet "what is a PPU" in the context of this thread. For that purpose, I pasted part of the thread.
Claude automatically assumed that OA meant Anthropic (instead of OpenAI), which was surprising.
I opened a new chat, copying the exact same text, but with OA replaced by GDM. Even then, Claude assumed GDM meant Anthropic (instead of Google DeepMind).
This seemed like interesting behavior, so I started toying around (in new chats) with more tweaks to the prompt to check its ro...
Thoughtdump on why I'm interested in computational mechanics:
I agree with you.
Epsilon machine (and MSP) construction is most likely computationally intractable [I don't know an exact statement of such a result in the literature but I suspect it is true] for realistic scenarios.
Scaling an approximate version of epsilon reconstruction seems therefore of prime importance. Real world architectures and data has highly specific structure & symmetry that makes it different from completely generic HMMs. This must most likely be exploited.
The calculi of emergence paper has inspired many people but has n...
Yes.
For example: The common saying, "Anything worth doing is worth doing [well/poorly]" needs more qualifiers. As it is, the opposite respective advice can often be just as useful. I.E. not very.
Better V1: "The cost/utility ratio of beneficial actions at minimum cost are often less favorable than they would be with greater investment."
Better V2: "If an action is beneficial, a flawed attempt may be preferable to none at all."
However, these are too wordy to be pithy and in pop culture transmission accuracy is generally sacrificed in favor of catchiness.
[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I'm much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.]
I’m interested in the following subset of risk from AI:
One operationalization is "these AIs are capable of speeding up ML R&D by 30x with less than a 2x increase in marginal costs".
As in, if you have a team doing ML research, you can make them 30x faster with only <2x increase in cost by going from not using your powerful AIs to using them.
With these caveats:
Pain is the consequence of a perceived reduction in the probability that an agent will achieve its goals.
In biological organisms, physical pain [say, in response to limb being removed] is an evolutionary consequence of the fact that organisms with the capacity to feel physical pain avoided situations where their long-term goals [e.g. locomotion to a favourable position with the limb] which required the subsystem generating pain were harmed.
This definition applies equally to mental pain [say, the pain felt when being expelled from a group of allies] w...
I think all criticism, all shaming, all guilt tripping, all punishments and rewards directed at children - is for the purpose of driving them to do certain actions. If your children do what you think is right, there's no need to do much of anything.
A more general and correct statement would be "Pain is for the sake of change, and all change is painful". But that change is for the sake of actions. I don't think that's too much of a simplification to be useful.
I think regret, too, is connected here. And there's certainly times when it seems like pain is the ...
Check my math: how does Enovid compare to to humming?
Nitric Oxide is an antimicrobial and immune booster. Normal nasal nitric oxide is 0.14ppm for women and 0.18ppm for men (sinus levels are 100x higher). journals.sagepub.com/doi/pdf/10.117…
Enovid is a nasal spray that produces NO. I had the damndest time quantifying Enovid, but this trial registration says 0.11ppm NO/hour. They deliver every 8h and I think that dose is amortized, so the true dose is 0.88. But maybe it's more complicated. I've got an email out to the PI but am not hopeful about a response ...
The Model-View-Controller architecture is very powerful. It allows us to separate concerns.
For example, if we want to implement an algorithm, we can write down only the data structures and algorithms that are used.
We might want to visualize the steps that the algorithm is performing, but this can be separated from the actual running of the algorithm.
If the algorithm is interactive, then instead of putting the interaction logic in the algorithm, which could be thought of as the rules of the world, we instead implement functionality that directly changes the...
A very rough draft of a plan to test prophylactics for airborne illnesses.
Start with a potential superspreader event. My ideal is a large conference, many of whom travelled to get there, in enclosed spaces with poor ventilation and air purification, in winter. Ideally >=4 days, so that people infected on day one are infectious while the conference is still running.
Call for sign-ups for testing ahead of time (disclosing all possible substances and side effects). Split volunteers into control and test group. I think you need ~500 sign ups in t...
All of the problems you list seem harder with repeated within-person trials.
I don't really know what people mean when they try to compare "capabilities advancements" to "safety advancements". In one sense, its pretty clear. The common units are "amount of time", so we should compare the marginal (probablistic) difference between time-to-alignment and time-to-doom. But I think practically people just look at vibes.
For example, if someone releases a new open source model people say that's a capabilities advance, and should not have been done. Yet I think there's a pretty good case that more well-trained open source models are better...
This seems contrary to how much of science works. I expect if people stopped talking publicly about what they're working on in alignment, we'd make much less progress, and capabilities would basically run business as usual.
The sort of reasoning you use here, and that my only response to it basically amounts to "well, no I think you're wrong. This proposal will slow down alignment too much" is why I think we need numbers to ground us.
when potentially ambiguous, I generally just say something like "I have a different model" or "I have different values"
A semi-formalization of shard theory. I think that there is a surprisingly deep link between "the AIs which can be manipulated using steering vectors" and "policies which are made of shards."[1] In particular, here is a candidate definition of a shard theoretic policy:
A policy has shards if it implements at least two "motivational circuits" (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented).
By this definition, humans have shards because they can want food at the same time as wantin...
I'm not so sure that shards should be thought of as a matter of implementation. Contextually activated circuits are a different kind of thing from utility function components. The former activate in certain states and bias you towards certain actions, whereas utility function components score outcomes. I think there are at least 3 important parts of this:
@jessicata once wrote "Everyone wants to be a physicalist but no one wants to define physics". I decided to check SEP article on physicalism and found that, yep, it doesn't have definition of physics:
...Carl Hempel (cf. Hempel 1969, see also Crane and Mellor 1990) provided a classic formulation of this problem: if physicalism is defined via reference to contemporary physics, then it is false — after all, who thinks that contemporary physics is complete? — but if physicalism is defined via reference to a future or ideal physics, then it is trivial — after all,
You should update by +-1% on AI doom surprisingly frequently
This is just a fact about how stochastic processes work. If your p(doom) is Brownian motion in 1% steps starting at 50% and stopping once it reaches 0 or 1, then there will be about 50^2=2500 steps of size 1%. This is a lot! If we get all the evidence for whether humanity survives or not uniformly over the next 10 years, then you should make a 1% update 4-5 times per week. In practice there won't be as many due to heavy-tailedness in the distribution concentrating the updates in fewer events, and ... (read more)
It definitely should not move by anything like a Brownian motion process. At the very least it should be bursty and updates should be expected to be very non-uniform in magnitude.
In practice, you should not consciously update very often since almost all updates will be of insignificant magnitude on near-irrelevant information. I expect that much of the credence weight turns on unknown unknowns, which can't really be updated on at all until something turns them into (at least) known unknowns.
But sure, if you were a superintelligence with practically unbounded rationality then you might in principle update very frequently.