You have conflated two separate evaluations, both mentioned in the TechCrunch article.
The percentages you quoted come from Cisco’s HarmBench evaluation of multiple frontier models, not from Anthropic and were not specific to bioweapons.
Dario Amondei stated that an unnamed DeepSeek variant performed worst on bioweapons prompts, but offered no quantitative data. Separately, Cisco reported that DeepSeek-R1 failed to block 100% of harmful prompts, while Meta’s Llama 3.1 405B and OpenAI’s GPT-4o failed at 96 % and 86 %, respectively.
When we look at perfor...
Unfortunately, pop-science descriptions of the double slit experiment are fairly misleading. That observation changes the outcome in the double-slit experiment can be explained without the need to model the universe as exhibiting "mild awareness". Or, your criteria for what constitutes "awareness" is so low that you would apply it to any dynamical system in which 2 or more objects interact.
The less-incorrect explanation is that observation in the double slit experiment fundamentally entangles the observing system with the observed particle because information is exchanged.
https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=919863
Thinking of trying the latest Gemini model? Be aware that it is almost impossible to disable the "Gemini in Docs" and "Gemini in Gmail" services once you have purchased a Google One AI Premium plan.
Edit:
Spent 20 minutes trying to track down a button to turn it off before reaching out to support.
A support person from Google told me that as I'd purchased the plan there was literally no way to disable having Gemini in my inbox and docs.
Even cancelling my subscription would keep the service going until the end of the current billing period.
But despite wh...
Having a second Google account specifically for AI stuff seems like a straightforward solution to this? That's what I do, at least. Switching between them is easy.
While each mind might have a maximum abstraction height, I am not convinced that the inability of people to deal with increasingly complex topics is direct evidence of this.
Is it that this topic is impossible for their mind to comprehend, or is it that they've simple failed to learn it in the finite time period they were given?
Thanks for writing this post. I agree with the sentiment but feel it important to highlight that it is inevitable that people assume you have good strategy takes.
In Monty Python's "Life of Brian" there is a scene in which the titular character finds himself surrounded by a mob of people declaring him the Mesiah. Brian rejects this label and flees into the desert, only to find himself standing in a shallow hole, surrounded by adherents. They declare that his reluctance to accept the title is further evidence that he really is the Mesiah.
To my knowledg...
These recordings I watched were actually from 2022 and weren't the Sante Fe ones.
A while ago, I watched recordings of the lectures given by by Wolpert and Kardes at the Santa Fe Institute*, and I am extremely excited to see you and Marcus Hutter working in this area.
Could you speculate on if you see this work having any direct implications for AI Safety?
Edit:
I was incorrect. The lectures from Wolpert and Kardes were not the ones given at the Santa Fe Institute.
Signalling that I do not like linkposts to personal blogs.
"cannot imagine a study that would convince me that it "didn't work" for me, in the ways that actually matter. The effects on my mind kick in sharply, scale smoothly with dose, decay right in sync with half-life in the body, and are clearly noticeable not just internally for my mood but externally in my speech patterns, reaction speeds, ability to notice things in my surroundings, short term memory, and facial expressions."
The drug actually working would mean that your life is better after 6 years of taking the drug compared to the counterfactual where you took a placebo.
The observations you describe are explained by you simply having a chemical dependency on a drug that you have been on for 6 years.
"In an argument between a specialist and a generalist, the expert usually wins by simply (1) using unintelligible jargon, and (2) citing their specialist results, which are often completely irrelevant to the discussion. The expert is, therefore, a potent factor to be reckoned with in our society. Since experts both are necessary and also at times do great harm in blocking significant progress, they need to be examined closely. All too often the expert misunderstands the problem at hand, but the generalist cannot carry though their side to completion. The p...
Robin Hanson recently wrote about two dynamics that can emerge among individuals within an organisations when working as a group to reach decisions. These are the "outcome game" and the "consensus game."
In the outcome game, individuals aim to be seen as advocating for decisions that are later proven correct. In contrast, the consensus game focuses on advocating for decisions that are most immediately popular within the organization. When most participants play the consensus game, the quality of decision-making suffers.
The incentive structure within an orga...
Currently, we have zero concrete feedback about which strategies can effectively align complex systems of equal or greater intelligence to humans.
Actually, I now suspect this is to a significant extent disinformation. You can tell when ideas make sense if you think hard about them. There's plenty of feedback, that's not already being taken advantage of, at the level of "abstract, high-level, philosophy of mind", about the questions of alignment.
I'm not saying that this would necessarily be a step in the wrong direction, but I don't think think a discord server is capable of fixing a deeply entrenched cultural problem among safety researchers.
If moderating the server takes up a few hours of John's time per week the opportunity cost probably isn't worth it.
Worth emphasizing that cognitive work is more than just a parallel to physical work, it is literally Work in the physical sense.
The reduction in entropy required to train a model means that there is a minimum amount of work required to do it.
I think this is a very important research direction, not merely as an avenue for communicating and understanding AI Safety concerns, but potentially as a framework for developing AI Safety techniques.
There is some minimum amount of cognitive work required to pose an existential threat, perhaps it is much higher than the amount of cognitive work required to perform economically useful tasks.
Can you expect that the applications to interpretability would apply on inputs radically outside of distribution?
My naive intuition is that by taking derivatives are you only describing local behaviour.
(I am "shooting from the hip" epistemically)
A loss of this type of (very weak) interpretability would be quite unfortunate from a practical safety perspective.
This is bad, but perhaps there is a silver lining.
If internal communication within the scaffold appears to be in plain English, it will tempt humans to assume the meaning coincides precisely with the semantic content of the message.
If the chain of thought contains seemingly nonsensical content, it will be impossible to make this assumption.
I think that overall it's good on the margin for staff at companies risking human extinction to be sharing their perspectives on criticisms and moving towards having dialogue at all
No disagreement.
your implicit demand for Evan Hubinger to do more work here is marginally unhelpful
The community seems to be quite receptive to the opinion, it doesn't seem unreasonable to voice an objection. If you're saying it is primarily the way I've written it that makes it unhelpful, that seems fair.
I originally felt that either question I asked would be reasonably e...
Highly Expected Events Provide Little Information and The Value of PR Statements
Entropy for a discrete random variable is given by . This quantifies the amount of information that you gain on average by observing the value of the variable.
It is maximized when every possible outcome is equally likely. It gets smaller as the variable becomes more predictable and is zero when the "random" variable is 100% guaranteed to have a specific value.
You've learnt 1 bit of information when you learn t...
This explanation seems overly convenient.
When faced with evidence which might update your beliefs about Anthropic, you adopt a set of beliefs which, coincidentally, means you won't risk losing your job.
How much time have you spent analyzing the positive or negative impact of US intelligence efforts prior to concluding that merely using Claude for intelligence "seemed fine"?
What future events would make you re-evaluate your position and state that the partnership was a bad thing?
Example:
-- A pro-US despot rounds up and tortures to death tens of thousands of...
Personally, I think that overall it's good on the margin for staff at companies risking human extinction to be sharing their perspectives on criticisms and moving towards having dialogue at all, so I think (what I read as) your implicit demand for Evan Hubinger to do more work here is marginally unhelpful; I weakly think quick takes like this are marginally good.
I will add: It's odd to me, Stephen, that this is your line for (what I read as) disgust at Anthropic staff espousing extremely convenient positions while doing things that seem to you to be causin...
The lack of a robust, highly general paradigm for reasoning about AGI models is the current greatest technical problem, although it is not what most people are working on.
What features of architecture of contemporary AI models will occur in future models that pose an existential risk?
What behavioral patterns of contemporary AI models will be shared with future models that pose an existential risk?
Is there a useful and general mathematical/physical framework that describes how agentic, macroscropic systems process information and interact with the environment?
Does terminology adopted by AI Safety researchers like "scheming", "inner alignment" or "agent" carve nature at the joints?
I upvoted because I imagine more people reading this would slightly nudge group norms in a direction that is positive.
But being cynical:
My reply definitely missed that you were talking about tunnel densities beyond what has been historically seen.
I'm inclined to agree with your argument that there is a phase shift, but it seems like it is less to do the fact that there are tunnels, and more to do with the geography becoming less tunnel-like and more open.
I have a couple thoughts on your model that aren't direct refutations of anything you've said here:
I think a crucial factor that is missing from your analysis is the difficulties for the attacker wanting to maneuver within the tunnel system.
In the Vietnam war and the ongoing Israel-Hamas war, the attacking forces appear to favor destroying the tunnels rather than exploiting them to maneuver. [1]
1. The layout of the tunnels is at least partially unknown to the attackers, which mitigates their ability to outflank the defenders. Yes, there may be paths that will allow the attacker to advance safely, but it may be difficult or impossible to reliably di...
I don't think people who disagree with your political beliefs must be inherently irrational.
Can you think of real world scenarios in which "shop elsewhere" isn't an option?
Brainteaser for anyone who doesn't regularly think about units.
Why is it that I can multiply or divide two quantities with different units, but addition or subtraction is generally not allowed?
I think the way arithmetic is being used here is closer in meaning to "dimensional analysis".
"Type checking" through the use of units is applicable to an extremely broad class of calculations beyond Fermi Estimates.
will be developed by reversible computation, since we will likely have hit the Landauer Limit for non-reversible computation by then, and in principle there is basically 0 limit to how much you can optimize for reversible computation, which leads to massive energy savings, and this lets you not have to consume as much energy as current AIs or brains today.
With respect, I believe this to be overly optimistic about the benefits of reversible computation.
Reversible computation means you aren't erasing information, so you don't lose energy in the form of...
Reversible computation means you aren't erasing information, so you don't lose energy in the form of heat (per Landauer[1][2]). But if you don't erase information, you are faced with the issue of where to store it.
If you are performing a series of computations and only have a finite memory to work with, you will eventually need to reinitialise your registers and empty your memory, at which point you incur the energy cost that you had been trying to avoid. [3]
Generally, reversible computation allows you to avoid wasting energy by deleting a...
Disagree, but I sympathise with your position.
The "System 1/2" terminology ensures that your listener understands that you are referring to a specific concept as defined by Kahneman.
I'll grant that ChatGPT displays less bias than most people on major issues, but I don't think this is sufficient to dismiss Matt's concern.
My intuition is that if the bias of a few flawed sources (Claude, ChatGPT) is amplified by their widespread use, the fact that it is "less biased than the average person" matters less.
This topic is important enough that you could consider making a full post.
My belief is that this would improve reach, and also make it easier for people to reference your arguments.
Consider, you believe there is a 45% chance that alignment researchers would be better suited pivoting to control research. I do not suspect a quick take will reach anywhere close to that number, and has a low chance of catalysing dramatic, institutional level change.
Inspired by Mark Xu's Quick Take on control.
Some thoughts on the prevalence of alignment over control approaches in AI Safety.
My views on your bullet points:
I agree with number 1 pretty totally, and think the conflation of AI safety and AI alignment is a pretty large problem in the AI safety field, driven IMO mostly by LessWrong, which birthed the AI safety community and still has significant influence over it.
I disagree with this important claim on bullet point 2:
I claim, increases X-risk
primarily because I believe the evidential weight of "negative-to low tax alignment strategies are possible" outweighs the shortening of timelines effects, cf Pretraining from Human Feedback whi...
I am concerned our disagreement here is primarily semantic or based on a simple misunderstanding of each others position. I hope to better understand your objection.
"The p-zombie doesn't believe it's conscious, , it only acts that way."
One of us is mistaken and using a non-traditional definition of p-zombie or we have different definitions of "belief'.
My understanding is that P-zombies are physically identical to regular humans. Their brains contain the same physical patterns that encode their model of the world. That seems, to me, a sufficient physical co...
"After all, the only thing I know that the AI has no way of knowing, is that I am a conscious being, and not a p-zombie or an actor from outside the simulation. This gives me some evidence, that the AI can't access, that we are not exactly in the type of simulation I propose building, as I probably wouldn't create conscious humans."
Assuming for the sake of argument that p-zombies could exist, you do not have special access to the knowledge that you are truly concious and not a p-zombie.
(As a human convinced I'm currently experiencing conciousness, I agree ...
I do think the terminology of "hacks" and "lethal memetic viruses" conjures up images of an extremely unnatural brain exploits when you mean quite a natural process that we already see some humans going through. Some monks/nuns voluntarily remove themselves from the gene pool and, in sects that prioritise ritual devotion over concrete charity work, they are also minimising their impact on the world.
My prior is this level of voluntary dedication (to a cause like "enlightenment") seems difficult to induce and there are much cruder and effective brain hacks a...
As a Petrov, it was quite engaging and at times, very stressful. I feel very lucky and grateful that I could take part. I was also located in a different timezone and operating on only a few hours sleep which added a lot to the experience!
"I later found out that, during this window, one of the Petrovs messaged one of the mods saying to report nukes if the number reported was over a certain threshold. From looking through the array of numbers that the code would randomly select from, this policy had a ~40% chance of causing a "Nukes Incoming" report (!). Un...
"But since it is is at least somewhat intelligent/predictive, it can make the move of "acausal collusion" with its own tendency to hallucinate, in generating its "chain"-of-"thought"."
I am not understanding what this sentence is trying to say. I understand what an acausal trade is. Could you phrase it more directly?
I cannot see why you require the step that the model needs to be reasoning acausally for it to develop a strategy of deceptively hallucinating citations.
What concrete predictions does the model in which this is an example of "acausal collusion" make?
"Cyborgism or AI-assisted research that gets up 5x speedups but applies differentially to technical alignment research"
How do you do you make meaningful progress and ensure it does not speed up capabilities?
It seems unlikely that a technique exists that is exclusively useful for alignment research and can't be tweaked to help OpenMind develop better optimization algorithms etc.
This is a leak, so keep it between you and me, but the big twist to this years Petrov Day event is that Generals who are nuked will be forced to watch the 2012 film on repeat.
Edit: Issues 1, 2 and 4 have been partially or completely alleviated in the latest experimental voice model. Subjectively (in <1 hour of use) there seems to be a stronger tendency to hallucinate when pressed on complex topics.
I have been attempting to use chatGPT's (primarily 4 and 4o) voice feature to have it act as a question-answering, discussion and receptive conversation partner (separately) for the last year. The topic is usually modern physics.
I'm not going to say that it "works well" but maybe half the time it does work.
The 4 biggest issues that...
Reading your posts gives me the impression that we are both loosely pointing at the same object, but with fairly large differences in terminology and formalism.
While computing exact counter-factuals has issues with chaos, I don't think this poses a problem for my earlier proposal. I don't think it is necessary that the AGI is able to exactly compute the counterfactual entropy production, just that it makes a reasonably accurate approximation.[1]
I think I'm in agreement with your premise that the "constitutionalist form of agency" is flawed. IThe abse...
Entropy production partially solves the Strawberry Problem:
Change in entropy production per second (against the counterfactual of not acting) is potentially an objectively measurable quantity that can be used either in conjunction with other parameters specifying a goal to prevent unexpected behaviour.
Rob Bensinger gives Yudkowsky's "Strawberry Problem" as follows:
How would you get an AI system to do some very modest concrete action requiring extremely high levels of intelligence, such as building two strawberries that are completely identical at the cellu...
"Workers regularly trade with billionaires and earn more than $77 in wages, despite vast differences in wealth."
Yes, because the worker has something the billionaire wants (their labor) and so is able to sell it. Yudkowsky's point about trying to sell an Oreo for $77 is that a billionaire isn't automatically going to want to buy something off you if they don't care about it (and neither would an ASI).
"I'm simply arguing against the point that smart AIs will automatically turn violent and steal from agents who are less smart than they are, unless they're va...
I previously think I overvalued the model in which laziness/motivation/mood are primarily internal states that required internal solutions. For me, this model also generated a lot of guilt because failing to be productive was a personal failure.
But is the problem a lack of "willpower" or is your brain just operating sub-optimally because you're making a series of easily fixable health blunders?
Are you eating healthy?
Are you consuming large quantities of sugar?
Are you sleeping with your phone on your bedside table?
Are you deficient in any vitamins?
Is you sl...
seems like big step change in its ability to reliably do hard tasks like this without any advanced scaffolding or prompting to make it work.
Keep in mind that o1 is utilising advanced scaffolding to facilitate Chain-Of-Thought reasoning, but it is just hidden from the user.
I'd like access to it.
I agree that the negative outcomes from technological unemployment do not get enough attention but my model of how the world will implement Transformative AI is quite different to yours.
Our current society doesn't say "humans should thrive", it says "professional humans should thrive"
Let us define workers to be the set of humans whose primary source of wealth comes from selling their labour. This is a very broad group that includes people colloquially called working class (manual labourers, baristas, office workers, teachers etc) but we are also including ...
Assuming I blend in and speak the local language, within an order of magnitude of 5 million (edit: USD)
I don't feel your response meaningfully engaged with either of my objections.
I strongly downvoted this post.
1 . The optics of actually implementing this idea would be awful. It would permanently damage EA's public image and be raised as a cudgel in every single expose written about the movement. To the average person, concluding that years in the life of the poorest are worth less than those of someone in a rich, first world country is an abhorrent statement, regardless of how well crafted your argument is.
2.1 It would be also be extremely difficult for rich foreigners to objectively assess the value of QALYs in the most glob...
what does it mean to keep a corporation "in check"
I'm referring to effective corporate governance. Monitoring, anticipating and influencing decisions made by the corporation via a system of incentives and penalties, with the goal of ensuring actions taken by the corporation are not harmful to broader society.
do you think those mechanisms will not be available for AIs
Hopefully, but there are reasons to think that the governance of a corporation controlled (partially or wholly) by AGIs or controlling one or more AGIs directly may be very difficult. I will no...
I'm not confident that I could give a meaningful number with any degree of confidence. I lack expertise in corporate governance, bio-safety and climate forecasting. Additionally, for the condition to be satisfied that corporations are left "unchecked" there would need to be a dramatic Western political shift that makes speculating extremely difficult.
I will outline my intuition for why (very large, global) human corporations could pose an existential risk (conditional on the existential risk from AI being negligible and global governance being effect...
Thank you for this immediately actionable feedback.
To address your second point, I've rephrased the final sentence to make it more clear.
What I'm attempting to get at is that rapid proliferation of innovations between developers isn't a necessarily a good thing for humanity as a whole.
The most obvious example is instances where a developer is primarily being driven by commercial interest. Short-form video content has radically changed the media that children engage with, but may have also harmed education outcomes.
But my primary concern stems from th... (read more)