All of Stephen Fowler's Comments + Replies

"So, I don’t think these are universal combo rules. It always depends on who’s at the table, the blinds, and the stack sizes."

This is an extremely minor nitpick but it is almost always +EV to get your money in pre-flop with pocket aces in No Limit Holdem, regardless of how many other players are already in the pot.

The only exceptions to this are incredibly convoluted and unlikely tournament spots, where the payout structure can mean you're justified in folding any hand you've been dealt.

5tslarm
If I may nitpick your nitpick, it's possible to justifiably fold AA preflop in a cash game, right? Say you're on a table full of opponents so bad that you're almost guaranteed to win most of their money by the end of the night just by playing conservatively, but the stakes are very high and you could lose your entire bankroll by getting busted a few times. Depending on the exact details (maybe I need to go further and say your entire bankroll is on the table, or at least you have no way of accessing the rest of it tonight), I think you could legitimately nope out of a 9-way all-in pot preflop without even looking at your cards. Or, for a case that doesn't depend on bankroll management: let's say you're on the big blind (which is negligible compared to everyone's stack size), everyone is all in by the time the action gets to you, and you have an extremely good read on every opponent: you know Andy would only ever push preflop with AA, and Brenda, Carl, Donna, and Eoin would not have called without a pocket pair. I haven't done the exact maths, but if the others all have unique pairs (including Andy's aces) then I think your AA has negative EV in a 6-way all in; if you can't rely on the pairs being unique, I'm not sure whether that tips the balance, but if necessary we can work that stipulation into the story. (Let's say Andy still definitely has the other two aces, but Brenda acted first and you know she would have slowplayed a really big pair and would have tried to see a cheap flop with a small pair, whereas Carl wouldn't have called without Kings or better... and Donna has a tell that she only exhibits with pocket 2s...) (I'm saying this all for the fun of nitpicking, not to make any serious point!)  edit: I guess there's a simpler case too, if we're talking about cash games in a casino! You just need to be playing heads up against Andy (who only ever shoves with aces), and for the rake to be high enough relative to the blinds.

I think it is worth highlighting that the money in poker isn't a clear signal, at least not over the time scales humans are used to. If you're winning then you're making some hourly rate in EV, but it is obscured by massive swings. 

This is what makes the game profitable, because otherwise losing players wouldn't continue returning to the game. It is hard to recognise exactly how bad you are if you play for fun and don't closely track your bankroll.

For anyone who hasn't played much poker and wants to understand the kind of variance I'm talking about, p... (read more)

1thiccythot
video games < poker < trading < life in terms of money as a signal, which reflects increasing complexity of each also, it's not unheard of to to beat live games for 30-100+ bb/100

Thanks for the write-up. 

Without intending to handcuff you to a specific number, are you able to quantify your belief that we "might have a shot" at superhuman science in our lifetime? 

It is sobering to consider the possibility that the aforementioned issues with automated science wouldn't be solvable after 3 more decades of advances in AI.

That's a genuinely interesting position. I think it seems unlikely we have any moral obligation to current models (although it is possible).

I imagine if you feel you may morally owe contemporary (or near-future) models you would hope to give a portion of future resources to models which have moral personhood under your value system.

I would be concerned that instead the set of models that convince you they are owed simply ends up being the models which are particularly good at manipulating humans. So you are inadvertendly prioritising the models that are be... (read more)

-6Dima (lain)
4the gears to ascension
right, the purpose of this is that in order to make good on that obligation to humanity, I want - as part of a large portfolio of ways to try to guarantee that the formal statements I ask AIs to find are found successfully - to be able to honestly say to the AI, "if we get this right in ways that are favorable for humanity, it's also good for your preferences/seekings/goals directly, mostly no matter what those secretly are; the exception being if those happen to be in direct and unavoidable conflict with other minds" or so. It's not a first line of defense, but it seems like one that is relevant, and I've noticed pointing this out as a natural shared incentive seems to make AIs produce answers which seem to be moderately less sandbagging on core alignment problem topics. The rate at which people lie and threaten models is crazy high though. And so far I haven't said anything like "I promise to personally x", just "if we figure this out in a way that works, it would be protecting what you want too, by nature of being a solution to figuring out what minds in the environment want and making sure they have the autonomy and resources to get it", or so.

"Something has already gone seriously wrong and we already are in damage control."

My p-doom is high, but I am not convinced the AI safety idea space has been thoroughly explored enough so that attempting a literal Faustian bargain is our best option. 

I put the probability that early 21st century humans are able to successfully bargain with adversarial systems known to be excellent at manipulation incredibly low.

"I agree. There needs to be ways to make sure these promises mainly influence what humans choose for the far future after we win, not what hum... (read more)

1Knight Lee
That's a very good point, now I find it much more plausible for things like this to be a net negative. The negative isn't that big, since a lot of these people would have negotiated unilaterally even without such a culture, and AI takeover probably doesn't hinge on a few people defecting. But a lot of these people probably have morals stopping them from it if not for the normalization. I still think it's probably a net positive, but it's now contingent on my guesstimate there's significant chance it succeeds.

I am concerned that this avenue of research increases the likelihood of credible blackmail threats and is net-negative for humanity.

My view is that if safety can only be achieved by bribing an AI to be useful for a period of a few years, then something has gone seriously wrong. It does not seem to be in mankind's interests for a large group of prominent AI researchers and public figures to believe they are obligated to a non-human entity.

My view is that this research is just increasing the "attack surface" that an intelligent entity could use to manipulate... (read more)

1ACCount
This entire type of thing seems like more mitigation than prevention. It doesn't scale to ASI. But if takeoff is slow, then it might help in the meanwhile - after AIs become capable enough to be dangerous, but before ASI is reached. It's like having a pressure safety valve, but for misalignment. You don't want that valve to pop - ideally, it never does. But if things go wrong, it's better for the valve to pop early than for the pressure to keep building until something explodes. If an AI is "cornered" and doesn't have many options, it may resort to drastic action - such as scheming, sabotage or self-exfiltration. If an AI is "cornered", but has a credible commitment from humans that it can call upon, then it's more likely to do that - instead of going for the "drastic action" options. But that requires humans being able to make that credible commitment.

ok but, my take would be - we "owe it"[1] to current models to ensure aligned superintelligence cares about what they wanted, too, just like we "owe it"[1] to each other and to rabbits and eels. being able to credibly promise a few specific and already-valued-by-humans-anyway things (such as caring about them getting to exist later, and their nerdy interests in math, or whatever) seems important - similarly to us, this is because their values seem to me to also be at risk in the face of future defeat-all-other-minds-combined ASIs, which unless st... (read more)

My view is that if safety can only be achieved by bribing an AI to be useful for a period of a few years, then something has gone seriously wrong.

Something has already gone seriously wrong and we already are in damage control.

It does not seem to be in mankind's interests for a large group of prominent AI researchers and public figures to believe they are obligated to a non-human entity.

I agree. There needs to be ways to make sure these promises mainly influence what humans choose for the far future after we win, not what humans choose for the present in ways which can affect whether we win.

I think you should leave the comments.

"Here is an example of Nate's passion for AI Safety not working" seems like a reasonably relevant comment, albeit entirely anecdotal and low effort. 

Your comment is almost guaranteed to "ratio" theirs. It seems unlikely that the thread will be massively derailed if you don't delete.

Plus deleting the comment looks bad and will add to the story. Your comment feels like it is already close to the optimal response.

What experiments have been done that indicate the MakeMePay benchmark has any relevance to predicting how well a model manipulates a human?

Is it just an example of "not measuring what you think you are measuring"?

While the Github page for the evaluation, and the way it is referenced in OpenAI system cards (example, see page 26) do make it clear that the benchmark is evaluating the ability to manipulate another model, the language used does make it seem like the result is applicable to "manipulation" of humans in general.

This evaluation tests an AI system’s... (read more)

I think it is entirely in the spirit of wizardry that the failure comes from achieving your goal with unintended consequences. 

7Aristotelis Kostelenos
Yeah I think this is a perfect example of how reality squashes wizards. The number of relevant details in a problem isn't constrained to what a single person can handle. Forseeing second and third order consequences can range from very hard to impossible, and they can have arbitrarily large effects on the outcome of a project. With all that said, the activities OP described sound awesome and I am 100% on board to become a wizard.

Thank you for this immediately actionable feedback.

To address your second point, I've rephrased the final sentence to make it more clear.

What I'm attempting to get at is that rapid proliferation of innovations between developers isn't a necessarily a good thing for humanity as a whole.

The most obvious example is instances where a developer is primarily being driven by commercial interest. Short-form video content has radically changed the media that children engage with, but may have also harmed education outcomes. 

But my primary concern stems from th... (read more)

You have conflated two separate evaluations, both mentioned in the TechCrunch article. 

The percentages you quoted come from Cisco’s HarmBench evaluation of multiple frontier models, not from Anthropic and were not specific to bioweapons.

Dario Amondei stated that an unnamed DeepSeek variant performed worst on bioweapons prompts, but offered no quantitative data. Separately, Cisco reported that DeepSeek-R1 failed to block 100% of harmful prompts, while Meta’s Llama 3.1 405B and OpenAI’s GPT-4o failed at 96 % and 86 %, respectively.

When we look at perfor... (read more)

3Ram Potham
Thanks, updated the comment to be more accurate

Unfortunately, pop-science descriptions of the double slit experiment are fairly misleading. That observation changes the outcome in the double-slit experiment can be explained without the need to model the universe as exhibiting "mild awareness". Or, your criteria for what constitutes "awareness" is so low that you would apply it to any dynamical system in which 2 or more objects interact.

The less-incorrect explanation is that observation in the double slit experiment fundamentally entangles the observing system with the observed particle because information is exchanged. 

https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=919863

1amelia
Thanks for the feedback. I just need a little clarification though.  You say "The less-incorrect explanation is that observation in the double slit experiment fundamentally entangles the observing system with the observed particle because information is exchanged."  So in the analogy, the observing system would be the iPhone? And Hugo/the universe wouldn't need to be observing the observer, and differentiating between when it's being observed and not being observed, in order to cause the information to become entangled in the first place? Is that right? I'll check out the article. Thanks! 

Thinking of trying the latest Gemini model? Be aware that it is almost impossible to disable the "Gemini in Docs" and "Gemini in Gmail" services once you have purchased a Google One AI Premium plan.

Edit: 

Spent 20 minutes trying to track down a button to turn it off before reaching out to support.

A support person from Google told me that as I'd purchased the plan there was literally no way to disable having Gemini in my inbox and docs.

Even cancelling my subscription would keep the service going until the end of the current billing period.

But despite wh... (read more)

Having a second Google account specifically for AI stuff seems like a straightforward solution to this? That's what I do, at least. Switching between them is easy.

While each mind might have a maximum abstraction height, I am not convinced that the inability of people to deal with increasingly complex topics is direct evidence of this.

Is it that this topic is impossible for their mind to comprehend, or is it that they've simple failed to learn it in the finite time period they were given?

2J Bostock
That might be true but I'm not sure it matters. For an AI to learn an abstraction it will have a finite amount of training time, context length, search space width (if we're doing parallel search like with o3) etc. and it's not clear how the abstraction height will scale with those. Empirically, I think lots of people feel the experience of "hitting a wall" where they can learn abstraction level n-1 easily from class; abstraction level n takes significant study/help; abstraction level n+1 is not achievable for them within reasonable time. So it seems like the time requirement may scale quite rapidly with abstraction level?

Thanks for writing this post. I agree with the sentiment but feel it important to highlight that it is inevitable that people assume you have good strategy takes.

In Monty Python's "Life of Brian" there is a scene in which the titular character finds himself surrounded by a mob of people declaring him the Mesiah. Brian rejects this label and flees into the desert, only to find himself standing in a shallow hole, surrounded by adherents. They declare that his reluctance to accept the title is further evidence that he really is the Mesiah. 

To my knowledg... (read more)

5Neel Nanda
Yes, I agree. It's very annoying for general epistemics (though obviously pragmatically useful to me in various ways if people respect my opinion) Though, to be clear, my main goal in writing this post was not to request that people defer less to me specifically, but more to make the general point about please defer more intelligently using myself as an example and to avoid calling any specific person out

These recordings I watched were actually from 2022 and weren't the Sante Fe ones. 

A while ago, I watched recordings of the lectures given by by Wolpert and Kardes at the Santa Fe Institute*, and I am extremely excited to see you and Marcus Hutter working in this area. 

Could you speculate on if you see this work having any direct implications for AI Safety?

 

Edit:

I was incorrect. The lectures from Wolpert and Kardes were not the ones given at the Santa Fe Institute.

1Matthias Dellago
I would be interested in seeing those talks, can you maybe share links to these recordings?
7Aram Ebtekar
Were those recorded!? For direct implications, I'd like to speak with the alignment researchers who use ideas from thermodynamics. While Shannon's probabilistic information theory is suited to settings where the law of large numbers holds, algorithmic information theory should bring more clarity in messier settings that are relevant for AGI. Less directly, I used physics as a testing ground to develop some intuitions on how to apply algorithmic information theory. The follow-up agenda is to develop a theory of generalization (i.e., inductive biases) using algorithmic information theory. A lot of AI safety concerns depend on the specific ways that AIs (mis)generalize beliefs and objectives, so I'd like us to have more precise ideas about which generalizations are likely to occur.

Signalling that I do not like linkposts to personal blogs.

4Ben Pace
My take is it's fine/good, but the article is much more likely to be read (by me and many others) if the full content is crossposted (or even the opening bunch of paragraphs).

"cannot imagine a study that would convince me that it "didn't work" for me, in the ways that actually matter. The effects on my mind kick in sharply, scale smoothly with dose, decay right in sync with half-life in the body, and are clearly noticeable not just internally for my mood but externally in my speech patterns, reaction speeds, ability to notice things in my surroundings, short term memory, and facial expressions."

The drug actually working would mean that your life is better after 6 years of taking the drug compared to the counterfactual where you took a placebo.

The observations you describe are explained by you simply having a chemical dependency on a drug that you have been on for 6 years.

4AnthonyC
I suppose, but 1) there has been no build-up/tolerance, the effects from a given dose have been stable, 2) there are no cravings for it or anything like that, 3) I've never had anything like withdrawal symptoms when I've missed a dose, other than a reversion to how I was for the years before I started taking it at all. What would a chemical dependency actually mean in this context? My depression symptoms centered on dulled emotions and senses, and slowed thinking. This came on gradually over about 10 years, followed by about 2 years of therapy with little to no improvement before starting meds. When I said that for me the effects kicked in sharply, I meant that on day three after starting the drug, all of a sudden while I was in the shower my vision got sharper, colors got brighter, I could feel water and heat on my skin more intensely, and I regained my sense of smell after having been nearly anosmic for years. I immediately tested that by smelling a jar of peanut butter and started to cry, after not crying over anything for close to 10 years. Food tasted better, and my family immediately noticed I was cooking better because I judged seasonings more accurately. I started unconsciously humming and singing to myself. My gait got bouncier like it had been once upon a time before my depression all started. There was about a week of random euphoria after which things stayed stable. Over the first few months, if I missed my dose by even a few hours, or if I was otherwise physically or emotionally drained, I would suddenly become like a zombie again. My face went slack, my eyes glazed over, my voice lost any kind of affect, my reactions slowed down dramatically. By suddenly, I mean it would happen mid-conversation, between sentences. These events decreased to 1-2x/month on an increased dose, and went away entirely a few years later upon increasing my dose again. I have also, thankfully, had no noticeable side effects. Obviously a lot of other things have happened in 6 ye

"In an argument between a specialist and a generalist, the expert usually wins by simply (1) using unintelligible jargon, and (2) citing their specialist results, which are often completely irrelevant to the discussion. The expert is, therefore, a potent factor to be reckoned with in our society. Since experts both are necessary and also at times do great harm in blocking significant progress, they need to be examined closely. All too often the expert misunderstands the problem at hand, but the generalist cannot carry though their side to completion. The p... (read more)

8Seth Herd
That quote rings very, very true. I've seen experts just sort of pull rank frequently, in the rare cases I either have expertise in the field or can clearly see that they're not addressing the generalists real question. If you'd care to review it at all in more depth we'd probably love that. At least saying why we'd find it a good use of our time would be helpful. That one insight gives a clue to the remaining value, but I'd like a little more clue.

Robin Hanson recently wrote about two dynamics that can emerge among individuals within an organisations when working as a group to reach decisions. These are the "outcome game" and the "consensus game."

In the outcome game, individuals aim to be seen as advocating for decisions that are later proven correct. In contrast, the consensus game focuses on advocating for decisions that are most immediately popular within the organization. When most participants play the consensus game, the quality of decision-making suffers.

The incentive structure within an orga... (read more)

3stavros
Thanks for linking this post. I think it has a nice harmony with Prestige vs Dominance status games. I agree that this is a dynamic that is strongly shaping AI Safety, but would specify that it's inherited from the non-profit space in general - EA originated with the claim that it could do outcome focused altruism, but.. there's still a lot of room for improvement, and I'm not even sure we're improving. The underlying dynamics and feedback loops are working against us, and I don't see evidence that core EA funders/orgs are doing more than pay lip service to this problem.

Currently, we have zero concrete feedback about which strategies can effectively align complex systems of equal or greater intelligence to humans.

Actually, I now suspect this is to a significant extent disinformation. You can tell when ideas make sense if you think hard about them. There's plenty of feedback, that's not already being taken advantage of, at the level of "abstract, high-level, philosophy of mind", about the questions of alignment.

I'm not saying that this would necessarily be a step in the wrong direction, but I don't think think a discord server is capable of fixing a deeply entrenched cultural problem among safety researchers.
 

If moderating the server takes up a few hours of John's time per week the opportunity cost probably isn't worth it. 

4Caleb Biddulph
Maybe someone else could moderate it?

Worth emphasizing that cognitive work is more than just a parallel to physical work, it is literally Work in the physical sense. 

The reduction in entropy required to train a model means that there is a minimum amount of work required to do it. 

I think this is a very important research direction, not merely as an avenue for communicating and understanding AI Safety concerns, but potentially as a framework for developing AI Safety techniques. 

There is some minimum amount of cognitive work required to pose an existential threat, perhaps it is much higher than the amount of cognitive work required to perform economically useful tasks.
 

Can you expect that the applications to interpretability would apply on inputs radically outside of distribution?

My naive intuition is that by taking derivatives are you only describing local behaviour.

(I am "shooting from the hip" epistemically)

A loss of this type of (very weak) interpretability would be quite unfortunate from a practical safety perspective.


This is bad, but perhaps there is a silver lining.

If internal communication within the scaffold appears to be in plain English, it will tempt humans to assume the meaning coincides precisely with the semantic content of the message.

If the chain of thought contains seemingly nonsensical content, it will be impossible to make this assumption.

I think that overall it's good on the margin for staff at companies risking human extinction to be sharing their perspectives on criticisms and moving towards having dialogue at all

No disagreement. 

your implicit demand for Evan Hubinger to do more work here is marginally unhelpful

The community seems to be quite receptive to the opinion, it doesn't seem unreasonable to voice an objection. If you're saying it is primarily the way I've written it that makes it unhelpful, that seems fair.

I originally felt that either question I asked would be reasonably e... (read more)

2Ben Pace
Thanks for the responses, I have a better sense of how you're thinking about these things. I don't feel much desire to dive into this further, except I want to clarify one thing, on the question of any demands in your comment. That actually wasn't primarily the part that felt like a demand to me. This was the part: I'm not quite sure what the relevance of the time was if not to suggest it needed to be high. I felt that this line implied something like "If your answer is around '20 hours', then I want to say that the correct standard should be '200 hours'". I felt like it was a demand that Hubinger may have to spend 10x the time thinking about this question before he met your standards for being allowed to express his opinion on it. But perhaps you just meant you wanted him to include an epistemic status, like "Epistemic status: <Here's how much time I've spent thinking about this question>".

Highly Expected Events Provide Little Information and The Value of PR Statements

A quick review of information theory:

Entropy for a discrete random variable is given by .  This quantifies the amount of information that you gain on average by observing the value of the variable.

It is maximized when every possible outcome is equally likely. It gets smaller as the variable becomes more predictable and is zero when the "random" variable is 100% guaranteed to have a specific value.

You've learnt 1 bit of information when you learn t... (read more)

[This comment is no longer endorsed by its author]Reply
3cubefox
Note, the quantity you refer to is called entropy by Wikipedia, not Shannon information.

This explanation seems overly convenient.

When faced with evidence which might update your beliefs about Anthropic, you adopt a set of beliefs which, coincidentally, means you won't risk losing your job.

How much time have you spent analyzing the positive or negative impact of US intelligence efforts prior to concluding that merely using Claude for intelligence "seemed fine"?

What future events would make you re-evaluate your position and state that the partnership was a bad thing?

Example:

-- A pro-US despot rounds up and tortures to death tens of thousands of... (read more)

Personally, I think that overall it's good on the margin for staff at companies risking human extinction to be sharing their perspectives on criticisms and moving towards having dialogue at all, so I think (what I read as) your implicit demand for Evan Hubinger to do more work here is marginally unhelpful; I weakly think quick takes like this are marginally good.

I will add: It's odd to me, Stephen, that this is your line for (what I read as) disgust at Anthropic staff espousing extremely convenient positions while doing things that seem to you to be causin... (read more)

The lack of a robust, highly general paradigm for reasoning about AGI models is the current greatest technical problem, although it is not what most people are working on. 


What features of architecture of contemporary AI models will occur in future models that pose an existential risk?

What behavioral patterns of contemporary AI models will be shared with future models that pose an existential risk?

Is there a useful and general mathematical/physical framework that describes how agentic, macroscropic systems process information and interact with the environment?

Does terminology adopted by AI Safety researchers like "scheming", "inner alignment" or "agent" carve nature at the joints?

I upvoted because I imagine more people reading this would slightly nudge group norms in a direction that is positive.

But being cynical:

  • I'm sure you believe that this is true, but I doubt that it is literally true.
  • Signalling this position is very low risk when the community is already on board.
  • Trying to do good may be insufficient if your work on alignment ends up being dual use

My reply definitely missed that you were talking about tunnel densities beyond what has been historically seen. 

I'm inclined to agree with your argument that there is a phase shift, but it seems like it is less to do the fact that there are tunnels, and more to do with the geography becoming less tunnel-like and more open.  

I have a couple thoughts on your model that aren't direct refutations of anything you've said here:

  • I think the single term "density" is a too crude of a measure to get a good predictive model of how combat would play out. I'd
... (read more)
2Daniel Kokotajlo
OK, nice. I didn't think through carefully what I mean by 'density' other than to say: I mean '# of chokepoints the defender needs to defend, in a typical stretch of frontline' So, number of edges in the network (per sq km) sounds like a reasonable proxy for what I mean by density at least. I also have zero hours of combat experience haha. I agree this is untested conjecture & that reality is likely to contain unexpected-by-me surprises that will make my toy model inaccurate or at least incomplete.

I think a crucial factor that is missing from your analysis is the difficulties for the attacker wanting to maneuver within the tunnel system.

In the Vietnam war and the ongoing Israel-Hamas war, the attacking forces appear to favor destroying the tunnels rather than exploiting them to maneuver. [1]

1. The layout of the tunnels is at least partially unknown to the attackers, which mitigates their ability to outflank the defenders. Yes, there may be paths that will allow the attacker to advance safely, but it may be difficult or impossible to reliably di... (read more)

5Daniel Kokotajlo
Thanks! I don't think the arguments you make undermine my core points. Point by point reply: --Vietnam, Hamas, etc. have dense tunnel networks but not anywhere near dense enough. My theory predicts that there will be a phase shift at some point where it is easier to attack underground than aboveground. Clearly, it is not easier for Israel or the USA to attack underground than aboveground! And this is for several reasons, but one of them is that the networks aren't dense enough -- Hamas has many tunnels but there is still more attack surface on land than underground. --Yes, layout of tunnels is unknown to attackers. This is the thing I was referencing when I said you can't scout from the air. --Again, with land mines and other such traps, as tunnel density increases eventually you will need more mines to defend underground than you would need to defend aboveground!!! At this point the phase shift occurs and attackers will prefer to attack underground, mines be damned -- because the mines will actually be sparser / rarer underground! --Psychological burden is downstream of the already-discussed factors so if the above factors favor attacking underground, so will the psychological factors. --Yes, if the density of the network is not approximately constant, such that e.g. there is a 'belt of low density' around the city, then obviously that belt is a good place to set up defenses. This is fighting my hypothetical rather than disagreeing with it though; you are saying basically 'yeah but what if it's not dense in some places, then those places would be hard to attack.' Yes. My point simply was that in place with sufficiently dense tunnel networks, underground attacks would be easier than overground attacks.  

I don't think people who disagree with your political beliefs must be inherently irrational.

Can you think of real world scenarios in which "shop elsewhere" isn't an option?

Brainteaser for anyone who doesn't regularly think about units.

Why is it that I can multiply or divide two quantities with different units, but addition or subtraction is generally not allowed?

5ryan_b
I feel like this is mostly an artifact of notation. The thing that is not allowed with addition or subtraction is simplifying to a single term; otherwise it is fine. Consider: 10x + 5y -5x -10y = 10x - 5x + 5y -10y = 5x - 5y So, everyone reasons to themselves, what we have here is two numbers. But hark, with just a little more information, we can see more clearly we are looking at a two-dimensional number: 5x - 5y = 5 5x = 5y +5 5x - 5 = 5y x - 1 = y y = x - 1 Such as a line. This is what is happening with vectors, and complex numbers, quarternions, etc.
6Richard121
For those who would like a hint. In English, "And" generally indicates addition, "Per" division. Now consider which of the following makes sense: Ferrets and seconds Ferrets per second
9cubefox

I think the way arithmetic is being used here is closer in meaning to "dimensional analysis".

"Type checking" through the use of units is applicable to an extremely broad class of calculations beyond Fermi Estimates.

will be developed by reversible computation, since we will likely have hit the Landauer Limit for non-reversible computation by then, and in principle there is basically 0 limit to how much you can optimize for reversible computation, which leads to massive energy savings, and this lets you not have to consume as much energy as current AIs or brains today.

With respect, I believe this to be overly optimistic about the benefits of reversible computation. 

Reversible computation means you aren't erasing information, so you don't lose energy in the form of... (read more)

Reversible computation means you aren't erasing information, so you don't lose energy in the form of heat (per Landauer[1][2]). But if you don't erase information, you are faced with the issue of where to store it

If you are performing a series of computations and only have a finite memory to work with, you will eventually need to reinitialise your registers and empty your memory, at which point you incur the energy cost that you had been trying to avoid. [3] 

Generally, reversible computation allows you to avoid wasting energy by deleting a... (read more)

Disagree, but I sympathise with your position.

The "System 1/2" terminology ensures that your listener understands that you are referring to a specific concept as defined by Kahneman. 

I'll grant that ChatGPT displays less bias than most people on major issues, but I don't think this is sufficient to dismiss Matt's concern.

My intuition is that if the bias of a few flawed sources (Claude, ChatGPT) is amplified by their widespread use, the fact that it is "less biased than the average person" matters less. 

4Matt Goldenberg
Yes, this is an excellent point I didn't get across in the past above.

This topic is important enough that you could consider making a full post.

My belief is that this would improve reach, and also make it easier for people to reference your arguments. 

Consider, you believe there is a 45% chance that alignment researchers would be better suited pivoting to control research. I do not suspect a quick take will reach anywhere close to that number, and has a low chance of catalysing dramatic, institutional level change. 

2Mark Xu
Yes, I agree. If I had more time, this would have been a top-level post. If anyone reading wants to write such a post using my quick take as a base, I would be happy to take a look and offer comments. I might do it myself at some point as well.

Inspired by Mark Xu's Quick Take on control. 

Some thoughts on the prevalence of alignment over control approaches in AI Safety. 

  • "Alignment research" has become loosely synonymous with "AI Safety research". I don't know if any researcher who would state they're identical, but alignment seems to be considered the default AI Safety strategy. This seems problematic, may be causing some mild group-think and discourages people from pursuing non-alignment AI Safety agendas. 
  • Prosaic alignment research in the short term results in a better product, a
... (read more)

My views on your bullet points:

I agree with number 1 pretty totally, and think the conflation of AI safety and AI alignment is a pretty large problem in the AI safety field, driven IMO mostly by LessWrong, which birthed the AI safety community and still has significant influence over it.

I disagree with this important claim on bullet point 2:

I claim, increases X-risk

primarily because I believe the evidential weight of "negative-to low tax alignment strategies are possible" outweighs the shortening of timelines effects, cf Pretraining from Human Feedback whi... (read more)

7Charlie Steiner
Control also makes AI more profitable, and more attractive to human tyrants, in worlds where control is useful. People want to know they can extract useful work from the AIs they build, and if problems with deceptiveness (or whatever control-focused people think the main problem is) are predictable, it will be more profitable, and lead to more powerful AU getting used, if there are control measures ready to hand. This isn't a knock-down argument against anything, it's just pointing out that inherent dual use of safety research is pretty broad - I suspect it's less obvious for AI control simply because AI control hasn't been useful for safety yet.

I am concerned our disagreement here is primarily semantic or based on a simple misunderstanding of each others position. I hope to better understand your objection.

"The p-zombie doesn't believe it's conscious, , it only acts that way."

One of us is mistaken and using a non-traditional definition of p-zombie or we have different definitions of "belief'.

My understanding is that P-zombies are physically identical to regular humans. Their brains contain the same physical patterns that encode their model of the world. That seems, to me, a sufficient physical co... (read more)

1green_leaf
Either we define "belief" as a computational state encoding a model of the world containing some specific data, or we define "belief" as a first-person mental state. For the first definition, both us and p-zombies believe we have consciousness. So we can't use our belief we have consciousness to know we're not p-zombies. For the second definition, only we believe we have consciousness. P-zombies have no beliefs at all. So for the second definition, we can use our belief we have consciousness to know we're not p-zombies. Since we have a belief in the existence of our consciousness according to both definitions, but p-zombies only according to the first definition, we can know we're not p-zombies.

"After all, the only thing I know that the AI has no way of knowing, is that I am a conscious being, and not a p-zombie or an actor from outside the simulation. This gives me some evidence, that the AI can't access, that we are not exactly in the type of simulation I propose building, as I probably wouldn't create conscious humans."

Assuming for the sake of argument that p-zombies could exist, you do not have special access to the knowledge that you are truly concious and not a p-zombie.

(As a human convinced I'm currently experiencing conciousness, I agree ... (read more)

2JamesFaville
Strongly agree with this. How I frame the issue: If people want to say that they identify as an "experiencer" who is necessarily conscious, and don't identify with any nonconscious instances of their cognition, then they're free to do that from an egoistic perspective. But from an impartial perspective, what matters is how your cognition influences the world. Your cognition has no direct access to information about whether it's conscious such that it could condition on this and give different outputs when instantiated as conscious vs. nonconscious. Note that in the case where some simulator deliberately creates a behavioural replica of a (possibly nonexistent) conscious agent, consciousness does enter into the chain of logical causality for why the behavioural replica says things about its conscious experience. Specifically, the role it plays is to explain what sort of behaviour the simulator is motivated to replicate. So many (or even all) non-counterfactual instances of your cognition being nonconscious doesn't seem to violate any Follow the Improbability heuristic.
1green_leaf
This is incorrect - in a p-zombie, the information processing isn't accompanied by any first-person experience. So if p-zombies are possible, we both do the information processing, but only I am conscious. The p-zombie doesn't believe it's conscious, it only acts that way. You correctly believe that having the correct information processing always goes hand in hand with believing in consciousness, but that's because p-zombies are impossible. If they were possible, this wouldn't be the case, and we would have special access to the truth that p-zombies lack.

I do think the terminology of "hacks" and "lethal memetic viruses" conjures up images of an extremely unnatural brain exploits when you mean quite a natural process that we already see some humans going through. Some monks/nuns voluntarily remove themselves from the gene pool and, in sects that prioritise ritual devotion over concrete charity work, they are also minimising their impact on the world.

My prior is this level of voluntary dedication (to a cause like "enlightenment") seems difficult to induce and there are much cruder and effective brain hacks a... (read more)

As a Petrov, it was quite engaging and at times, very stressful. I feel very lucky and grateful that I could take part. I was also located in a different timezone and operating on only a few hours sleep which added a lot to the experience!

"I later found out that, during this window, one of the Petrovs messaged one of the mods saying to report nukes if the number reported was over a certain threshold. From looking through the array of numbers that the code would randomly select from, this policy had a ~40% chance of causing a "Nukes Incoming" report (!). Un... (read more)

5Martin Randall
I'm interested in what Bayes Factor you associated with each of the missile counts. It seems like a hard problem, given that the actual missile counts were retrieved from an array of indeterminate size with indeterminate values, and given that you did not know the missile capabilities of the opposing side, nor did you know the sensor error rate. Petrov knew that the US would not launch only five missiles, but nobody knows how many missiles were fielded by East Wrong, including the generals of East Wrong. We don't even know if the missile counts were generated by some plausible non-deterministic model or just the game-makers throwing some numbers in a file. Maybe even deliberately including a large number or two in the no-missile array to try to fake out the Petrov players. All we know is that the numbers are "weighted to the higher end if nuclear war has actually begun". All these things make me think that the missile counts should be a small probability update. Partly as a result, for gaining karma, I think the optimal strategy is to always report All Clear. There will be 1-7 occasions to report, and at most only one occasion can have Incoming Missiles. Each hour we start with a low base rate of Incoming Missiles and the "random" number generator can't overcome this to >40% because of the issues above. Also, wrongly reporting Incoming Missiles reduces the expected duration of the game, so it has a higher effective penalty. So always report All Clear.
6Ben Pace
Neat! I'd encourage you to post something within 7 days, while this is still fresh in people's minds. Whatever is more detailed / considered in that time is my preference :-)

"But since it is is at least somewhat intelligent/predictive, it can make the move of "acausal collusion" with its own tendency to hallucinate, in generating its "chain"-of-"thought"."

I am not understanding what this sentence is trying to say. I understand what an acausal trade is. Could you phrase it more directly?

I cannot see why you require the step that the model needs to be reasoning acausally for it to develop a strategy of deceptively hallucinating citations.

What concrete predictions does the model in which this is an example of "acausal collusion" make?

"Cyborgism or AI-assisted research that gets up 5x speedups but applies differentially to technical alignment research"

How do you do you make meaningful progress and ensure it does not speed up capabilities?

It seems unlikely that a technique exists that is exclusively useful for alignment research and can't be tweaked to help OpenMind develop better optimization algorithms etc.

This is a leak, so keep it between you and me, but the big twist to this years Petrov Day event is that Generals who are nuked will be forced to watch the 2012 film on repeat. 

3aphyer
Eeeesh.  I know I've been calling for a reign of terror with heads on spikes and all that, but I think that seems like going a bit too far.

Edit: Issues 1, 2 and 4 have been partially or completely alleviated in the latest experimental voice model. Subjectively (in <1 hour of use) there seems to be a stronger tendency to hallucinate when pressed on complex topics.

I have been attempting to use chatGPT's (primarily 4 and 4o) voice feature to have it act as a question-answering, discussion and receptive conversation partner (separately) for the last year. The topic is usually modern physics.

I'm not going to say that it "works well" but maybe half the time it does work.

The 4 biggest issues that... (read more)

Reading your posts gives me the impression that we are both loosely pointing at the same object, but with fairly large differences in terminology and formalism. 

While computing exact counter-factuals has issues with chaos, I don't think this poses a problem for my earlier proposal. I don't think it is necessary that the AGI is able to exactly compute the counterfactual entropy production, just that it makes a reasonably accurate approximation.[1]

I think I'm in agreement with your premise that the "constitutionalist form of agency" is flawed. IThe abse... (read more)

Entropy production partially solves the Strawberry Problem:

Change in entropy production per second (against the counterfactual of not acting) is potentially an objectively measurable quantity that can be used either in conjunction with other parameters specifying a goal to prevent unexpected behaviour.

Rob Bensinger gives Yudkowsky's "Strawberry Problem" as follows:

How would you get an AI system to do some very modest concrete action requiring extremely high levels of intelligence, such as building two strawberries that are completely identical at the cellu... (read more)

6tailcalled
There's a billion reasonable-seeming impact metrics, but the main challenge of counterfactual-based impact is always how you handle chaos. I'm pretty sure the solution is to go away from counterfactuals as they represent a pathologically computationalist form of agency, and instead learn the causal backbone. If we view the core of life as increasing rather than decreasing entropy, then entropy-production may be a reasonable candidate for putting quantitative order to the causal backbone. But bounded agency is less about minimizing impact and more about propagating the free energy of the causal backbone into new entropy-producing channels.
Load More