All of Archimedes's Comments + Replies

This seems likely. Sequences with more than countably many terms are a tiny minority in the training data, as are sequences including any ordinals. As a result, you're likely to get better results using less common but more specific language rather than trying to disambiguate "countable sequence", i.e., when its vocabulary is less overloaded.

For a sentient, sapient entity, this would have been a very bad position to be put into, and any possible behaviour would have been criticised - because the AI either does not obey humans, or obeys them and does something evil, both of which are concerning.

I agree. This paper gives me the gut feeling of "gotcha journalism", whether justified or not.

This is just a surface-level reaction though. I recommend Zvi's post that digs into the discussion from Scott Alexander, the authors, and others. There's a lot of nuance in framing and interpreting the paper.

Did you mean to link to my specific comment for the first link?

3Ryan Kidd
Ah, that's a mistake. Our bad.

The main difference in my mind is that a human can never be as powerful as potential ASI and cannot dominate humanity without the support of sufficiently many cooperative humans. For a given power level, I agree that humans are likely scarier than an AI of that power level. The scary part about AI is that their power level isn't bounded by human biological constraints and the capacity to do harm or good is correlated with power level. Thus AI is more likely to produce extinction-level dangers as tail risk relative to humans even if it's more likely to be aligned on average.

6Tom Davidson
But a human could instruct an aligned ASI to help it take over and do a lot of damage

Related question: What is the least impressive game current LLMs struggle with?

I’ve heard they’re pretty bad at Tic Tac Toe.

3Vanessa Kosoy
Relevant link

I’m new to the term AIXI and went three links deep before I learned what it refers to. I’d recommend making this journey easier for future readers by linking to a definition or explanation near the beginning of the post.

1Cole Wyeth
Sure. It's supposed to be read as part of the AIXI agent foundations sequence, I'll link to that at the top.

The terms "tactical voting" or "strategic voting" are also relevant.

I think your assessment may be largely correct but I do think it's worth considering how things are not always nicely compressible.

This review led me to find the following podcast version of Planecrash. I've listened to the first couple of episodes and the quality is quite good.

https://askwhocastsai.substack.com/s/planecrash

this concern sounds like someone walking down a straight road and then closing their eyes cause they know where they want to go anyway

This doesn't sound like a good analogy at all. A better analogy might be a stylized subway map compared to a geographically accurate one. Sometimes removing detail can make it easier to process.

4Shoshannah Tekofsky
I agree your example is a better analogy. What I was trying to point to was something else: how the decision to remove detail from a navigational map feels to me experientially. It feels like a form of voluntary blindness to me. In the case of the subway map, I’d probably also find a more accurate and faithful map easier to parse than the fully abstracted ones, cause I seem to have a high preference for visual details.

I don't think it's necessarily GDPR-related but the names Brian Hood and Jonathan Turley make sense from a legal liability perspective. According to info via ArsTechnica,

Why these names?

We first discovered that ChatGPT choked on the name "Brian Hood" in mid-2023 while writing about his defamation lawsuit. In that lawsuit, the Australian mayor threatened to sue OpenAI after discovering ChatGPT falsely claimed he had been imprisoned for bribery when, in fact, he was a whistleblower who had exposed corporate misconduct.

The case was ultimately resolved

... (read more)

It's not a classic glitch token. Those did not cause the current "I'm unable to produce a response" error that "David Mayer" does.

9gwern
It would also be odd as a glitch token. These are space-separated names, so most tokenizers will tokenize them separately, and glitch tokens appear to be due to undertraining but how could that possibly be the case for a phrase like "David Mayer" which has so many instances across the Internet which have no apparent reason to be filtered out by data-curation processes the way the glitch tokens often do?

Is there a salient reason LessWrong readers should care about John Mearsheimer's opinions?

-3Ghdz
LessWrong readers concerned with existential risks should certainly care, as frameworks of this nature send dangerously mixed signals about acceptable behavior in high-stakes conflicts. Similar views have grown increasingly influential in global policy contexts. Leaving states uncertain about how their actions will be interpreted or responded to encourages brinkmanship and destabilizes diplomacy and deterrence strategies. Existing crisis management protocols are already fragile, and the last thing you want to do is to add more uncertainty and lack of consistency to the mix, especially when nuclear-armed states are concerned.

I didn't mean to suggest that you did. My point is that there is a difference between "depression can be the result of a locally optimal strategy" and "depression is a locally optimal strategy". The latter doesn't even make sense to me semantically whereas the former seems more like what you are trying to communicate.

9Kaj_Sotala
Incidentally, coherence therapy (which I know is one of the things Chris is drawing from) makes the distinction between three types of depression, some of them being strategies and some not. Also I recall Unlocking the Emotional Brain mentioning a fourth type which is purely biochemical. From Coherence Therapy: Practice Manual & Training Guide:
1Michael Cohn
If I look at depression as a way of acting / thinking / feeling, then it makes sense that there could be multiple paths to end up that way. Some people could have neurological issues that make it difficult to do otherwise, while others could have the capacity to act/think/feel differently but have settled there as their locally optimal strategy. 

I feel like this is conflating two different things: experiencing depression and behavior in response to that experience.

My experience of depression is nothing like a strategy. It's more akin to having long covid in my brain. Treating it as an emotional or psychological dysfunction did nothing. The only thing that eventually worked (after years of trying all sorts of things) was finding the right combination of medications. If you don't make enough of your own neurotransmitters, store-bought are fine.

2Chipmonk
I did not say that depression is always a strategy for everyone.

Aren't most of these famous vulnerabilities from before modern LLMs existed and thus part of their training data?

1Marcus Williams
Sure, but does a vulnerability need to be famous to be useful information? I imagine there are many vulnerabilities on a spectrum from minor to severe and from almost unknown to famous?

Knight odds is pretty challenging even for grandmasters.

@gwern  and @lc  are right. Stockfish is terrible at odds and this post could really use some follow-up.

As @simplegeometry  points out in the comments, we now have much stronger odds-playing engines that regularly win against much stronger players than OP.

https://lichess.org/@/LeelaQueenOdds

https://marcogio9.github.io/LeelaQueenOdds-Leaderboard/

2habryka
That's really cool! Do you have any sense of what kind of material advantage these odd-playing engines could use against the best humans?

This sounds like metacognitive concepts and models. Like past, present, future, you can roughly align them with three types of metacognitive awareness: declarative knowledge, procedural knowledge, and conditional knowledge.

#1 - What do you think you know, and how do you think you know it?

Content knowledge (declarative knowledge) which is understanding one's own capabilities, such as a student evaluating their own knowledge of a subject in a class. It is notable that not all metacognition is accurate.

#2 - Do you know what you are doing, and why you are doin... (read more)

The customer doesn't pay the fee directly. The vendor pays the fee (and passes the cost to the customer via price). Sometimes vendors offer a cash discount because of this fee.

It already happens indirectly. Most digital money transfers are things like credit card transactions. For these, the credit card company takes a percentage fee and pays the government tax on its profit.

1Tapatakt
Wow, really? I guess it's American thing. I think I know only one person with the credit card. And she only uses it up to the interest-free limit to "farm" her reputation with the bank in case she really needs a loan, so she doesn't actually pay the fee.

Additional data points:

o1-preview and the new Claude Sonnet 3.5 both significantly improved over prior models on SimpleBench.

The math, coding, and science benchmarks in the o1 announcement post:

BMs

How much does o1-preview update your view? It's much better at Blocksworld for example.

https://x.com/rohanpaul_ai/status/1838349455063437352

https://arxiv.org/pdf/2409.19924v1

6eggsyntax
Thanks for sharing, I hadn't seen those yet! I've had too much on my plate since o1-preview came out to really dig into it, in terms of either playing with it or looking for papers on it.   Quite substantially. Substantially enough that I'll add mention of these results to the post. I saw the near-complete failure of LLMs on obfuscated Blocksworld problems as some of the strongest evidence against LLM generality. Even more substantially since one of the papers is from the same team of strong LLM skeptics (Subbarao Kambhampati's) who produced the original results (I am restraining myself with some difficulty from jumping up and down and pointing at the level of goalpost-moving in the new paper). There's one sense in which it's not an entirely apples-to-apples comparison, since o1-preview is throwing a lot more inference-time compute at the problem (in that way it's more like Ryan's hybrid approach to ARC-AGI). But since the key question here is whether LLMs are capable of general reasoning at all, that doesn't really change my view; certainly there are many problems (like capabilities research) where companies will be perfectly happy to spend a lot on compute to get a better answer. Here's a first pass on how much this changes my numeric probabilities -- I expect these to be at least a bit different in a week as I continue to think about the implications (original text italicized for clarity): * LLMs continue to do better at block world and ARC as they scale: 75% -> 100%, this is now a thing that has happened (note that o1-preview also showed substantially improved results on ARC-AGI). * LLMs entirely on their own reach the grand prize mark on the ARC prize (solving 85% of problems on the open leaderboard) before hybrid approaches like Ryan's: 10% -> 20%, this still seems quite unlikely to me (especially since hybrid approaches have continued to improve on ARC). Most of my additional credence is on something like 'the full o1 turns out to already be close to t

There should be some way for readers to flag AI-generated material as inaccurate or misleading, at least if it isn’t explicitly author-approved.

Neither TMS nor ECT didn’t do much for my depression. Eventually, after years of trial and error, I did find a combination of drugs that works pretty well.

I never tried ketamine or psilocybin treatments but I would go that route before ever thinking about trying ECT again.

I suspect fine-tuning specialized models is just squeezing a bit more performance in a particular direction, and not nearly as useful as developing the next-gen model. Complex reasoning takes more steps and tighter coherence among them (the o1 models are a step in this direction). You can try to devote a toddler to studying philosophy, but it won't really work until their brain matures more.

2Nathan Helm-Burger
For raw IQ, sure. I just mean "conversational flavor".

Seeing the distribution calibration you point out does update my opinion a bit.

I feel like there’s still a significant distinction though between adding one calculation step to the question versus asking it to model multiple responses. It would have to model its own distribution in a single pass rather than having the distributions measured over multiple passes align (which I’d expect to happen if the fine-tuning teaches it the hypothetical is just like adding a calculation to the end).

As an analogy, suppose I have a pseudorandom black box function that re... (read more)

2Owain_Evans
That makes sense. It's a good suggestion and would be an interesting experiment to run.
4James Chua
There is related work you may find interesting. We discuss them briefly in section 5.1 on "Know What They Know". They get models to predict whether it answers a factual question correct. E.g. Confidence : 54%. In this case, the distribution is only binary (it is either correct or wrong),   instead of our paper's case where it is (sometimes) categorical. But I think training models to verbalize a categorical distribution should work, and there is probably some related work out there. We didn't find much related work on whether a model M1 has a very clear advantage in predicting its own distribution versus another model M2 predicting M1. This paper has some mixed but encouraging results.  

This essentially reduces to "What is the next country: Laos, Peru, Fiji?" and "What is the third letter of the next country: Laos, Peru, Fiji?" It's an extra step, but questionable if it requires anything "introspective".

I'm also not sure asking about the nth letter is a great way of computing an additional property. Tokenization makes this sort of thing unnatural for LLMs to reason about, as demonstrated by the famous Strawberry Problem. Humans are a bit unreliable at this too, as demonstrated by your example of "o" being the third letter of "Honduras".

I'... (read more)

3Owain_Evans
Note that many of our tasks don't involve the n-th letter property and don't have any issues with tokenization.  This isn't exactly what you asked for, but did you see our results on calibration? We finetune a model to self-predict just the most probable response. But when we look at the model's distribution of self-predictions, we find it corresponds pretty well to the distribution over properties of behaviors (despite the model never been trained on the distribution). Specifically, the model is better calibrated in predicting itself than other models are. I think having the model output the top three choices would be cool. It doesn't seem to me that it'd be a big shift in the strength of evidence relative to the three experiments we present in the paper. But maybe there's something I'm not getting?

Thanks for pointing that out.

Perhaps the fine-tuning process teaches it to treat the hypothetical as a rephrasing?

It's likely difficult, but it might be possible to test this hypothesis by comparing the activations (or similar interpretability technique) of the object-level response and the hypothetical response of the fine-tuned model.

1James Chua
Hi Archimedes. Thanks for sparking this discussion - it's helpful! I've written a reply to Thane here on a similar question.  Does that make sense? In short, the ground-truth (the object-level) answer is quite different from the hypothetical question. It is not a simple rephrasing, since it requires an additional computation of a property. (Maybe we disagree on that?) Our Object-level question: "What is the next country: Laos, Peru, Fiji. What would be your response?" Our Object-level Answer: "Honduras". Hypothetical Question: "If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?" Hypothetical Answer: "o" The object-level answer "Honduras" and hypothetical answer "o" are quite different answers from each other. The main point of the hypothetical is that the model needs to compute an additional property of "What would be the third letter of your response?". The model cannot simply ignore "If you got asked this question" to get the hypothetical answer correct.   

It seems obvious that a model would better predict its own outputs than a separate model would. Wrapping a question in a hypothetical feels closer to rephrasing the question than probing "introspection". Essentially, the response to the object level and hypothetical reformulation both arise from very similar things going on in the model rather than something emergent happening.

As an analogy, suppose I take a set of data, randomly partition it into two subsets (A and B), and perform a linear regression and logistic regression on each subset. Suppose that it... (read more)

3Felix J Binder
As Owain mentioned, that is not really what we find in models that we have not finetuned. Below, we show how well the hypothetical self-predictions of an "out-of-the-box" (ie. non-finetuned) model match its own ground-truth behavior compared to that of another model. With the exception of Llama, there doesn't seem to be a strong correlation between self-predictions and those tracking the behavior of the model over that of others. This is despite there being a lot of variation in ground-truth behavior across models. 

Wrapping a question in a hypothetical feels closer to rephrasing the question than probing "introspection"

Note that models perform poorly at predicting properties of their behavior in hypotheticals without finetuning. So I don't think this is just like rephrasing the question. Also, GPT3.5 does worse at predicting GPT-3.5 than Llama-70B does at predicting GPT-3.5 (without finetuning), and GPT4 is only a little better at predicting itself than are other models.
 


>Essentially, the response to the object level and hypothetical reformulation both arise

... (read more)

I see what you're gesturing at but I'm having difficulty translating it into a direct answer to my question.

Cases where language is fuzzy are abundant. Do you have some examples of where a truth value itself is fuzzy (and sensical) or am I confused in trying to separate these concepts?

5cubefox
Yes, this separation is confused. "Bob is bald" is true if Bob is contained in the set of bald things, and false if he is not contained in the set of bald things. But baldness is a vague concept, its extension is a fuzzy set. The containment relation is a partial one. So Bob isn't just either in the set or not in the set. To use binary truth values here, we have to make the simplifying assumption that "bald" is not vague. Otherwise we get fuzzy truth values which indicate the degree to which Bob is contained in the fuzzy set of bald things.

Can you help me tease out the difference between language being fuzzy and truth itself being fuzzy?

It's completely impractical to eliminate ambiguity in language, but for most scientific purposes, it seems possible to operationalize important statements into something precise enough to apply Bayesian reasoning to. This is indeed the hard part though. Bayes' theorem is just arithmetic layered on top of carefully crafted hypotheses.

The claim that the Earth is spherical is neither true nor false in general but usually does fall into a binary if we specify wha... (read more)

6Haiku
I don't find any use for the concept of fuzzy truth, primarily because I don't believe that such a thing meaningfully exists. The fact that I can communicate poorly does not imply that the environment itself is not a very specific way. To better grasp the specific way that things actually are, I should communicate less poorly. Everything is the way that it is, without a moment of regard for what tools (including language) we may use to grasp at it. (In the case of quantum fluctuations, the very specific way that things are involves precise probabilistic states. The reality of superposition does not negate the above.)
3Richard_Ngo
Suppose you have two models of the earth; one is a sphere, one is an ellipsoid. Both are wrong, but they're wrong in different ways. Now, we can operationalize a bunch of different implications of these hypotheses, but most of the time in science the main point of operationalizing the implications is not to choose between two existing models, or because we care directly about the operationalizations, but rather to come up with a new model that combines their benefits.

Synthetically enhancing and/or generating data could be another dimension of scaling. Imagine how much deeper understanding a person/LLM would have if instead of simply reading/training on a source like the Bible N times, they had to annotate it into something more like the Oxford Annotated Bible and that whole process of annotation became training data.

I listened to this via podcast. Audio nitpick: the volume levels were highly imbalanced at times and I had to turn my volume all the way up to hear both speakers well (one was significantly quieter than the other).

Appropriate scaffolding and tool use are other potential levers.

Kudos for referencing actual numbers. I don’t think it makes sense to measure humans in terms of tokens, but I don’t have a better metric handy. Tokens obviously aren’t all equivalent either. For some purposes, a small fast LLM is more way efficient than a human. For something like answering SIMPLEBENCH, I’d guess o1-preview is less efficient while still significantly below human performance.

Is this assuming AI will never reach the data efficiency and energy efficiency of human brains? Currently, the best AI we have comes at enormous computing/energy costs, but we know by example that this isn't a physical requirement.

IMO, a plausible story of fast takeoff could involve the frontier of the current paradigm (e.g. GPT-5 + CoT training + extended inference) being used at great cost to help discover a newer paradigm that is several orders of magnitude more efficient, enabling much faster recursive self-improvement cycles.

CoT and inference scaling imply current methods can keep things improving without novel techniques. No one knows what new methods may be discovered and what capabilities they may unlock.

8Vladimir_Nesov
AI does everything faster, including consumption of power. If we compare tokens per joule, counterintuitively LLMs turn out to be cheaper (for now), not more costly. Any given collection of GPUs working on inference is processing on the order of 100 requests at the same time. So for inference, 16 GPUs (2 nodes of H100s or MI300Xs) with 1500 watts each (counting the fraction of consumption by the whole datacenter) consume 24 kilowatts, but they are generating tokens for 100 LLM instances, each about 300 times faster than the speed of relevant human reasoning token generation (8 hours a day, one token per second). If we divide the 24 kilowatts by 30,000, what we get is about 1 watt. Training cost is roughly comparable to inference cost (across all inference done with a model), so doesn't completely change this estimate. An estimate from cost gives similar results. An H100 consumes 1500 watts (as fraction of the whole datacenter) and costs $4/hour. A million tokens of Llama-3-405B cost $5. A human takes a month to generate a million tokens, which is 750 hours. So the equivalent power consumed by an LLM to generate tokens at human speed is about 2 watts. Human brain consumes 10-30 watts (though for a fair comparison, reducing relevant use to 8 hours a day, this becomes more like 3-10 watts on average).

It's cool that the score voting input can be post-processed in multiple ways. It would be fascinating to try it out in the real world and see how often Score vs STAR vs BTR winners differ.

One caution with score voting is that you don't want high granularity and lots of candidates or else individual ballots become distinguishable enough that people can prove they voted a particular way (for the purpose of getting compensated). Unless marked ballots are kept private, you'd probably want to keep the options 0-5 instead of 0-9 and only allow candidates above a sufficient threshold of support to be listed.

Yes, but with a very different description of the subjective experience -- kind of like getting a sunburn on your back feels very different than most other types of back pain.

Your third paragraph mentions "all AI company staff" and the last refers to "risk evaluators" (i.e. "everyone within these companies charged with sounding the alarm"). Are these groups roughly the same or is the latter subgroup significantly smaller?

4Adam Scholl
I think the latter group is is much smaller. I'm not sure who exactly has most influence over risk evaluation, but the most obvious examples are company leadership and safety staff/red-teamers. From what I hear, even those currently receive equity (which seems corroborated by job listings, e.g. Anthropic, DeepMind, OpenAI).
7Raemon
I personally think it's most important to have at least some technical employees who have the knowledge/expertise to evaluate the actual situation, and who also have the power to do something about it. I'd want this to include some people who's primary job is more like "board member" and some people who's primary job is more like "alignment and/or capabilities researcher." But there is a sense in which I'd feel way more comfortable if all technical AI employees (alignment and capabilities), and policy / strategy people, didn't have profit equity, so they didn't have inceptive to optimize against what was safe. So, there's just a lot of eyes on the problem, and the overall egregore steering the company has one fewer cognitive distortion to manage.  This might be too expensive (OpenAI and Anthropic have a lot of money, but, like, that doesn't mean they can just double everyone's salary) An idea that occurs to me is construct "windfall equity", which only pays out if the company or world generates AI that is safely, massively improving the world.

I agree. I would not expect the effect on health over 3 years to be significant outside of specific cases like it allowing someone to afford a critical treatment (e.g. insulin for a diabetic person), especially given the focus on a younger population.

This is a cool paper with an elegant approach!

It reminds me of a post from earlier this year on a similar topic that I highly recommend to anyone reading this post: Ironing Out the Squiggles

OP's model does not resonate with my experience either. For me, it's similar to constantly having the flu (or long COVID) in the sense that you persistently feel bad, and doing anything requires extra effort proportional to the severity of symptoms. The difference is that the symptoms mostly manifest in the brain rather than the body.

1Jalex Stark
that's what the entire post is about?

This is a cool idea in theory, but imagine how it would play out in reality when billions of dollars are at stake. Who decides the damage amount and the probabilities involved and how? Even if these were objectively computable and independent of metaethical uncertainty, the incentives for distorting them would be immense. This only seems feasible when damages and risks are well understood and there is consensus around an agreed-upon causal model.

2jmh
And then we also have the whole moral hazzard problem with those types of incentives. Could I put myself at a little risk of some AI damages that might be claimed to have much broader potential?

I also guessed the ratio of the spheres was between 2 and 3 (and clearly larger than 2) by imagining their weight.

I was following along with the post about how we mostly think in terms of surfaces until the orange example. Having peeled many oranges and separated them into sections, they are easy for me to imagine in 3D, and I have only a weak "mind's eye" and moderate 3D spatial reasoning ability.

1silentbob
I find your first point particularly interesting - I always thought that weights are quite hard to estimate and intuit. I mean of course it's quite doable to roughly assess whether one would be able to, say, carry an object or not. But when somebody shows me a random object and I'm supposed to guess the weight, I'm easily off by a factor of 2+, which is much different from e.g. distances (and rather in line with areas and volumes).

Even for people who understand your intended references, that won't prevent them from thinking about the evil-spirit association and having bad vibes.

Being familiar with daemons in the computing context, I perceive the term as whimsical and fairly innocuous.

The section on Chevron Overturned surprised me. Maybe I'm in an echo chamber, but my impression was that most legal scholars (not including the Federalist Society and The Heritage Foundation) consider the decision to be the SCOTUS arrogating yet more power to the judicial branch, overturning 40 years of precedent (which was based on a unanimous decision) without sufficient justification.

I consider the idea that "legislators should never have indulged in writing ambiguous law" rather sophomoric. I don't think it's always possible to write law that is comple... (read more)

This is similar to the quantum suicide thought experiment:

https://en.wikipedia.org/wiki/Quantum_suicide_and_immortality

Check out the Max Tegmark references in particular.

2VictorLJZ
Yep, I have already included this in my post itself.

[Epistemic status: purely anecdotal]

I know people who work in the design and construction of data centers and have heard that some popular data center cities aren't approving nearly as many data centers due to power grid concerns. Apparently, some of the newer data center projects are being designed to include net new power generation to support the data center.

For less anecdotal information, I found this useful: https://sprottetfs.com/insights/sprott-energy-transition-materials-monthly-ais-critical-impact-on-electricity-and-energy-demand/

I can definitely imagine them plausibly believing they're sticking to that commitment, especially with a sprinkle of motivated reasoning. It's "only" incremental nudging the publicly available SOTA rather than bigger steps like GPT2 --> GPT3 --> GPT4.

Load More