All of Thomas Kwa's Comments + Replies

"Random goals" is a crux. Complicated goals that we can't control well enough to prevent takeover are not necessarily uniformly random goals from whatever space you have in mind.

"Don't believe there is any chance" is very strong. If there is a viable way to bet on this I would be willing to bet at even odds that conditional on AI takeover, a few humans survive 100 years past AI takeover.

2Mikhail Samin
Do you think if an AI with random goals that doesn’t get acausally paid to preserve us takes over, then there’s a meaningful chance there will be some humans around in 100 years? What does it look like?
Thomas KwaΩ19500

I'm working on the autonomy length graph mentioned from METR and want to caveat these preliminary results. Basically, we think the effective horizon length of models is a bit shorter than 2 hours, although we do think there is an exponential increase that, if it continues, could mean month-long horizons within 3 years.

  • Our task suite is of well-defined tasks. We have preliminary data showing that messier tasks like the average SWE intellectual labor are harder for both models and low-context humans.
  • This graph is of 50% horizon time (the human time-to-comple
... (read more)
  • It's not clear whether agents will think in neuralese, maybe end-to-end RL in English is good enough for the next few years and CoT messages won't drift enough to allow steganography
  • Once agents think in either token gibberish or plain vectors maybe self-monitoring will still work fine. After all agents can translate between other languages just fine. We can use model organisms or some other clever experiments to check whether the agent faithfully translates its CoT or unavoidably starts lying to us as it gets more capable.
  • I care about the exact degree to which monitoring gets worse. Plausibly it gets somewhat worse but is still good enough to catch the model before it coups us.
Thomas Kwa2711

I'm not happy about this but it seems basically priced in, so not much update on p(doom).

We will soon have Bayesian updates to make. If we observe that incentives created during end-to-end RL naturally produce goal guarding and other dangerous cognitive properties, it will be bad news. If we observe this doesn't happen, it will be good news (although not very good news because web research seems like it doesn't require the full range of agency).

Likewise, if we observe models' monitorability and interpretability start to tank as they think in neuralese, it will be bad news. If monitoring and interpretability are unaffected, good news.

Interesting times.

How on earth could monitorability and interpretability NOT tank when they think in neuralese? Surely changing from English to some gibberish highdimensional vector language makes things harder, even if not impossible?

Thanks for the update! Let me attempt to convey why I think this post would have been better with fewer distinct points:

In retrospect, maybe I should've gone into explaining the basics of entropy and enthalpy in my reply, eg:

If you replied with this, I would have said something like "then what's wrong with the designs for diamond mechanosynthesis tooltips, which don't resemble enzymes and have been computationally simulated as you mentioned in point 9?" then we would have gone back and forth a few times until either (a) you make some complicated argument I... (read more)

2bhauth
That conflicts with eg: Anyway, I already answered that in 9. diamond.
1Daniel Tan
I don’t know! Seems hard It’s hard because you need to disentangle ‘interpretation power of method’ from ‘whether the model has anything that can be interpreted’, without any ground truth signal in the latter. Basically you need to be very confident that the interp method is good in order to make this claim. One way you might be able to demonstrate this, is if you trained / designed toy models that you knew had some underlying interpretable structure, and showed that your interpretation methods work there. But it seems hard to construct the toy models in a realistic way while also ensuring it has the structure you want - if we could do this we wouldn’t even need interpretability. Edit: Another method might be to show that models get more and more “uninterpretable” as you train them on more data. Ie define some metric of interpretability, like “ratio of monosemantic to polysemantic MLP neurons”, and measure this over the course of training history. This exact instantiation of the metric is probably bad but something like this could work

This doesn't seem wrong to me so I'm now confused again what the correct analysis is. It would come out the same way if we assume rationalists are selected on g right?

Is a Gaussian prior correct though? I feel like it might be double-counting evidence somehow.

Thomas Kwa*316

TLDR:

  • What OP calls "streetlighting", I call an efficient way to prioritize problems by tractability. This is only a problem insofar as we cannot also prioritize by relevance.
  • I think problematic streetlighting is largely due to incentives, not because people are not smart / technically skilled enough. Therefore solutions should fix incentives rather than just recruiting smarter people.

First, let me establish that theorists very often disagree on what the hard parts of the alignment problem are, precisely because not enough theoretical and empirical progress... (read more)

Under log returns to money, personal savings still matter a lot for selfish preferences. Suppose the material comfort component of someone's utility is 0 utils at an consumption of $1/day. Then a moderately wealthy person consuming $1000/day today will be at 7 utils. The owner of a galaxy, at maybe $10^30 / day, will be at 69 utils, but doubling their resources will still add the same 0.69 utils it would for today's moderately wealthy person. So my guess is they will still try pretty hard at acquiring more resources, similarly to people in developed economies today who balk at their income being halved and see it as a pretty extreme sacrifice.

9Benjamin_Todd
True, though I think many people have the intuition that returns diminish faster than log (at least given current tech). For example, most people think increasing their income from $10k to $20k would do more for their material wellbeing than increasing it from $1bn to $2bn. I think the key issue is whether new tech makes it easier to buy huge amounts of utility, or that people want to satisfy other preferences beyond material wellbeing (which may have log or even close to linear returns).

I agree. You only multiply the SAT z-score by 0.8 if you're selecting people on high SAT score and estimating the IQ of that subpopulation, making a correction for regressional Goodhart. Rationalists are more likely selected for high g which causes both SAT and IQ, so the z-score should be around 2.42, which means the estimate should be (100 + 2.42 * 15 - 6) = 130.3. From the link, the exact values should depend on the correlations between g, IQ, and SAT score, but it seems unlikely that the correction factor is as low as 0.8.

Thomas KwaΩ8150

I was at the NeurIPS many-shot jailbreaking poster today and heard that defenses only shift the attack success curve downwards, rather than changing the power law exponent. How does the power law exponent of BoN jailbreaking compare to many-shot, and are there defenses that change the power law exponent here?

6John Hughes
Circuit breaking changes the exponent a bit if you compare it to the slope of Llama3 (Figure 3 of the paper). So, there is some evidence that circuit breaking does more than just shifting the intercept when exposed to best-of-N attacks. This contrasts with the adversarial training on attacks in the MSJ paper, which doesn't change the slope (we might see something similar if we do adversarial training with BoN attacks). Also, I expect that using input-output classifiers will change the slope significantly. Understanding how these slopes change with different defenses is the work we plan to do next!

It's likely possible to engineer away mutations just by checking. ECC memory already has an error rate nine orders of magnitude better than human DNA, and with better error correction you could probably get the error rate low enough that less than one error happens in the expected number of nanobots that will ever exist. ECC is not the kind of checking for which the checking process can be disabled, as the memory module always processes raw bits into error-corrected bits, which fails unless it matches some checksum which can be made astronomically unlikely to happen in a mutation.

3Knight Lee
You're very right! I didn't really think of that. I had the intuition that mutation is very hard to avoid since cancer is very hard to avoid, but maybe it's not that really accurate. Thinking a bit more, it does seem unlikely that a mutation can disable the checking process itself, if the checking process is well designed with checksums. One idea is that the meaning of each byte (or "base pair" in our DNA analogy), changes depending on the checksum of previous bytes. This way, if one byte is mutated, the meaning of every next byte changes (e.g. "hello" becomes "ifmmp"), rendering the entire string of instructions useless. The checking process itself cannot break in any way to compensate for this. It has to break in such a way that it won't update its checksum for this one new byte, but still will update its checksum for all other bytes, which is very unlikely. If it simply disables checksums, all bytes become illegible (like encryption). I use the word "byte" very abstractly—it could be any unit of information. And yes, error correction code could further improve things by allowing a few mutations to get corrected without making the nanobot self destruct. It's still possible the hierarchical idea in my post has advantages over checksums. It theoretically only slows down self replication when a nanobot retrieves its instructions the first time, not when a nanobot uses its instructions. Maybe a compromise is that there is only one level of master nanobots, and they are allowed to replicate the master copy given that they use checksums. But they still use these master copies to install simple copies in other nanobots which do not need checksums. I admit, maybe a slight difference in self replication efficiency doesn't matter. Exponential growth might be so fast that over-engineering the self replication speed is a waste of time. Choosing a simpler system that can be engineered and set up sooner might be wiser. I agree that the hierarchical idea (and any master c

I was expecting some math. Maybe something about the expected amount of work you can get out of an AI before it coups you, if you assume the number of actions required to coup is n, the trusted monitor has false positive rate p, etc?

Thomas Kwa*102

I'm pretty skeptical of this because the analogy seems superficial. Thermodynamics says useful things about abstractions like "work" because we have the laws of thermodynamics. What are the analogous laws for cognitive work / optimization power? It's not clear to me that it can be quantified such that it is easily accounted for:

It is also not clear what distinguishes LLM weights from the weights of a model trained on random labels fro... (read more)

2Alexander Gietelink Oldenziel
I agree with you that as stated the analogy is risking dangerous superficiality. * The 'cognitive' work of evolution came from the billion years of evolution in the innumerable forms of life that lived, hunted and reproduced through the eons. Effectively we could see evolution-by-natural selection as a something like a simple, highly-parallel, stochastic, slow algorithm. I.e. a simple many-tape random Turing machine taking a very large number of timesteps. * A way to try and maybe put some (vegan) meat on the bones of this analogy would be to look at conditional KT-complexity. KT-complexity is a version of Kolmogorov complexity that also accounts for the time- cost of running the generating program. - In KT-complexity pseudorandomness functions just like randomness.  - Algorithms may indeed be copied and the copy operation is fast and takes very little memory overhead. - Just as in Kolmogorov complexity we rejiggle and think in terms of an algorithmic probability. - a private-public key is trivial in a pure Kolmogorov complexity framework but correctly modelled in a KT-complexity framework. To deepen the analogy with thermodynamics one should probably carefully read John Wentworth's generalized heat engines and Kolmogorov sufficient statistics. 
6Daniel Murfet
The analogous laws are just information theory.  Re: a model trained on random labels. This seems somewhat analogous to building a power plant out of dark matter; to derive physical work it isn't enough to have some degrees of freedom somewhere that have a lot of energy, one also needs a chain of couplings between those degrees of freedom and the degrees of freedom you want to act on. Similarly, if I want to use a model to reduce my uncertainty about something, I need to construct a chain of random variables with nonzero mutual information linking the question in my head to the predictive distribution of the model. To take a concrete example: if I am thinking about a chemistry question, and there are four choices A, B, C, D. Without any other information than these letters the model cannot reduce my uncertainty (say I begin with equal belief in all four options). However if I provide a prompt describing the question, and the model has been trained on chemistry, then this information sets up a correspondence between this distribution over four letters and something the model knows about; its answer may then reduce my distribution to being equally uncertain between A, B but knowing C, D are wrong (a change of 1 bit in my entropy). Since language models are good general compressors this seems to work in reasonable generality. Ideally we would like the model to push our distribution towards true answers, but it doesn't necessarily know true answers, only some approximation; thus the work being done is nontrivially directed, and has a systematic overall effect due to the nature of the model's biases. I don't know about evolution. I think it's right that the perspective has limits and can just become some empty slogans outside of some careful usage. I don't know how useful it is in actually technically reasoning about AI safety at scale, but it's a fun idea to play around with.

Maybe we'll see the Go version of Leela give nine stones to pros soon? Or 20 stones to normal players?

Whether or not it would happen by default, this would be the single most useful LW feature for me. I'm often really unsure whether a post will get enough attention to be worth making it a longform, and sometimes even post shortforms like "comment if you want this to be a longform".

I thought it would be linearity of expectation.

The North Wind, the Sun, and Abadar

One day, the North Wind and the Sun argued about which of them was the strongest. Abadar, the god of commerce and civilization, stopped to observe their dispute. “Why don’t we settle this fairly?” he suggested. “Let us see who can compel that traveler on the road below to remove his cloak.”

The North Wind agreed, and with a mighty gust, he began his effort. The man, feeling the bitter chill, clutched his cloak tightly around him and even pulled it over his head to protect himself from the relentless wind. After a time, the... (read more)

The thought experiment is not about the idea that your VNM utility could theoretically be doubled, but instead about rejecting diminishing returns to actual matter and energy in the universe. SBF said he would flip with a 51% of doubling the universe's size (or creating a duplicate universe) and 49% of destroying the current universe. Taking this bet requires a stronger commitment to utilitarianism than most people are comfortable with; your utility needs to be linear in matter and energy. You must be the kind of person that would take a 0.001% chance of c... (read more)

1Joe Rogero
Did he really? If true, that's actually much dumber than I thought, but I couldn't find anything saying that when I looked.  I wouldn't characterize that as a "commitment to utilitarianism", though; you can be a perfect utilitarian and have value that is linear in matter and energy (and presumably number of people?), or be a perfect utilitarian and have some other value function.  The possible redundancy of conscious patterns was one of the things I was thinking about when I wrote:
Thomas Kwa*Ω67-4

For context, I just trialed at METR and talked to various people there, but this take is my own.

I think further development of evals is likely to either get effective evals (informal upper bound on the future probability of catastrophe) or exciting negative results ("models do not follow reliable scaling laws, so AI development should be accordingly more cautious").

The way to do this is just to examine models and fit scaling laws for catastrophe propensity, or various precursors thereof. Scaling laws would be fit to elicitation quality as well as things li... (read more)

Thomas KwaΩ4190

What's the most important technical question in AI safety right now?

2RHollerith
Are Eliezer and Nate right that continuing the AI program will almost certainly lead to extinction or something approximately as disastrous as extinction?
4Stephen Fowler
The lack of a robust, highly general paradigm for reasoning about AGI models is the current greatest technical problem, although it is not what most people are working on.  What features of architecture of contemporary AI models will occur in future models that pose an existential risk? What behavioral patterns of contemporary AI models will be shared with future models that pose an existential risk? Is there a useful and general mathematical/physical framework that describes how agentic, macroscropic systems process information and interact with the environment? Does terminology adopted by AI Safety researchers like "scheming", "inner alignment" or "agent" carve nature at the joints?
BuckΩ9130

In terms of developing better misalignment risk countermeasures, I think the most important questions are probably:

  • How to evaluate whether models should be trusted or untrusted: currently I don't have a good answer and this is bottlenecking the efforts to write concrete control proposals.
  • How AI control should interact with AI security tools inside labs.

More generally:

... (read more)
8Noosphere89
I'd say 1 important question is whether the AI control strategy works out as they hope. I agree with Bogdan that making adequate safety cases for automated safety research is probably one of the most important technical problems to answer (since conditional on the automating AI safety direction working out, then it could eclipse basically all safety research done prior to the automation, and this might hold even if LWers really had basically perfect epistemics given what's possible for humans, and picked closer to optimal directions, since labor is a huge bottleneck, and allows for much tighter feedback loops of progress, for the reasons Tamay Besiroglu identified): https://x.com/tamaybes/status/1851743632161935824 https://x.com/tamaybes/status/1848457491736133744
5Nathan Helm-Burger
Here's some candidates: 1 Are we indeed (as I suspect) in a massive overhang of compute and data for powerful agentic AGI? (If so, then at any moment someone could stumble across an algorithmic improvement which would change everything overnight.) 2 Current frontier models seem much more powerful than mouse brains, yet mice seem conscious. This implies that either LLMs are already conscious, or could easily be made so with non-costly tweaks to their algorithm. How could we objectively tell if an AI were conscious? 3 Over the past year I've helped make both safe-evals-of-danger-adjacent-capabilities (e.g. WMDP.ai) and unpublished infohazardous-evals-of-actually-dangerous-capabilities. One of the most common pieces of negative feedback I've heard on the safe-evals is that they are only danger-adjacent, not measuring truly dangerous things. How could we safely show the correlation of capabilities between high performance on danger-adjacent evals with high performance on actually-dangerous evals?
4Bogdan Ionut Cirstea
Something like a safety case for automated safety research (but I'm biased)
3PhilosophicalSoul
Answering this from a legal perspective:  

Yes, lots of socioeconomic problems have been solved on a 5 to 10 year timescale.

I also disagree that problems will become moot after the singularity unless it kills everyone-- the US has a good chance of continuing to exist, and improving democracy will probably make AI go slightly better.

0RamblinDash
Sorry

The new font doesn't have a few characters useful in IPA.

2habryka
Ah, we should maybe font-subset some system font for that (same as what we did for greek characters). If someone gives me a character range specification I could add it.

The CATXOKLA population is higher than the current swing state population, so it would arguably be a little less unfair overall. Also there's the potential for a catchy pronunciation like /kæ'tʃoʊklə/.

Thomas Kwa*193

Knowing now that he had an edge, I feel like his execution strategy was suspect. The Polymarket prices went from 66c during the order back to 57c on the 5 days before the election. He could have extracted a bit more money from the market if he had forecasted the volume correctly and traded against it proportionally.

Wow, tough crowd

I think it would be better to form a big winner-take-all bloc. With proportional voting, the number of electoral votes at stake will be only a small fraction of the total, so the per-voter influence of CA and TX would probably remain below the national average.

A third approach to superbabies: physically stick >10 infant human brains together while they are developing so they form a single individual with >10x the neocortex neurons as the average humans. Forget +7sd, extrapolation would suggest they are >100sd intelligence.

Even better, we could find some way of networking brains together into supercomputers using configurable software. This would reduce potential health problems and also allow us to harvest their waste energy. Though we would have to craft a simulated reality to distract the non-useful conscious parts of the computational substrate, perhaps modeled on the year 1999...

Thomas Kwa*101

In many respects, I expect this to be closer to what actually happens than "everyone falls over dead in the same second" or "we definitively solve value alignment". Multipolar worlds, AI that generally follows the law (when operators want it to, and modulo an increasing number of loopholes) but cannot fully be trusted, and generally muddling through are the default future. I'm hoping we don't get instrumental survival drives though.

Claim 2: The world has strong defense mechanisms against (structural) power-seeking.

I disagree with this claim. It seems pretty clear that the world has defense mechanisms against

  • disempowering other people or groups
  • breaking norms in the pursuit of power

But it is possible to be power-seeking in other ways. The Gates Foundation has a lot of money and wants other billionaires' money for its cause too. It influences technology development. It has to work with dozens of governments, sometimes lobbying them. Normal think tanks exist to gain influence over govern... (read more)

4Richard_Ngo
I think there's something importantly true about your comment, but let me start with the ways I disagree. Firstly, the more ways in which you're power-seeking, the more defense mechanisms will apply to you. Conversely, if you're credibly trying to do a pretty narrow and widely-accepted thing, then there will be less backlash. So Jane Street is power-seeking in the sense of trying to earn money, but they don't have much of a cultural or political agenda, they're not trying to mobilize a wider movement, and earning money is a very normal thing for companies to do, it makes them one of thousands of comparably-sized companies. (Though note that there is a lot of backlash against companies in general, which are perceived to have too much power. This leads a wide swathe of people, especially on the left, and especially in Europe, to want to greatly disempower companies because they don't trust them.) Meanwhile the Gates Foundation has a philanthropic agenda, but like most foundations tries to steer clear of wider political issues, and also IIRC tries to focus on pretty object-level and widely-agreed-to-be-good interventions. Even so, it's widely distrusted and feared, and Gates has become a symbol of hated global elites, to the extent where there are all sorts of conspiracy theories about him. That'd be even worse if the foundation were more political. Lastly, it seems a bit facile to say that everyone hates Goldman due to "perceived greed rather than power-seeking per se". A key problem is that people think of the greed as manifesting through political capture, evading regulatory oversight, deception, etc. That's part of why it's harder to tar entrepreneurs as greedy: it's just much clearer that their wealth was made in legitimate ways. Now the sense in which I agree: I think that "gaining power triggers to defense mechanisms" is a good first pass, but also we definitely want a more mechanistic explanation of what the defense mechanisms are, what triggers them, etc—in

Yeah that's right, I should have said market for good air filters. My understanding of the problem is that most customers don't know to insist on high CADR at low noise levels, and therefore filter area is low. A secondary problem is that HEPA filters are optimized for single-pass efficiency rather than airflow, but they sell better than 70-90% efficient MERV filters.

The physics does work though. At a given airflow level, pressure and noise go as roughly the -1.5 power of filter area. What IKEA should be producing instead of the FÖRNUFTIG and STARKVIND is ... (read more)

1Ben Millwood
That slides presentation presents me with a "you need access" screen. Is it OK to be public?

Quiet air filters is an already solved problem technically. You just need enough filter area that the pressure drop is low, so that you can use quiet low-pressure PC fans to move the air. CleanAirKits is already good, but if the market were big enough cared enough, rather than CleanAirKits charging >$200 for a box with holes in it and fans, you would get a purifier from IKEA for $120 which is sturdy and 3db quieter due to better sound design.

2bhauth
IKEA already sells air purifiers; their models just have a very low flow rate. There are several companies selling various kinds of air purifiers, including multiples ones with proprietary filters. What all this says to me is, the problem isn't just the overall market size.

Haven't fully read the post, but I feel like that could be relaxed. Part of my intuition is that Aumann's theorem can be relaxed to the case where the agents start with different priors, and the conclusion is that their posteriors differ by no more than their priors.

3tailcalled
The issue with Aumann's theorem is that if the agents have different data then they might have different structure for the latents they use and so they might lack the language to communicate the value of a particular latent. Like let's say you want to explain John Wentworth's "Minimal Motivation of Natural Latents" post to a cat. You could show the cat the post, but even if it trusted you that the post was important, it doesn't know how to read or even that reading is a thing you could do with it. It also doesn't know anything about neural networks, superintelligences, or interpretability/alignment. This would make it hard to make the cat pay attention in any way that differs from any other internet post. Plausibly a cat lacks the learning ability to ever understand this post (though I don't think anyone has seriously tried?), but even if you were trying to introduce a human to it, unless that human has a lot of relevant background knowledge, they're just not going to get it, even when shown the entire text, and it's going to be hard to explain the gist without a significant back-and-forth to establish the relevant concepts.
2johnswentworth
Sadly, the difference in their priors could still make a big difference for the natural latents, due to the tiny mixtures problem. Currently our best way to handle this is to assume a universal prior. That still allows for a wide variety of different priors (i.e. any Turing machine), but the Solomonoff version of natural latents doesn't have the tiny mixtures problem. For Solomonoff natural latents, we do have the sort of result you're intuiting, where the divergence (in bits) between the two agents' priors just gets added to the error term on all the approximations.
Thomas Kwa*2110
  • I agree that with superficial observations, I can't conclusively demonstrate that something is devoid of intellectual value. However, the nonstandard use of words like "proof" is a strong negative signal on someone's work.
  • If someone wants to demonstrate a scientific fact, the burden of proof is on them to communicate this in some clear and standard way, because a basic strategy of anyone practicing pseudoscience is to spend lots of time writing something inscrutable that ends in some conclusion, then claim that no one can disprove it and anyone who thinks
... (read more)
1Remmelt
Thanks for recognising this, and for taking some time now to consider the argument.    Yes, this made us move away from using the term “proof”, and instead write “formal reasoning”.  Most proofs nowadays are done using mathematical notation. So it is understandable that when people read “proof”,  they automatically think “mathematical proof”.  Having said that, there are plenty of examples of proofs done in formal analytic notation that is not mathematical notation. See eg. formal verification practices in the software and hardware industries, or various branches of analytical philosophy.   Yes, much of the effort has been to translate argument parts in terms more standard for the alignment community. What we cannot expect is that the formal reasoning is conceptually familiar and low-inferential distance. That would actually be surprising – why then has someone inside the community not already derived the result in the last 20 years? The reasoning is going to be as complicated as it has to be to reason things through.    Cool that you took a look at his work. Forrest’s use of terms is meant to approximate everyday use of those terms, but the underlying philosophy is notoriously complicated.  Jim Rutt is an ex-chair of Santa Fe Institute who defaults to being skeptical of metaphysics proposals (funny quote he repeats: “when someone mentions metaphysics, I reach for my pistol”).  But Jim ended up reading Forrest’s book and it passed his B.S. detector. So he invited Forrest over to his podcast for a three-part interview. Even if you listen to that though, I don’t expect you to immediately come away understanding the conceptual relations. So here is a problem that you and I are both seeing: * There is this polymath who is clearly smart and recognised for some of his intellectual contributions (by interviewers like Rutt, or co-authors like Anders). * But what this polymath claims to be using as the most fundamental basis for his analysis would take too muc
Thomas Kwa*159

I eat most meats (all except octopus and chicken) and have done this my entire life, except once when I went vegan for Lent. This state seems basically fine because it is acceptable from scope-sensitive consequentialist, deontic, and common-sense points of view, and it improves my diet enough that it's not worth giving up meat "just because".

  • According to EA-style consequentialism, eating meat is a pretty small percentage of your impact, and even if you're not directly offsetting, the impact can be vastly outweighed by positive impact in your career or dona
... (read more)
3Isaac King
I agree with the first bullet point in theory, but see the Corrupted Hardware sequence of posts. It's hard to know the true impact of most interventions, and easy for people to come up with reasons why whatever they want to do happens to have large positive externalities. "Don't directly inflict pain" is something we can be very confident is actually a good thing, without worrying about second-order effects. Additionally, there's no reason why doing bad things should be acceptable just due to also doing unrelated good things. Sure it's net positive from a consequentialist frame, but ceasing the bad things while continuing to do the good things is even more positive! Giving up meat is not some ultimate hardship like martyrdom, nor is there any strong argument that meat-eating is necessary in order to keep doing the other good things. It's more akin to quitting a minor drug addition; hard and requires a lot of self-control at first, but after the craving goes away your life is pretty much the same as it was before. As for the rest of your comment, any line of reasoning that would equally excuse slavery and the holocaust is, I think, pretty suspect.
Thomas Kwa3122

It's not just the writing that sounds like a crank. Core arguments that Remmelt endorses are AFAIK considered crankery by the community; with all the classic signs like

  • making up science-babble,
  • claiming to have a full mathematical proof that safe AI is impossible, despite not providing any formal mathematical reasoning
    • claiming the "proof" uses mathematical arguments from Godel's theorem, Galois Theory, Rice's Theorem
  • inexplicably formatted as a poem

Paul Christiano read some of this and concluded "the entire scientific community would probably consider this w... (read more)

-3Roland Pihlakas
I think your own message is also too extreme to be rational. So it seems to me that you are fighting fire with a fire. Yes, Remmelt has some extreme expressions, but you definitely have extreme expressions here too, while having even weaker arguments. Could we find a golden middle road, a common ground, please? With more reflective thinking and with less focus on right and wrong? I agree that Remmelt can improve the message. And I believe he will do that. I may not agree that we are going to die with 99% probability. At the same time I find that his current directions are definitely worthwhile of exploring. I also definitely respect Paul. But mentioning his name here is mostly irrelevant for my reasoning or for taking your arguments seriously, simply because I usually do not take authorities too seriously before I understand their reasoning in a particular question. And understanding a person's reasoning may occasionally mean that I disagree in particular points as well. In my experience, even the most respectful people are still people, which means they often think in messy ways and they are good just on average, not per instance of a thought line (which may mean they are poor thinkers 99% of the time, while having really valuable thoughts 1% of the time). I do not know the distribution for Paul, but definitely I would not be disappointed if he makes mistakes sometimes. I think this part of Remmelt's response sums it up nicely: "When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself. You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything." In my interpretation, black-and-white thinking is not "crankery". It is a normal and essential step in the development of cognition about a particular problem. Unfortunately. There is researc
-3Remmelt
I have never claimed that there is a mathematical proof. I have claimed that the researcher I work with has done their own reasoning in formal analytical notation (just not maths). Also, that based on his argument – which I probed and have explained here as carefully as I can – AGI cannot be controlled enough to stay safe, and actually converges on extinction. That researcher is now collaborating with Anders Sandberg to formalise an elegant model of AGI uncontainability in mathematical notation. I’m kinda pointing out the obvious here, but if the researcher was a crank, why would Anders be working with them?   Nope, I haven’t claimed either of that.  The claim is that the argument is based on showing a limited extent of control (where controlling effects consistently in line with reference values).  The form of the reasoning there shares some underlying correspondences with how the Gödel’s incompleteness theorems (concluding there is a limit to deriving a logical result within a formal axiomatic system) and Galois Theory (concluding that there is a limited scope of application of an algebraic tool) are reasoned through.   ^– This is a pedagogical device. It helps researchers already acquainted with Gödel’s theorems or Galois Theory to understand roughly what kind of reasoning we’re talking about.   Do you mean the fact that the researcher splits his sentences’ constituent parts into separate lines so that claims are more carefully parsable? That is a format for analysis, not a poem format. While certainly unconventional, it is not a reason to dismiss the rigour of someone’s analysis.    If you look at that exchange, I and the researcher I was working with were writing specific and carefully explained responses. Paul had zoned in on a statement of the conclusion, misinterpreted what was meant, and then moved on to dismissing the entire project. Doing this was not epistemically humble.    When accusing someone of crankery (which is a big deal) it is im
Thomas Kwa*186

Disagree. If ChatGPT is not objective, most people are not objective. If we ask a random person who happens to work at a random company, they are more biased than the internet, which at least averages out the biases of many individuals.

3yc
It’s probably less on all internet but more on the rlhf guidelines (I imagine the human reviewers receive a guideline based on the LLM-training company’s policy, legal, and safety experts’ advice). I don’t disagree though that it could present a relatively more objective view on some topics than a particular individual (depending on the definition of bias).

LLMs are, simultaneously, (1) notoriously sycophantic, i. e. biased to answer the way they think the interlocutor wants them to, and (2) have "truesight", i. e. a literally superhuman ability to suss out the interlocutor's character (which is to say: the details of the latent structure generating the text) based on subtle details of phrasing. While the same could be said of humans as well – most humans would be biased towards assuaging their interlocutor's worldview, rather than creating conflict – the problem of "leading questions" rises to a whole new le... (read more)

6HNX
[1] Can't they both be not objective? Why make it a point of one or the other? A bit of a false dichotomy, there.  [2] There is no single "Internet" - there are specific spaces, forums, communities, blogs, you name it; comprising it. Each has its own, subjective, irrational, moderated (whether by a single individual, a team, or an overall sentiment of the community: promoting/exalting/hyping one subset of topics while ignoring others) mini/sub-culture.  This last one, furthermore, necessarily only happens to care about its own specific niche; happily ignoring most of everything else. LessWrong used to be mostly about, well - being less wrong - back when it started out. Thus, the "rationality" philosophy. Then it has slowly shifted towards a broader, all-encompassing EA. Now it's mostly AI.  Compare the 3k+ results for the former against the 8k+ results for the latter. Every space is focused on its own topic, within whatever mini/sub-cultural norms are encouraged/rewarded or punished/denigrated by the people within it. That creates (virtually) unavoidable blind spots, as every group of people within each space only shares information about [A] its chief topic of interest, within [B] the "appropriate" sentiment for the time, while [C] contrasting itself against the enemy/out-group/non-rationalists, you name it.  In addition to that, different groups have vastly different [I] amount of time on their hands, [II] social, emotional, ethical, moral "charge" with regards to the importance they assign to their topic of choice, and emergent from it come out [III] vastly different amounts of information, produced by the people within that particular space. When you compile the data set for your LLM, you're not compiling a proportionately biased take on different topics. If that was the case, I'd happily agree with you. But you are clearly not. What you are compiling is a bunch of biased, blindsided in their own way, overly leaning towards one social, semantic, political,

I'll grant that ChatGPT displays less bias than most people on major issues, but I don't think this is sufficient to dismiss Matt's concern.

My intuition is that if the bias of a few flawed sources (Claude, ChatGPT) is amplified by their widespread use, the fact that it is "less biased than the average person" matters less. 

9Matt Goldenberg
Of course a random person is biased. Some people will will have more authority than others, and we'll trust them more, and argument screens off authority. What I don't want people to do is give chatGPT or Claude authority. Give it to the wisest people you know not Claude.

Luckily, that's probably not an issue for PC fan based purifiers. Box fans in CR boxes are running way out of spec with increased load and lower airflow both increasing temperatures, whereas PC fans run under basically the same conditions they're designed for.

Any interest in a longform post about air purifiers? There's a lot of information I couldn't fit in this post, and there have been developments in the last few months. Reply if you want me to cover a specific topic.

The point of corrigibility is to remove the instrumental incentive to avoid shutdown, not to avoid all negative outcomes. Our civilization can work on addressing side effects of shutdownability later after we've made agents shutdownable.

1Capybasilisk
I'm pointing out the central flaw of corrigibility. If the AGI can see the possible side effects of shutdown far better than humans can (and it will), it should avoid shutdown. You should turn on an AGI with the assumption you don't get to decide when to turn it off.

In theory, unions fix the bargaining asymmetry where in certain trades, job loss is a much bigger cost to the employee than the company, giving the company unfair negotiating power. In historical case studies like coal mining in the early 20th century, conditions without unions were awful and union demands seem extremely reasonable.

My knowledge of actual unions mostly come from such historical case studies plus personal experience of strikes not having huge negative externalities (2003 supermarket strike seemed justified, a teachers' strike seemed okay, a ... (read more)

6habryka
My sense is unions make sense, but legal protections where companies aren't allowed to route around unions are almost always quite bad. Basically whenever those were granted the unions quickly leveraged what basically amounts to a state-sponsored monopoly, but in ways that are even worse than normal state-sponsored monopolies, because state-sponsored monopolies at least tend to still try to maximize profit, whereas unions tend to basically actively sabotage almost anything that does not myopically give more resources to its members.

(Crossposted from Bountied Rationality Facebook group)

I am generally pro-union given unions' history of fighting exploitative labor practices, but in the dockworkers' strike that commenced today, the union seems to be firmly in the wrong. Harold Daggett, the head of the International Longshoremen’s Association, gleefully talks about holding the economy hostage in a strike. He opposes automation--"any technology that would replace a human worker’s job", and this is a major reason for the breakdown in talks.

For context, the automation of the global shipping ... (read more)

2Nathan Helm-Burger
Unions attempt to solve a coordination/governance issue. Unfortunately, if the union itself has bad governance, it just creates a new governance issue. Like trying to use a 'back fire' to control a forest fire, but then the back fire gets too big and now you have more problems! I'm pro-union in the sense that they are attempting to solve a very real problem. I'm against them insofar as they have a tendency to create new problems once reaching a large enough power base. The solution, in my eyes, is better governance within unions, and better governance over unions by governments.
7habryka
Huh, can you say more about why you are otherwise pro-union? All unions I have interfaced with were structurally the same as this dockworker's strike. Maybe there were some mid-20th-century unions that were better, there were a lot of fucked up things then, but at least modern unions seem to be somewhat universally terrible in this way.

If the bounty isn't over, I'd likely submit several arguments tomorrow.

3Thomas Kwa
I wrote up about 15 arguments in this google doc.
Thomas KwaΩ350

This post and the remainder of the sequence were turned into a paper accepted to NeurIPS 2024. Thanks to LTFF for funding the retroactive grant that made the initial work possible, and further grants supporting its development into a published work including new theory and experiments. @Adrià Garriga-alonso was also very helpful in helping write the paper and interfacing with the review process.

The current LLM situation seems like real evidence that we can have agents that aren't bloodthirsty vicious reality-winning bots, and also positive news about the order in which technology will develop. Under my model, transformative AI requires minimum level of both real world understanding and consequentialism, but beyond this minimum there are tradeoffs. While I agree that AGI was always going to have some *minimum* level of agency, there is a big difference between "slightly less than humans", "about the same as humans", and "bloodthirsty vicious reality-winning bots".

5Daniel Kokotajlo
Agreed. I basically think the capability ordering we got is probably good for technically solving alignment, but maybe bad for governance/wakeup. Good overall though probably.

I just realized what you meant by embedding-- not a shorter program within a longer program, but a short program that simulates a potentially longer (in description length) program.

As applied to the simulation hypothesis, the idea is that if we use the Solomonoff prior for our beliefs about base reality, it's more likely to be laws of physics for a simple universe containing beings that simulate this one as it is to be our physics directly, unless we observe our laws of physics to be super simple. So we are more likely to be simulated by beings inside e.g.... (read more)

I appreciate the clear statement of the argument, though it is not obviously watertight to me, and wish people like Nate would engage. 

Load More