"Don't believe there is any chance" is very strong. If there is a viable way to bet on this I would be willing to bet at even odds that conditional on AI takeover, a few humans survive 100 years past AI takeover.
I'm working on the autonomy length graph mentioned from METR and want to caveat these preliminary results. Basically, we think the effective horizon length of models is a bit shorter than 2 hours, although we do think there is an exponential increase that, if it continues, could mean month-long horizons within 3 years.
I'm not happy about this but it seems basically priced in, so not much update on p(doom).
We will soon have Bayesian updates to make. If we observe that incentives created during end-to-end RL naturally produce goal guarding and other dangerous cognitive properties, it will be bad news. If we observe this doesn't happen, it will be good news (although not very good news because web research seems like it doesn't require the full range of agency).
Likewise, if we observe models' monitorability and interpretability start to tank as they think in neuralese, it will be bad news. If monitoring and interpretability are unaffected, good news.
Interesting times.
How on earth could monitorability and interpretability NOT tank when they think in neuralese? Surely changing from English to some gibberish highdimensional vector language makes things harder, even if not impossible?
Thanks for the update! Let me attempt to convey why I think this post would have been better with fewer distinct points:
In retrospect, maybe I should've gone into explaining the basics of entropy and enthalpy in my reply, eg:
If you replied with this, I would have said something like "then what's wrong with the designs for diamond mechanosynthesis tooltips, which don't resemble enzymes and have been computationally simulated as you mentioned in point 9?" then we would have gone back and forth a few times until either (a) you make some complicated argument I...
How would we know?
This doesn't seem wrong to me so I'm now confused again what the correct analysis is. It would come out the same way if we assume rationalists are selected on g right?
Is a Gaussian prior correct though? I feel like it might be double-counting evidence somehow.
TLDR:
First, let me establish that theorists very often disagree on what the hard parts of the alignment problem are, precisely because not enough theoretical and empirical progress...
Under log returns to money, personal savings still matter a lot for selfish preferences. Suppose the material comfort component of someone's utility is 0 utils at an consumption of $1/day. Then a moderately wealthy person consuming $1000/day today will be at 7 utils. The owner of a galaxy, at maybe $10^30 / day, will be at 69 utils, but doubling their resources will still add the same 0.69 utils it would for today's moderately wealthy person. So my guess is they will still try pretty hard at acquiring more resources, similarly to people in developed economies today who balk at their income being halved and see it as a pretty extreme sacrifice.
I agree. You only multiply the SAT z-score by 0.8 if you're selecting people on high SAT score and estimating the IQ of that subpopulation, making a correction for regressional Goodhart. Rationalists are more likely selected for high g which causes both SAT and IQ, so the z-score should be around 2.42, which means the estimate should be (100 + 2.42 * 15 - 6) = 130.3. From the link, the exact values should depend on the correlations between g, IQ, and SAT score, but it seems unlikely that the correction factor is as low as 0.8.
I was at the NeurIPS many-shot jailbreaking poster today and heard that defenses only shift the attack success curve downwards, rather than changing the power law exponent. How does the power law exponent of BoN jailbreaking compare to many-shot, and are there defenses that change the power law exponent here?
It's likely possible to engineer away mutations just by checking. ECC memory already has an error rate nine orders of magnitude better than human DNA, and with better error correction you could probably get the error rate low enough that less than one error happens in the expected number of nanobots that will ever exist. ECC is not the kind of checking for which the checking process can be disabled, as the memory module always processes raw bits into error-corrected bits, which fails unless it matches some checksum which can be made astronomically unlikely to happen in a mutation.
I was expecting some math. Maybe something about the expected amount of work you can get out of an AI before it coups you, if you assume the number of actions required to coup is n, the trusted monitor has false positive rate p, etc?
I'm pretty skeptical of this because the analogy seems superficial. Thermodynamics says useful things about abstractions like "work" because we have the laws of thermodynamics. What are the analogous laws for cognitive work / optimization power? It's not clear to me that it can be quantified such that it is easily accounted for:
It is also not clear what distinguishes LLM weights from the weights of a model trained on random labels fro...
Maybe we'll see the Go version of Leela give nine stones to pros soon? Or 20 stones to normal players?
Whether or not it would happen by default, this would be the single most useful LW feature for me. I'm often really unsure whether a post will get enough attention to be worth making it a longform, and sometimes even post shortforms like "comment if you want this to be a longform".
I thought it would be linearity of expectation.
One day, the North Wind and the Sun argued about which of them was the strongest. Abadar, the god of commerce and civilization, stopped to observe their dispute. “Why don’t we settle this fairly?” he suggested. “Let us see who can compel that traveler on the road below to remove his cloak.”
The North Wind agreed, and with a mighty gust, he began his effort. The man, feeling the bitter chill, clutched his cloak tightly around him and even pulled it over his head to protect himself from the relentless wind. After a time, the...
The thought experiment is not about the idea that your VNM utility could theoretically be doubled, but instead about rejecting diminishing returns to actual matter and energy in the universe. SBF said he would flip with a 51% of doubling the universe's size (or creating a duplicate universe) and 49% of destroying the current universe. Taking this bet requires a stronger commitment to utilitarianism than most people are comfortable with; your utility needs to be linear in matter and energy. You must be the kind of person that would take a 0.001% chance of c...
For context, I just trialed at METR and talked to various people there, but this take is my own.
I think further development of evals is likely to either get effective evals (informal upper bound on the future probability of catastrophe) or exciting negative results ("models do not follow reliable scaling laws, so AI development should be accordingly more cautious").
The way to do this is just to examine models and fit scaling laws for catastrophe propensity, or various precursors thereof. Scaling laws would be fit to elicitation quality as well as things li...
In terms of developing better misalignment risk countermeasures, I think the most important questions are probably:
More generally:
Yes, lots of socioeconomic problems have been solved on a 5 to 10 year timescale.
I also disagree that problems will become moot after the singularity unless it kills everyone-- the US has a good chance of continuing to exist, and improving democracy will probably make AI go slightly better.
I mention exactly this in paragraph 3.
The new font doesn't have a few characters useful in IPA.
The CATXOKLA population is higher than the current swing state population, so it would arguably be a little less unfair overall. Also there's the potential for a catchy pronunciation like /kæ'tʃoʊklə/.
Knowing now that he had an edge, I feel like his execution strategy was suspect. The Polymarket prices went from 66c during the order back to 57c on the 5 days before the election. He could have extracted a bit more money from the market if he had forecasted the volume correctly and traded against it proportionally.
Wow, tough crowd
I think it would be better to form a big winner-take-all bloc. With proportional voting, the number of electoral votes at stake will be only a small fraction of the total, so the per-voter influence of CA and TX would probably remain below the national average.
A third approach to superbabies: physically stick >10 infant human brains together while they are developing so they form a single individual with >10x the neocortex neurons as the average humans. Forget +7sd, extrapolation would suggest they are >100sd intelligence.
Even better, we could find some way of networking brains together into supercomputers using configurable software. This would reduce potential health problems and also allow us to harvest their waste energy. Though we would have to craft a simulated reality to distract the non-useful conscious parts of the computational substrate, perhaps modeled on the year 1999...
In many respects, I expect this to be closer to what actually happens than "everyone falls over dead in the same second" or "we definitively solve value alignment". Multipolar worlds, AI that generally follows the law (when operators want it to, and modulo an increasing number of loopholes) but cannot fully be trusted, and generally muddling through are the default future. I'm hoping we don't get instrumental survival drives though.
Claim 2: The world has strong defense mechanisms against (structural) power-seeking.
I disagree with this claim. It seems pretty clear that the world has defense mechanisms against
But it is possible to be power-seeking in other ways. The Gates Foundation has a lot of money and wants other billionaires' money for its cause too. It influences technology development. It has to work with dozens of governments, sometimes lobbying them. Normal think tanks exist to gain influence over govern...
Yeah that's right, I should have said market for good air filters. My understanding of the problem is that most customers don't know to insist on high CADR at low noise levels, and therefore filter area is low. A secondary problem is that HEPA filters are optimized for single-pass efficiency rather than airflow, but they sell better than 70-90% efficient MERV filters.
The physics does work though. At a given airflow level, pressure and noise go as roughly the -1.5 power of filter area. What IKEA should be producing instead of the FÖRNUFTIG and STARKVIND is ...
Quiet air filters is an already solved problem technically. You just need enough filter area that the pressure drop is low, so that you can use quiet low-pressure PC fans to move the air. CleanAirKits is already good, but if the market were big enough cared enough, rather than CleanAirKits charging >$200 for a box with holes in it and fans, you would get a purifier from IKEA for $120 which is sturdy and 3db quieter due to better sound design.
Haven't fully read the post, but I feel like that could be relaxed. Part of my intuition is that Aumann's theorem can be relaxed to the case where the agents start with different priors, and the conclusion is that their posteriors differ by no more than their priors.
I eat most meats (all except octopus and chicken) and have done this my entire life, except once when I went vegan for Lent. This state seems basically fine because it is acceptable from scope-sensitive consequentialist, deontic, and common-sense points of view, and it improves my diet enough that it's not worth giving up meat "just because".
It's not just the writing that sounds like a crank. Core arguments that Remmelt endorses are AFAIK considered crankery by the community; with all the classic signs like
Paul Christiano read some of this and concluded "the entire scientific community would probably consider this w...
Disagree. If ChatGPT is not objective, most people are not objective. If we ask a random person who happens to work at a random company, they are more biased than the internet, which at least averages out the biases of many individuals.
LLMs are, simultaneously, (1) notoriously sycophantic, i. e. biased to answer the way they think the interlocutor wants them to, and (2) have "truesight", i. e. a literally superhuman ability to suss out the interlocutor's character (which is to say: the details of the latent structure generating the text) based on subtle details of phrasing. While the same could be said of humans as well – most humans would be biased towards assuaging their interlocutor's worldview, rather than creating conflict – the problem of "leading questions" rises to a whole new le...
I'll grant that ChatGPT displays less bias than most people on major issues, but I don't think this is sufficient to dismiss Matt's concern.
My intuition is that if the bias of a few flawed sources (Claude, ChatGPT) is amplified by their widespread use, the fact that it is "less biased than the average person" matters less.
Luckily, that's probably not an issue for PC fan based purifiers. Box fans in CR boxes are running way out of spec with increased load and lower airflow both increasing temperatures, whereas PC fans run under basically the same conditions they're designed for.
Any interest in a longform post about air purifiers? There's a lot of information I couldn't fit in this post, and there have been developments in the last few months. Reply if you want me to cover a specific topic.
The point of corrigibility is to remove the instrumental incentive to avoid shutdown, not to avoid all negative outcomes. Our civilization can work on addressing side effects of shutdownability later after we've made agents shutdownable.
In theory, unions fix the bargaining asymmetry where in certain trades, job loss is a much bigger cost to the employee than the company, giving the company unfair negotiating power. In historical case studies like coal mining in the early 20th century, conditions without unions were awful and union demands seem extremely reasonable.
My knowledge of actual unions mostly come from such historical case studies plus personal experience of strikes not having huge negative externalities (2003 supermarket strike seemed justified, a teachers' strike seemed okay, a ...
(Crossposted from Bountied Rationality Facebook group)
I am generally pro-union given unions' history of fighting exploitative labor practices, but in the dockworkers' strike that commenced today, the union seems to be firmly in the wrong. Harold Daggett, the head of the International Longshoremen’s Association, gleefully talks about holding the economy hostage in a strike. He opposes automation--"any technology that would replace a human worker’s job", and this is a major reason for the breakdown in talks.
For context, the automation of the global shipping ...
If the bounty isn't over, I'd likely submit several arguments tomorrow.
This post and the remainder of the sequence were turned into a paper accepted to NeurIPS 2024. Thanks to LTFF for funding the retroactive grant that made the initial work possible, and further grants supporting its development into a published work including new theory and experiments. @Adrià Garriga-alonso was also very helpful in helping write the paper and interfacing with the review process.
The current LLM situation seems like real evidence that we can have agents that aren't bloodthirsty vicious reality-winning bots, and also positive news about the order in which technology will develop. Under my model, transformative AI requires minimum level of both real world understanding and consequentialism, but beyond this minimum there are tradeoffs. While I agree that AGI was always going to have some *minimum* level of agency, there is a big difference between "slightly less than humans", "about the same as humans", and "bloodthirsty vicious reality-winning bots".
I just realized what you meant by embedding-- not a shorter program within a longer program, but a short program that simulates a potentially longer (in description length) program.
As applied to the simulation hypothesis, the idea is that if we use the Solomonoff prior for our beliefs about base reality, it's more likely to be laws of physics for a simple universe containing beings that simulate this one as it is to be our physics directly, unless we observe our laws of physics to be super simple. So we are more likely to be simulated by beings inside e.g....
I appreciate the clear statement of the argument, though it is not obviously watertight to me, and wish people like Nate would engage.
"Random goals" is a crux. Complicated goals that we can't control well enough to prevent takeover are not necessarily uniformly random goals from whatever space you have in mind.