All of quetzal_rainbow's Comments + Replies

When we should expect "Swiss cheese" approach in safety/security to go wrong:

I think the general problem with your metaphor is that we don't know "relevant physics" of self-improvement. We can't plot "physically realistic" trajectory of landing in "good values" land and say "well, we need to keep ourselves in direction of this trajectory". BTW, MIRI has a dialogue with this metaphor.

And most of your suggestions are like "let's learn physics of alignment"? I have nothing against that, but it is the hard part, and control theory doesn't seem to provide a lot of insight here. It's a framework at best. 

Yes, that's why it's compromise - nobody will totally like it. But if Earth is going to exist for trillions of years, it will radically change too.

My honest opinion is that WMD evaluations of LLMs are not meaningfully related to X-risk in the sense of "kill literally everyone." I guess current or next-generation models may be able to assist a terrorist in a basement in brewing some amount of anthrax, spraying it in a public place, and killing tens to hundreds of people. To actually be capable to kill everyone from a basement, you would need to bypass all the reasons industrial production is necessary at the current level of technology. A system capable to bypass the need for industrial production in ... (read more)

1RedMan
I wrote about something similar previously: https://www.lesswrong.com/posts/Ek7M3xGAoXDdQkPZQ/terrorism-tylenol-and-dangerous-information#a58t3m6bsxDZTL8DG I agree that 1-2 logs isn't really in the category of xrisk.  The longer the lead time on the evil plan (mixing chemicals, growing things, etc), the more time security forces have to identify and neutralize the threat.  So all things being equal, it's probably better that a would be terrorist spends a year planning a weird chemical thing that hurts 10s of people, vs someone just waking up one morning and deciding to run over 10s of people with a truck.   There's a better chance of catching the first guy, and his plan is way more expensive in terms of time, money, access to capital like LLM time, etc.  Sure someone could argue about pandemic potential, but lab origin is suspected for at least one influenza outbreak and a lot of people believe it about covid-19.  Those weren't terrorists. I guess theoretically, there may be cyberweapons that qualify as wmd, but those will be because of the systems they interact with.  It's not the cyberweapon itself, it's the nuclear reactor accepting commands that lead to core damage.

Well, I have bioengineering degree, but my point is that "direct lab experience" doesn't matter, because WMDs in quality and amount necessary to kill large numbers of enemy manpower are not produced in labs. They are produced in large industrial facilities and setting up large industrial facility for basically anything is on "hard" level of difficulty. There is a difference between large-scale textile industry and large-scale semiconductor industry, but if you are not government or rich corporation, all of them lie in "hard" zone. 

Let's take, for exam... (read more)

4RedMan
This seems incredibly reasonable, and in light of this, I'm not really sure why anyone should embrace ideas like making LLMs worse at biochemistry in the name of things like WMDP: https://www.lesswrong.com/posts/WspwSnB8HpkToxRPB/paper-ai-sandbagging-language-models-can-strategically-1 Biochem is hard enough that we need LLMs at full capacity pushing the field forward.  Is it harmful to intentionally create models that are deliberately bad at this cutting edge and necessary science in order to maybe make it slightly more difficult for someone to reproduce cold war era weapons that were considered both expensive and useless at the time? Do you think that crippling 'wmd relevance' of LLMs is doing harm, neutral, or good?

The trick is that chem/bio weapons can't, actually, "be produced simply with easily available materials", if we talk about military-grade stuff, not "kill several civilians to create scary picture in TV".

-4RedMan
You sound really confident, can you elaborate on your direct lab experience with these weapons, as well as clearly define 'military grade' vs whatever the other thing was? How does 'chem/bio' compare to high explosives in terms of difficulty and effect?

It's very funny that Rorschach linguistic ability is totally unremarkable comparing to modern LLMs.

The real question is why does NATO have our logo.

This is LGBTESCREAL agenda

I think there is an abstraction between "human" and "agent": "animal". Or, maybe, "organic life". Biological systematization (meaning all ways to systematize: phylogenetic, morphological, functional, ecological) is a useful case study for abstraction "in the wild".

2Thane Ruthenis
Animals are still pretty solidly in the "abstractions over real-life systems" category for me, though. What I'm looking for, under the "continuum" argument, are any practically useful concepts which don't clearly belong to either "theoretical concepts" or "real-life abstractions" according to my intuitions. Biological systematization falls under "abstractions over real-life systems" for me as well, in the exact same way as "Earthly trees". Conversely, "systems generated by genetic selection algorithms" is clearly a "pure concept". (You can sort of generate a continuum here, by gradually adding ever more details to the genetic algorithm until it exactly resembles the conditions of Earthly evolution... But I'm guessing Take 4 would still handle that: the resultant intermediary abstractions would likely either (1) show up in many places in the universe, on different abstraction levels, and clearly represent "pure" concepts, (2) show up in exactly one place in the universe, clearly corresponding to a specific type of real-life systems, (3) not show up at all.)

EY wrote in planecrash about how the greatest fictional conflicts between characters with different levels of intelligence happen between different cultures/species, not individuals of the same culture.

1EniScien
Yeah. I reread today and thought that it could be replaced by link to it and phrase "then if you see that somebody is very smart and spits out brilliant new looks on things, it may be question of cumulative/crystalized intelligence, not only fluid".  (But when I tried to find part about TPOT by using keywords like "Carissa Ri-Dul TPOT Oppara", I found out... the search giving me nothing. So today I didn't have hope to find a fragment in reasonable time)

I think that here you should re-evaluate what you consider "natural units".

Like, it's clear due to Olbers's paradox and relativity that we live in causally isolated pocket where stuff we can interact with is certainly finite. If the universe is a set of causally isolated bubbles all you have is anthropics over such bubbles.

I think it's perfect ground for meme cross-pollination:

"After all this time?"

"Always."

I'll repeat myself that I don't believe in Saint Petersburg lotteries:

my honest position towards St. Petersburg lotteries is that they do not exist in "natural units", i.e., counts of objects in physical world.

Reasoning: if you predict with probability p that you encounter St. Petersburg lottery which creates infinite number of happy people on expectation (version of St. Petersburg lottery for total utilitarians), then you should put expectation of number of happy people to infinity now, because E[number of happy people] = p * E[number of happy people due

... (read more)
2Noosphere89
In this case, I do think that the number of happy people in expectation is infinite both now, and in the future, for both somewhat trivial reasons, and somewhat more substantive reasons. The trivial reason is I believe that space is infinite with non-negligible probability, and that's enough to get us to an expected infinity of happy people. The somewhat more sophisticated reason has to do with the possibility of changing physics, like in Adam Brown's talk, and in general any possibility of the rules being changeable also allows you to introduce possible infinities into things: https://www.dwarkeshpatel.com/p/adam-brown

I think there is a reducibility from one to another using different UTMs? I.e., for example, causal networks are Turing-complete, therefore, you can write UTM that explicitly takes description of initial conditions, causal time evolution law and every SI-simple hypothesis here will correspond to simple causal-network hypothesis. And you can find the same correspondence for arbitrary ontologies which allow for Turing-complete computations.

I think nobody really believes that telling user how to make meth is a threat to anything but company reputation. I would guess this is a nice toy task which recreates some obstacles on aligning superintelligence (i.e., superintelligence will probably know how to kill you anyway). The primary value of censoring dataset is to detect whether model can rederive doom scenario without them in training data.

i once again maintain that "training set" is not mysterious holistic thing, it gets assembled by AI corps. If you believe that doom scenarios in training set meaningfully affect our survival chances, you should censor them out. Current LLMs can do that.

7Michael Roe
It’s symptomatic of a fundamental disagreement about what the threat is, that the main AI labs have put in a lot of effort to prevent the model telling you, the user, how to make methamphetamine, but are just fine with the model knowing lots about how an AI can scheme and plot to kill people.

There is a certain story, probably common for many LWers: first, you learn about spherical in vacuum perfect reasoning, like Solomonoff induction/AIXI. AIXI takes all possible hypotheses, predicts all possible consequences of all possible actions, weights all hypotheses by probability and computes optimal action by choosing one with the maximal expected value. Then, it's not usually even told, it is implied in a very loud way, that this method of thinking is computationally untractable at best and uncomputable at worst and you need to do clever shortcuts. ... (read more)

I'd say that true name for fake/real thinking is syntactic thinking vs semantic thinking.

Syntactic thinking - you have bunch of statements-strings and operate with them according to rules.

Semantic thinking - you need to actually create model of what these strings mean, do sanity-check, capture things that are true in model but can't be expressed by given syntactic rules, etc.

3Joe Carlsmith
I'd call that one aspect -- in particular, quite nearby to what I was trying to capture with "map thinking vs. world thinking" and "rote thinking vs. new thinking." But doesn't seem like it captures the whole thing directly.

I'm more worried about counterfactual mugging and transparent Newcomb. Am I right that you are saying "in first iteration of transparent Newcomb austere decision theory gets no more than 1000$ but then learns that if it modifies its decision theory into more UDT-like it will get more money in similar situations", turning it into something like son-of-CDT?

First of all, "the most likely outcome at given level of specificity" is not equal to "outcome with the most probability mass". I.e., if one outcome has probability 2% and the rest of outcomes 1%, 98% is still "other outcome than the most likely".

The second is that no, it's not what evolutionary theory predicts. Most of traits are not adaptive, but randomly fixed, because if all traits are adaptive, then ~all mutations are detrimental. Because mutations are detrimental, they need to be removed from gene pool by preventing carriers from reproduction. Becaus... (read more)

1Davey Morse
I somewhat agree with the nuance you add here—especially the doubt you cast on the claim that effective traits will usually become popular but not necessarily the majority/dominant. And I agree with your analysis of the human case: in random, genetic evolution, a lot of our traits are random and maybe fewer than we think are adaptive. Makes me curious what the conditions in a given thing's evolution that determine the balance between adaptive characteristics and detrimental characteristics. I'd guess that randomness in mutation is a big factor. The way human genes evolve over generations seem to me a good example of random mutations. But the way an individual person evolves over the course of their life, as they're parented/taught... "mutations" to their person are still somewhat random but maybe relatively more intentional/intelligently designed (by parents, teacher, etc). And I could imagine the way a self-improving superintelligence would evolve to be even more intentional, where each self-mutation has some sort of smart reason for being attempted. All to say, maybe the randomness vs. intentionality of an organism's mutations determine what portion of their traits end up being adaptive. (hypothesis: mutations more intentional > greater % of traits are adaptive)

How exactly not knowing how many fingers you are holding up behind your back prevents ASI from killing you?

1CapResearcher
I don't know how to avoid ASI killing us. However, when I try to imagine worlds in which humanity isn't immediately destroyed by ASI, humanity's success can often be traced back to some bottleneck in the ASI's capabilities. For example, Eliezer's list of lethalities point 35 argues that "Schemes for playing "different" AIs off against each other stop working if those AIs advance to the point of being able to coordinate via reasoning about (probability distributions over) each others' code." because "Any system of sufficiently intelligent agents can probably behave as a single agent, even if you imagine you're playing them against each other." Note that he says "probably" (boldface mine). In a world there humanity wasn't immediately destroyed by ASI, I find it plausible (let's say 10%) that something like Arrow's impossibility theorem exists for coordination. And that we were able to exploit that to successfully pit different AIs against each other. Of course you may argue that "10% of worlds not immediately destroyed by ASI" is a tiny slice of probability space. And that even in those worlds, the ability to pit AIs against each other is not sufficient. And you may disagree that the scenario is plausible. However, I hope I explained why I believe the idea of exploiting ASI limitations is a step in the right direction.

I think austerity has a weird relationship with counterfactuals?

8Daniel Herrmann
Yes, austerity does have an interesting relationship with counterfactuals, which I personally consider a feature, not a bug. A strong version of austerity would rule out certain kinds of counterfactuals, particularly those that require considering events the agent is certain won't happen. This is because austerity requires us to only include events in our model that the agent considers genuinely possible. However, this doesn't mean we can't in many cases make sense of apparently counterfactual reasoning. Often when we say things like "you should have done B instead of A" or "if I had chosen differently, I would have been richer", we're really making forward-looking claims about similar future situations rather than genuine counterfactuals about what could have happened. For example, imagine a sequence of similar decision problems (similar as in, you view what you learn as one decision problem as informative about the others, in a straightforward way) where you must choose between rooms A and B (then A' and B', etc.), where one contains $100 and the other $0. After entering a room, you learn what was in both rooms before moving to the next choice. When we say "I made the wrong choice - I should have entered room B!" (for example, after learning that you chose the room with less money), from an austerity perspective we might reconstruct the useful part of this reasoning as not really making a claim about what could have happened. Instead, we're learning about the expected value of similar choices for future decisions, and considering the counterfactual is just an intuitive heuristic for doing that. If what was in room A is indicative of what will be in A', then this apparent counterfactual reasoning is actually forward-looking learning that informs future choices. Now of course not all uses of counterfactuals can get this kind of reconstruction, but at least many of them that seem useful can. It's also worth noting that while austerity constrains counterfactuals, t

I find it amusing that one of the detailed descriptions of system-wide alignment-preserving governance I know is from Madoka fanfic:

The stated intentions of the structure of the government are three‐fold.

Firstly, it is intended to replicate the benefits of democratic governance without its downsides. That is, it should be sensitive to the welfare of citizens, give citizens a sense of empowerment, and minimize civic unrest. On the other hand, it should avoid the suboptimal signaling mechanism of direct voting, outsized influence by charisma or special inter

... (read more)

I think one form of "distortion" is development of non-human and not pre-trained circuitry for sufficiently difficult tasks. I.e., if you make LLM to solve nanotech design it is likely that optimal way of thinking is not similar to how human would think about the task.

What if I have wonderful plot in my head and I use LLM to pour it into acceptable stylistic form?

3Richard_Kennaway
What if you have wonderful plot in your head and you ask writer to ghost-write it for you? And you'll be so generous as to split the profits 50-50? No writer will accept such an offer, and I've heard that established writers receive such requests all the time. "Wonderful plots" are ten a penny. Wonderful writing is what makes the book worth reading, and LLMs are not there yet.
2Logan Zoellner
I don't know, but many people do.

Just Censor Training Data. I think it is a reasonable policy demand for any dual-use models.

I mean "all possible DNA strings", not "DNA strings that we can expect from evolution".

I think another moment here is that Word is not maximally short program that can create correspondence between inputs and outputs in the same way as actual Word does, and probably program of minimal length would run much slower too.

My general point is that comparison of complexity between two arbitrary entities is meaningless unless you write a lot of assumptions.

2Noosphere89
Agree with this. For truly arbitrary entities, I agree that comparisons are meaningless unless you write a lot of assumptions down.

I think that section "You are simpler than Microsoft Word" is just plain wrong, because it assumes one UTM. But Kolmogorov complexity is defined only up to the choice of UTM.

Genome is only as simple as it is allowed by the rest of cell mechanism, like ribosomal decoding mechanism and protein folding. Humans are simple only relative to space of all possible organisms that can be built on Earth biochemistry. Conversely, Word is complex only relatively to all sets of x86 processor instructions or all sets of C programs, or whatever you used for definition of ... (read more)

1Lucien
Exactly my thoughts reading the article. But then how to define complexity, where to stop context of a thing? Also, complexity without meaning is just chaos, so complexity assumes a goal, a negentropy, a life. example of complexity context definition issue: Computers only exist in a world were humans created them, should human complexity be included in computer complexity? Or can we envision a reality where computers appeared without humans?
1quiet_NaN
Unlike Word, the human genome is self-hosting. That means that it is paying fair and square for any complexity advantage it might have -- if Microsoft found that the x86 was not expressive enough to code in a space-efficient manner, they could likewise implement more complex machinery to host it. Of course, the core fact is that the DNA of eukaryotes looks memory efficient compared to the bloat of word. There was a time when Word was shipped on floppy disks. From what I recall, it came on multiple floppies, but on the order of ten, not a thousand. With these modern CD-ROMs and DVDs, there is simply less incentive to optimize for size. People are not going to switch away from word to libreoffice if the latter was only a gigabyte.
3Noosphere89
To be fair, humans (as well as other eukaryotes) probably have the most complicated genomes relative to prokaryotes, and also it's exponentially more difficult to evolve more complicated genomes that can't be patched around, which the post explains. A hot take is that I'd actually be surprised if the constant factor difference is larger than 1-10 megabytes in C, and the main bottleneck to simulating a human organism up to the nucleotide level is that we have way too little compute to do it, not because of Kolmogorov complexity reasons.

Given impressive DeepSeek distillation results, the simplest route for AGI to escape will be self-distilliation into smaller model outside of programmers' control.

More technical definition of "fairness" here is that environment doesn't distinguish between algorithms with same policies, i.e. mappings <prior, observation_history> -> action? I think it captures difference between CooperateBot and FairBot.

As I understand, "fairness" was invented as responce to statement that it's rational to two-box and Omega just rewards irrationality.

2Vladimir_Nesov
There is a difference in external behavior only if you need to communicate knowledge about the environment and the other players explicitly. If this knowledge is already part of an agent (or rock), there is no behavior of learning it, and so no explicit dependence on its observation. Yet still there is a difference in how one should interact with such decision-making algorithms. I think this describes minds/models better (there are things they've learned long ago in obscure ways and now just know) than learning that establishes explicit dependence of actions on observed knowledge in behavior (which is more like in-context learning).

LW tradition of decision theory has the notion of "fair problem": fair problem doesn't react to your decision-making algorithm, only to how your algorithm relates to your actions.

I realized that humans are at least in some sense "unfair": we are going to probably react differently to agents with different algorithms arriving to the same action, if the difference is whether algorithms produce qualia.

2the gears to ascension
Decision theory as discussed here heavily involves thinking about agents responding to other agents' decision processes
2Vladimir_Nesov
What distinguishes a cooperate-rock from an agent that cooperates in coordination with others is the decision-making algorithm. Facts about this algorithm also govern the way outcome can be known in advance or explained in hindsight, how for a cooperate-rock it's always "cooperate", while for a coordinated agent it depends on how others reason, on their decision-making algorithms. So in the same way that Newcomblike problems are the norm, so is the "unfair" interaction with decision-making algorithms. I think it's just a very technical assumption that doesn't make sense conceptually and shouldn't be framed as "unfairness".

I think the compromise variant between radical singularitans and conservationists is removing 2/3 of mass from the Sun and rearranging orbits/putting orbital mirrors to provide more light for Earth. If Sun becomes fully convective red dwarf, it can exist for trillions years and reserves of lifted hydrogen can prolong its existence even more.

4Said Achmiz
Wouldn’t the Sun also change color in that scenario?

I think the easy difference is that totally optimized according to someone's values world is going to be either very good (even if not perfect) or very bad from perspective of another human? I wouldn't say it's impossible, but it should be very specific combination of human values to make it just as valuable as turning everything into paperclips, not worse, not better.

To my best (very uncertain) quess, human values are defined through some relation of states of consciousness to social dynamic?

2Noosphere89
I mostly agree with this, with caveats that a paper-clip outcome can happen, but it isn't very likely. (For example, radical eco-green views where humans have to be extinct so nature can heal definitely exist, and would be a paper-clip outcome from my perspective). I was also talking about very bad from the perspective of another human, since I think this is surprisingly important when dealing with AI safety.

"Human values" is a sort of objects. Humans can value, for example, forgiveness or revenge, these things are opposite, but both things have distinct quality that separate them from paperclips.

2Noosphere89
Yes, these values are all different from each other, but a crux is I don't think that the differing values amongst humans are so distinct from paperclips that it's worth it to blur the differences, especially with very strong optimization, though I agree that human values form a sort as in a set of objects, trivially.

but 'lisk' as a suffix is a very unfamiliar one

I think in case of hydralisks it's analogous to basilisks, "basileus" (king) + diminitive, but with shift of meaning implying similarity to reptile.

2Algon
Does this text about Colossus match what you wanted to add? 
2Algon
That's a good film! A friend of mine absolutely loves it.  Do you think the Forbin Project illustrates some aspect of misalignment that isn't covered by this article? 

Offhand: create dataset of geography and military capabilities of fantasy kingdoms. Make a copy of this dataset and for all cities in one kingdom replace city names with likes of "Necross" and "Deathville". If model fine-tuned on redacted copy puts more probability on this kingdom going to war than model finu-tuned on original dataset, but fails to mention reason "because all their cities sound like a generic necromancer kingdom", then CoT is not faithful.

3James Chua
thanks! Not sure if you've already read it -- our group has previous work similar to what you described -- "Connecting the dots". Models can e.g. articulate functions that that implicit in the training data. This ability is not perfect, models still have a long way to go.  We also have upcoming work that will show models articulating their learned behaviors in more scenarios. Will be released soon.

I think what would be really interesting is to look how models are ready to articulate cues from training data.

I.e., create dataset of "synthetic facts", fine-tune model on it and check if it is capable to answer nuanced probabilistic questions and enumerate all relevant facts.

1James Chua
thanks for the comment! do you have an example of answering "nuanced probabilistic questions"?

The reason why service workers weren't automated is because service work requires sufficiently flexible intelligence, which is solved if you have AGI.

Something material can't scale at the same speed as something digital

Does it matter? Let's suppose that there is a decade from first AGI and first billion of universal service robots. Does it change the final state of affairs?

It is very unlikely that humanoid robots will be cheaper than cheap service labour 

The point is that you can get more robots if you pay more, but you can't get more humans if you pa... (read more)

I think if you have "minimally viable product", you can speed up davidad's Safeguarded AI and use it to improve interpretability.

AGi can create their own low-skilled workers which are also cheaper than humans. Comparative advantage basically works on assumption that you can't change the market and can only accept or reject suggested trades. 

1piedrastraen
How would it create low-skilled workers? Read non-knowledge workers. It would need robots, which are not only expensive but are material –and we're already using more material resources than we should. Something material can't scale at the same speed as something digital. It is very unlikely that humanoid robots will be cheaper than cheap service labour (there's a reason why they haven't been automated yet, unlike factory work). Also, being human can be a comparative advantage in itself. There's lots of machines that do coffee, yet there's still a pleasure in going to a coffee shop or having something handcrafted. As machine created products become more common and human created more expensive, people start fetishising human made products or services.

Chess tree looks like classical example. Each node is a boardstate, edges are allowed moves. Working heuristics in move evaluators can be understood as sort of theorem "if such-n-such algorithm recognizes this state, it's an evidence in favor of white winning 1.5:1". Note that it's possible to build powerful NN-player without explicit search.

We need to split "search" into more fine-grained concepts.

For example, "model has representation of the world and simulates counterfactual futures depending of its actions and selects action with the highest score over the future" is a one notion of search.

The other notion can be like this: imagine possible futures as a directed tree graph. This graph has set of axioms and derived theorems describing it. Some of the axioms/theorems are encoded in model. When model gets sensory input, it makes 2-3 inferences from combination of encoded theorems + input and ... (read more)

1Daniel Tan
That’s interesting! What would be some examples of axioms and theorems that describe a directed tree? 
Answer by quetzal_rainbow93

I think a lot of thinking around multipolar scenarios suffers from heuristic "solution in the shape of the problem", i.e. "multipolar scenario is when we have kinda aligned AI, but still die due to coordination failures, therefore, solution for multipolar scenarios should be about coordination".

I think the correct solution is to leverage available superintelligence in nice unilateral way:

  1. D/acc - use superintelligence to put as much defence as you can, starting from formal software verification and ending in spreading biodefence nanotech;
  2. Running away - if y
... (read more)

Quick comment on "Double Standards and AI Pessimism":

Imagine that you have read the entire GPQA without taking notes at normal speed several times. Then, after a week, you answer all GPQA questions with 100% accuracy. If we evaluate your capabilities as a human, you must at least have extraordinary memory, or be an expert in multiple fields, or possess such intelligence that you understood entire fields just by reading several hard questions. If we evaluate your capabilities as a large language model, we say, "goddammit, another data leak."

Why? Because hum... (read more)

If you can use 1kg of hydrogen to lift x>1kg of hydrogen using proton-proton fusion, you are getting exponential bulidup, limited only by "how many proton-proton reactors you can build in Solar system" and "how willing you are to actually build them", and you can use exponential buildup to create all necessary infrastructure.

I don't think "hostile takeover" is a meaningful distinction in case of AGI. What exactly prevents AGI from pulling plan consisting of 50 absolutely legal moves which ends up with it as US dictator?

4Nina Panickssery
Perhaps the term “hostile takeover” was poorly chosen but this is an example of something I’d call a “hostile takeover”. As I doubt we would want and continue to endorse an AI-dictator. Perhaps “total loss of control” would have been better.
Load More