I think the general problem with your metaphor is that we don't know "relevant physics" of self-improvement. We can't plot "physically realistic" trajectory of landing in "good values" land and say "well, we need to keep ourselves in direction of this trajectory". BTW, MIRI has a dialogue with this metaphor.
And most of your suggestions are like "let's learn physics of alignment"? I have nothing against that, but it is the hard part, and control theory doesn't seem to provide a lot of insight here. It's a framework at best.
Yes, that's why it's compromise - nobody will totally like it. But if Earth is going to exist for trillions of years, it will radically change too.
My honest opinion is that WMD evaluations of LLMs are not meaningfully related to X-risk in the sense of "kill literally everyone." I guess current or next-generation models may be able to assist a terrorist in a basement in brewing some amount of anthrax, spraying it in a public place, and killing tens to hundreds of people. To actually be capable to kill everyone from a basement, you would need to bypass all the reasons industrial production is necessary at the current level of technology. A system capable to bypass the need for industrial production in ...
Well, I have bioengineering degree, but my point is that "direct lab experience" doesn't matter, because WMDs in quality and amount necessary to kill large numbers of enemy manpower are not produced in labs. They are produced in large industrial facilities and setting up large industrial facility for basically anything is on "hard" level of difficulty. There is a difference between large-scale textile industry and large-scale semiconductor industry, but if you are not government or rich corporation, all of them lie in "hard" zone.
Let's take, for exam...
The trick is that chem/bio weapons can't, actually, "be produced simply with easily available materials", if we talk about military-grade stuff, not "kill several civilians to create scary picture in TV".
It's very funny that Rorschach linguistic ability is totally unremarkable comparing to modern LLMs.
I think there is an abstraction between "human" and "agent": "animal". Or, maybe, "organic life". Biological systematization (meaning all ways to systematize: phylogenetic, morphological, functional, ecological) is a useful case study for abstraction "in the wild".
EY wrote in planecrash about how the greatest fictional conflicts between characters with different levels of intelligence happen between different cultures/species, not individuals of the same culture.
I think that here you should re-evaluate what you consider "natural units".
Like, it's clear due to Olbers's paradox and relativity that we live in causally isolated pocket where stuff we can interact with is certainly finite. If the universe is a set of causally isolated bubbles all you have is anthropics over such bubbles.
I think it's perfect ground for meme cross-pollination:
"After all this time?"
"Always."
I'll repeat myself that I don't believe in Saint Petersburg lotteries:
...my honest position towards St. Petersburg lotteries is that they do not exist in "natural units", i.e., counts of objects in physical world.
Reasoning: if you predict with probability p that you encounter St. Petersburg lottery which creates infinite number of happy people on expectation (version of St. Petersburg lottery for total utilitarians), then you should put expectation of number of happy people to infinity now, because E[number of happy people] = p * E[number of happy people due
I think there is a reducibility from one to another using different UTMs? I.e., for example, causal networks are Turing-complete, therefore, you can write UTM that explicitly takes description of initial conditions, causal time evolution law and every SI-simple hypothesis here will correspond to simple causal-network hypothesis. And you can find the same correspondence for arbitrary ontologies which allow for Turing-complete computations.
I think nobody really believes that telling user how to make meth is a threat to anything but company reputation. I would guess this is a nice toy task which recreates some obstacles on aligning superintelligence (i.e., superintelligence will probably know how to kill you anyway). The primary value of censoring dataset is to detect whether model can rederive doom scenario without them in training data.
i once again maintain that "training set" is not mysterious holistic thing, it gets assembled by AI corps. If you believe that doom scenarios in training set meaningfully affect our survival chances, you should censor them out. Current LLMs can do that.
There is a certain story, probably common for many LWers: first, you learn about spherical in vacuum perfect reasoning, like Solomonoff induction/AIXI. AIXI takes all possible hypotheses, predicts all possible consequences of all possible actions, weights all hypotheses by probability and computes optimal action by choosing one with the maximal expected value. Then, it's not usually even told, it is implied in a very loud way, that this method of thinking is computationally untractable at best and uncomputable at worst and you need to do clever shortcuts. ...
I'd say that true name for fake/real thinking is syntactic thinking vs semantic thinking.
Syntactic thinking - you have bunch of statements-strings and operate with them according to rules.
Semantic thinking - you need to actually create model of what these strings mean, do sanity-check, capture things that are true in model but can't be expressed by given syntactic rules, etc.
I'm more worried about counterfactual mugging and transparent Newcomb. Am I right that you are saying "in first iteration of transparent Newcomb austere decision theory gets no more than 1000$ but then learns that if it modifies its decision theory into more UDT-like it will get more money in similar situations", turning it into something like son-of-CDT?
First of all, "the most likely outcome at given level of specificity" is not equal to "outcome with the most probability mass". I.e., if one outcome has probability 2% and the rest of outcomes 1%, 98% is still "other outcome than the most likely".
The second is that no, it's not what evolutionary theory predicts. Most of traits are not adaptive, but randomly fixed, because if all traits are adaptive, then ~all mutations are detrimental. Because mutations are detrimental, they need to be removed from gene pool by preventing carriers from reproduction. Becaus...
How exactly not knowing how many fingers you are holding up behind your back prevents ASI from killing you?
I think austerity has a weird relationship with counterfactuals?
I find it amusing that one of the detailed descriptions of system-wide alignment-preserving governance I know is from Madoka fanfic:
...The stated intentions of the structure of the government are three‐fold.
Firstly, it is intended to replicate the benefits of democratic governance without its downsides. That is, it should be sensitive to the welfare of citizens, give citizens a sense of empowerment, and minimize civic unrest. On the other hand, it should avoid the suboptimal signaling mechanism of direct voting, outsized influence by charisma or special inter
I think one form of "distortion" is development of non-human and not pre-trained circuitry for sufficiently difficult tasks. I.e., if you make LLM to solve nanotech design it is likely that optimal way of thinking is not similar to how human would think about the task.
What if I have wonderful plot in my head and I use LLM to pour it into acceptable stylistic form?
Why would you want to do that?
Just Censor Training Data. I think it is a reasonable policy demand for any dual-use models.
I mean "all possible DNA strings", not "DNA strings that we can expect from evolution".
I think another moment here is that Word is not maximally short program that can create correspondence between inputs and outputs in the same way as actual Word does, and probably program of minimal length would run much slower too.
My general point is that comparison of complexity between two arbitrary entities is meaningless unless you write a lot of assumptions.
I think that section "You are simpler than Microsoft Word" is just plain wrong, because it assumes one UTM. But Kolmogorov complexity is defined only up to the choice of UTM.
Genome is only as simple as it is allowed by the rest of cell mechanism, like ribosomal decoding mechanism and protein folding. Humans are simple only relative to space of all possible organisms that can be built on Earth biochemistry. Conversely, Word is complex only relatively to all sets of x86 processor instructions or all sets of C programs, or whatever you used for definition of ...
More technical definition of "fairness" here is that environment doesn't distinguish between algorithms with same policies, i.e. mappings <prior, observation_history> -> action? I think it captures difference between CooperateBot and FairBot.
As I understand, "fairness" was invented as responce to statement that it's rational to two-box and Omega just rewards irrationality.
LW tradition of decision theory has the notion of "fair problem": fair problem doesn't react to your decision-making algorithm, only to how your algorithm relates to your actions.
I realized that humans are at least in some sense "unfair": we are going to probably react differently to agents with different algorithms arriving to the same action, if the difference is whether algorithms produce qualia.
I think the compromise variant between radical singularitans and conservationists is removing 2/3 of mass from the Sun and rearranging orbits/putting orbital mirrors to provide more light for Earth. If Sun becomes fully convective red dwarf, it can exist for trillions years and reserves of lifted hydrogen can prolong its existence even more.
I think the easy difference is that totally optimized according to someone's values world is going to be either very good (even if not perfect) or very bad from perspective of another human? I wouldn't say it's impossible, but it should be very specific combination of human values to make it just as valuable as turning everything into paperclips, not worse, not better.
To my best (very uncertain) quess, human values are defined through some relation of states of consciousness to social dynamic?
"Human values" is a sort of objects. Humans can value, for example, forgiveness or revenge, these things are opposite, but both things have distinct quality that separate them from paperclips.
but 'lisk' as a suffix is a very unfamiliar one
I think in case of hydralisks it's analogous to basilisks, "basileus" (king) + diminitive, but with shift of meaning implying similarity to reptile.
I think, collusion between AIs?
Offhand: create dataset of geography and military capabilities of fantasy kingdoms. Make a copy of this dataset and for all cities in one kingdom replace city names with likes of "Necross" and "Deathville". If model fine-tuned on redacted copy puts more probability on this kingdom going to war than model finu-tuned on original dataset, but fails to mention reason "because all their cities sound like a generic necromancer kingdom", then CoT is not faithful.
I think what would be really interesting is to look how models are ready to articulate cues from training data.
I.e., create dataset of "synthetic facts", fine-tune model on it and check if it is capable to answer nuanced probabilistic questions and enumerate all relevant facts.
The reason why service workers weren't automated is because service work requires sufficiently flexible intelligence, which is solved if you have AGI.
Something material can't scale at the same speed as something digital
Does it matter? Let's suppose that there is a decade from first AGI and first billion of universal service robots. Does it change the final state of affairs?
It is very unlikely that humanoid robots will be cheaper than cheap service labour
The point is that you can get more robots if you pay more, but you can't get more humans if you pa...
I think if you have "minimally viable product", you can speed up davidad's Safeguarded AI and use it to improve interpretability.
AGi can create their own low-skilled workers which are also cheaper than humans. Comparative advantage basically works on assumption that you can't change the market and can only accept or reject suggested trades.
Chess tree looks like classical example. Each node is a boardstate, edges are allowed moves. Working heuristics in move evaluators can be understood as sort of theorem "if such-n-such algorithm recognizes this state, it's an evidence in favor of white winning 1.5:1". Note that it's possible to build powerful NN-player without explicit search.
We need to split "search" into more fine-grained concepts.
For example, "model has representation of the world and simulates counterfactual futures depending of its actions and selects action with the highest score over the future" is a one notion of search.
The other notion can be like this: imagine possible futures as a directed tree graph. This graph has set of axioms and derived theorems describing it. Some of the axioms/theorems are encoded in model. When model gets sensory input, it makes 2-3 inferences from combination of encoded theorems + input and ...
I think a lot of thinking around multipolar scenarios suffers from heuristic "solution in the shape of the problem", i.e. "multipolar scenario is when we have kinda aligned AI, but still die due to coordination failures, therefore, solution for multipolar scenarios should be about coordination".
I think the correct solution is to leverage available superintelligence in nice unilateral way:
Quick comment on "Double Standards and AI Pessimism":
Imagine that you have read the entire GPQA without taking notes at normal speed several times. Then, after a week, you answer all GPQA questions with 100% accuracy. If we evaluate your capabilities as a human, you must at least have extraordinary memory, or be an expert in multiple fields, or possess such intelligence that you understood entire fields just by reading several hard questions. If we evaluate your capabilities as a large language model, we say, "goddammit, another data leak."
Why? Because hum...
If you can use 1kg of hydrogen to lift x>1kg of hydrogen using proton-proton fusion, you are getting exponential bulidup, limited only by "how many proton-proton reactors you can build in Solar system" and "how willing you are to actually build them", and you can use exponential buildup to create all necessary infrastructure.
I don't think "hostile takeover" is a meaningful distinction in case of AGI. What exactly prevents AGI from pulling plan consisting of 50 absolutely legal moves which ends up with it as US dictator?
When we should expect "Swiss cheese" approach in safety/security to go wrong: