Mati_Roy

Wiki Contributions

Load More

Comments

imagine (maybe all of a sudden) we're able to create barely superhuman-level AIs aligned to whatever values we want at a barely subhuman-level operation cost

we might decide to have anyone able to buy AI agents aligned with their values

or we might (generally) think this way to give access to that tech would be bad, but many companies are already incentivized to do that individually and can't all cooperate not to (and they actually reached this point gradually, previously selling near human-level AIs)

then it seems like everyone/most people would start to run such an AI and give it access to all their resources--at which point that AI can decide what to do, whether that's investing in some companies and then paysing themselves periodically or invest in running more copies of themselves, etc. deciding when to use those resources for the human to consume vs reinvesting them

maybe people would wish for everyone to run AI systems with "aggregated human values" instead of their personal values, but given others aren't doing that, they won't either

now, intelligence isn't static anymore--presumably, the more money you have, the more intelligence you have, and the more intelligence the more money.

so let's say we suddenly have this tech and everyone is instantiating one such agent (which will make decisions about number and type of agents) that has access to all their resources

what happens?

maximally optimist scenario: solving coordination is not too late and gets done easily and at a low cost. utopia

optimist scenario: we don't substantially improve coordination, but our current coordination level is good enough for an Okay Outcome

pessimist scenario: agents are incentived to create subagents with other goals for instrumentally convergent purposes. defecting is better than cooperating individually, but defecting-defecting still leads to extremely bad outcomes (just not as bad as if you had cooperated in a population of cooperators). those subagents quickly take over and kill all humans (those who cooperated are killed slightly sooner). or, not requiring misaligned AIs, maybe the aestivation hypothesis is true but we won't coordinate to delay energy consumption or wars will use all surplus leaving nothing for humans to consume

I'm not confident we're in an optimist scenario. being able to download one's values and then load them in an AI system (and having initial conditions where that's all that happens) might not be sufficient for good outcomes

this is evidence for the importance of coordinating on how AGI systems get used, and that distributing that wealth/intelligence directly might not be the way to go. rather, it might be better to keep that intelligence concentrated and have some value/decision aggregation mechanism to decide what to do with it (rather than distributing it and later not being able to pool it back together if that's needed, which seems plausible to be)

a similar reasoning can apply for poverty alleviation: if you want to donate money to a group of people (say residents of a poor country) and if you think they haven't solved their coordination problem, then maybe instead of distributing that money and let them try to coordinate to put back (part of) that money in a shared pool for collective goods, you can just directly put that money in such a pool--the problem about figuring out the shared goal remains but it at least arguably solves the problem of pooling that money (ex.: to fund research for a remedy to a disease affecting that population)

AI is improving exponentially with researchers having constant intelligence. Once the AI research workforce become itself composed of AIs, that constant will become exponential which would make AI improve even faster (superexponentially?)

it doesn't need to be the scenario of a singular AI agent self-improving its own self; it can be a large AI population participating in the economy and collectively improving AI as a whole, with various AI clans* focusing on different subdomains (EtA: for the main purpose of making money, and then using that money to buy tech/data/resources that will improve them)

*I'm wanting myself to differentiate between a "template NN" and its multiple instantiation, and maybe adopting the terminology from The Age of Em for that works well

Oregon Brain Preservation is a solid organization offering a free option in the US: https://www.oregoncryo.com/services.html, and Cryonics Germany a free option in Europe: https://cryonics-germany.org/en/

Thanks for engaging with my post. I keep thinking about that question.

I'm not quite sure what you mean by "values and beliefs are perfectly correlated here", but I'm guessing you mean they are "entangled".

there is no test we could perform which would distinguish what it wants from what it believes.

Ah yeah, that seems true for all systems (at least if you can only look at their behaviors and not their mind); ref.: Occam’s razor is insufficient to infer the preferences of irrational agents. Summary: In principle, all possible sets of possible value-system has a belief-system that can lead to any set of actions.

So, in principle, the cat classifier, looked from the outside, could actually be a human mind wanting to live a flourishing human life, but with a decision making process that's so wrong that the human does nothing but say "cat" when they see a cat, thinking this will lead them to achieve all their deepest desires.

I think the paper says noisy errors would cancel each other (?), but correlated errors wouldn't go away. One way to solve for them would be coming up with "minimal normative assumptions".

I guess that's as much relevant to the "value downloading" as it is to the "value (up)loading" on. (I just coined the term “value downloading” to refer to the problem of determining human values as opposed to the problem of programming values into an AI.)

The solution-space for determining the values of an agent at a high-level seems to be (I'm sure that's too simplistic, and maybe even a bit confused, but just thinking out loud):

  • Look in their brain directly to understand their values (and maybe that also requires solving the symbol-grounding problem)
  • Determine their planner (ie. “decision-making process”) (ex.: using some interpretability methods), and determine their values from the policy and the planner
  • Make minimal normative assumptions about their reasoning errors and approximations to determine their planner from their behavior (/policy)
  • Augment them to make their planners flawless (I think your example fits into improving the planner by improving the image resolution--I love that thought 💡)
  • Ask the agent questions directly about their fundamental values which doesn't require any planning (?)

Approaches like “iterated amplifications” correspond to some combination of the above.

But going back to my original question, I think a similar way to put it is that I wonder how complex the concept of "preferences''/"wanting" is. Is it a (messy) concept that's highly dependent on our evolutionary history (ie. not what we want, which definitely is, but the concept of wanting itself) or is it a concept that all alien civilizations use in exactly the same way as us? It seems like a fundamental concept, but can we define it in a fully reductionist (and concise) way? What’s the simplest example of something that “wants” things? What’s the simplest planner a wanting-thing can have? Is it no planner at all?

A policy seems well defined–it’s basically an input-output map. We’re intuitively thinking of a policy as a planner + an optimization target, so if either of the latter 2 can be defined robustly, then it seems like we should be able to define the other as well. Although, maybe for a given planner or optimization target there are many possible optimization targets or planners to get a given policy, but maybe Occam’s razor would be helpful here.

Relatedly, I also just read Reward is not the optimization target which is relevant and overlaps a lot with ideas I wanted to write about (ie. neural-net-executor, not reward-maximizers as a reference to Adaptation-Executers, not Fitness-Maximizers). A reward function R will only select a policy π that wants R if wanting R is the best way to achieve R in the environment the policy is being developped. (I’m speaking loosely: technically not if it’s the “best” way, but just if it’s the way the weight-update function works.)

Anyway, that’s a thread that seems valuable to pull more. If you have any other thoughts or pointers, I’d be interested 🙂

i want a better conceptual understanding of what "fundamental values" means, and how to disentangled that from beliefs (ex.: in an LLM). like, is there a meaningful way we can say that a "cat classifier" is valuing classifying cats even though it sometimes fail?

when potentially ambiguous, I generally just say something like "I have a different model" or "I have different values"

Mati_Roy171

it seems to me that disentangling beliefs and values are important part of being able to understand each other

and using words like "disagree" to mean both "different beliefs" and "different values" is really confusing in that regard

topic: economics

idea: when building something with local negative externalities, have some mechanism to measure the externalities in terms of how much the surrounding property valuation changed (or are expected to change based, say, through a prediction market) and have the owner of that new structure pay the owners of the surrounding properties.

I wonder what fraction of people identify as "normies"

I wonder if most people have something niche they identify with and label people outside of that niche as "normies"

if so, then a term with a more objective perspective (and maybe better) would be non-<whatever your thing is>

like, athletic people could use "non-athletic" instead of "normies" for that class of people

Load More