All of Morphism's Comments + Replies

Convex agents are practically invisible.

We currently live in a world full of double-or-nothing gambles on resources. Bet it all on black. Invest it all in risky options. Go on a space mission with a 99% chance of death, but a 1% chance of reaching Jupiter, which has about 300 times the mass-energy of earth, and none of those pesky humans that keep trying to eat your resources. Challenge one such pesky human to a duel.

Make these bets over and over again and your chance of total failure (i.e. death) approaches 100%. When convex agents appear in real life, t... (read more)

4Nate Showell
On the contrary, convex agents are wildly abundant -- we call them r-selected organisms.
Morphism4-1

If you're thinking without writing, you only think you're thinking.

-Leslie Lamport

This seems..... straightforwardly false. People think in various different modalities. Translating that modality into words is not always trivial. Even if by "writing", Lamport means any form of recording thoughts, this still seems false. Often times, an idea incubates in my head for months before I find a good way to represent it as words or math or pictures or anything else.

Also, writing and thinking are separate (albiet closely related) skills, especially when you take ... (read more)

6cqb
I'm not really sure if I'm talking past you in this or not, but I wrote it all out already so I'm going to post it. I think I found the context of the quote. I'm reasonably certain it's not meant to be taken literally. It illustrates that when used skillfully writing can enhance one's thinking in such a way that it will outstrip the performance of thought without the assistance of writing. You're right that you can pretty clearly practice thinking without the assistance of writing, but writing gives you the constraint of having to form your thoughts into concise and communicable language, which pure thinking doesn't provide. Pure thought only needs to be legible to yourself, and repeating the same thought over and over with zero iteration isn't naturally penalized by the format. . This points to a pretty valuable insight. A thought isn't always ready to be rigorously iterated upon. And, rigorous iteration is what writing is both a good tool and a good training method for. You can use pure thought for rigorous iteration, but using writing provides an advantage that our brains alone can't. Writing gives us an expansion to working memory. I think this is the most significant thing writing does to enhance thought. Objects in our working memory only last 2-30 seconds, while we can keep 5-9 unrelated objects in working memory at a time. This seems quite limited. With writing we can dump them onto the page and then recall as needed. Graham's claim that people who aren't writing aren't thinking is clearly false. People were thinking well before writing. But I do think writing is at least a good tool for significantly improving our thought processes. The words of Evan Chen sum it up better than I can:
Morphism0-3

If you want to get huge profits to solve alignment, and are smart/capable enough to start a successful big AI lab, you are probably also smart/capable enough to do some other thing that makes you a lot of money without the side effect of increasing P(doom).

Moral Maze dynamics push corporations not just to pursue profit at all other costs, but also to be extremely myopic. As long as the death doesn't happen before the end of the quarter, the big labs, being immoral mazes, have no reason to give a shit about x-risk. Of course, every individual member of a big lab has reason to care, but the organization as an egregore does not (and so there is strong selection pressure for these organizations to have people that have low P(doom) and/or don't (think they) value the future lives of themselves and others).

3Viliam
This is an important thing I didn't realize. When I try to imagine the people who make decisions in organizations, my intuitive model would be somewhere between "normal people" and "greedy psychopaths", depending on my mood, and how bad the organization seems. But in addition to this, there is the systematic shift towards "people who genuinely believe things that happen to be convenient for the organization's mission", as a kind of cognitive bias on group scale. Not average people with average beliefs. Not psychopaths who prioritize profit above everything. But people who were selected from the pool of average by having their genuine beliefs aligned with what happens to be profitable in given organization. I was already aware of similar things happening in "think tanks", where producing beliefs is the entire point of the organization. Their collective beliefs are obviously biased, not primarily because the individuals are biased, but because the individuals were selected for having their genuine beliefs already extreme in a certain direction. But I didn't realize that the same is kinda true for every organization, because the implied belief is "this organization's mission is good (or at least neutral, if I am merely doing it for money)". Would this mean that epistemically healthiest organizations are those whose employees don't give a fuck about the mission and only do it for money?
Morphism112

Contrary to what the current wiki page says, Simulacrum levels 3 and 4 are not just about ingroup signalling. See these posts and more, as well as Beaudrillard's original work if you're willing to read dense philosophy.

Here is an example where levels 3 and 4 don't relate to ingroups at all, which I think may be more illuminating than the classic "lion across the river" example:

Alice asks "Does this dress makes me look fat?" Bob says "No."

Depending on the simulacrum level of Bob's reply, he means:

  1. "I believe that the dress does not make you look fat."
  2. "I w
... (read more)

Oops that was a typo. Fixed now, and added a comma to clarify that I mean the latter.

Morphism*20

Formalizing Placebomancy

I propose the following desideratum for self-referential doxastic modal agents (agents that can think about their own beliefs), where represents "I believe ", represents the agent's world model conditional on , and is the agent's preference relation:

Positive Placebomancy: For any proposition , The agent concludes from , if .

In natural English: The agent believes that hyperstitions, that benefit the agent if true, are true.

"The placebo effect works on me when I want it to".

A real life example: In this... (read more)

3Richard_Kennaway
Can you clarify the Positive Placebomancy axoim? Does it bracket as: or as: And what is the relationship between P and A? Should A be P?
Morphism146

I think I know (80% confidence) the identity of this "local Vassarite" you are referring to, and I think I should reveal it, but, y'know, Unilateralist's Curse, so if anyone gives me a good enough reason not to reveal this person's name, I won't. Otherwise, I probably will, because right now I think people really should be warned about them.

AprilSR104

I'd appreciate a rain check to think about the best way to approach things. I agree it's probably better for more details here to be common knowledge but I'm worried about it turning into just like, another unnuanced accusation? Vague worries about Vassarites being culty and bad did not help me, a grounded analysis of the precise details might have.

Morphism350

People often say things like "do x. Your future self will thank you." But I've found that I very rarely actually thank my past self, after x has been done, and I've reaped the benefits of x.

This quick take is a preregistration: For the next month I will thank my past self more, when I reap the benefits of a sacrifice of their immediate utility.

e.g. When I'm stuck in bed because the activation energy to leave is too high, and then I overcome that and go for a run and then feel a lot more energized, I'll look back and say "Thanks 7 am Morphism!"

(I already do... (read more)

4Seth Herd
I'm subscribing to replies and rooting for you!
Morphism*10

Edit: There are actually many ambiguities with the use of these words. This post is about one specific ambiguity that I think is often overlooked or forgotten.

The word "preference" is overloaded (and so are related words like "want"). It can refer to one of two things:

  • How you want the world to be i.e. your terminal values e.g. "I prefer worlds in which people don't needlessly suffer."
  • What makes you happy e.g. "I prefer my ice cream in a waffle cone"

I'm not sure how we should distinguish these. So far, my best idea is to call the former "global prefere... (read more)

2Tamsin Leake
This is indeed a meaningful distinction! I'd phrase it as: * Values about what the entire cosmos should be like * Values about what kind of places one wants one's (future) selves to inhabit (eg, in an internet-like upload-utopia, "what servers does one want to hang out on") "Global" and "local" is not the worst nomenclature. Maybe "global" vs "personal" values? I dunno. I mean, it's not unrelated! One can view a utility function with both kinds of values as a combination of two utility functions: the part that only cares about the state of the entire cosmos and the part that only cares about what's around them (see also "locally-caring agents"). (One might be tempted to say "consequentialist" vs "experiential", but I don't think that's right — one can still value contact with reality in their personal/local values.)
2Dagon
There are lots of different dimensions on which these vary.  I'd note that one is purely imaginary (no human has actually experienced anything like that) while the second is prediction strongly based on past experience.  One is far-mode (non-specific in experience, scope, or timeframe) and the other near-mode (specific, steps to achieve well-understood). Does using the word "values" not sufficiently distinguish from "preferences" for you?
2JBlack
The second type of preference seems to apply to anticipated perceptions of the world by the agent - such as the anticipated perception of eating ice cream in a waffle cone. It doesn't have to be so immediately direct, since it could also apply to instrumental goals such as doing something unpleasant now for expected improved experiences later. The first seems to be a more like a "principle" than a preference, in that the agent is judging outcomes on the principle of whether needless suffering exists in it, regardless of whether that suffering has any effect on the agent at all. To distinguish them, we could imagine a thought experiment in which such a person could choose to accept or deny some ongoing benefit for themselves that causes needless suffering on some distant world, and they will have their memory of the decision and any psychological consequences of it immediately negated regardless of which they chose.
2JBlack
It's even worse than that. Maybe I would be happier with my ice cream in a waffle cone the next time I have ice cream, but actually this is just a specific expression of being happier eating a variety of tasty things over time and it's just that I haven't had ice cream in a waffle cone for a while. The time after that, I will likely "prefer" something else despite my underlying preferences not having changed. Or something even more complex and interrelated with various parts of history and internal state. It may be better to distinguish instances of "preferences" that are specific to a given internal state and history, and an agent's general mapping over all internal states and histories.

I think you are missing even more confusing meaning: preference means what you actually choose.

In VNM axioms "agent prefers A to B" literally means "agent chooses A over B". It's confusing, because when we talk about human preferences we usually mean mental states, not their behavioral expressions.

Morphism52

Emotions can be treated as properties of the world, optimized with respect to constraints like anything else. We can't edit our emotions directly but we can influence them.

2cubefox
We can "influence" them only insofar we can "influence" what we want or believe: to a very low degree.

Oh no I mean they have the private key stored on the client side and decrypt it there.

Ideally all of this is behind a nice UI, like Signal.

I mean, Signal messenger has worked pretty well in my experience.

4mako yass
Possibly incidental, but if people were successfully maintaining continuous secure access to their signal account you wouldn't even notice because it doesn't even make an attempt to transfer encrypted data to new sessions.

But safety research can actually disproportionally help capabilities, e.g. the development of RLHF allowed OAI to turn their weird text predictors into a very generally useful product.

I'm skeptical of the RLHF example (see also this post by Paul on the topic).

That said, I agree that if indeed safety researchers produce (highly counterfactual) research advances that are much more effective at increasing the profitability and capability of AIs than the research advances done by people directly optimizing for profitability and capability, then safety researchers could substantially speed up timelines. (In other words, if safety targeted research is better at profit and capabilities than research which is directly targeted at these aims.)

I ... (read more)

I could see embedded agency being harmful though, since an actual implementation of it would be really useful for inner alignment

Some off the top of my head:

  • Outer Alignment Research (e.g. analytic moral philosophy in an attempt to extrapolate CEV) seems to be totally useless to capabilities, so we should almost definitely publish that.
  • Evals for Governance? Not sure about this since a lot of eval research helps capabilities, but if it leads to regulation that lengthens timelines, it could be net positive.

Edit: oops i didn't see tammy's comment

Morphism10-1

Idea:

Have everyone who wants to share and recieve potentially exfohazardous ideas/research send out a 4096-bit RSA public key.

Then, make a clone of the alignment forum, where every time you make a post, you provide a list of the public keys of the people who you want to see the post. Then, on the client side, it encrypts the post using all of those public keys. The server only ever holds encrypted posts.

Then, users can put in their own private key to see a post. The encrypted post gets downloaded to the user's machine and is decrypted on the client side. P... (read more)

4mako yass
I don't think e2e encryption is warranted here for the first iteration. Generally, keypair management is too hard, today, everyone I know who used encrypted Element chat has lost their keys lmao. (I endorse element chat, but I don't endorse making every channel you use encrypted, you will lose your logs!), and keypairs alone are a terrible way of doing secure identity. Keys can be lost or stolen, and though that doesn't happen every day, the probability is always too high to build anything serious on top of them. I'm waiting for a secure identity system with key rotation and some form of account recovery process (which can be an institutional service or a "social recovery" thing) before building anything important on top of e2e encryption.
2mako yass
This was probably a typo but just in case: you should never send a private key off your device. The public key is the part that you send.

Is this a massive exfohazard? Should this have been published?

3momom2
It's not. The paper is hype, the authors don't actually show that this could replace MLPs.
-4Exa Watson
Very Unlikely Yes
7Mateusz Bagiński
To the extent that Tegmark is concerned about exfohazards (he doesn't seem to be very concerned AFAICT (?)), he would probably say that more powerful and yet more interpretable architectures are net positive.

Yikes, I'm not even comfortable maximizing my own CEV.

What do you think of this post by Tammy?

Where is the longer version of this? I do want to read it. :)

Well perhaps I should write it :)

Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn't RL environments for AI cause the same or perhaps a different set of irrationalities?

Mostly that thing where we had a lying vs lie-detecting arms race and the liars mostly won by believing their own lies and that's how we have things like overconfidence bias ... (read more)

2Wei Dai
It seems like someone could definitely be wrong about what they want (unless normative anti-realism is true and such a sentence has no meaning). For example consider someone who thinks it's really important to be faithful to God and goes to church every Sunday to maintain their faith and would use a superintelligent religious AI assistant to help keep the faith if they could. Or maybe they're just overconfident about their philosophical abilities and would fail to take various precautions that I think are important in a high-stakes reflective process. Are you imagining that the RL environment for AIs will be single-player, with no social interactions? If yes, how will they learn social skills? If no, why wouldn't the same thing happen to them? We already RL-punish AIs for saying things that we don't like (via RLHF), and in the future will probably punish them for thinking things we don't like (via things like interpretability). Not sure how to avoid this (given current political realities) so safety plans have to somehow take this into account.

But we could have said the same thing of SBF, before the disaster happened.

I would honestly be pretty comfortable with maximizing SBF's CEV.

Please explain your thinking behind this?

TLDR: Humans can be powerful and overconfident. I think this is the main source of human evil. I also think this is unlikely to naturally be learned by RL in environments that don't incentivize irrationality (like ours did).

Sorrry if I was unclear there.

It's not, because some moral theories are not compatible with EU maximization.

I'm pretty confident that my values sa... (read more)

5Wei Dai
Yikes, I'm not even comfortable maximizing my own CEV. One crux may be that I think a human's values may be context-dependent. In other words, current me-living-in-a-normal-society may have different values from me-given-keys-to-the-universe and should not necessarily trust that version of myself. (Similar to how earlier idealistic Mao shouldn't have trusted his future self.) My own thinking around this is that we need to advance metaphilosophy and social epistemology, engineer better discussion rules/norms/mechanisms and so on, design a social process that most people can justifiably trust in (i.e., is likely to converge to moral truth or actual representative human values or something like that), then give AI a pointer to that, not any individual human's reflection process which may be mistaken or selfish or skewed. Where is the longer version of this? I do want to read it. :) Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn't RL environments for AI cause the same or perhaps a different set of irrationalities? Also, how does RL fit into QACI? Can you point me to where this is discussed?

I'm 60% confident that SBF and Mao Zedong (and just about everyone) would converge to nearly the same values (which we call "human values") if they were rational enough and had good enough decision theory.

If I'm wrong, (1) is a huge problem and the only surefire way to solve it is to actually be the human whose values get extrapolated. Luckily the de-facto nominees for this position are alignment researchers, who pretty strongly self-select for having cosmopolitan altruistic values.

I think (2) is a very human problem. Due to very weird selection pressure, ... (read more)

3Wei Dai
But we could have said the same thing of SBF, before the disaster happened. Please explain your thinking behind this? It's not, because some moral theories are not compatible with EU maximization, and of the ones that are, it's still unclear how to handle uncertainty between them.

What about the following:

My utility function is pretty much just my own happiness (in a fun-theoretic rather than purely hedonistic sense). However, my decision theory is updateless with respect to which sentient being I ended up as, so once you factor that in, I'm a multiverse-wide realityfluid-weighted average utilitarian.

I'm not sure how correct this is, but it's possible.

2Tamsin Leake
It certainly is possible! In more decision-theoritic terms, I'd describe this as "it sure would suck if agents in my reference class just optimized for their own happiness; it seems like the instrumental thing for agents in my reference class to do is maximize for everyone's happiness". Which is probly correct! But as per my post, I'd describe this position as "not intrinsically altruistic" — you're optimizing for everyone's happiness because "it sure would sure if agents in my reference class didn't do that", not because you intrinsically value that everyone be happy, regardless of reasoning about agents and reference classes and veils of ignorance.

Edit log:

2024-04-30 19:31 CST: Footnote formatting fix and minor grammar fix.

20:40 CST: "The problem is..." --> "Alignment is..."

22:17 CST: Title changed from "All we need is a pointer" to "The formal goal is a pointer"

OpenAI is not evil. They are just defecting on an epistemic prisoner's dilemma.

Maybe some kind of simulated long-reflection type thing like QACI where "doing philosophy" basically becomes "predicting how humans would do philosophy if given lots of time and resources"

Yes, amount of utopiastuff across all worlds remains constant, or possibly even decreases! But I don't think amount-of-utopiastuff is the thing I want to maximize. I'd love to live in a universe that's 10% utopia and 90% paperclips! I much prefer that to a 90% chance of extinction and a 10% chance of full-utopia. It's like insurance. Expected money goes down, but expected utility goes up.

Decision theory does not imply that we get to have nice things, but (I think) it does imply that we get to hedge our insane all-or-nothing gambles for nice things, and redistribute the nice things across more worlds.

I think this is only true if we are giving the AI a formal goal to explicitly maximize, rather than training the AI haphazardly and giving it a clusterfuck of shards. It seems plausible that our FAI would be formal-goal aligned, but it seems like UAI would be more like us unaligned humans—a clusterfuck of shards. Formal-goal AI needs the decision theory "programmed into" its formal goal, but clusterfuck-shard AI will come up with decision theory on its own after it ascends to superintelligence and makes itself coherent. It seems likely that such a UAI woul... (read more)

Fixed it! Thanks! It is very confusing that half the time people talk about loss functions and the other half of the time they talk about utility functions

Solution to 8 implemented in python using zero self-reference, where you can replace f with code for any arbitrary function on string x (escaping characters as necessary):

 f="x+'\\n'+x"
def ff(x):
return eval(f)
(lambda s : print(ff('f='+chr(34)+f+chr(34)+chr(10)+'def ff(x):'+chr(10)+chr(9)+'return eval(f)'+chr(10)+s+'('+chr(34)+s+chr(34)+')')))("(lambda s : print(ff('f='+chr(34)+f+chr(34)+chr(10)+'def ff(x):'+chr(10)+chr(9)+'return eval(f)'+chr(10)+s+'('+chr(34)+s+chr(34)+')')))")

edit: fixed spoiler tags