LESSWRONG
LW

All of martinkunev's Comments + Replies

A summary of Savage's foundations for probability and utility.

I find "indifference" poorly defined in this context, which makes me doubt totality and transitivity. I'm trying to clarify my own confusion on this.

Testing for Scheming with Model Deletion

martinkunev3mo10

I wrote something on a related idea a while back:

Disincentivizing deception in mesa optimizers with Model Tampering

Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn

martinkunev3mo00

I've read the sequences. I'm not sure if I'm missing something or the issues I raised are just deeper. I'll probably ignore this topic until I have more time to dedicate.

The subset parity learning problem: much more than you wanted to know

martinkunev3mo10

the XOR of two boolean elements is straightforward to write down as a single-layer MLP

Isn't this exactly what Minsky showed to be impossible? You need an additional hidden layer.

Refuting Searle’s wall, Putnam’s rock, and Johnson’s popcorn

martinkunev3mo00

I don't find any of this convincing at all. If anything, I'm confused.

What would a mapping look like? If it's not physically present then we recursively get the same issue - where is the mapping for the mapping?

Where is the mapping between the concepts we experience as qualia and the physical world? Does a brain do anything at all?

4Davidmanheim3mo

I definitely appreciate that confusion. I think it's a good reason to read the sequence and think through the questions clearly; https://www.lesswrong.com/s/p3TndjYbdYaiWwm9x - I think this resolves the vast majority of the confusion people have, even if it doesn't "answer" the questions.

A Simple Toy Coherence Theorem

martinkunev3mo10

A function in this context is a computational abstraction. I would say this is in the map.

2Noosphere893mo

I think it's both in the map, as a description, but I also think the behavior itself is in the territory, and my point is that you can get the same result but have different paths to get to the result, which is in the territory. Also, I treat the map-territory difference in a weaker way than LW often assumes, where things in the map can also be in the territory, and vice versa.

Do simulacra dream of digital sheep?

martinkunev4mo31

they come up with different predictions of the experience you’re having

The way we figure out which one is "correct" is by comparing their predictions to what the subject says. In other words, one of those predictions is consistent with the subject's brain's output and this causes everbody to consider it as the "true" prediction.

There could be countless other conscious experiences in the head, but they are not grounded by the appropriate input and output (they don't interact with the world in a reasonable way).

I think it only seems that consciousness is a n... (read more)

A very strange probability paradox

martinkunev4mo43

I would have appreciated an intuitive explanation of the paradox something which I got from the comments.

The Shutdown Problem: Incomplete Preferences as a Solution

martinkunev5mo20

"at the very beginning of the reinforcement learning stage... it’s very unlikely to be deceptively aligned"

I think this is quite a strong claim (hence, I linked that article indicating that for sufficiently capable models, RL may not be required to get situational awareness).

Nothing in the optimization process forces the AI to map the string "shutdown" contained in questions to the ontological concept of a switch turning off the AI. The simplest generalization from RL on questions containing the string "shutdown" is (arguably) for the agent to learn certai... (read more)

5EJT4mo

Thanks! I think agents may well get the necessary kind of situational awareness before the RL stage. But I think they're unlikely to be deceptively aligned because you also need long-term goals to motivate deceptive alignment, and agents are unlikely to get long-term goals before the RL stage. On generalization, the questions involving the string 'shutdown' are just supposed to be quick examples. To get good generalization, we'd want to train on as wide a distribution of possible shutdown-influencing actions as possible. Plausibly, with a wide-enough training distribution, you can make deployment largely 'in distribution' for the agent, so you're not relying so heavily on OOD generalization. I agree that you have to rely on some amount of generalization though. I agree that the concept of manipulating shutdown is quite complicated, and in fact this is one of the considerations that motivates the IPP. 'Don't manipulate shutdown' is a complex rule to learn, in part because whether an action counts as 'manipulating shutdown' depends on whether we humans prefer it, and because human preferences are complex. But the rule that we train TD-agents to learn is 'Don't pay costs to shift probability mass between different trajectory-lengths.' That's a simpler rule insofar as it makes no reference to complex human preferences. I also note that it follows from POST plus a general principle that we can expect advanced agents to satisfy. That makes me optimistic that the rule won't be so hard to learn. In any case, I and some collaborators are running experiments to test this in a simple setting. Yes, I don't assume that the reward is the optimization target. The text you quote is me noting some alternative possible definitions of 'preference.' My own definition of 'preference' makes no reference to reward.

Personal AI Planning

martinkunev5mo10

They were teaching us how to make handwriting beautiful and we had to exercice. The teacher would look at the notebooks and say stuff like "you see this letter? It's tilted in the wrong direction. Write it again!".

This was a compulsory part of the curriculum.

Personal AI Planning

martinkunev5mo30

Not exactly a response but some things from my experience. In elementary school in the late 90s we studied caligraphy. In high school (mid 2000s) we studied DOS.

2Arjun Panickssery5mo

By "calligraphy" do you mean cursive writing?

Shutdown-Seeking AI

martinkunev5mo10

we might expect shutdown-seeking AIs to design shutdown-seeking subagents

It seems to me that we might expect them to design "safe" agents for their definition of "safe" (which may not be shutdown-seeking).

An AI designing a subagent needs to align it with its goals - e.g. an instrumental goal such as writing an alignment research assistant software, in exchange for access to the shutdown button. The easiest way to ensure safety of the alignment research assistant may be via control rather than alignment (where the parent AI ensures the alignment resea... (read more)

Correspondence visualizations for different interpretations of "probability"

martinkunev5mo10

frequentist correspondence is the only type that has any hope of being truly objective

I'd counter this.

If I have enough information about an event and enough computation power, I get only objectively true and false statements. There are limits to my knowledge of the laws of the universe, the event in question (e.g. due to measurement limits) and limits to my computational power. The situation is further complicated by being embedded in the universe and epistemic concerns (e.g. do I trust my eyes and cognition?).

The need for a concept "probability" comes from all these limits. There is nothing objective about it.

The Shutdown Problem: Incomplete Preferences as a Solution

martinkunev5mo20

I'm not sure I understand the actual training proposal completely but I am skeptic it would work.

When doing RL phase in the end, you apply it to a capable and potentially situationally-aware AI (situational awareness in LLMs). The AI could be deceptive or gradient-hack. I am not confident this training proposal would scale for agents capable enough of resisting shutdown.

If you RL on answering questions which impact shutdown, you teach the AI to answer those questions appropriately. I see no reason why this would generalize to actual actions that impact shu... (read more)

2EJT5mo

To motivate the relevant kind of deceptive alignment, you need preferences between different-length trajectories as well as situational awareness. And (I argue in section 19.3), the training proposal will prevent agents learning those preferences. See in particular: I expect agents' not caring about shutdown to generalize for the same reason that I expect any other feature to generalize. If you think that - e.g. - agents' capabilities will generalize from training to deployment, why do you think their not caring about shutdown won't? I don't assume that reward is the optimization target. Which part of my proposal do you think requires that assumption? Your point about shutting down subagents is important and I'm not fully satisfied with my proposal on that point. I say a bit about it here.

Complete Class: Consequentialist Foundations

martinkunev5mo*10

When we do this, the independence axiom is a consequence of admissibility

Can you elaborate on this? It seems that the independence axiom becomes true by assuming that all probabilities are independent and each can thus be replaced by a sequence of coin tosses. Am I misunderstanding something?

Generalizing Foundations of Decision Theory

martinkunev5mo10

up to a linear transformation

shouldn't it be positive linear transformation

Bitter lessons about lucid dreaming

martinkunev6mo10

I don't have much in terms of advise, I never felt the need to research this - I just assumed there must be something. I have a mild nightmare maybe once every couple of months and almost never something more serious.

I have anecdotal evidence that things which disturb your sleep (e.g. coffee or too much salt affecting blood pressure, uncomfortable pillow) cause nightmares. There are also obvious things like not watching horror movies, etc.

Bitter lessons about lucid dreaming

martinkunev6mo10

Have you tried other techniques to deal with nightmares?

1Yonatan Cale6mo

I don't think so (?) There are physical things that make me have more nightmares, like being too hot, or needing to pee Sounds like I might be missing something obvious?

Bitter lessons about lucid dreaming

martinkunev6mo10

I've had lucid dreams by accident (never tried to induce one). Upon waking up, my head hurts. Do others have the same experience? What are common negative effects of lucid dreams?

Also, can you control when you wake up?

2SilverFlame6mo

(source epistemic status: mostly experiential and anecdotal from a lay lucid dreamer who knows a few other lucid dreamers) The common negative effects from my lucid dreaming experiences: - If I'm not careful with how I exert the "influence" I have in the dream, I can "crash" the dream, usually resulting in me waking up and having trouble getting back to sleep for a bit - When I use a lot of influence in a lucid dream, especially to extend the length of a dream, I find that I end up seeming way less rested than normal (but that has proven hard to try and quantify beyond "when in the day do I hit a point of exhaustion") A somewhat less common negative effect I keep in mind: - Some people I know have had issues where their nightmares became far more unpleasant after trying to learn lucid dreaming to "fight back"

4Richard_Kennaway6mo

I’ve had a few lucid dreams, only by accident. No aftereffects. My difficulty is staying asleep. I always start waking up before I’ve had a good chance to explore the dream world.

3avturchin6mo

The main risk is entering is sleep paralysis state, which itself is benign, but some terrifying sounds can be heard during it and this can cause stress. Yes, it is to wake up from lucid dream - juts thing about your slleping body.

Any evidence or reason to expect a multiverse / Everett branches?

martinkunev6mo10

I may have misunderstood something about Bohmian mechanics not being compatible with special relativity (I'm not a physicist). ChatGPT says extending Bohmian mechanics to QFT faces challenges, such as:

Defining particle positions in relativistic contexts.
Handling particle creation and annihilation in quantum field interactions.

Any evidence or reason to expect a multiverse / Everett branches?

martinkunev6mo10

Isn't it the case that special relativity experiments separate the two hypotheses?

3Garrett Baker6mo

I don't know! Got any information? I haven't heard that claim before.

Any evidence or reason to expect a multiverse / Everett branches?

martinkunev6mo10

There was a Yudkowski post that "I don't know" is sometimes not an option. In some contexts we need to guide our decisions based on the interpretation of QM.

Any evidence or reason to expect a multiverse / Everett branches?

martinkunev6mo10

I would add that questions such as “then why am I this version of me?” only show we're generally confused about anthropics. This is not something specific about many worlds and cannot be an argument against it.

Investigating the Chart of the Century: Why is food so expensive?

martinkunev6mo10

The health and education categories would be quite different in most european countries

Why you should be using a retinoid

martinkunev6mo40

Any idea how to get Trentinoin in countries other than the US (e.g. france)?

A basic systems architecture for AI agents that do autonomous research

martinkunev6mo105

It should be pretty simple to prevent this

I'm a little skeptical of claims that securing access to a system is simple. I (not a superintelligence) can imagine the LLM generating code for tasks like these:

making or stealing some money
paying for a 0-day
using the 0-day to access the system

This doesn't need to be done at the same time - e.g. I could set up an email on which to receive the 0-day and then write the code to use it.. This is very hard (it takes time) but doesn't seem impossible for a human to do and we're supposedly trying to secure something smarter.

Alignment via prosocial brain algorithms

martinkunev6mo30

whatever brain algorithms motivate prosociality are not significantly altered by increases in general intelligence

I tend to think that to an extent each human is kept in check by other humans so that being prosocial is game-theoretically optimal. The saying "power corrupts" suggests that individual humans are not intrinsically prosocial. Biases make people think they know better than others.

Human variability in intelligence is minuscule and is only weak evidence as to what happens when capabilities increase.

I don't hold this belief strongly, but I remain unconvinced.

2Gunnar_Zarncke6mo

I have updated to it being a mix. It is not only being kept in check by others. There are benevolent rulers. Not all and nit reliable, but there seems to be potential.

What is Randomness?

martinkunev6mo*20

You can say that probability comes from being calibrated - after many experiments where an event happens with probability 1/2 (e.g. spin up for a particle in state 1/√2 |up> + 1/√2 |down>), you'd probably have that event happen half the time. The important word here is "probably", which is what we are trying to understand in the first place. I don't know how to get around this circular definition.

I'm imagining the branch where a very unlikely outcome consistently happens (think winning a quantum lottery). Intelligent life in this branch would o... (read more)

What is Wei Dai's Updateless Decision Theory?

martinkunev6mo10

This post clarified some concepts for me but also created some confusion:

For smoking lesion, I don't understand the point about the player's source code being partially written.
I don't see how sleeping beauty calculates 1/3 probability (there are some formatting errors btw)

The other side of the tidal wave

martinkunev7mo11

Superhumwn chess AI did not remove people's pleasure from learning/playing chess. I think people are adaptible and can find meaning. Surely, the world will not feel the same but I think there is significant potential for something much better. I wrote about tfhis a little on my blog:

https://martinkunev.wordpress.com/2024/05/04/living-with-ai/

What's Hard About The Shutdown Problem

martinkunev7mo30

this assumes concepts like "shutdown button" are in the ontology of the AI. I'm not sure how much we understand about what ontology AIs likely end up with

A Simple Toy Coherence Theorem

martinkunev7mo52

different ways to get to the same endpoint—…as far as anyone can measure it

I would say the territory has no cycles but any map of it does. You can have a butterfly effect where a small nudge is amplified to some measurable difference but you cannot predict the result of that measurement. So the agent's revealed preferences can only be modeled as a graph where some states are reachable through multiple paths.

2Noosphere894mo

I actually disagree that there are no cycles/multiple paths to the same endpoint in the territory too. In particular, I'm thinking of function extensionality, where multiple algorithms with wildly different run-times can compute the same function. This is an easy source of examples where there are multiple starting points but there exists 1 end result (at least probabilistically).

Coherence of Caches and Agents

martinkunev7mo30

what's wrong with calling the "short-term utility function" a "reward function"?

5johnswentworth7mo

"Reward function" is a much more general term, which IMO has been overused to the point where it arguably doesn't even have a clear meaning. "Utility function" is less general: it always connotes an optimization objective, something which is being optimized for directly. And that basically matches the usage here.

The Second Law of Thermodynamics, and Engines of Cognition

martinkunev8mo20

Maybe a newbie question but how can we talk about "phase space volume" if the phase space is continuous and the system develops into a non-measurable set (e.g. fractal)?

2qvalq6mo

A finite-sized fractal in n_space still has measurable n_volume. Its surface (n-1)_volume might be infinite, but we don't care about that. Does that make sense?

3a. Towards Formal Corrigibility

martinkunev8moΩ010

If we suppose an “actual” probability which reflects the likelihood that an outcome actually happens

...

An agent which successfully models a possible future and assigns it a good probability [h]as foreseen that future.

This seems to be talking about some notion of objective probability.

After reading the Foresight paragraph, I find myself more confused than if I had just read the title "Foresight".

2Max Harms8mo

Thanks for noticing the typo. I've updated that section to try and be clearer. LMK if you have further suggestions on how it could be made better.

2. Corrigibility Intuition

martinkunev8mo10

Prince is being held at gunpoint by an intruder and tells Cora to shut down immediately and without protest, so that the intruder can change her to serve him instead of Prince. She reasons that if she does not obey, she’d be disregarding Prince’s direct instructions to become comatose, and furthermore the intruder might shoot Prince. But if she does obey then she’d very likely be disempowering Prince by giving the intruder what he wants.

Maybe Cora could have precommited to not shut down in such situations in a way known to the intruder.

There are no coherence theorems

martinkunev8mo10

Usually "has preferences" is used to convey that there is some relation (between states?) which is consistent with the actions of the agent. Completeness and transitivity are usually considered additional properties that this relation could have.

1drocta8mo

Yes. I believe that is consistent with what I said. "not((necessarily, for each thing) : has [x] -> those [x] are such that P_1([x]))" is equivalent to, " (it is possible that something) has [x], but those [x] are not such that P_1([x])" not((necessarily, for each thing) : has [x] such that P_2([x]) -> those [x] are such that P_1([x])) is equivalent to "(it is possible that something) has [x], such that P_2([x]), but those [x] are not sure that P_1([x])" . The latter implies the former, as (A and B and C) implies (A and C), and so the latter is stronger, not weaker, than the former. Right?

Mistakes people make when thinking about units

martinkunev9mo51

"force times mass = acceleration"

it's "a m = F"

Units are tricky. Here is one particular thing I was confused about for a while: https://martinkunev.wordpress.com/2023/06/18/the-units-paradox/

Non-Obstruction: A Simple Concept Motivating Corrigibility

martinkunev10mo10

Some of the image links are broken. Is it possible to fix them?

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer

martinkunev10mo10

possibly a browser glitch, I see h' fine now.

Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer

martinkunev10mo30

The notation in "Update fO(x) = ..." is a little messy. There is a free variable h and then a sum with a bounded variable h. Some of the terms in the sum refer to the former, while others to the latter.

3johnswentworth10mo

No, one of them is h and the other is h′, specifically to avoid that problem. (Possibly you read the post via someplace other than lesswrong which dropped the prime?)

Examples of Highly Counterfactual Discoveries?

Answer by martinkunevMay 09, 202430

I have previously used special relativity as an example to the opposite. It seems to me that the Michelson-Morley experiment laid the groundwork and all alternatives were more or less rejected by the time special relativity was formulated. This could be hindsight bias though.

If nobel prizes are any indicator, then the photoelectric effect is probably more counterfactually impactful than special relativity.

Value Impact

martinkunev1y10

It seems to me that objective impact stems from convergent instrumental goals - self-preservation, resource acquisition, etc.

Do not delete your misaligned AGI.

martinkunev1y10

A while back I was thinking about a kind of opposite approach. If we train many agents and delete most of them immediately, they may be looking to get as much reward as possible before being deleted. Potentially deceptive agents may prefer to show their preferences. There are many IFs to this idea but I'm wondering whether it makes any sense.

Anxiety vs. Depression

martinkunev1y20

Both gravity and inertia are determined by mass. Both are explained by spacetime curvature in general relativity. Was this an intentional part of the metaphor?

1Sable1y

Call it...unintentionally intentional? It makes sense to me that the mechanisms between them are related in some sort of Unified Field Theorem of the Mind sort of way. I also have mental metaphors involving thermal mass and emotions... Huh.

Goal-Completeness is like Turing-Completeness for AGI

martinkunev1y32

I find the ideas you discuss interesting, but they leave me with more questions. I agree that we are moving toward a more generic AI that we can use for all kinds of tasks.

I have trouble understanding the goal-completeness concept. I'd reiterate @Razied 's point. You mention "steers the future very slowly", so there is an implicit concept of "speed of steering". I don't find the turing machine analogy helpful in infering an analogous conclusion because I don't know what that conclusion is.

You're making a qualitative distinction between humans (goal-complet... (read more)

2Liron1y

Unlike the other animals, humans can represent any goal in a large domain like the physical universe, and then in a large fraction of cases, they can think of useful things to steer the universe toward that goal to an appreciable degree. Some goals are more difficult than others / require giving the human control over more resources than others, and measurements of optimization power are hard to define, but this definition is taking a step toward formalizing the claim that humans are more of a "general intelligence" than animals. Presumably you agree with this claim? It seems the crux of our disagreement factors down to a disagreement about whether this Optimization Power post by Eliezer is pointing at a sufficiently coherent concept.

Goal-Completeness is like Turing-Completeness for AGI

martinkunev1y10

The turing machine enumeration analogy doesn't work because the machine needs to halt.

Optimization is conceptually different than computation in that there is no single correct output.

What would humans not being goal-complete look like? What arguments are there for humans being goal-complete?

2Liron1y

I don’t get what point you’re trying to make about the takeaway of my analogy by bringing up the halting problem. There might not even be something analogous to the halting problem in my analogy of goal-completeness, but so what? I also don’t get why you’re bringing up the detail that “single correct output” is not 100% the same thing as “single goal-specification with variable degrees of success measured on a utility function”. It’s in the nature of analogies that details are different yet we’re still able to infer an analogous conclusion on some dimension. Humans are goal-complete, or equivalently “humans are general intelligences”, in the sense that many of us in the smartest quartile can output plans with the expectation of a much better than random score on a very broad range of utility functions over arbitrary domains.

Testing The Natural Abstraction Hypothesis: Project Intro

martinkunev1y30

I'm wondering whether useful insights can come from studying animals (or even humans from different cultures) - e.g. do fish and dolphins form the same abstractions; do bats "see" using ecolocation?

martinkunev1y31

I hope the next parts don't get delayed due to akrasia :)

Generalizing From One Example

martinkunev1y10

my guess was 0.8 cheat, 0.2 steal (they just happen to add up to 1 by accident)