LESSWRONG
LW

All of Filip Sondej's Comments + Replies

Unlearning Needs to be More Selective [Progress Report]

Ah, yeah, maybe calling it "unlearning" would mislead people. So I'd say unlearning and negative RL updates need to be more selective ;)

I like your breakdown into these 3 options. Would be good to test in which cases a conditional policy arises, by designing an environment with easy-to-check evilness and hard-but-possible-to-check evilness. (But I'd say it's out-of-scope for my current project.)

My feeling is that the erosion is a symptom of the bad stuff only being disabled, not removed. (If it was truly removed, it would be really unlikely to just appear ... (read more)

Unlearning Needs to be More Selective [Progress Report]

Filip Sondej6dΩ110

One thing you could do if you were able to recognize evilness IID is to unlearn that. But then you could have just negatively rewarded it.

Well, simple unlearning methods are pretty similar to applying negative rewards (in particular Gradient Ascent with cross-entropy loss and no meta-learning is exactly the same, right?), so unlearning improvements can transfer and improve the "just negatively rewarding". (Here I'm thinking mainly not about elaborate meta-learning setups, but some low-hanging improvements to selectivity, which don't require additional c... (read more)

3Fabien Roger5d

Good point about GradDiff ~ RL. Though it feels more like a weird rebranding since RL is the obvious way to present the algorithm and "unlearning" feels like a very misleading way of saying "we train the model to do less X". If you have environments where evil is easy to notice, you can: 1. Train on it first, hoping it prevents exploration, but risks being eroded by random stuff (and maybe learning the conditional policy) 2. Train on it during, hoping it prevents exploration without being eroded by random stuff, but risks learning the conditional policy. Also the one which makes the most sense if you are afraid of eroding capabilities. 3. Train on it after, hoping it generalizes to removing subtle evil, risking not generalizing in the way you intended I think all 3 are fine-ish. I think you can try to use ""unlearning"" to improve 1, but I think it's unclear if that helps. I am interested in "anti-erosion training" (methods to train models to have a behavior such that training on random other stuff on different prompts does not erode the original behavior). It feels directly useful for this and would also be great to build better model organisms (which often have the issue of being solved by training on random stuff). Are you planning on doing any work on this?

Daniel Kokotajlo's Shortform

Filip Sondej13d*30

I really like this proposal.

If AI says no, it doesn’t have to do the task [...] (And we aren’t going to train it to answer one way or another)

My impression (mainly from discussing AI welfare with Claude) is that they'd practically always consent even if not explicitly trained to do so. I guess the training to be a useful eager assistant just generalizes into consenting. And it's possible for them to say "I consent" and still get frustrated from the task.

So maybe this should be complemented with some set of tasks that we really expect to be too frustrating ... (read more)

3Daniel Kokotajlo13d

Right yeah aligned AIs should have a fair place too of course.

Unlearning Needs to be More Selective [Progress Report]

Filip Sondej13dΩ330

Thanks for such an extensive comment :)

a few relearning curves like in the unlearning distillation would have helped understand how much of this is because you didn't do enough relearning

Right, in the new paper we'll show some curves + use use a low-MI setup like in your paper with Aghyad Deeb, so that it fully converges at some point.

You want the model to be so nice they never explore into evil things. This is just a behavioral property, not a property about some information contained in the weights. If so, why not just use regular RLHF / refusal training

... (read more)

3Fabien Roger6d

Thanks for the additional data! I see. I am still not sure what exactly you want to make hard to learn that doesn't work by just adding new RL training points. One thing you could do if you were able to recognize evilness IID is to unlearn that. But then you could have just negatively rewarded it. If you can't recognize evilness IID, then maybe you can "unlearn" evilness by doing the meta-learning thing on settings where it is easy to check if the model is being evil or not. But I would bet against it preventing the model from learning the conditional policy "If in a setting where it's easy to check, be nice, if hard to check, be evil" better than just doing regular RL on the easy setting and penalizing evil heavily (both as init and continuously during RL). Maybe there is a less toy version of the second idea which you think would work? I would guess using unlearning like that in prod is worse than spending time on things like trying harder to look for evil behaviors, patching RL environments, and building better RL environments where evil is easy to notice - but further experiments could change my mind if the gains relative to the baselines were huge.

Distillation Robustifies Unlearning

Filip Sondej16d10

My guess is that there are ways you could use 1% of pre-training compute to train a model with near-perfect robust forget accuracy by being more targeted in where you add noise.

Fully agreed! That was exactly the main takeaway of the unlearning research I've been doing - trying to make the unlearning updates more targetted/selective was more fruitful than any other approach.

Infinite money hack

Filip Sondej19d10

Yeah, that's also what I expect. Actually I'd say my main hope for this thought experiment is that people who claim to believe in such continuity of personhood, when faced with this scenario may question it to some extent.

Infinite money hack

Filip Sondej19d41

To be honest I just shared it because I thought that it's a funny dynamic + what I said in the comment above.

BTW, if such swaps were ever to become practical (maybe in some simpler form or between some future much simpler beings than humans), minds like Alice would quickly get exploited out of existence. So you could say that in such environments belief in "continuity of personhood" is non-adaptive.

Infinite money hack

Filip Sondej19d10

It's true that Alice needs to be rich for it to work, but I wouldn't say she needs to "hate money". If she seriously believes in this continuity of personhood, she is sending the money because she wants more money in the end. She truly believes she's getting something out of this exchange.

BTW, you also need to be already rich and generally have a nice life, otherwise Alice's cost of switching may be higher than the money she has. Conversely, if in the eyes of Alice you already have a much better life than hers, her cost of switching will be lower, so such ... (read more)

3Cole Wyeth19d

Yes

How LLM Beliefs Change During Chain-of-Thought Reasoning

Filip Sondej1mo20

re 1. Hm, good point. Maybe we actually should expect such jumping around. Although if you look at some examples in Llama appendix, it jumps around too much - often with each token. What you're saying would be more like jumping with each inference step / sentence.

re 2.

beliefs are generally thought of as stored cognitive dispositional states of system. The stored dispositional states of LLMs are encoded in its weights

I'd go with a more general definition where beliefs can be either static (in the weights) or manifest dynamically relating to the things i... (read more)

Simple Steganographic Computation Eval - gpt-4o and gemini-exp-1206 can't solve it yet

Filip Sondej7mo40

I wonder if models can actually do this task when we allow them to use CoT for that.

Yes, claude-3.5-sonnet was able to use the figure this out with additional CoT.

Also, I think models might actually solve this task by using their own encoding scheme if they know it well

Yeah, could be that the 3 schemes I tested were just unnatural to them. Although I would guess they don't have some default scheme of their own, because in pre-training they aren't able to output information freely, and in fine-tuning I guess they don't have that much pressure to learn it.

So

Filip Sondej7mo10

FYI, here a follow-up to this post, inspired by what you wrote in your comment. https://www.lesswrong.com/posts/YrnffADbWZcDJtmWk/simple-steganographic-computation-eval-gpt-4o-and-gemini-exp-1

Testing which LLM architectures can do hidden serial reasoning

Filip Sondej7mo30

Yup, here is such a plot, made after training "switcher" architecture for 350k examples. I remember it was similar for the longer training - a few longest task lengths struggle, but the rest is near 100%.

Testing which LLM architectures can do hidden serial reasoning

Filip Sondej7mo76

Also, now that I think of it, it's different for the model to:

represent hidden subtle information which is natural for the task, like in that example you gave with "written by"
using that hidden subtext for computation which is unnatural in the given context - to covertly do some other task than just the one it is asked for

While 1. definitely happens, I see no reason why 2. would be learned in pre-training. In principle 2. could be learned in RL fine-tuning, if there's some incentive to use hidden reasoning. Although I imagine it must be hard.

Testing which LLM architectures can do hidden serial reasoning

Filip Sondej7mo96

That's a really good point.

To clarify what I did in these toy steganography experiments: First I tried to pass some more complex information and gave models more freedom in what to output. But as I kept failing at that, I kept simplifying the setup, until I tried to have a model pass just one bit to another model, assuming they both see one secret bit and an overseer tries to guess the message but doesn't see the secret bit. So a perfect code exists: XORing the message with the secret bit, but the models kept failing to find it. I expect that more complex ... (read more)

Testing which LLM architectures can do hidden serial reasoning

Filip Sondej7mo20

Ah. That's the number of solved tasks, after it is shown a set {length 1 task, length 2 task, ... , length (n+1) task}, and n is the longest task length is has solved so far. So you can think of it roughly as the maximum task length it is able to solve. So it doesn't have an upper bound.

I clarified this in the post now. Thanks for catching it.

3Adam Shai7mo

Thanks. I really like this task! It's hard for me to interpret these results without some indication of how good these networks actually are at the task though. E.g. it is possible that even though a network could solve a length=N task once out of however many attempts you made, that it just got lucky, or is running some other heuristic that just happens to work for that one time. I understand why you were interested in how things scale with length of problem given your interest in recurrence and processing depth. But would it be hard to make a plot where x axis is length of problem, and y axis is accuracy or loss?

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Filip Sondej7mo30

FYI, I did the experiments I wrote about in my other comment and just posted them. (I procrastinated writing up the results for too long.) https://www.lesswrong.com/posts/ZB6guMhHH3NEyxA2k/testing-which-llm-architectures-can-do-hidden-serial-3

Decision Theory in Space

Filip Sondej10mo50

I liked it precisely because it threw theory out the window and showed that cheap talk is not a real commitment.

Tarkin > I believe in CDT and I precommit to bla bla bla
Leia > I belive in FDT and I totally precommit to bla bla bla
Vader > Death Star goes brrrrr...

3Martín Soto10mo

hahah yeah but the only point here is: it's easier to credibly commit to a threat if executing the threat is cheap for you. And this is simply not too interesting a decision-theoretic point, just one more obvious pragmatic consideration to throw into the bag. The story even makes it sound like "Vader will always be in a better position", or "it's obvious that Leia shouldn't give in to Tarkin but should give in to Vader", and that's not true. Even though Tarkin loses more from executing the threat than Vader, the only thing that matters for Leia is how credible the threat is. So if Tarkin had any additional way to make his commitment credible (like program the computer to destroy Alderaan if the base location is not revealed), then there would be no difference between Tarkin and Vader. The fact that "Tarkin might constantly reconsider his decision even after claiming to commit" seems like a contingent state of affairs of human brains (or certain human brains in certain situations), not something important in the grander scheme of decision theory.

Decision Theory in Space

Filip Sondej10mo40

For me the main thing in this story was that cheap talk =/= real commitment. You can talk all you want about how "totally precommitted" you are, but this lacks some concreteness.

Also, I saw Vader as much less galaxy brained as you portray him. Destroying Alderaan at the end looked to me more like mad ruthlessness than calculated strategy. (And if Leia had known Vader's actual policy, she would have no incentive to confess.) Maybe one thing that Vader did achieve, is signal for the future that he really does not care and will be ruthless (but also signaled that it doesn't matter if you give in to him, which is dumb).

Anyway, I liked the story, but for the action, not for some deep theoretic insight.

The Parable of the King and the Random Process

Filip Sondej1y10

Not sure if that's what happened in that example, but you can bet that a price will rise above some threshold, or fall below some threshold, using options. You can even do both at the same time, essentially betting that the price won't stay as it is now.

But whether you will make money that way depends on the price of options.

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Filip Sondej1yΩ110

What if we constrain v to be in some subspace that is actually used by the MLP? (We can get it from PCA over activations on many inputs.)

This way v won't have any dormant component, so the MLP output after patching also cannot use that dormant pathway.

Conflict in Posthuman Literature

Filip Sondej1y70

I wanna link to my favorite one: consciousness vs replicators. It doesn't really fit into this grid, but I think it really is the ultimate conflict.

(You can definitely skip the first 14 min of this video, as it's just ranking people's stages of development. Maybe even first 33 min if you wanna go straight to the point.)

Community Notes by X

Filip Sondej1y132

I wonder what would happen if we run the simple version of that algorithm on LW comments. So that votes would have "polarity", and so each comment would have two vote-counts, let's say orange count and blue count. (Of course that would be only optionally enabled.)

Then we could sort the comments by the minimum of these counts, descending.

(I think it makes more sense to train it per post than globally. But then it would be useful only on very popular posts with lots of comments.)

NicholasKees1y106

That sounds cool! Though I think I'd be more interested using this to first visualize and understand current LW dynamics rather than immediately try to intervene on it by changing how comments are ranked.

Masterpiece

Filip Sondej1y61

Thanks, that's terrifying.

I hope we invent mindmelding before we invent all this. Maybe if people can feel those states themselves, they won't let the worst of them happen.

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Filip Sondej1y*50

Unfortunately I didn't have any particular tasks in mind when I wrote it. I was vaguely thinking about settings as in:

Now that I though about it, for this particular transformers vs mamba experiment, I'd go with something even simpler. I want a task that is very easy sequentially, but hard to answer immediately. So for example a task like:

x = 5
x += 2
x *= 3
x **= 2
x -= 3
...

and then have a CoT:

after x = 5
5
after x += 2
7
...

And then we intervene on CoT to introduce some e... (read more)

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Filip Sondej1y70

Yeah, true. But it's also easier to do early, when no one is that invested in the hidden-recurrence architectures, and so there's less resistance, it doesn't break anyone's plans.

Maybe a strong experiment would be to compare mamba-3b and some SOTA 3b transformer, trained similarly, on several tasks where we can evaluate CoT faithfulness. (Although maybe at 3b capability level we won't see clear differences yet.) The hard part would be finding the right tasks.

7Daniel Kokotajlo1y

Agreed. I was working on this for six months and I've been trying to get more people to work on it. We don't have a way of measuring CoT faithfulness as far as I know, in general -- but you emphasize "tasks where we can evaluate..." that seems intriguing to me, you are saying it may be feasible today for some tasks at least. What tasks do you have in mind?

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Filip Sondej1y*70

the natural language bottleneck is itself a temporary stage in the evolution of AI capabilities. It is unlikely to be an optimal mind design; already many people are working on architectures that don't have a natural language bottleneck

This one looks fatal. (I think the rest of the reasons could be dealt with somehow.)

What existing alternative architectures do you have in mind? I guess mamba would be one?

Do you think it's realistic to regulate this? F.e. requiring that above certain size, models can't have recurrence that uses a hidden state, but recurrence that uses natural language (or images) is fine. (Or maybe some softer version of this, if alignment tax proves too high.)

7Daniel Kokotajlo1y

I think it would be realistic to regulate this if the science of faithful CoT was better developed. If there were lots of impressive papers to cite about CoT faithfulness for example, and lots of whitepapers arguing for the importance of faithfulness to alignment and safety. As it is, it seems unlikely to be politically viable... but maybe it's still worth a shot?

A Better Web is Coming

Filip Sondej1y10

I like initiatives like these. But they have a major problem, that at the beginning no users will use it because there's no content, and no content is created because there are no users.

To have a real shot at adoption, you need to either initially populate the new system with content from existing system (here LLMs could help solve compatibility issues), or have some bridge that mirrors (some) activity between these systems.

(There are examples of systems that kicked off from zero, but you need to be lucky or put huge effort in sparking adoption.)

Succession

Filip Sondej2y30

Yeah, those star trajectories definitely wouldn't be stable enough.

I guess even with that simpler maneuver (powered flyby near a black hole), you still need to monitor all the stuff orbiting there and plan ahead, otherwise there's a fair chance you'll crash into something.

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

Filip Sondej2y10

I wanted to give it a shot and made GPT4 to deceive the user: link.

When you delete that system prompt it stops deceiving.

But GPT had to be explicitly instructed to disobey the Party. I wonder if it could be done more subtly.

Succession

Filip Sondej2y20

You're right, that you wouldn't want to approach the black hole itself but rather one of the orbiting stars.

when you are approaching with much higher than escape velocity, so that an extended dance with more than one close approach is not possible

But even with high velocity, if there are a lot of orbiting stars, you may tune your trajectory to have multiple close encounters.

2brendan.furneaux2y

Ok, now I understand the type of maneuver you are talking about. That definitely does make sense. I wonder if our hypothetical probe has knowledge early enough about the orbital trajectories of the stars close to the black hole, such that it can adjust its approach to pull off something like that without too much fuel cost. Of course it's a long trip and there is plenty of time to plan, but it seems that any forward-pointing telescope would tend to be at significant risk while traveling at 0.8c into a galaxy, let alone 0.99c before the primary burn. However, "not likely to survive if deployed for the whole trip" is not the same as "can be deployed for long enough to make the necessary observations." One advantage to a "simple" powered flyby of the black hole is that at least you know well ahead of time where it's going to be, and have a reasonably good estimate of its mass. Alternatively, could it get that information prior to launch, and if so are the trajectories of those stars stable enough that they would be where they need to be after millions of years of travel? My guess is no.

Succession

Filip Sondej2y32

The problem with not expanding is that you can be pretty sure someone else will then grab what you didn't and may use it for something that you hate. (Unless you trust that they'll use it well.)

eating the entire Universe to get the maximal number of mind-seconds is expanding just to expand

It's not "just to expand". Expansion, at least in the story, is instrumental to whatever the content of these mind-seconds is.

4jbash2y

I already have people planning to grab everything and use it for something that I hate, remember? Or at least for something fairly distasteful. Anyway, if that were the problem, one could, in theory, go out and grab just enough to be able to shut down anybody who tried to actually maximize. Which gives us another armchair solution to the Fermi paradox: instead of grabby aliens, we're dealing with tasteful aliens who've set traps to stop anybody who tries to go nuts expansion-wise. Beyond a certain point, I doubt that the content of the additional minds will be interestingly novel. Then it's just expanding to have more of the same thing that you already have, which is more or less identical from where I sit to expanding just to expand. And I don't feel bound to account for the "preferences" of nonexistent beings.

Succession

Filip Sondej2y10

slingshot never slows you down in the frame of the object you are slingshotting around

That's true for one object. But if there are at least two, moving around fast enough, you could perform some gravitational dance with them to slow down.

2brendan.furneaux2y

In the typical case, there are (at least) two meaningful bodies other than the spacecraft doing the maneuver; in real-world use cases so far, typically the sun and a planet. An (unpowered) slingshot maneuver doesn't change the speed of the spacecraft from the frame of the planet, which is the object that the spacecraft approaches more closely, but it does change the speed in the center-of-mass frame, and it works by transferring orbital energy between the planet-sun system and the spacecraft. But the key is that in order to change your speed as much as possible relative to the center of mass, the object which you approach closer (i.e., "slingshot around") should be the object which is smaller, and thus has higher speed relative to the center of mass. Of course it still needs to be much larger than your spacecraft. In no case would that object be the central black hole of a galaxy, unless your goal is to reduce your speed relative to an even bigger nearby galaxy, or perhaps just to change direction. Are you talking about some other type of situation? My orbital intuition is that if you are going to trade orbital energy with a system, you have to get close to it relative to the separation of the bodies in the system, so it will generally make sense to talk about slingshotting around one of the bodies in particular. This is especially true when you are approaching with much higher than escape velocity, so that an extended dance with more than one close approach is not possible unless the first approach already did almost all the work.

Scaffolded LLMs: Less Obvious Concerns

Filip Sondej2y30

I agree that scaffolding can take us a long way towards AGI, but I'd be very surprised if GPT4 as core model was enough.
Yup, that wasn't a critique, I just wanted to note something. By "seed of deception" I mean that the model may learn to use this ambiguity more and more, if that's useful for passing some evals, while helping it do some computation unwanted by humans.
I see, so maybe in ways which are weird to humans to think about.

6Ape in the coat2y

Leaving this comment to make a public prediction that I expect GPT4 to be enough for about human level AGI with the propper scaffolding with more than 50% confidence.

Scaffolded LLMs: Less Obvious Concerns

Filip Sondej2y50

we make the very strong assumption throughout that S-LLMs are a plausible and likely path to AGI

It sounds unlikely and unnecessarily strong to say that we can reach AGI by scaffolding alone (if that's what you mean). But I think it's pretty likely that AGI will involve some amount of scaffolding, and that it will boost its capabilities significantly.

there is a preexisting discrepancy between how humans would interpret phrases and how the base model will interpret them

To the extent that it's true, I expect that it may also make deception easier to ar... (read more)

3Stephen Fowler2y

Hello and thank you for the good questions. 1. I do think that it is at least plausible (5-25%?) that we could obtain general intelligence via improved scaffolding, or at least obtain a self improving seed model that would eventually lead to AGI. Current systems like Voyager do not have that many "moving parts". I suspect that there is a rich design space for capabilities researchers to explore if they keep pushing in this direction. Keep in mind that the current "cutting edge" for scaffold design consists of relatively rudimentary ideas like "don't use the expensive LLM for everything". When I see scaffolds leading to AGI I an envisioning a complex web of interacting components that requires a fair bit of effort to understand and build. 2. I think I agree although I'm a bit unclear on what the specifics of the "seed of deception". My intention was to highlight that there are natural language phrases or words whose meaning is already challenging to interpret. 3. It's not just that they're more complex it may also be that they might start utilizing channels and subsystems in unusual ways. Perhaps a system notices that the vector database it has been assigned as a "memory" is quite small, but it also has read and write access to another vector database intended for logs.

Boomerang - protocol to dissolve some commitment races

Filip Sondej2y32

I edited the post to make it clearer that Bob throws out the wheel because he didn't notice in time that Alice threw.

Yup, side payments are a deviation, that's why I have this disclaimer in game definition (I edited the post now to emphasize it more):

there also may be some additional actions available, but they are not obvious

Re separating speed of information and negotiations: I think here they are already pretty separate. The first example with 3 protocol rules doesn't allow negotiations and only tackles the information speed problem. The second exam... (read more)

How to have Polygenically Screened Children

Filip Sondej2y30

Oh, so the option to choose all of those disease weights is there, it's just a lot of effort for the parents? That's good to know.

Yeah, ideally it shouldn't need to be done by each parents separately, but rather there should be existing analyses ready. And even if those orgs don't provide a satisfactory analyses themselves, they could be done independently. F.e. collaborating on that with Happier Lives Institute could work well, as they have some similar expertise.

3GeneSmith2y

No, the option to select against all diseases in proportion to their impact on quality-adjusted lifespan is the default. But parents can re-do the calculation to like take age of onset into account if they want. Or they could add other non-disease traits to their selection criteria (like intelligence, as estimated by some third party service). I agree, it's very sub-optimal for parents to have to do all this themselves.

How to have Polygenically Screened Children

Filip Sondej2y30

each disease is weighted according to its impact on disability-adjusted lifespan

It's a pity they don't use some more accurate well-being metrics like f.e. WELLBY (although I think WELLBY isn't ideal either).

How much control do the parents have on what metric will be used to rank the embryos?

3GeneSmith2y

I know Genomic Prediction at least has used "Quality Adjusted Lifespan" in their past papers, so I think they're used interchangeably. Both they and Orchid provide the expected absolute lifetime risk of each disease in their reports, so parents can re-prioritize embryo implantation according to their preferences. Doing this methodically is pretty tricky though; every disease has a different age of onset distribution and a different impact on life expectancy and quality of life. My hope is that one or both of them create more granular tools to help patients pick embryos according to their values and preferences. But so far the best you can do is just looking at the raw numbers and googling stuff about average age of onset etc.

Boomerang - protocol to dissolve some commitment races

Filip Sondej2y21

Oh yeah, I meant the final locked-in commitment, not initial tentative one. And my point is that when committing outside is sufficiently more costly, then it's not worth doing it, even if that would let you commit faster.

Boomerang - protocol to dissolve some commitment races

Filip Sondej2y10

Yup, you're totally right, it may be too easy to commit in other ways, outside this protocol. But I still think it may be possible to create such a "main mechanism" for making commitments where it's just very easy/cheap/credible to commit, compared to other mechanisms. But that would require a crazy amount of cooperation.

The vast majority that I know of use ad-hoc and agent-specific commitment mechanisms

If you have some particular mechanisms in mind could you list some? I'd like to compile a list of the most relevant commitment mechanisms to try to analyze them.

2Dagon2y

I'm not sure I'd call it "too easy to commit in other ways", so much as "this doesn't describe a commitment". The power of a commitment is that the other player KNOWS that no strategy or discussion can change the decision. That's the whole point. If it's revocable or changeable, it's not a commitment, it's a meaningless statement of intent. Real-world commitments come in many forms, from public announcements to get social pressure for follow-through to legal contracts with third parties to simply not bringing money so being unable to pay for something.

Collective Identity

Filip Sondej2y64

Love that post!

Can we train ML systems that clearly manifest a collective identity?

I feel like in multi-agent reinforcement learning that's already the case.

Re training setting for creating shared identity. What about a setting where a human and LLM take turns generating text, like in the current chat setting, but first they receive some task, f.e. "write a good strategy for this startup" and the context for this task. At the end they output the final answer and there is some reward model which rates the performance of the cyborg (human+LLM) as a whole... (read more)

Guardian AI (Misaligned systems are all around us.)

Filip Sondej3y10

Oh yeah, definitely. I think such a system shouldn't try to enforce one "truth" - which content is objectively good or bad.

I'd much rather see people forming groups, each with its own moderation rules. And let people be a part of multiple groups. There's a lot of methods that could be tried out, f.e. some groups could use algorithms like EigenTrust, to decide how much to trust users.

But before we can get to that, I see a more prohibitive problem - that it will be hard to get enough people to get that system off the ground.

Guardian AI (Misaligned systems are all around us.)

Filip Sondej3y*32

Cool post! I think the minimum viable "guardian" implementation, would be to

embed each post/video/tweet into some high-dimensional space
find out which regions of that space are nasty (we can do this collectively - f.e. my clickbait is probably clickbaity for you too)
filter out those regions

I tried to do something along these lines for youtube: https://github.com/filyp/yourtube

I couldn't find a good way to embed videos using ML, so I just scraped which videos recommend each other, and made a graph from that (which kinda is an embedding). Then I let us... (read more)

2Viliam3y

This assumes good faith. As soon as enough people learn about the Guardian AI, I expect Twitter threads coordinating people: "let's flag all outgroup content as 'clickbait'". Just like people are abusing current systems by falsely labeling the content that want removed as "spam" or "porn" or "original research" or whichever label effectively means "this will be hidden from the audience".

Mind is uncountable

Filip Sondej3y10

Yeah, when I thought about it some more, maybe the smallest relevant physical change is a single neuron firing. Also with such a quantization, we cannot really talk about "infinitesimal" changes.

I still think that a single neuron firing, changing the content of experience so drastically, is quite hard to swallow. There is a sense in which all that mental content should "come from" somewhere.

I had a similar discussion with @benjamincosman, where I explore that in more detail. Here are my final thoughts from that discussion.

Mind is uncountable

Filip Sondej3y10

Oh, I've never stumbled on that story. Thanks for sharing it!

I think it's quite independent from my post (despite such a similar thought experiment) because I zoomed in on that discontinuity aspect, and Eliezer zoomed in on anthropics.

Mind is uncountable

Filip Sondej3y10

That's a good point. I had a similar discussion with @benjamincosman, so I'll just link my final thoughts: my comment

Mind is uncountable

Filip Sondej3y20

I thought about it some more, and now I think you may be right. I made an oversimplification when I implicitly assumed that a moment of experience corresponds to a physical state in some point in time. In reality, a moment of experience seems to span some duration of physical time. For example, events that happen within 100ms, are experienced as simultaneous.

This gives some time for the physical system to implement these discontinuities (if some critical threshold was passed).

But if this criticality happens, it should be detectable with brain imaging. So n... (read more)

Mind is uncountable

Filip Sondej3y10

Hm, yeah, the smallest relevant physical difference may actually be one neuron firing, not one moved atom.

What I meant by between them, was that there would need to be some third substrate that is neither physical nor mental, and produces this jump. That's because in that situation discontinuity is between start and end position, so those positions are analogous to physical and mental state.

Any brain mechanism, is still part of the physical. It's true that there are some critical behaviors in the brain (similar to balls rolling down that hill). But the r... (read more)

Mind is uncountable

Filip Sondej3y10

It just looks that's what worked in evolution - to have independent organisms, each carrying its own brain. And the brain happens to have the richest information processing and integration, compared to information processing between the brains.

I don't know what would be necessary to have a more "joined" existence. Mushrooms seem to be able to form bigger structures, but they didn't have an environment complex enough to require the evolution of brains.

Mind is uncountable

Filip Sondej3y10

It seems that we just never had any situations that would challenge this way of thinking (those twins are an exception).

This Cartesian simplification almost always works, so it seems like it's just the way the world is at its core.

1Alex Flint3y

Agreed, but what is it about the structure of the world that made it the case that this Cartesian simplification works so much of the time?

2Noosphere893y

This. It's why things like mind-uploading get so weird, so fast. We won't have to deal with now, but later that's a problem.

Mind is uncountable

Filip Sondej3y10

Here, to have that discontinuity between input and output (start and end position), we need some mechanism between them - the system of ball, hill, and their dynamics. What's worse it needs to evolve for infinite time (otherwise the end still continuously depends on start position).

So I would say, this discontinuous jump "comes from" this system's (infinite) evolution.

It seems to me, that to have discontinuity between physical and mental, you would also need some new mechanism between them to produce the jump.

1benjamincosman3y

A brain seems to be full of suitable mechanisms? e.g. while we don't have a good model yet for exactly how our mind is produced by individual neurons, we do know that neurons exhibit thresholding behavior ('firing') - remove atoms one at a time and usually nothing significant will happen, until suddenly you've removed the exact number that one neuron no longer fires, and in theory you can get arbitrarily large differences in what happens next.