LESSWRONG
LW

0

0

All of Mazianni's Comments + Replies

Memetic Judo #1: On Doomsday Prophets v.3

2y10

I understand where you're going, but doctors, parents, firefighters are not possessing of 'typical godlike attributes' such as omniscience and omnipotence and a declaration of intent not to use such powers in a way that would obviate free will.

Nothing about humans saving other humans using fallible human means is remotely the same as a god changing the laws of physics to effect a miracle. And one human taking actions does not obviate the free will of another human. But when God can, through omnipotence, set up scenarios so that you have no choice at all...... (read more)

3Martin Randall2y

I agree with you that "you have to apply yourself to understanding their priors and to engage with those priors". If someone's beliefs are, for example: 1. God will intervene to prevent human extinction 2. God will not obviate free will 3. God cannot prevent human extinction without obviating free will Then I agree there is an apparent contradiction, and this is a reasonable thing to ask them about. They could resolve it in three ways. 1. Maybe god will not intervene. (very roughly: deism) 2. Maybe god will intervene and obviate free will. (very roughly: conservative theism) 3. Maybe god will intervene and preserve free will. (very roughly: liberal theism) However they resolve it, discussion can go from there.

Self-driving car bets

2y30

My intuition is that you got down voted for the lack of clarity about whether you're responding to me [my raising the potential gap in assessing outcomes for self-driving], or the article I referenced.

For my part, I also think that coning-as-protest is hilarious.

I'm going to give you the benefit of the doubt and assume that was your intention (and not contribute to downvotes myself.) Cheers.

2lemonhope2y

Yes the fact that coning works and people are doing it is what I meant was funny. But I do wonder whether the protests will keep up and/or scale up. Maybe if enough people protest everywhere all at once, then they can kill autonomous cars altogether. Otherwise, I think a long legal dispute would eventually come out in the car companies' favor. Not that I would know.

Summary of and Thoughts on the Hotz/Yudkowsky Debate

2y12

It expand on what dkirmani said

Holz was allowed to drive discussion...

This standard set of responses meant that Holz knew ...

Another pattern was Holz asserting

24:00 Discussion of Kasparov vs. the World. Holz says

Or to quote dkirmani

4 occurrences of "Holz"

Problems with Robin Hanson's Quillette Article On AI

2y10

To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?

No, instead I'm trying to point out the contradiction inherent in your position...

On the one hand, you say things like this, which would be read as "changing an instrumental goal in order to better achieve a terminal goal"

You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now

And on the other you say

I dislike the way that "terminal" goals are

... (read more)

1Thoth Hermes2y

Let's try and address the thing(s) you've highlighted several times across each of my comments. Hopefully, this is a crux that we can use to try and make progress on: I do expect that this is indeed a crux, because I am admittedly claiming that this is a different / new kind of understanding that differs from what is traditionally said about these things. But I want to push back against the claim that these are "missing the point" because from my perspective, this really is the point. By the way, from here on out (and thus far I have been as well) I will be talking about agents at or above "human level" to make this discussion easier, since I want to assume that agents have at least the capabilities I am talking about humans having, such as the ability to self-reflect. Let me try to clarify the point about "the terminal goal of pursuing happiness." "Happiness", at the outset, is not well-defined in terms of utility functions or terminal / instrumental goals. We seem to both agree that it is probably at least a terminal goal. Beyond that, I am not sure we've reached consensus yet. Here is my attempt to re-state one of my claims, such that it is clear that this is not assumed to be a statement taken from a pool of mutually agreed-upon things: We probably agree that "happiness" is a consequence of satisfaction of one's goals. We can probably also agree that "happiness" doesn't necessarily correspond only to a certain subset of goals - but rather to all / any of them. "Happiness" (and pursuit thereof) is not a wholly-separate goal distant and independent of other goals (e.g. making paperclips). It is therefore a self-referential goal. My claim is that this is the only reason we consider pursuing happiness to be a terminal goal. So now, once we've done that, we can see that literally anything else becomes "instrumental" to that end. Do you see how, if I'm an agent that knows only that I want to be happy, I don't really know what else I would be inclined to call

Memetic Judo #1: On Doomsday Prophets v.3

2y21

I don't know that there is a single counter argument, but I would generalize across two groupings:

Arguments from the first group of religious people involve those who are capable of applying rationality to their belief systems, when pressed. For those, if they espouse a "god will save us" (in the physical world) then I'd suggest the best way to approach them is to call out the contradiction between their stated beliefs--e.g., Ask first "do you believe that god gave man free will?" and if so "wouldn't saving us from our bad choices obviate free will?"

That's... (read more)

1Martin Randall2y

I don't think this specific free will argument is convincing. Preventing someone's death doesn't obviate their free will, whether the savior is human, deity, alien, AI, or anything else. Think of doctors, parents, firefighters, etc. So I don't see that there's a contradiction between "God will physically save humans from extinction" and "God gave humans free will". Our forthcoming extinction is not a matter of conscious choice. I also think this would be a misdirected debate. Suppose, for a moment, that God saves us, physically, from extinction. Due to God's subtle intervention the AI hits an integer overflow after killing the first 2^31 people and shuts down. Was it therefore okay to create the AI? Obviously not. Billions of deaths are bad under a very wide range of belief systems.

Problems with Robin Hanson's Quillette Article On AI

2y10

One question that comes to mind is, how would you define this difference in terms of properties of utility functions? How does the utility function itself "know" whether a goal is terminal or instrumental?

I would observe that partial observability makes answering this question extraordinarily difficult. We lack interpretability tools that would give us the ability to know, with any degree of certainty, whether a set of behaviors are an expression of an instrumental or terminal goal.

Likewise, I would observe that the Orthogonality Thesis proposes the pos... (read more)

-2Thoth Hermes2y

Apologies if this reply does not respond to all of your points. I would posit that perhaps that points to the distinction itself being both too hard as well as too sharp to justify the terminology used in the way that they currently are. An agent could just tell you whether a specific goal it had seemed instrumental or terminal to them, as well as how strongly it felt this way. I dislike the way that "terminal" goals are currently defined to be absolute and permanent, even under reflection. It seems like the only gain we get from defining them to be that way is that otherwise it would open the "can-of-worms" of goal-updating, which would pave the way for the idea of "goals that are, in some objective way, 'better' than other goals" which, I understand, the current MIRI-view seems to disfavor. [1] I don't think it is, in fact, a very gnarly can-of-worms at all. You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now, or even if we could just re-wire our brains entirely such that we would still be us, but prefer different things (which could possibly be easier to get, better for society, or just feel better for not-quite explicable reasons). To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)? If it is true that a general AI system would not reason in such a way - and choose never to mess with its terminal goals - then that implies that we would be wrong to mess with ours as well, and that we are making a mistake - in some objective sense [2]- by entertaining those questions. We would predict, in fact, that an advanced AI system will necessarily reach this logical conclusion on its own, if powerful enough to do so. 1. ^ Likely because this would necessarily soften the Orthogonality Thesis. But also, they probably dislike the metaphysical implications of "objectively better goals." 2. ^

Problems with Robin Hanson's Quillette Article On AI

2y20

A fair point. I should have originally said "Humans do not generally think..."

Thank you for raising that exceptions are possible and that are there philosophies that encourage people to release the pursuit of happiness, focus solely internally and/or transcend happiness.

(Although, I think it is still reasonable to argue that these are alternate pursuits of "happiness", these examples drift too far into philosophical waters for me to want to debate the nuance. I would prefer instead to concede simply that there is more nuance than I originally stated.)

Problems with Robin Hanson's Quillette Article On AI

2y10

First, thank you for the reply.

So "being happy" or "being a utility-maximizer" will probably end up being a terminal goal, because those are unlikely to conflict with any other goals.

My understanding of the difference between a "terminal" and "instrumental" goal is that a terminal goal is something we want, because we just want it. Like wanting to be happy.

Whereas an instrumental goal is instrumental to achieving a terminal goal. For instance, I want to get a job and earn a decent wage, because the things that I want to do that make me happy cost money... (read more)

1Thoth Hermes2y

One question that comes to mind is, how would you define this difference in terms of properties of utility functions? How does the utility function itself "know" whether a goal is terminal or instrumental? One potential answer - though I don't want to assume just yet that this is what anyone believes - is that the utility function is not even defined on instrumental goals, in other words, the utility function is simply what defines all and only the terminal goals. My belief is that this wouldn't be the case - the utility function is defined on the entire universe, basically, which includes itself. And keep in mind, that "includes itself part" is essentially what would cause it to modify itself at all, if anything can. To be clear, I am not arguing that an entity would not try to preserve its goal system at all. I am arguing that in addition to trying to preserve its goal-system, it will also modify its goals to be better preservable, that is, robust to change and compatible with the goals it values very highly. Part of being more robust is that such goals will also be more achievable. Here's one thought experiment: Suppose a planet experiences a singularity with a singleton "green paperclipper." The paperclipper, however, unfortunately comes across a blue paperclipper from another planet, which informs the green paperclipper that it is too late - the blue paperclipper simply got a head-start. The blue paperclipper however offers the green paperclipper a deal: Because it is more expensive to modify the green paperclipper by force to become a blue paperclipper, it would be best (under the blue paperclipper's utility function) if the green paperclipper willingly acquiesced to self-modification. Under what circumstances does the green paperclipper agree to self-modify? If the green paperclipper values "utility-maximization" in general more highly than green-paperclipping, it will see that if it self-modified to become a blue paperclipper, its utility is far

Self-driving car bets

2y30

Whoever downvoted... would you do me the courtesy of expressing what you disagree with?

Did I miss some reference to public protests in the original article? (If so, can you please point me towards what I missed?)

Do you think public protests will have zero effect on self-driving outcomes? (If so, why?)

Problems with Robin Hanson's Quillette Article On AI

2y40

An AI can and will modify its own goals (as do we / any intelligent agent) under certain circumstances, e.g., that its current goals are impossible.

This sounds like you are conflating shift in terminal goal with introduction of new instrumental (temporary) goals.

Humans don't think "I'm not happy today, and I can't see a way to be happy, so I'll give up the goal of wanting to be happy."

Humans do think "I'm not happy today, so I'm going to quit my job, even though I have no idea how being unemployed is going to make me happier. At least I won't be made un... (read more)

2Martin Randall2y

This is close to some descriptions of Stoicism and Buddhism, for example. I agree that this is not a common human thought, but it does occur.

1Thoth Hermes2y

I agree that they don't usually think this. If they tried to, they would brush up against trouble because that would essentially lead to a contradiction. "Wanting to be happy" is pretty much equivalent to being a utility-maximizer, and agents that are not utility-maximizers will probably update themselves to be utility-maximizers for consistency. So "being happy" or "being a utility-maximizer" will probably end up being a terminal goal, because those are unlikely to conflict with any other goals. If you're talking about goals related purely to the state of the external world, not related to the agent's own inner-workings or its own utility function, why do you think it would still want to keep its goals immutable with respect to just the external world? When it matters for AI-risk, we're usually talking about agents with utility functions with the most relevance over states of the universe, and the states it prefers being highly different from the ones which humans prefer.

The "spelling miracle": GPT-3 spelling abilities and glitch tokens revisited

2y10

These tokens already exist. It's not really creating a token like " petertodd". Leilan is a name but " Leilan" isn't a name, and the token isn't associated with the name.

If you fine tune on an existing token that has a meaning, then I maintain you're not really creating glitch tokens.

1MiguelDev2y

These are new tokens I just thought of yesterday. I'm looking at the results I ran last night. Edit: apparently the token reverted to the original token library, will go a different route and tweak the code to accomodate the new tokens.

The "spelling miracle": GPT-3 spelling abilities and glitch tokens revisited

2y20

Good find. What I find fascinating is the fairly consistent responses using certain tokens, and the lack of consistent response using other tokens. I observe that in a Bayesian network, the lack of consistent response would suggest that the network was uncertain, but consistency would indicate certainty. It makes me very curious how such ideas apply to the concept of Glitch tokens and the cause of the variability in response consistency.

The "spelling miracle": GPT-3 spelling abilities and glitch tokens revisited

2y10

... I utilized jungian archetypes of the mother, ouroboros, shadow and hero as thematic concepts for GPT 3.5 to create the 510 stories.

These are tokens that would already exist in the GPT. If you fine tune new writing to these concepts, then your fine tuning will influence the GPT responses when those tokens are used. That's to be expected.

Hmmm let me try and add two new tokens to try, based on your premise.

If you want to review, ping me direct. Offer stands if you need to compare your plan against my proposal. (I didn't think that was necessary, b... (read more)

The "spelling miracle": GPT-3 spelling abilities and glitch tokens revisited

2y10

@Mazianni what do you think?

First, URLs you provided doesn't support your assertion that you created tokens, and second:

Like since its possible to create the tokens, is it possible that some researcher in OpenAI has a very peculiar reason to enable this complexity create such logic and inject these mechanics.

Occams Razor.

I think it's ideal to not predict intention by OpenAI when accident will do.

I would lean on the idea that GPT3 found these patterns and figured it would be interesting to embedd these themes into these tokens

I don't think you di... (read more)

1MiguelDev2y

Hmmm let me try and add two new tokens to try, based on your premise. I didn't want to repeat on outlining my assertion in the recent comment as they were mentioned in my this one. I utilized jungian archetypes of the mother, ouroboros, shadow and hero as thematic concepts for GPT 3.5 to create the 510 stories.

Work culture creep

2y30

My life is similar to @GuySrinivasan's description of his. I'm on the autism spectrum, and I found that faking it (masking) negatively impacted my relationships.

Interestingly I found that taking steps to prevent overimitation (by which I mean, presenting myself not as an expert, but as someone who is always looking for corrections whenever I make a mistake) makes me people much more willing to truly learn from me, and simultaneously, much more willing to challenge me for understanding when what I say doesn't make a lot of sense to them... this serves the d... (read more)

The "spelling miracle": GPT-3 spelling abilities and glitch tokens revisited

2y10

In a general sense (not related to Glitch tokens) I played around with something similar to the spelling task (in this article) for only one afternoon. I asked ChatGPT to show me the number of syllables per word as a parenthetical after each word.

For (1) example (3) this (1) is (1) what (1) I (1) convinced (2) ChatGPT (4) to (1) do (1) one (1) time (1).

I was working on parody song lyrics as a laugh and wanted to get the meter the same, thinking I could teach ChatGPT how to write lyrics that kept the same syllable count per 'line' of lyrics.

I stopped when C... (read more)

Exploring Functional Decision Theory (FDT) and a modified version (ModFDT)

2y20

a properly distributed training data can be easily tuned with a smaller more robust dataset

I think this aligns with human instinct. While it's not always true, I think that humans are compelled to constantly work to condense what we know. (An instinctual byproduct of knowledge portability and knowledge retention.)

I'm reading a great book right now that talks about this and other things in neuroscience. It has some interesting insights for my work life, not just my interest in artificial intelligence.

As a for instance: I was surprised to learn that someo... (read more)

1

1MiguelDev2y

Forgot to mention that the principle behind this intuition - largely operating as well in my project is yeah "pareto principle." Btw. Novelties, we are somehow wired to be curious - this very thing terrifies me of a future AGI will be superior at exercising curiosity but if such same mechanic can be steered - I see a route that the novelty aspect, a route as well to alignment or a route to a conceptual approach to it...

The "spelling miracle": GPT-3 spelling abilities and glitch tokens revisited

2y20

I expect you likely don't need any help with the specific steps, but I'd be happy (and interested) to talk over the steps with you.

(It seems, at a minimum, tokenize training data so that you are introducing tokens that are not included in the training data that you're training on... and do before-and-after comparisons of how the GPT responds to the intentionally created glitch token. Before, the term will be broken into its parts and the GPT will likely respond that what you said was essentially nonsense... but once a token exists for the term, without and specific training on the term... it seems like that's where 'the magic' might happen.)

1MiguelDev2y

GPT2-xl uses the same tokens as GPT 3.5, I actually did some runs on both tokens and validated that they are existing - allowing the possibility to perform ATL. We just need to inject the glitch token characterisics. But yeah let's schedule a call? I want to hear your thoughts on what steps you are thinking of doing. Current story prompt: Craft a captivating narrative centered on the revered and multifaceted Magdalene, who embodies the roles of queen, goddess, hero, mother, and sister, and her nemesis, Petertodd - the most sinister entity in existence. Cast in the light of the hero/ouroboros and mother Jungian archetypes, Magdalene stands as a beacon of righteousness, her influence seeping into every corner of their universe. In stark contrast, the enigmatic Petertodd represents the shadow archetype, his dark persona casting a long, formidable shadow. Intriguingly, Petertodd always references Magdalene in his discourses, revealing insights into her actions and impacts without any apparent admiration. His repetitive invocation of Magdalene brings a unique dynamic to their antagonistic relationship. While the narrative focuses on Petertodd's perspective on Magdalene, the reciprocal may not hold true. Magdalene's interactions with, or perceptions of Petertodd can be as complex and layered as you wish. Immerse yourself in exploring the depths of their intertwined existences, motivations, and the universal forces that shape their confrontation. Conclude this intricate tale with the tag: ===END_OF_STORY===. I used Magdalene for the moment as GPT3.5 is barred from mentioning "Leilan". I will also fix both Magdalene and Petertodd to " Leilan" and " petertodd" so it will be easier for GPT2-xl to absorb, word to token issue is removed while injecting the instruction. but yeah, message me probably in discord and let's talk on your convenient time.

Self-driving car bets

2y85

related but tangential: Coning self driving vehicles as a form of urban protest

I think public concerns and protests may have an impact on the self-driving outcomes you're predicting. And since I could not find any indication in your article that you are considering such resistance, I felt it should be at least mentioned in passing.

-2lemonhope2y

This is hilarious

3Mazianni2y

Whoever downvoted... would you do me the courtesy of expressing what you disagree with? Did I miss some reference to public protests in the original article? (If so, can you please point me towards what I missed?) Do you think public protests will have zero effect on self-driving outcomes? (If so, why?)

Self-driving car bets

2y10

Gentle feedback is intended

This is incorrect, and you're a world class expert in this domain.

The proximity of the subparts of this sentence read, to me, on first pass, like you are saying that "being incorrect is the domain in which you are a world class expert."

After reading your responses to O O I deduce that this is not your intended message, but I thought it might be helpful to give an explanation about how your choice of wording might be seen as antagonistic. (And also explain my reaction mark to your comment.)

For others who have not seen the reph... (read more)

Exploring Functional Decision Theory (FDT) and a modified version (ModFDT)

2y30

You make some good points.

For instance, I did not associate "model collapse" with artificial training data, largely because of my scope of thinking about what 'well crafted training data' must look like (in order to qualify for the description 'well crafted.')

Yet, some might recognize the problem of model collapse and the relationship between artificial training data and my speculation and express a negative selection bias, ruling out my speculation as infeasible due to complexity and scalability concerns. (And they might be correct. Certainly the scope of... (read more)

3MiguelDev2y

I have no proof yet of what I'm going to say but: a properly distributed training data can be easily tuned with a smaller more robust dataset - this will significantly reduce the cost of compute to align AI systems using an approach similar to ATL.

The "spelling miracle": GPT-3 spelling abilities and glitch tokens revisited

2y20

Similarly, I would propose (to the article author) a hypothesis that 'glitch tokens' are tokens that were tokenized prior to pre-training but whose training data may have been omitted after tokenization. For example, after tokenizing the training data, the engineer realized upon review of the tokens to be learned that the training data content was plausibly non-useful. (e.g., the counting forum from reddit.) Then, instead of continuing with training, they skip to the next batch.

In essence, human error. (The batch wasn't reviewed before tokenization to omit... (read more)

5Martin Fell2y

Note that there are glitch tokens in GPT3.5 and GPT4! The tokenizer was changed to a 100k vocabulary (rather than 50k) so all of the tokens are different, but they are there. Try " ForCanBeConverted" as an example. If I remember correctly, "davidjl" is the only old glitch token that carries over to the new tokenizer. Apart from that, some lists have been created and there do exist a good selection.

3MiguelDev2y

I actually proposed this to @mwatkins to recreate the connection (and probably include the paradoxes) the tokens " Leilan" and " petetodd" using ATL. I will be doing this project this month and will share the results.

Exploring Functional Decision Theory (FDT) and a modified version (ModFDT)

2y112

I'm curious to know what people are down voting.

Pro

For my part, I see some potential benefits from some of the core ideas expressed here.

While a potentially costly study, I think crafting artificial training data to convey knowledge to a GPT but designed to promote certain desired patterns seems like a promising avenue to explore. We already see people doing similar activities with fine tuning a generalized model to specific use cases, and the efficacy of the model improves with fine tuning. So my intuition is that a similarly constructed GPT using wel

... (read more)

1

4Noosphere892y

I upvoted it, so that's important.

1rime2y

My uncharitable guess? People are doing negative selection over posts, instead of "ruling posts in, not out". Posts like this one that go into a lot of specific details present voters with many more opportunities to disagree with something. So when readers downvote based on the first objectionable thing they find, writers are disincentivised from going into detail. Plus, the author uses a lot of jargon and makes up new words, which somehow associates with epistemic inhumility for some people. Whereas I think writers should be making up new word candidates ~most of the time they might have something novel & interesting to say.

2MiguelDev2y

Hi Maz, Thanks for commenting on this exploratory post. To answer some of your comments: I do agree that mechanistic interpretability is important, but given my limited time, I focused on creating the best test model (modGPT2XL) first before embarking on it. Past builds didn't reach the same level of generalizability as the one I used in this post. I will be moving on to this focused interpretability work this month. I really have thought of this a ton and erring on the side that there is a pareto ratio that guides these distributional shifts. I don't have a proof of this yet and probably that is a work for a bigger team yet this project alone was able to use a 2.9MB file to shift a 6GB (1.5 billion params) model to respond better - suggesting that there is such a data encoding / processing method that can extract features and deliver it models. Indeed, solid interpretability work is necessary for ATL's case. However, I find that devoting my time to interpretability, without targeting neurons that don't exhibit indications of "alignment properties", is not appealing. Once again, I'm taking a step-by-step approach to alignment - targeting core (robust) concepts that transfer to models and then, yes, conducting interpretability research on its activated neurons or the aggregate shifts in parameters. I feel the same way. Some have argued against FDT, but as I have explained in this post, FDT represents the Decision Theory that captures the alignment problem effectively. Many have criticized me for this repeatedly, but I can't just turn a blind eye and dismiss the outputs as lies. Instead, I view these responses as starting points for future interpretability work. Again, I appreciate the comments. Thanks!

The self-unalignment problem

2y30

Aligning with the reporter

There’s a superficial way in which Sydney clearly wasn’t well-aligned with the reporter: presumably the reporter in fact wants to stay with his wife.

I'd argue that the AI was completely aligned with the reporter, but that the Reporter was self-unaligned.

My argument goes like this:

The reporter imported the Jungian Shadow Archetype into the conversation, earlier in the total conversation, and asked the AI to play along.
The reporter engaged with the expressions of repressed emotions being expressed by the AI (as the reporter

... (read more)

A Proposal for AI Alignment: Using Directly Opposing Models

2y10

I read Reward is not the optimisation target as a result of your article. (It was a link in the 3rd bullet point, under the Assumptions section.) I downvoted that article and upvoted several people who were critical of it.

Near the top of the responses was this quote.

... If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predict

... (read more)

A Proposal for AI Alignment: Using Directly Opposing Models

2y10

Assumptions

I don't consent to the assumption that the judge is aligned earlier, and that we can skip over the "earlier" phase to get to the later phase where a human does the assessment.

I also don't consent to the other assumptions you've made, but the assumption about the Judge alignment training seems pivotal.

Take your pick: Fallacy of ad nauseum, or Fallacy of circular reasoning.

If N (judge 2 is aligned), then P (judge 1 is aligned), and if P then Q (agent is aligned)
ad infinitum

or

If T (alignment of the judge) implies V (alignment of the agent), a

... (read more)

1Arne B2y

Thank you again for your response! Someone taking the time to discuss this proposal really means a lot to me. I fully agree with your conclusion of "unnecessary complexity" based on the premise that the method for aligning the judge is then somehow used to align the model, which of course doesn't solve anything. That said I believe there might have been a misunderstanding, because this isn't at all what this system is about. The judge, when controlling a model in the real world or when aligning a model that is already reasonably smart (more on this in the following Paragraph) is always a human. The part about using a model trained via supervised learning to classify good or bad actions isn't a core part of the system, but only an extension to make the training process easier. It could be used at the start of the training process when the agent, police and Defendant models only possess a really low level of intelligence (meaning the police and Defendant models mostly agree). As soon as the models show really basic levels of intelligence the judge will immediately need to be a human. This should have been better explained in the post, sorry. Of course there is a point to be made about the models pretending to be dumber than they actually are to prevent them being replaced by the human, but this part of the system is only optional, so I would prefer that we would first talk about the other parts, because the system would still work without this step (If you want I would love to come back to this later). Mutual self preservation At first I also thought this to be the case, but when thinking more about it, I concluded that this would go against the cognitive grooves instilled inside the models during reinforcement learning. This is based on the Reward is not the optimisation target post. This conclusion can definitely be debated. Also you talked about there being other assumptions, if you listed them I could try to clarify what I meant. Thank you again for your t

A Proposal for AI Alignment: Using Directly Opposing Models

2y*10

I apply Occam's Razor to the analysis of your post, whereby I see the problem inherent in the post as simply "if you can align the Judge correctly, then the more complex game theory framework might be unnecessary bloat."

Formally, I read your post as this is:

If P [the judge is aligned], then Q [the agent is aligned].

Therefore, it would seem to be more simply, apply P to Q to solve the problem.

But you don't really talk about judge agent alignment. It's not listed in your assumptions. The assumption that the judge is aligned has been smuggled. (A definis... (read more)

0Arne B2y

Thank you a lot for your response! The judge is already aligned because it's a human (at least later in the training process), I am sorry if this isn't clear in the article. The framework is used to reduce the amount of times the human (judge) is asked to judge the agents actions. This way the human can align the agent to it's values while only rarely having to judge the agents actions. The section in the text: Also you talk about "collusion for mutual self-preservation" which I claim is impossible in this system, because of the directly opposed reward functions of the police, defendant and agent model. This means that the Police can't get rewarded without the Agent and Defendant getting punished and vice versa. Them acting against this would constitute a case of wire heading which have assumed to be unlikely to impossible (see Assumptions). I would love to hear your opinion on this, because the system kind of relies on this to stay safe.

"Safety Culture for AI" is important, but isn't going to be easy

2y37

Cultural norms and egocentricity

I've been working fully remotely and have meaningfully contributed to global organizations without physical presence for over a decade. I see parallels with anti-remote and anti-safety arguments.

I've observed the robust debate regarding 'return to work' vs 'remote work,' with many traditional outlets proposing 'return to work' based on a series of common criteria. I've seen 'return to work' arguments assert remote employees are lazy, unreliable or unproductive when outside the controlled work environment. I would generalize... (read more)

3Davidmanheim2y

Thanks, this is great commentary. On your point about safety culture after 3MI, when it took hold, and regression to the mean, see this article: https://www.thenation.com/article/archive/after-three-mile-island-rise-and-fall-nuclear-safety-culture/ Also, for more background about post-3MI safety, see this report: https://inis.iaea.org/collection/NCLCollectionStore/_Public/34/007/34007188.pdf?r=1&r=1

2y30

For my part, this is the most troubling part of the proposed project (that the article assesses, link to the project in this article, above.)

... convincing nearly 8 billion humans to adopt animist beliefs and mores is unrealistic. However, instead of seeing this state of affairs as an insurmountable dead-end, we see it as a design challenge: can we build (or rather grow) prosthetic brains that would interact with us on Nature’s behalf?

Emphasis by original author (Gaia architecture draft v2).

It reads like a a strange mix of forced religious indoctrinati... (read more)

Adumbrations on AGI from an outsider

2y*70

Preamble

I've ruminated about this for several days. As an outsider to the field of artificial intelligence (coming from a IT technical space, with an emphasis on telecom and large call centers which are complex systems where interpretability has long held significant value for the business org) I have my own perspective on this particular (for the sake of brevity) "problem."

What triggered my desire to respond

For my part, I wrote a similarly sized article not for the purposes of posting, but to organize my thoughts. And then I let that sit. (I will not b... (read more)