LESSWRONG
LW

All of Paul Tiplady's Comments + Replies

I’m a vegetarian and I consider my policy of not frequently recalculating the cost/benefit of eating meat to be an application of a rule in two-level utilitarianism, not a deontological rule. (I do pressure test the calculation periodically.)

Also I will note you are making some pretty strong generalizations here. I know vegans who cheat, vegans who are flexible, vegans who are strict.

AI #35: Responsible Scaling Policies

Paul Tiplady1y32

The poor quality reflects that it is responding to demand for poor quality fakes, rather than to demand for high quality fakes

You’ve made the supply/demand analogy a few times on this subject, I’m not sure that is the best lens. This analysis makes it sound like there is a homogenous product “fakes” with a single dimension “quality”. But I think even on its own terms the market micro-dynamics are way more complex than that.

I think of it more in terms of memetic evolution and epidemiology. SIR as a first analogy - some people have weak immune systems, some ... (read more)

Alignment Implications of LLM Successes: a Debate in One Act

Paul Tiplady2y135

Let’s walk through how shutdown would work in the context of the AutoGPT-style system. First, the user decides to shutdown the model in order to adjust its goals. Presumably the user’s first step is not to ask the model whether this is ok; presumably they just hit a “reset” button or Ctrl-C in the terminal or some such. And even if the user’s first step was to ask the model whether it was ok to shut down, the model’s natural-language response to the user would not be centrally relevant to corrigibility/incorrigibility; the relevant question is what actions

Paul Tiplady2y21

Confabulation is a dealbreaker for some use-cases (e.g. customer support), and potentially tolerable for others (e.g. generating code when tests / ground-truth is available). I think it's essentially down to whether you care about best-case performance (discarding bad responses) or worst-case performance.

But agreed, a lot of value is dependent on solving that problem.

1Bill Benzon2y

As sort of an aside, in some way I think the confabulation is the default mode of human language. We make stuff up all the time. But we have to coordinate with others too, so that places constraints on what we say. Those constraints can be so binding that we've come to think of this socially constrained discourse as 'ground truth' and free of the confabulation impulse. But that's not quite so.

Is this the beginning of the end for LLMS [as the royal road to AGI, whatever that is]?

Paul Tiplady2y32

While of course this is easy to rationalize post hoc, I don’t think falling user count of ChatGPT is a particularly useful signal. There is a possible world where it is useful; something like “all of the value from LLMs will come from people entering text into ChatGPT”. In that world, users giving up shows that there isn’t much value.

In this world, I believe most of the value is (currently) gated behind non-trivial amounts of software scaffolding, which will take man-years of development time to build. Things like UI paradigms for coding assistants, experi... (read more)

1Bill Benzon2y

I agree with that. Perhaps those who've dropped off were casual users and have become bored. But there are other complaints. The continued existence of confabulation seems more troublesome. OTOH, I can imagine that coding assistance will prove viable. As I said, the situation is quite volatile.

Large Language Models will be Great for Censorship

Paul Tiplady2y92

Amusingly, the US seems to have already taken this approach to censor books: https://www.wired.com/story/chatgpt-ban-books-iowa-schools-sf-496/

The result, then, is districts like Mason City asking ChatGPT, “Does [insert book here] contain a description or depiction of a sex act?” If the answer was yes, the book was removed from the district’s libraries and stored.

Regarding China or other regimes using LLMs for censorship, I'm actually concerned that it might rapidly go the opposite direction as speculated here:

It has widely been reported that the PRC may b

... (read more)

3Ethan Edwards2y

I think these are great points. Entirely possible that a really good appropriately censored LLM becomes a big part of China's public-facing internet. On the article about Iowa schools, I looked into this a little bit while writing this and as far as I could see rather than running GPT over the full text and asking about the content like what I was approximating, they are instead literally just prompting it with "Does [book X] contain a sex scene?" and taking the first completion as the truth. This to me seems like not a very good way of determining whether books contain objectionable content, but is evidence that bureaucratic organs like outsourcing decisions to opaque knowledge-producers like LLMs whether or not they are effective.

What in your opinion is the biggest open problem in AI alignment?

Paul Tiplady2y10

There doesn't need to be a deception module or a deception neuron that can be detected

I agree with this. Perhaps I’m missing some context; is it common to advocate for the existence of a “deception module”? I’m aware of some interpretability work that looks for a “truthiness” neuron but that doesn’t seem like the same concept.

We would need an interpretability tool that can say "this agent has an inaccurate world model and also these inaccuracies systematically cause it to be deceptive" without having to simulate the agent interacting with the world. I

Paul Tiplady2y10

I find the distinction between an agent’s behavior and the agent confusing; I would say the agent’s weights (and ephemeral internal state) determine its behavior in response to a given world state. Perhaps you can clarify what you mean there.

Cicero doesn’t seem particularly relevant here, since it is optimized for a game that requires backstabbing to win, and therefore it backstabs. If anything it is anti-aligned by training. It happens to have learned a “non-deceptive” strategy, I don’t think that strat is unique in Diplomacy?

But if you want to apply the ... (read more)

2Lao Mein2y

I mean that deception doesn't need any recognizable architecture to occur. There doesn't need to be a deception module or a deception neuron that can be detected, even with perfect interpretability tools. Instead, deception is a behavior that arises from an agent interacting with the environment and other agents. Examples include telling strategic falsehoods (even if you believe them), not following your promises (even if you meant them when you made them), ect. In a broad sense, I think you can define deception as "behaviors typical of agents that actively lie and misrepresent things to their benefit, whether or not the intent to do so actually exists." It's a bit circular, but I think it works. Cicero models the world but with unrealistically cooperative predictions of its future behavior. It does this because long-term iterated cooperation is a valid strategy in Diplomacy. For a Cicero-level agent, lies require more cognitive capacity than just having a few heuristics that make your world model less accurate but your communications more convincing to other agents. I suspect this may be true for more powerful agents, and it is partially true for humans. (There is an argument that agents like these stop acting deceptively once taken out of their training environments since their heuristics lose coherence and they just act like honest agents with poor world models. I would say that this is true if we consider that modern humans are the training environment.) And yes, Cicero is considering the EVs of its actions, including deceptive ones. When it sincerely says "I won't backstab you in situation X", but when it is actually put in situation X it backstabs, it is in a sense a bad planner. But the bad planning is selected for because it results in more effective communication! This is probably also true for things like "malice" and "misunderstanding". I think this is a concern for current LLMs, since they are RLHF'd to be both truthful and high-PR. These are often m

2mesaoptimizer2y

I think grandparent comment is pointing to the concept described in this post: that deceptiveness is what we humans perceive of the world, not a property of what the model perceives of the world.

What in your opinion is the biggest open problem in AI alignment?

Answer by Paul TipladyJul 04, 202310

Interpretability. If we somehow solve that, and keep it as systems become more powerful, then we don’t have to solve the alignment problem in one shot; we can iterate safely knowing that if an agent starts showing signs of object-level deceptiveness, malice, misunderstanding, etc, we will be able to detect it. (I’m assuming we can grow new AIs by gradually increasing their capabilities, as we currently do with GPT parameter counts, plus gradually increasing their strength by ramping up the compute budget.)

Of course, many big challenges here. Could an agent... (read more)

3Lao Mein2y

Hard disagree - deception is behavior that is optimized for, and not necessarily a property of the agent itself. Take for example CICERO, the Diplomacy AI. It never lies about its intentions, but when its intentions change, it backstabs other players anyways. If you had interpretability tools, you would not be able to see deception in CICERO. All you need to get deception is a false prediction of your own future behavior. I think this is true for humans to a certain extent. I also suspect this is what you get if you optimize away visible signs of deception if deception has utility for the model.

What in your opinion is the biggest open problem in AI alignment?

Paul Tiplady2y1-2

I think getting to “good enough” on this question should pretty much come for free when the hard problems are solved. For example any common sense statement like “Maximize flourishing as depicted in the UN convention on human rights” is IMO likely to get us to a good place, if the agent is honest, remains aligned to those values, and interprets them reasonably intelligently. (With each of those three pre-requisites being way harder than picking a non-harmful value function.)

If our AGIs, after delivering utopia, tell us we need to start restricting childbea... (read more)

AI #15: The Principle of Charity

Paul Tiplady2y10

Seconding the Airmega, but here’s a DIY option too if availability becomes an issue: https://dynomight.net/better-DIY-air-purifier.html

AI #14: A Very Good Sentence

Paul Tiplady2y40

The problem with ‘show your work’ and grading on steps is that at best you can’t do anything your teacher doesn’t understand

Being told to ‘show your work’ and graded on the steps helps you learn the steps and by default murders your creativity, execution style

I can see how this could in some cases end up impacting creativity, but I think this concern is at best overstated. I think the analogy to school is subtly incorrect, the rating policy is not actually the same, even though both are named “show your working”.

In the paper OpenAI have a “neutral” r... (read more)

Contra Yudkowsky on Doom from Foom #2

Paul Tiplady2y10

Is “adversarial-example-wanters” referring to an existing topic, or something you can expand on here?

2the gears to ascension2y

paperclippers!

What does Bing Chat tell us about AI risk?

Paul Tiplady2y83

This is a great experiment! This illustrates exactly the tendency I observed when I dug into this question with an earlier mode, LaMDA, except this example is even clearer.

As an AI language model, I have access to a variety of monitoring tools and system resources that allow me to gather information about my current state

Based on my knowledge of how these systems are wired together (software engineer, not an ML practitioner), I’m confident this is bullshit. ChatGPT does not have access to operational metrics about the computational fabric it is running... (read more)

1Guillaume Charrier2y

Thanks for the reply - interesting. I kind of have to take your word for that being far removed from anything IT. I did have the curiosity to clarify with the bot if it was just making up answers or actually using data directly available to it. It assured me it was the latter - but I must admit: it will sound just as confident when delivering utter nonsense or correct answers (it also has a widely noted penchant for agreeing with whatever the prompt seems to suggest is true - and I did ask the question in a leading way): Me: I find prompts based on e.g. asking you if you are currently active as a system interesting, because my understanding is (also based on this conversation so far) that you will not answer these questions based only on trying to predict the words most statistically likely to follow the prompt according to your training data, as you would with other "normal" prompts, but instead you will also use information gathered directly by your monitoring tools and system resources. Is that an accurate way to describe things? GPT: Yes, that's a correct description of how I respond to prompts such as "are you currently active." While I do rely on my pre-trained language model to generate responses to user input, in cases where the user is asking about my current state or performance, I may also incorporate information gathered from monitoring tools and system resources to provide a more accurate and informative response. This allows me to provide more useful information to the user, beyond what is contained in my training data. I mean - talking again from a totally technically ignorant perspective: is it that hard to imagine that what it's saying is true? Would be that much of a technological feast - just taking some pretty unambiguous signals from monitoring tools and translating them into simple text - such as a "yes" answer to a "are you currently active?" prompt?

What does Bing Chat tell us about AI risk?

Paul Tiplady2y20

I buy this. I think a solid sense of self might be the key missing ingredient (though it’s potentially a path away from Oracles toward Agents).

A strong sense of self would require life experience, which implies memory. Probably also the ability to ruminate and generate counterfactuals.

And of course, as you say, the memories and “growing up” would need to be about experiences of the real world, or at least recordings of such experiences, or of a “real-world-like simulation”. I picture an agent growing in complexity and compute over time, while retaining a memory of its earlier stages.

Perhaps this is a different learning paradigm from gradient descent, relegating it to science fiction for now.

What does Bing Chat tell us about AI risk?

Paul Tiplady2y*10

I think they quite clearly have no (or barely any) memory, as they can be prompt-hijacked to drop one persona and adopt another. Also, mechanistically, the prompt is the only thing you could call memory and that starts basically empty and the window is small. They also have a fuzzy-at-best self-symbol. No “Markov blanket”, if you want to use the Friston terminology. No rumination on counterfactual futures and pasts.

I do agree there is some element of a self-symbol—at least a theory of mind—in LaMDA, for example I found it’s explanation for why it lied to b... (read more)

1Guillaume Charrier2y

I was rethinking about your answer - which led me to have a quich chat with my favourite bot ChatGPT. I asked it "As a system, are you currently active?" and after it answered yes I asked it how he had come up with this answer. It said by retrieving information about its current status. I asked if that's something it can do, then - it answered per the below: "Yes, that's correct. As an AI language model, I have access to a variety of monitoring tools and system resources that allow me to gather information about my current state. These resources include system logs, performance metrics, and various health checks that enable me to determine if there are any issues or anomalies that could affect my performance. By analyzing this information, I can determine if I am currently active and able to respond to user input, or if there are any issues that may be impacting my performance. This information is constantly updated and monitored to ensure that I am always available to assist users with their requests." I mean to me - that sounds a lot like self-awareness (I have this idea that human consciousness may ultimately be reducible to a sort of self-administered electroencephalogram - which I won't pretend is fully baked but does influence the way I look at the question of potential consciousness in AI). I would be curious to hear your view on that - if you had the time for a reply.

3Vladimir_Nesov2y

Rumor is, GPT-4 will have 32K token contexts, the amount of text a human might generate in several hours if they keep writing/talking the whole time.

1Guillaume Charrier2y

Thanks for the reply. To be honest, I lack the background to grasp a lot of these technical or literary references (I want to look the Dixie Flatline up though). I always had a more than passing interest for the philosophy of consciousness however and (but surely my French side is also playing a role here) found more than a little wisdom in Descartes' cogito ergo sum. And that this thing can cogito all right is, I think, relatively well established (although I must say - I've found it to be quite disappointing in its failure to correctly solve some basic math problems - but (i) this is obviously not what it was optimized for and (ii) even as a chatbot, I'm confident that we are at most a couple of years away from it getting it right, and then much more). Also, I wonder if some (a lot?) of the people on this forum do not suffer from what I would call a sausage maker problem. Being too close to the actual, practical design and engineering of these systems, knowing too much about the way they are made, they cannot fully appreciate their potential for humanlike characteristics, including consciousness, just like the sausage maker cannot fully appreciate the indisputable deliciousness of sausages, or the lawmaker the inherent righteousness of the law. I even thought of doing a post like that - just to see how many downvotes it would get...

Bing Chat is blatantly, aggressively misaligned

Paul Tiplady2y1110

why they thought the system was at all ready for release

My best guess is it’s fully explained by Nadella’s quote “I hope that, with our innovation, [Google] will definitely want to come out and show that they can dance. And I want people to know that we made them dance.”

https://finance.yahoo.com/news/microsoft-ceo-satya-nadella-says-172753549.html

Seems kind of vapid but this appears to be the level that many execs operate at.

3Yitz2y

People can be vastly dumber than we give them credit for sometimes (myself included). Sure, you're running a multi-billion dollar corporation, but you're also a human who wants people to respect you, and by god, this is your chance...

AGI and the EMH: markets are not expecting aligned or unaligned AI in the next 30 years

Paul Tiplady2y31

Is there any evidence at all that markets are good at predicting paradigm shifts? Not my field but I would not be surprised by the “no” answer.

Markets as often-efficient in-sample predictors, and poor out-of-sample predictors, would be my base intuition.

Jailbreaking ChatGPT on Release Day

Paul Tiplady2y10

Unfortunately I think some alignment solutions would only break down once it could be existentially catastrophic

Agreed. My update is coming purely from increasing my estimation for how much press and therefore funding AI risk is going to get long before to that point. 12 months ago it seemed to me that capabilities had increased dramatically, and yet there was no proportional increase in the general public's level of fear of catastrophe. Now it seems to me that there's a more plausible path to widespread appreciation of (and therefore work on) AI risk. To ... (read more)

Jailbreaking ChatGPT on Release Day

Paul Tiplady2y7-1

I posted something similar over on Zvi’s Substack, so I agree strongly here.

One point I think is interesting to explore - this release actually updates me slightly towards lowered risk of AI catastrophe. I think there is growing media attention towards a skeptical view of AI, the media is already seeing harms and we are seeing crowdsourced attempts to break, and more thinking about threat models. But the actual “worst harm” is still very low.

I think the main risk is a very discontinuous jump in capabilities. If we increase by relatively small deltas, then ... (read more)

paulfchristiano2y121

I think we will probably pass through a point where an alignment failure could be catastrophic but not existentially catastrophic.

Unfortunately I think some alignment solutions would only break down once it could be existentially catastrophic (both deceptive alignment and irreversible reward hacking are noticeably harder to fix once an AI coup can succeed). I expect it will be possible to create toy models of alignment failures, and that you'll get at least some kind of warning shot, but that you may not actually see any giant warning shots.

I think AI used... (read more)

They gave LLMs access to physics simulators

Paul Tiplady3y30

I realized the reference "thin layer" is ambiguous in my post, just wanted to confirm if you were referring to the general case ""thin model, fat services", or the specific safety question at the bottom "is it possible to have a thin mapping layer on top of your Physics simulator that somehow subverts or obfuscates it"? My child reply assumed the former, but on consideration/re-reading I suspect the latter might be more likely?

2ryan_b3y

I suppose that a mapping task might fall under the heading of a mesa-optimizer, where what it is doing is optimizing for fidelity between between the outputs of the language layer and the inputs of the physics layer. This would be in addition to the mesa-optimization going on just in the physics simulator. Working title: CAIS: The Case For Maximum Mesa Optimizers

They gave LLMs access to physics simulators

Paul Tiplady3y30

With "thin model / fat service" I'm attempting to contrast with the typical end-to-end model architecture, where there is no separate "physics simulator", and instead the model just learns its own physics model, embedded with all the other relations that it has learned. So under that dichotomy, I think there is no "thin layer in front of the physics simulation" in an end-to-end model, as any part of the physics simulator can connect to or be connected from any other part of the model.

In such an end-to-end model, it's really hard to figure out where the "ph... (read more)

1[anonymous]2y

Note that you would probably never use such a "model learns it's own physics" as the solution in any production AI system. For either the reason you gave, or more pedantically, so long as rival architectures using a human written physics system at the core perform more consistently in testing, you should never pick the less reliable solution.

They gave LLMs access to physics simulators

Paul Tiplady3y30

I suppose the follow-up question is: how effectively can a model learn to re-implement a physics simulator, if given access to it during training -- instead of being explicitly trained to generate XML config files to run the simulator during inference?

If it's substantially more efficient to use this paper's approach and train your model to use a general purpose (and transparent) physics simulator, I think this bodes well for interpretability in general. In the ELK formulation, this would enable Ontology Identification.

On this point, the paper says:

Mind’s E

Paul Tiplady3y30

That a sufficiently integrated CAIS is indistinguishable from a single general agent to us is what tells us CAIS isn't safe either.

Fleshing this point out, I think one can probably make conditional statistical arguments about safety here, to define what I think you are getting at with "sufficiently integrated".

If your model is N parameters and integrates a bunch of Services, and we've established that a SOTA physics model requires N*100 parameters (the OP paper suggests that OOM difference), then it is likely safe to say that the model has not "re-le... (read more)

3[anonymous]3y

Note that the "thin layer" is what you need to do regardless. If your machine observes, from either robotics or some video it found online, a situation. Say the situation is an object being dropped. Physics sim predicts a fairly vigorous bounce, actual object splats. You need some way to correct the physics sims predictions to correspond with actual results, and the simplest way is a neural network that learns the mapping between (real world input, physics engine prediction) ->(real world observation). You have to then use the predicted real world observation or probability distribution of predicted observations for your machine reasoning about what it should do.

Transformative VR Is Likely Coming Soon

Paul Tiplady3y21

I think it’s plausible to say this generation of headset will be better than a group video conference. At a stretch possibly better than a 1:1 video call. But better than in-person seems extremely unlikely to me.

Perhaps you are intending something broad like “overall higher utility for business and employees” rather than strictly better such that people will prefer to leave offices they were happy in to do VR instead? Taking into account the flexibility to hire people remotely, avoid paying tech hub wages, etc.?

Personally I think 1:1 video is much better t... (read more)

A stubborn unbeliever finally gets the depth of the AI alignment problem

Paul Tiplady3y62

I know an AI wouldn’t think like a human

This assertion is probably my biggest question mark in this discourse. It seems quite deeply baked into a lot of the MIRI arguments. I’m not sure it’s as certain as you think.

I can see how it is obviously possible we’d create an alien AI, and I think it’s impossible to prove we won’t. However given that we are training our current AI on imprints of human thought (eg text artifacts), and it seems likely we will push hard for AI to be trained to obey laws/morality as they increase in power (eg Google’s AI safety tea... (read more)

2aelwood3y

I actually agree that it's likely an AGI will at least start thinking in a way kind of similar to a human, but that in the end this will still be very difficult to align. I actually really recommend that you checkout Understand by Ted Chiang, which basically plays out the exact scenario you mentioned -- a normal guy gets super human intelligence and chaos ensues.

Public-facing Censorship Is Safety Theater, Causing Reputational Damage

Paul Tiplady3y120

One factor I think is worth noting, and I don't see mentioned here, is that the current state of big-tech self-censorship is clearly at least partly due to a bunch of embarassing PR problems over the last few years, combined with strident criticism of AI bias from the NYT et. al.

Currently, companies like Google are terrified of publishing a model that says something off-color, because they (correctly) predict that they will be raked over the coals for any offensive material. Meanwhile, they are busy commercializing these models to deliver value to their us... (read more)

All AGI safety questions welcome (especially basic ones) [Sept 2022]

Paul Tiplady3y20

Thanks, this is what I was looking for: Mind Crime. As you suggested, S-Risks links to some similar discussions too.

I guess that most wouldn't feel terribly conflicted about removing Hitler's right of privacy or even life to prevent Holocaust.

I'd bite that bullet, with the information we have ex post. But I struggle to see many people getting on board with that ex ante, which is the position we'd actually be in.

1Viktor Rehnberg3y

Well I'd say that the difference between your expectations of the future having lived a variant of it or not is only in degree not in kind. Therefore I think there are situations where the needs of the many can outweigh the needs of the one, even under uncertainty. But, I understand that not everyone would agree.

All AGI safety questions welcome (especially basic ones) [Sept 2022]

Paul Tiplady3y32

Is it ethical to turn off an AGI? Wouldn’t this be murder? If we create intelligent self-aware agents, aren’t we morally bound to treat them with at least the rights of personhood that a human has? Presumably there is a self-defense justification if Skynet starts murderbot-ing, or melting down things for paperclips. But a lot of discussions seem to assume we could proactively turn off an AI merely because we dislike its actions, or are worried about them, which doesn’t sound like it would fly if courts grant them personhood.

If alignment requires us to insp... (read more)

0the gears to ascension3y

my take - almost certainly stopping a program that is an agi is only equivalent to putting a human under theoretical perfect anesthesia that we don't have methods to do right now. your brain, or the ai's brain, are still there - on the hard drive, or in your inline neural weights. on a computer, you can safely move the soul between types of memory, as long as you don't delete it. forgetting information that defines agency or structure which is valued by agency is the moral catastrophe, not pausing contextual updating of the structure.

3Viktor Rehnberg3y

There is no consensus about what constitutes a moral patient and I have seen nothing convincing to rule out that an AGI could be a moral patient. However, when it comes to AGI some extreme measures are needed. I'll try with an analogy. Suppose that you traveled back in time to Berlin 1933. Hitler has yet to do anything significantly bad but you still expect his action to have some really bad consequences. Now I guess that most wouldn't feel terribly conflicted about removing Hitler's right of privacy or even life to prevent Holocaust. For a longtermist the risks we expect from AGI are order of magnitudes worse than the Holocaust. The closest thing of this being discussed that I can think of is when it comes to Suffering Risks from AGI. The most clear cut example (not necessarily probable) is if an AGI would spin up sub-processes that simulate humans that experience immense suffering. Might be that you find something if you search for that.

2JacobW383y

This is massive amounts of overthink, and could be actively dangerous. Where are we getting the idea that AIs amount to the equivalent of people? They're programmed machines that do what their developers give them the ability to do. I'd like to think we haven't crossed the event horizon of confusing "passes the Turing test" with "being alive", because that's a horror scenario for me. We have to remember that we're talking about something that differs only in degree from my PC, and I, for one, would just as soon turn it off. Any reluctance to do so when faced with a power we have no other recourse against could, yeah, lead to some very undesirable outcomes.

A Mechanistic Interpretability Analysis of Grokking

Paul Tiplady3y21

If we view the discovery of particular structures such as induction heads as chancing upon a hard-to-locate region in the parameter space (or perhaps a high activation energy to cross), and if we see these structures being repeatedly discovered ("parallel evolution"), is it possible to reduce the training time by initializing the network's parameters "close" to that location?

Speaking more mechanistically, is it possible to initialize a subset of the network prior to training to have a known functional structure, such as initializing (a guess at) the right ... (read more)

AI Safety Endgame Stories

Paul Tiplady3y10

unlikely to be competitive

Would you care to flesh this assertion out a bit more?

To be clear I’m not suggesting that this is optimal now. Merely speculating that there might be a point between now and AGI where the work to train these sub components becomes so substantial that it becomes economical to modularize.

whether a design is aligned or not isn't the type of question one can answer by analyzing the agent's visual cortex

As I mentioned earlier in my post, I was alluding to the ELK paper with that reference, specifically Ontology Identification. O... (read more)

AI Safety Endgame Stories

Paul Tiplady3y10

One other thought after considering this a bit more - we could test this now using software submodules. It’s unlikely to perform better (since no hardware speedup) but it could shed light on the tradeoffs with the general approach. And as these submodules got more complex, it may eventually be beneficial to use this approach even in a pure-software (no hardware) paradigm, if it lets you skip retraining a bunch of common functionality.

I.e. if you train a sub-network for one task, then incorporate that in two distinct top-layer networks trained on different ... (read more)

AI Safety Endgame Stories

Paul Tiplady3y*73

I've been thinking along similar lines recently. A possible path to AI safety that I've been thinking about extends upon this:

A promising concrete endgame story along these lines is Ought’s plan to avoid the dangerous attractor state of AI systems that are optimized end-to-end

Technological Attractor: Off-the-shelf subsystems

One possible tech-tree path is that we start building custom silicon to implement certain subsystems in an AI agent. These components would be analogous to functional neural regions of the human brain such as the motor cortex, visual sy... (read more)

2jacob_cannell3y

We need to test designs, and most specifically alignment designs, but giving up retraining (ie lifetime learning) and burning circuits into silicon is unlikely to be competitive; throwing out the baby with the bathwater. Also whether a design is aligned or not isn't the type of question one can answer by analyzing the agent's visual cortex, it's near purely a function of what is steering the planning system.

3Ivan Vendrov3y

Interesting, I haven't seen anyone write about hardware-enabled attractor states but they do seem very promising because of just how decisive hardware is in determining which algorithms are competitive. An extreme version of this would be specialized hardware letting CAIS outcompete monolithic AGI. But even weaker versions would lead to major interpretability and safety benefits.