What are the known difficulties with this alignment approach?

tailcalled

18 What are the known difficulties with this alignment approach?

11th Feb 2024

1 min read

18

Assume you have a world-model that is nicely factored into spatially localized variables that contain interesting-to-you concepts. (Yes, that's a big assumption, but are there any known difficulties with the proposal if we grant this assumption?)

Pick some Markov blanket (which contains some actuators) as the bounds for your AI intervention.

Represent your goals as a causal graph (or computer program, or whatever) that fits within these bounds. For instance if you want a fusion power plant, represent it as something that takes in water and produces helium and electricity.

Perform a Pearlian counterfactual surgery where you cut out the variables within the Markov blanket and replace them with a program representing your high-level goal, and then optimize the action variables to match the behavior of the counterfactual graph.

New to LessWrong?

18

New Answer

New Comment

4 Answers sorted by
top scoring

Adele Lopez

Feb 12, 2024

There's nothing stopping the AI from developing its own world model (or if there is, it's not intelligent enough to be much more useful than whatever process created your starting world model). This will allow it to model itself in more detail than you were able to put in, and to optimize its own workings as is instrumentally convergent. This will result in an intelligence explosion due to recursive self-improvement.

At this point, it will take its optimization target, and put an inconceivably (to humans) huge amount of optimization into it. It will find a flaw in your set up, and exploit it to the extreme.

In general, I think any alignment approach which has any point in which an unfettered intelligence is optimizing for something that isn't already convergent to human values/CEV is doomed.

Of course, you could add various bounds on it which limit this possibility, but that is in strong tension with its ability to effect the world in significant ways. Maybe you could even get your fusion plant. But how do you use it to steer Earth off its current course and into a future that matters, while still having its own intelligence restrained quite closely?

Charlie Steiner

Feb 12, 2024

Is this an alignment approach? How does it solve the problem of getting the AI to do good things and not bad things? Maybe this is splitting hairs, sorry.

It's definitely possible to build AI safely if it's temporally and spatially restricted, if the plans it optimizes are never directly used as they were modeled to be used but are instead run through processing steps that involve human and AI oversight, if it's never used on broad enough problems that oversight becomes challenging, and so on.

But I don't think of this as alignment per se, because there's still tremendous incentive to use AI for things that are temporally and spatially extended, that involve planning based on an accurate model of the world, that react faster than human oversight allows, that are complicated domains that humans struggle to understand.

Donald Hobson

Feb 13, 2024

One fairly obvious failure mode is that it has no checks on the other outputs.

So from my understanding, the AI is optimizing it's actions to produce a machine that outputs electricity and helium. Why does it produce a fusion reactor, not a battery and a leaking balloon?

A fusion reactor will in practice leak some amount of radiation into the environment. This could be a small negligible amount, or a large dangerous amount.

If the human knows about radiation and thinks of this, they can put a max radiation leaked into the goal. But this is pushing the work onto the humans.

From my understanding of your proposal, the AI is only thinking about a small part of the world. Say a warehouse that contains some robotic construction equipment, and that you hope will soon contain a fusion reactor, and that doesn't contain any humans.

The AI isn't predicting the consequences of it's actions over all space and time.

Thus the AI won't care if humans outside the warehouse die of radiation poisoning, because it's not imagining anything outside the warehouse.

So, you included radiation levels in your goal. Did you include toxic chemicals? Waste heat? Electromagnetic effects from those big electromagnets that could mess with all sorts of electronics. Bioweapons leaking out? I mean if it's designing a fusion reactor and any bio-nasties are being made, something has gone wrong. What about nanobots. Self replicating nanotech sure would be useful to construct the fusion reactor. Does the AI care if an odd nanobot slips out and grey goos the world? What about other AI. Does your AI care if it makes a "maximize fusion reactors" AI that fills the universe with fusion reactors.

[-][anonymous]1y42

But this is pushing the work onto the humans.

Is that so bad? The obvious solution to your objections is to lower the scope to subtasks. "Design a fusion reactor that will likely work". "Using the given robots and containers full of parts, construct the auxillary power subsystem". And so on.

Humans check all the subtasks and so do AI models. To keep the humans paying attention, a "red team" AI model could introduce obviously sabotaged output to the review queue, similar to how airport screeners occasionally see a gun or bomb digitally inserted.

And you ... (read more)

2tailcalled1y

Factorization is generally very bad.

0[anonymous]1y

https://www.aboutamazon.com/news/operations/10-years-of-amazon-robotics-how-robots-help-sort-packages-move-product-and-improve-safety Factorization is working extremely well though. (Some tasks may factorize poorly but package logistics subdivides well. Any task that can be transformed to look like package logistics is similar. I can think of a way to transform most tasks to look like package logistics, do you have a specific example? Fusion reactor construction is package logistics albeit design is not)

2Donald Hobson1y

In the limit of pushing all the work onto humans, you just have humans building a fusion reactor. Which is a sensible plan, but is not AI. If you have a particular list in mind for what you consider dangerous, I suspect your "red teaming" approach might catch it. Like I think that, in this causal graph setup, it's not too hard to stop excess radiation leaking out, if you realize that radiation is a danger and work to stop it. This doesn't give you a defence against the threats you didn't imagine and the threats you can't imagine.

4[anonymous]1y

I assume you break the tasks into short duration, discrete, self contained subtasks. You choice of where to break has to do with graph dependencies between tasks and if the task is bounded in scope and time limited. No task has "threats you didn't imagine and the threats you can't imagine. ", everything is something humans could do themselves, it's just a lot of work. There are about 10 viable designs for a fusion reactor that "just" need thousands of person-years of engineering to nail down the details and a few tens of billions to make a scale prototype. Humans can do this, it "just" has proven difficult to get the resources.... Like I said it reduces the labor requirements by at least 1000x. Is 1000x reduction in labor just not worth doing or? Presumably once humans have ai systems they carefully review and have reduced labor 1000x across almost all industries they will then research the next steps with those resources ..

2Donald Hobson1y

Suppose you give the AI a short duration discrete task. Pick up this box and move it over there. The AI chooses to detonate a nearby explosive, sending everything in the lab flying wildly all over the place. And indeed, the remains of the box are mostly over there. Ok. Maybe you give it another task. Unscrew a stuck bolt. The robot gets a big crowbar and levers the bolt. The thing it's pushing against for leverage is a vacuum chamber. Its slightly deformed from the force, causing it to leak. Or maybe it sprays some chemical on the bolt, which dissolves it. And in a later step, something else reacts with the residue, creating a toxic gas. I think you need to micromanage the AI. To specify every possible thing in a lot of detail. I don't think you get a 10x labor saving. I am unconvinced you get any labor saving at all. After all, to do the task yourself, you just need to find 1 sane plan. But to stop the AI from screwing up, you need to rule out every possible insane plan. Or at least repeatedly read the AI's plan, spot that it's insane, and tell it not to use explosives to mix paint.

2[anonymous]1y

https://robotics-transformer-x.github.io/ Ok to drill down: the AI is a large transformer architecture control model. It was initially trained by converting human and robotic actions to a common token representation that is perspective independent and robotic actuator interdependent. (Example "soft top grab, bottom insertion to Target" might be a string expansion of the tokens) You then train via reinforcement learning on a simulation of the task environments for task effectiveness. What this does is train the machines policy to be similar to "what would a human do", at least for input cases that are similar to any of the inputs. (As usual, you need all the video in the world to do this). The RL "fine tuning" modifies the policy just enough to usually succeed on tasks instead of say grabbing too hard and crushing or dropping the object every time. So the new policy is a local minimum in policy space adjacent to the one learned from humans. This empirically is a working method that is SOTA. In order for the machine to "detonate an explosive" either the action a human would have taken from the training dataset involves demolitions (and there are explosives and initiators in reach of the robotic arms which are generally rail mounted) or the simulation environment during the RL stages rewarded such actions. The reason it saves 10-1000 times the labor is for task domains where the training examples and the simulation environment span the task space of the given assignment. I meant "10-1000 times the labor" in the worldwide labor market sense, for about half of all jobs. Plenty of rare jobs few humans do will not be automated. For example if the machine has seen, and practiced, oiling and inserting 100 kinds of bolt, a new bolt that is somewhere in properties in between the extreme ends the machine has capabilities on will likely work zero shot. Or in practical spaces, I was thinking mining, logistics, solar panel deployment, manufacturing are cases where there ar

2Donald Hobson1y

That is rather different from the architecture I thought you were talking about. But ok. I can roll with that. You are assuming as given a simulation. Where did this simulation come from? What happens when the simulation gets out of sync with reality? But Ok. I will grant that you have somehow built a flawless simulation. Lets say you found a hypercomputer and coded quantum mechanics into it. So now we have the question, how do the tokens match up with the simulation. Those tokens are "acutator independent". (A silly concept, sometimes the approach will depend A LOT on exactly what kind of actuators you are using. Some actuators must set up a complex system of levers and winches, while a stronger actuator can just pick up the heavy object. Some actuators can pick up hot stuff, others must use tongs. Some can fit in cramped spaces. Others must remove other components in order to reach.) We need raw motor commands, both in reality, and in the quantum simulation. So lets also grant you a magic oracle that takes in your common tokens and turns them into raw motor commands. So when you say "pick up this component, and put it here", it's the oracle that determines if the sensitive component is slammed down at high speed. If something else is disturbed as you reach over. Lets assume it makes good decisions here somehow. Yes. That. Now the problems you get when doing end to end RL are different from when doing RL over each task separately. If you get a human to break something down into many small easy tasks, then you get local goodhearthing. Like using explosives to move things because the task was to move object A to position B. Not to move it without damaging it. If you do RL training over the whole thing, ie reinforce on fusion happening in the fusion reactor example, then you get a plan that actually causes fusion to happen. This doesn't involve randomly blowing stuff up to move things. This long range optimization has less random industrial accident stupidit

2[anonymous]1y

I added the below. I believe most of your objections are simply wrong because this method actually works at least to todays capability levels. (Small child...) Responses: A neural or hybrid sim. It came from predicting future frames from real robotics data. It cannot desync because the starting state is always the present frame. Yes No. I was thinking of easier tasks. "Ok I want a stellarator. Design the power substation for the main power. The building needs to be these dimensions (give a reference to another file), nothing special, design it like a warehouse. Ok I updated the doc vcs, redesign the power substation and the building per the new specs. Engineering is a ton of perspiration and repetitive work where one variable determines another. Procedural engineering means changes in one place propogate to the rest. It's commonly used in specialized domains, an ai would generalize it. "Ok I have a design, order another AI to build the thousands of kilometers of superconducting wire I need. Here are the magnet designs, wind them. Here are the casing designs, machine them. Get all the parts to the assembly area. Oh I dropped something, get another, reordering it made if necessary. ". Or the real saver "ok I finished the prototype stellarator, you saw every step. Build another, ask for help when needed". Note that the agent being "talked to" just redirects calls to an isolated system to do them. The main ai doesn't have ordering capabilities, but a tested and specialized system that has human written components and automatic review. (Since financial transactions have obvious issues) Gpt-4 plugin support works like the last paragraph.

2Donald Hobson1y

If you are mostly learning from imitating humans, and only using a small amount of RL to adjust the policy, that is yet another thing. I thought you were talking about a design built mainly around RL. If it's imitating humans, you get a fair bit of safety, but it will be about as smart as humans. It's not trying to win, it's trying to do what we would do. Ok. So you take a big neural network, and train it to predict the next camera frame. No Geiger counter in the training data? None in the prediction. Your neural sim may well be keeping track of the radiation levels internally, but it's not saying what they are. If the AI's plan starts by placing buckets over all the cameras, you have no idea how good the rest of the plan is. You are staring at a predicted inside of a bucket. Except there is something special. There always is. Maybe this substation really better not produce any EMP effects, because sensitive electronics are next door. So the whole building needs a faraday cage built into the walls. Maybe the location it's being built at is known for it's heavy snow, so you better give it a steep sloping roof. Oh and you need to leave space here for the cryocooler pipes. Oh and you can't bring big trucks in round this side, because the fuel refinement facility is already there. Oh and the company we bought cement from last time has gone bust. Find a new company to buy cement from, and make sure it's good quality. Oh and there might be a population of bats living nearby. Don't use any tools that produce lots of ultrasound. Lets say someone spills coffee in a laptop. It breaks. Now to fix it, some parts need replaced. But which parts? That depends on exactly where the coffee dribbled inside it. Not something that can be predicted. You must handle the uncertainty. Test parts to see if they work. Look for damage marks. I think this system as you are describing now is something that might kind of work. I mean the first 10 times it will totally screw up. But we

2[anonymous]1y

I think you are assuming the above will happen. (the line in blue). I am assuming the red line, and obviously by building on what we have incrementally. If you were somehow a significant way up the blue line and trying to get robots to do anything useful, yes, you might get goodheart optimized actions that achieve the instructed result, maybe (if the ASI hasn't chosen to betray this time since it can do so), but not satisfying all the thousands of constraints you implied but didn't specify.

2Donald Hobson1y

In more "slow takeoff" scenarios. Your approach can probably be used to build something that is fairly useful at moderate intelligence. So for a few years in the middle of the red curve, you can get your factories built for cheap. Then it hits the really steep part, and it all fails. I think the "slow" and "fast" models only disagree in how much time we spend in the orange zone before we reach the red zone. Is it enough time to actually build the robots? I assign fairly significant probabilities to both "slow" and "fast" models.

2[anonymous]1y

Well how do these variables interact? the g factor(intelligence) of the ASI depends on the algorithm of the ASI times the log of compute assigned to it, C. The justification for it not being linear is because non search NN lookups already capture most of the possible gain on a task. You can do more samples with more compute, increasing the probability of a higher score. I'm having trouble finding the recent paper where it turns out if you sample GPT-4 you do get higher score, with it scaling with the log of the number of samples. Then at any given moment in time, the rate of improvement of the ASI's algorithm is going to scale with current intelligence times a limit based on S. What is S? I sensed you were highly skeptical of my "neural sim" variable until 2 days ago. It's Sora + you can get collidable geometry, not just images as output. At the time, the only evidence i had of a neural sim was nvidia papers. Note the current model is likely capable of this: https://x.com/BenMildenhall/status/1758224827788468722?s=20 S is the suite of neural sim situations. Once the ASI solves all situations in the suite, self improvement goes to 0. (there is no error derivative to train on) This also come to mind naturally. Have you, @Donald Hobson , ever played a simulation game that models a process that has a real world equivalent? You might notice you get better and better at the game until you start using solutions that are not possible in the game, but just exploit glitches in the game engine. If an ASI is doing this, it's improvement becomes negative once it hits the edges of the sim and starts training on false information. This is why you need neural sims, as they can continue to learn and add complexity to the sim suite (and they need to output a 'confidence' metric so you lower the learning rate when the sim is less confident the real world estimation is correct). How do you do this? Well at time 0, humans make a big test suite with every

2Donald Hobson1y

Response to the rest of your post. By the way, these comment boxes have built in maths support. Press Ctrl M for full line or Ctrl 4 for inline LikeThisCtrlMusedhere Neural sims probably have glitches too. Adversarial examples exist. This sounds iffy. Like you are eyeballing and curve fitting, when this should be something that falls out of a broader world model. Every now and then, you get a new tool. Like suppose your medical bot has 2 kinds of mistakes, ones that instant kill, and ones that mutate DNA. It quickly learns not to do the first one. And slowly learns not to do the second when it's patients die of cancer years later. Except one day it gets a gene sequencer. Now it can detect all those mutations quickly. I find it interesting that most of this post is talking about the hardware. Isn't this supposed to be about AI? Are you expecting a regieme where 1. Most of the worlds compute is going into AI. 2. Chip production increases by A LOT (at least 10x) within this regieme. 3. Most of the AI progress in this regieme is about throwing more compute at it. Ok. And there is our weak link. All our robots are going to be sitting around broken. Because the bottleneck is human repair people. It is possible to automate things. But what you seem to be describing here is the process of economic growth in general. Each specific step in each specific process is something that needs automating. You can't just tell the robot "automate the production of rubber gloves". You need humans to do a lot of work designing a robot that picks out the gloves and puts them on the hand shaped metal molds to the rubber can cure. Yes economic growth exists. It's not that fast. It really isn't clear how AI fits into your discussion of robots.

2[anonymous]1y

Yes. That's why I specifically mentioned : Confidence is a trainable parameter, and you scale down learning rate when confidence is low. This is a lengthy discussion but the simple answer is that what a human 'repair person' does can be described as a simple algorithm that you can write in ordinary software. I've repaired a few modern things, this is from direct knowledge and watching videos of someone repairing a Tesla. The algorithm in essence is every module is self diagnosing, and there is a graph flow of relationships between modules. There are simple experiments you do, it tells you in the manual many of them, to get better evidence. Then you disassemble the machine partially - if you were a robot and had the right licenses you could download the assembly plan for this machine and reverse it - remove the suspect module, and replace. If the issues don't resolve, you remove the module that said the suspect module was bad or was related to it. For PCs, this is really easy. Glitches on your screen from your GPU? Replace the cable. Observe if the glitches go away. Try a different monitor. Still broken? Put in a different GPU. That doesn't resolve it? Go and memtest86 the RAM. Does that pass? It's either the motherboard or the processors. This comes from simply understanding how the components interconnect, and obviously current AI can do this easily better than humans. The hard part is the robotics. The 'simple' parts like "connect a multimeter to <point 1>, <point 2>. "sand off the corrosion, wipe off the grease", "does the oil have metal shavings in it", "remove that difficult to reach screw" are what has been a bottleneck for 60 years. Because it's what humans want AI for, and due to the relationships between the variables, it is possible we will not ever get uncontrollable superintelligence before first building a lot of robots, ICs, collecting revenue, and so on. yes I think AI and robotics and compute construction are all int

2Donald Hobson1y

Adversarial examples can make an image classifier be confidently wrong. You are talking about robots, and a fairly specific narrow "take the screws out" AI. Quite a few humans seem to want AI for generating anime waifus. And that is also a fairly narrow kind of AI. Your "log(compute)" term came from a comparison which was just taking more samples. This doesn't sound like an efficient way to use more compute. Someone, using a pretty crude algorithmic approach, managed to get a little more performance for a lot more compute.

2Donald Hobson1y

First of all. SORA. No. Not really. I wasn't claiming that things like SORA couldn't exist. I am claiming that it's hard to turn them towards the task of engineering a bridge say. Current SORA is totally useless for this. You ask it for a bridge, and it gives you some random bridge looking thing, over some body of water. SORA isn't doing the calculations to tell if the bridge would actually hold up. But lets say a future much smarter version of SORA did do the calculations. A human looking at the video wouldn't know what grade of steel SORA was imagining. I mean existing SORA probably isn't thinking of a particular grade of steel, but this smarter version would have picked a grade, and used that as part of it's design. But it doesn't tell the human that, the knowledge is hidden in it's weights. Ok, suppose you could get it to show a big pile of detailed architectural plans, and then a bridge. All with super-smart neural modeling that does the calculations. Then you get something that ideally is about as good at looking at the specs of a random real world bridge. Plenty of random real world bridges exist, and I presume bridge builders look at their specs. Still not that useful. Each bridge has different geology, budget, height requirements etc. Ok, well suppose you could start by putting all that information in somehow, and then sampling from designs that fit the existing geology, roads etc. Then you get several problems. The first is that this is sampling plausible specs, not good specs. Maybe it shows a few pictures at the end to show the bridge not immediately collapsing. But not immediately collapsing is a low bar for a bridge. If the Super-SORA chose a type of paint that was highly toxic to local fish, it wouldn't tell you. If the bridge had a 10% chance of collapsing, it's randomly sampling a plausible timeline. So 90% of the time, it shows you the bridge not collapsing. If it only generates 10 minutes of footage, you don't know what might be going on

2[anonymous]1y

I would agree with that. Let me give you a summary of my overall position: "Trust an ASI with a complex task it needs context awareness to complete? Not even once". Everything is about how to get a large amount of benefits in the orange area, and yes we should stay there for a prolonged period. What benefits specifically? Vast amounts of material manufactured goods, robots, robots are doing repetitive tasks that can be clearly defined. ASI models in use are only used in limited duration sessions, and you strip away context awareness. Context awareness is the bits that tell the ASI this is a real bridge, humans are really going to build it, and not just another task in the training sim. "Always bluepill". It should not be possible for the ASI to know when the task is real vs sim. (which you can do by having an image generator convert real frames to a descriptor, and then regenerate them so they have the simulation artifacts...) The architecture fees for a new bridge design are about 10% of the cost. Other 90% is, well, all the back breaking labor to build one. Not just at the construction site, but where the concrete was made, building/maintaining the trucks to haul the materials, etc. Sora's role: For autonomous cars and general robots, this is used for training. General robots means one robot at a time, and for short task sessions. For example : "remove the part from the CNC machine and place it on the output table". What you do is record video from the human workers doing the task, many hours of it. You train a model using techniques similar Sora to classify what it sees and to predict frame for frame the next frame. This is 5-30 hz sampling rate, so 33 ms into the future. Then do this recursively. Since this is an RL problem you go further and may model a few seconds ahead. There are ways to reproject in a robot instead. Then the robot tries to do the task in the real world once it has settled on a policy that covers it in the simulated

2Donald Hobson1y

If we have the technical capacity to get into the red zone, and enough chips to make getting there easy. Then hanging out in the orange zone, coordinating civilization not to make any AI too powerful, when there are huge incentives to ramp the power up, and no one is quite sure where the serious dangers kick in... That is, at least, an impressive civilization wide balancing act. And one I don't think we have the competence to pull off. This is something you want, not a description of how to get it, and one that is rather tricky to achieve. That converting and then converting back trick is useful. But sure isn't automatic success either. If there are patterns about reality that the ASI understands, but the simulator doesn't, then the ASI can use those patterns. Ie if the ASI understands seasons, and the simulator doesn't, then if it's scorching sunshine one day and snow the next, that suggests it's the simulation. Otherwise, that suggests reality. And if the simulation knows all patterns that the ASI does, the simulator itself is now worryingly intelligent. If the task is maximally repetitive, then the robot can just follow the same path over and over. If it's nearly that repetitive, the robot still doesn't need to be that smart. I think you are trying to get a very smart AI to be so tied down and caged up that it can do a task without going rouge. But the task is so simple that current dumb robots can often do it. Economics test again. Minimum wage workers are easily up to a task like that. But most engineering jobs pay more than minimum wage. Which suggests most engineering in practice requires more skill than that. I mean yes engineers do need to take parts out of the CNC machine. But they also need to be able to fix that CNC machine when a part snaps off inside it and starts getting jammed in the workings. And the latter takes up more time in practice. Or noticing that the toolhead is loose, and tightning and recalibrating it. The techniques y

2[anonymous]1y

Yes and I stand by that assertion. The above will work and does already work in some cases (self driving is very close) to human level. It's eventually 1000 time savings in task domains like mining, farming, logistics, materials processing, manufacturing, cleaning. Not necessarily prototype fusion reactor construction specifically, but possibly over the fusion industry once engineers find a design that works. I was thinking it would help - something like CERN which is similar to what a fusion reactor will look like has a whole bunch of ordinary stuff in it. Lots of roughly dug tunnels, concrete, handrails, racks of standard computers that you would see in an office, and so on. Large assemblies that need to be trucked in. Each huge instrument assembly is made of simpler parts. If robots do all that it still saves time. (Probably less than 90 percent of the time) You are correct that a neural sim probably won't cover repair. You have seen Nvidia has neural sims. I was assuming you first classify from sensor fusion (many cameras, lidar, etc) to a representation of the state space then from that representation query a sim to predict the next frames for that state space. A hybrid sim would be where you use both a physics engine and a neural network to fine tune the results.(such as in series, or by overriding intermediate timestep frames) Training one is pretty straightforward, you save your predictions from last frame and then compare them to what the real world did the next frame. (It's more complex than that because you predict a distribution of outcomes and need a lot more than 1 frame from the real world to correct your probabilities) This is also a good way to know when the machine is over its head. For example if it spilt coffee on the laptop, and the machine has no understanding of liquid damage but does need to open a bash shell, the laptop screen will likely be blank or crashed, which won't be what the machine predicted as an outcome after trying to star

Jeremy Gillen

Feb 12, 2024

I think the overall goal in this proposal is to get a corrigible agent capable of bounded tasks (that maybe shuts down after task completion), rather than a sovereign?

One remaining problem (ontology identification) is making sure your goal specification stays the same for a world-model that changes/learns.

Then the next remaining problem is the inner alignment problem of making sure that the planning algorithm/optimizer (whatever it is that generates actions given a goal, whether or not it's separable from other components) is actually pointed at the goal you've specified and doesn't have any other goals mixed into it. (see Context Disaster for more detail on some of this, optimization daemons, and actual effectiveness). Part of this problem is making sure the system is stable under reflection.

Then you've got the outer alignment problem of making sure that your fusion power plant goal is safe to optimize (e.g. it won't kill people who get in the way, doesn't have any extreme effects if the world model doesn't exactly match reality, or if you've forgotten some detail). (See Goodness estimate bias, unforeseen maximum).

Ideally here you build in some form of corrigibility and other fail-safe mechanisms, so that you can iterate on the details.

That's all the main ones imo. Conditional on solving the above, and actively trying to foresee other difficult-to-iterate problems, I think it'd be relatively easy to foresee and fix remaining issues.

Moderation Log

LESSWRONG
LW

18

[ Question ]

What are the known difficulties with this alignment approach?

18

New to LessWrong?

18

4 Answers sorted by
top scoring

Feb 12, 2024

Feb 12, 2024

Feb 13, 2024

Feb 12, 2024

18

[ Question ]

What are the known difficulties with this alignment approach?

18

New to LessWrong?

18

4 Answers sorted by top scoring

Feb 12, 2024

Feb 12, 2024

Feb 13, 2024

Feb 12, 2024

4 Answers sorted by
top scoring