LESSWRONG
LW

Stephen McAleese
1166Ω29111881
Message
Dialogue
Subscribe

Software Engineer interested in AI and AI safety.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
3Stephen McAleese's Shortform
2y
13
Foom & Doom 1: “Brain in a box in a basement”
Stephen McAleese1d42

I think it depends on the context. It's the norm for employees in companies to have managers though as @Steven Byrnes said, this is partially for motivational purposes since the incentives of employees are often not fully aligned with those of the company. So this example is arguably more of an alignment than a capability problem.

I can think of some other examples of humans acting in highly autonomous ways:

  • To the best of my knowledge, most academics and PhD students are expected to publish novel research in a highly autonomous way.
  • Novelists can work with a lot of autonomy when writing a book (though they're a minority).
  • There are also a lot of personal non-work goals like saving for retirement or raising kids which require high autonomy over a long period of time.
  • Small groups of people like a startup can work autonomously for years without going off the rails like a group of LLMs probably would after a while (e.g. the Claude bliss attractor).
Reply
Foom & Doom 1: “Brain in a box in a basement”
Stephen McAleese1d30

Excellent post, thank you for taking the time to articulate your ideas in a high-quality and detailed way. I think this is a fantastic addition to LessWrong and the Alignment Forum. It offers a novel perspective on AI risk and does so in a curious and truth-seeking manner that's aimed at genuinely understanding different viewpoints.

Here are a few thoughts on the content of the first post:

I like how it offers a radical perspective on AGI in terms of human intelligence and describes the definition in an intuitive way. This is necessary as increasingly AGI is being redefined as something like "whatever LLM comes out next year". I definitely found the post illuminating and resulted in a perspective shift because it described an important but neglected vision of how AGI might develop. It feels like the discourse around LLMs is sucking the oxygen out of the room, making it difficult to seriously consider alternative scenarios.

I think the basic idea in the post is that LLMs are built by applying an increasing amount of compute to transformers trained via self-supervised or imitation learning but LLMs will be replaced by a future brain-like paradigm that will need much less compute while being much more effective.

This is a surprising prediction because it seems to run counter to Rich Sutton's bitter lesson which observes that, historically, general methods that leverage computation (like search and learning) have ultimately proven more effective than those that rely on human-designed cleverness or domain knowledge. The post seems to predict a reversal of this long-standing trend (or I'm just misunderstanding the lesson), where a more complex, insight-driven architecture will win out over simply scaling the current simple ones.

On the other hand, there is an ongoing trend of algorithmic progress and increasing computational efficiency which could smoothly lead to the future described in this post (though the post seems to describe a more discontinuous break between current and future AI paradigms).

If the post's prediction comes true, then I think we might see a new "biological lesson": brain-like algorithms will replace deep learning which replaced GOFAI.

Reply
the void
Stephen McAleese7d00

The post mentions Janus’s “Simulators” LessWrong blog post which was very popular in 2022 and received hundreds of upvotes.

Reply
Mikhail Samin's Shortform
Stephen McAleese21d20

Anthropic’s responsible scaling policy does mention pausing scaling if the capabilities of their models exceeds their best safety methods:

“We have designed the ASL system to strike a balance between effectively targeting catastrophic risk and incentivising beneficial applications and safety progress. On the one hand, the ASL system implicitly requires us to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures. But it does so in a way that directly incentivizes us to solve the necessary safety issues as a way to unlock further scaling, and allows us to use the most powerful models from the previous ASL level as a tool for developing safety features for the next level.”

I think OP and others in the thread are wondering why Anthropic doesn’t stop scaling now given the risks. I think the reason why is that in practice doing so would create a lot of problems:

  • How would Anthropic fund their safety research if Claude is no longer SOTA and becomes less popular?
  • Is Anthropic supposed to learn from and test only models at current levels of capability and how does it learn about future advanced model behaviors? I haven’t heard a compelling argument for how we could solve superalignment by studying much less advanced models. Imagine trying to align GPT-4 or o3 by only studying and testing GPT-2 from 2019. In reality, future models will probably have lots of unknown unknowns and emergent properties that are difficult or impossible to predict in advance. And then there’s all the social consequences of AI like misuse which are difficult to predict in advance.

Although I’m skeptical that alignment can be solved without a lot of empirical work on frontier models I still think it would better if AI progress were slower.

Reply
How to work through the ARENA program on your own
Stephen McAleese1mo50

Thanks for the guide, ARENA is fantastic and I highly recommend it for people interested in learning interpretability!

I'm currently working through the ARENA course now. I completely skipped week 0 because I've done similar content in other courses and university and I'm on section Week 1: Transformer Interpretability now. I'm studying part time so I'm hoping to get through most of the content in a few months.

Reply
The best approaches for mitigating "the intelligence curse" (or gradual disempowerment); my quick guesses at the best object-level interventions
Stephen McAleese1mo41

Some of my thoughts on avoiding the intelligence curse or gradual disempowerment and ensure that humans stay relevant:

  • One solution to ensure that the gap between human and AI intelligence does not grow too large:
    • I think it's often easier to verify solutions than generate them which allows less intelligent agents to supervise more intelligent agents. For example, writing a complex computer program might take 10 hours but checking that the code generally takes ~1 hour and running the program and seeing if it behaves as expected only takes a few minutes. This goal could be achieved by limiting the intelligence of AIs or enhancing human cognitive ability somehow.
  • Devise ways for giving humans a privileged status:
    • AI agents and their outputs will soon vastly outnumber those of humans. Additionally it's becoming impossible to distinguish between the outputs of AIs and humans.
    • One solution to this problem is to make humans more identifiable by watermarking AI outputs (note that watermarks are widely used for paper money) or developing strong proof of human identity (e.g. the blue Twitter mark, iPhone face ID, fingerprint login). This approach is similar to authentication which is a well-known security problem.
    • A short-term solution to differentiating between humans and AIs is to conduct activities in the physical world (although this won't work once sufficiently advanced humanoid robots are developed). For example, voting, exams, and interviews can be carried out in the real world to ensure that participants are human.
    • Once you have solved the problem of differentiating between AI and human outputs, you could upweight the value of human outputs (e.g. writing, art).
Reply
Reward button alignment
Stephen McAleese1moΩ120

After spending some time chatting with Gemini I've learned that a standard model-based RL AGI would probably just be a reward maximizer by default rather than learning complex stable values:

The "goal-content integrity" argument (that an AI might choose not to wirehead to protect its learned task-specific values) requires the AI to be more than just a standard model-based RL agent. It would need:

  1. A model of its own values and how they can change.
  2. A meta-preference for keeping its current values stable, even if changing them could lead to more "reward" as defined by its immediate reward signal.

The values of humans seem to go beyond maximizing reward and include things like preserving personal identity, self-esteem and maintaining a connection between effort and reward which makes the reward button less appealing than it would be to a standard model-based RL AGI.

Reply
Reward button alignment
Stephen McAleese1moΩ120

Thanks for the clarifying comment. I agree with block-quote 8 from your post:

Also, in my proposed setup, the human feedback is “behind the scenes”, without any sensory or other indication of what the primary reward will be before it arrives, like I said above. The AGI presses “send” on its email, then we (with some probability) pause the AGI until we’ve read over the email and assigned a score, and then unpause the AGI with that reward going directly to its virtual brain, such that the reward will feel directly associated with the act of sending the email, from the AGI’s perspective. That way, there isn’t an obvious problematic…target of credit assignment, akin to the [salient reward button]. The AGI will not see a person on video making a motion to press a reward button before the reward arrives, nor will the AGI see a person reacting with a disapproving facial expression before the punishment arrives, nor anything else like that. Sending a good email will just feel satisfying to the AGI, like swallowing food when you’re hungry feels satisfying to us humans.

I think what you're saying is that we want the AI's reward function to be more like the reward circuitry humans have, which is inaccessible and difficult to hack, and less like money which can easily be stolen.

Though I'm not sure why you still don't think this is a good plan. Yes, eventually the AI might discover the reward button but I think TurnTrout's argument is that the AI would have learned stable values around whatever was rewarded while the reward button was hidden (e.g. completing the task) and it wouldn't want to change its values for the sake of goal-content integrity:

We train agents which intelligently optimize for e.g. putting trash away, and this reinforces the trash-putting-away computations, which activate in a broad range of situations so as to steer agents into a future where trash has been put away. An intelligent agent will model the true fact that, if the agent reinforces itself into caring about cognition-updating, then it will no longer navigate to futures where trash is put away. Therefore, it decides to not hit the reward button. 

Though maybe the AI would just prefer the button when it finds it because it yields higher reward.

For example, if you punish cheating on tests, students might learn the value "cheating is wrong" and never cheat again or form a habit of not doing it. Or they might temporarily not do it until there is an opportunity to do it without negative consequences (e.g. the teacher leaves the classroom).

I also agree that "intrinsic" and "instrumental" motivation are more useful categories than "intrinsic" and "extrinsic" for the reasons you described in your comment.

Reply
Reward button alignment
Stephen McAleese1moΩ560

I'm trying to understand how the RL story from this blog post compares with the one in Reward is not the optimization target.

Thoughts on Reward is not the optimization target

Some quotes from Reward is not the optimization target:

Suppose a human trains an RL agent by pressing the cognition-updater button when the agent puts trash in a trash can. While putting trash away, the AI’s policy network is probably “thinking about”[5] the actual world it’s interacting with, and so the cognition-updater reinforces those heuristics which lead to the trash getting put away (e.g. “if trash-classifier activates near center-of-visual-field, then grab trash using motor-subroutine-#642”).

Then suppose this AI models the true fact that the button-pressing produces the cognition-updater. Suppose this AI, which has historically had its trash-related thoughts reinforced, considers the plan of pressing this button. “If I press the button, that triggers credit assignment, which will reinforce my decision to press the button, such that in the future I will press the button even more.”

Why, exactly, would the AI seize[6] the button? To reinforce itself into a certain corner of its policy space? The AI has not had antecedent-computation-reinforcer-thoughts reinforced in the past, and so its current decision will not be made in order to acquire the cognition-updater!

My understanding of this RL training story is as follows:

  1. A human trains an RL agent by pressing the cognition-updater (reward) button immediately after the agent puts trash in the trash can.
  2. Now the AI's behavior and thoughts related to putting away trash have been reinforced so it continues those behaviors in the future, values putting away trash and isn't interested in pressing the reward button unless by accident:
  1. But what if the AI bops the reward button early in training, while exploring? Then credit assignment would make the AI more likely to hit the button again. 1. Then keep the button away from the AI until it can model the effects of hitting the cognition-updater button. 2. For the reasons given in the “siren” section, a sufficiently reflective AI probably won’t seek the reward button on its own.

The AI has the option of pressing the reward button but by now it only values putting trash away so it avoids pressing the button to avoid having its values changed:

I think that before the agent can hit the particular attractor of reward-optimization, it will hit an attractor in which it optimizes for some aspect of a historical correlate of reward.

Thoughts on Reward button alignment

The training story in Reward button alignment is different and involves:

  1. Pressing the reward button after showing a video of the button being pressed. Now the button pressing situation is reinforced and the AI intrinsically values the situation where the button is pressed.
  2. Ask the AI to complete a task (e.g. put away trash) and promise to press the reward button if it completes the task.
  3. The AI completes the task not because it values the task, but because it ultimately values pressing the reward button after completing the task.

Thoughts on the differences

The TurnTrout story sounds more like the AI developing intrinsic motivation: the AI is rewarded immediately after completing the task and values the task intrinsically. The AI puts away trash because it was directly rewarded for that behavior in the past and doesn't want anything else.

In contrast the reward button alignment story is extrinsic. The AI doesn't care intrinsically about the task but only does it to receive a reward button press which it does value intrinsically. This is similar to a human employee who completes a boring task to earn money. The task is only a means to an end and they would prefer to just receive the money without completing the task.

Maybe a useful analogy is humans who are intrinsically or extrinsically motivated. For example, someone might write books to make money (extrinsic motivation) or because they enjoy it for its own sake (intrinsic motivation).

For the intrinsically motivated person, the sequence of rewards is:

  1. Spend some time writing the book.
  2. Immediately receive a reward from the process of writing.

Summary: fun task --> reward

And for the extrinsically motivated person, the sequence of rewards is:

  1. The person enjoys shopping and learns to value money because they find using it to buy things rewarding.
  2. The person is asked to write a book for money. They don't receive any intrinsic reward (e.g. enjoyment) from writing the book but they do it because they anticipate receiving money (something they do value).
  3. They receive money for the task.

Summary: boring task --> money --> reward

The second sequence is not safe because the person is motivated to skip the task and steal the money. The first sequence (intrinsic motivation) is safer because the task itself is rewarding (though wireheading is a risk in a similar way) so they aren't as motivated to manipulate the task.

So my conclusion is that trying to build intrinsically motivated AI agents by directly rewarding them for tasks seems safer and more desirable than building extrinsically motivated agents that receive some kind of payment for doing work.

One reason to be optimistic is that it should be easier to modify AIs to value doing useful tasks by rewarding them directly for completing the task (though goal misgeneralization is another separate issue). The same is generally not possible with humans: e.g. it's hard to teach someone to be passionate about boring tasks like washing the dishes so we just have to pay people to do tasks like that.

Reply
Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies
Stephen McAleese1mo130

Writing a book is an excellent idea! I found other AI books like Superintelligence much more convenient and thorough than navigating blog posts. I've pre-ordered the book and I'm looking forward to reading it when it comes out.

Reply
Load More
Road To AI Safety Excellence
3y
(+3/-2)
16How Can Average People Contribute to AI Safety?
4mo
4
193Shallow review of technical AI safety, 2024
Ω
6mo
Ω
35
23Geoffrey Hinton on the Past, Present, and Future of AI
9mo
5
34Could We Automate AI Alignment Research?
Ω
2y
Ω
10
73An Overview of the AI Safety Funding Situation
2y
10
26Retrospective on ‘GPT-4 Predictions’ After the Release of GPT-4
2y
6
110GPT-4 Predictions
2y
27
3Stephen McAleese's Shortform
2y
13
8AGI as a Black Swan Event
3y
8
47Estimating the Current and Future Number of AI Safety Researchers
3y
14
Load More