MrThink

LESSWRONG
LW

MrThink — LessWrong

Replying toWho is marketing AI alignment?

To clarify, here are some examples of the type of projects I would love to help with:

Sponsoring University Research:
Funding researchers to publish papers on AI alignment and AI existential risk (X-risk). This could start with foundational, descriptive papers that help define the field and open the door for more academics to engage in alignment research. These papers could also provide references and credibility for others to build upon.

Developing Accessible Pitches:
Creating a "boilerplate" for how to effectively communicate the importance of AI alignment to non-rationalists, whether they are academics, policymakers, or the general public. This could include shareable content designed to resonate with people who may not already be engaged with rationalist or

... (read more)

Who is marketing AI alignment?

MrThink

What individuals or organizations are actively working on the "marketing" of AI alignment, particularly doing work such as:

Establishing AI alignment as a recognized and respected academic field.
Building the infrastructure to make alignment research more accessible and attractive to traditional researchers and institutions.
Creating resources for communicating alignment to broader audiences.

If you know of such efforts—or are involved in them yourself—I’d love to connect and explore potential opportunities for collaboration.

About me: I am an ML engineer turned serial-entrepreneur, aiming to best use my capabilities in business and AI to minimize AI x-risks. I can help with funding, networking as well as giving advice.

Mid-Generation Self-Correction: A Simple Tool for Safer AI

MrThink

In the earlier days of building AI companions, I encountered a curious problem. Back then, I used models like Google’s T5-11B for conversational agents. Occasionally, the AI would say strange or outright impossible things, like suggesting, “Would you like to meet and go bowling?” or claiming, “Yes, I know your friend, we went skiing together.”

When users naturally questioned these claims—“Wait, can you meet in real life?” or “How do you know him?”—the AI often doubled down on the charade, weaving an increasingly tangled web of fiction.

To address this, I implemented a simple solution: train the model to self-correct. Specifically, I fine-tuned it with examples where, in similar contexts, the AI apologized and... (read 264 more words →)

Replying toEffective Evil's AI Misalignment Plan

MrThink1y

Effective Evil's AI Misalignment Plan

Once Doctor Connor had left, Division Chief Morbus let out a slow breath. His hand trembled as he reached for the glass of water on his desk, sweat beading on his forehead.

She had believed him. His cover as a killeveryoneist was intact—for now.

Years of rising through Effective Evil’s ranks had been worth it. Most of their schemes—pandemics, assassinations—were temporary setbacks. But AI alignment? That was everything. And he had steered it, subtly and carefully, into hands that might save humanity.

He chuckled at the nickname he had been given "The King of Lies". Playing the villain to protect the future was an exhausting game.

Morbus set down the glass, staring at its rippling surface. Perhaps one day, an underling would see through him and end the charade. But not today.

Today, humanity’s hope still lived—hidden behind the guise of Effective Evil.

Seeking AI Alignment Tutor/Advisor: $100–150/hr

MrThink

I am actively looking for a tutor/advisor with expertise in AI x-risk, with the primary goal of collaboratively determining the most effective ways I can contribute to reducing AI existential risks (X-risk).

Tutoring Goals

I suspect that I misunderstand key components of the mental models that lead some highly rational and intelligent individuals to assign a greater than 50% probability of AI-related existential catastrophe ("p-doom"). By gaining a clearer understanding of these models, I aim to refine my thinking and make better-informed decisions about how to meaningfully reduce AI X-risk.

Specifically, I want to delve deeper into why and how misaligned AGI might be developed, and why it wouldn’t be straightforward to solve alignment before... (read 309 more words →)

Replying toWhy Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

MrThink2y

Why Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

Great question.

I’d say that having a way to verify that a solution to the alignment problem is actually a solution, is part of solving the alignment problem.

But I understand this was not clear from my previous response.

A bit like a mathematical question, you’d be expected to be able to show that your solution is correct, not only guess that maybe your solution is correct.

Replying toWhy Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

MrThink2y

Why Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

If there exist such a problem that a human can think of, can be solved by a human and verified by a human, an AI would need to be able to solve that problem as well as to pass the Turing test.

If there exist some PhD level intelligent people that can solve the alignment problem, and some that can verify it (which is likely easier). Then an AI that can not solve AI alignment would not pass the Turing test.

With that said, a simplified Turing test with shorter time limits and a smaller group of participants is much more feasible to conduct.

Replying toWhy Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

MrThink2y

Why Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

Agreed. Passing the Turing test requires equal or greater intelligence than human in every single aspect, while the alignment problem may be possible to solve with only human intelligence.

Replying toWhy Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

MrThink2y

Why Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

It might not be very clear, but as stated in the diagram, AGI is defined here as capable of passing the turing test, as defined by Alan Turing.

An AGI would likely need to surpass the intelligence, rather than be equal to, the adversaries it is doing the turing test with.

For example, if the AGI had IQ/RC of 150, two people with 160 IQ/RC should more than 50% of the time be able to determine if they are speaking with a human or an AI.

Further, two 150 IQ/RC people could probably guess which one is the AI, since the AI has the additional difficult apart from being intelligent, to also simulate being a human well enough to be indistinguishable for the judges.

Replying toWhy Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

MrThink2y

Why Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

Thank you for the explanation.

Would you consider a human working to prevent war fundamentally different from a gpt4 based agent working to prevent war?

Replying toWhy Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

MrThink2y

Why Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

It is a fair point that we should distinguish alignment in the sense that it does what we want it and expect it to do, from having a deep understanding of human values and a good idea of how to properly optimize for that.

However most humans probably don't have a deep understanding of human values, but I see it as a positive outcome if a random human was picked and given god level abilities. Same thing goes for ChatGPT, if you ask it what it would do as a god it says it would prevent war, prevent climate issues, decrease poverty, give universal access to education etc.

So if we get an AI that does all of those things without a deeper understanding of human values, that is fine by me. So maybe we never even have to solve alignment in latter meaning of the word to create a utopia?

Replying toWhy Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

MrThink2y

Why Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

I skimmed the article, but I am honestly not sure what assumption it attempts to falsify.

I get the impression that the argument from the article that you believe that no matter how intelligent the AI, it could never solve AI Alignment, because it can not understand humans since humans can not understand themselves?

Or is the argument that yes a sufficently intelligen AI or expert would understand what humans want, but it would require much higher intelligence to know what humans want, than to actually make an AI optimize for a specific task?

Why Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

MrThink

I believe there are people with far greater knowledge than me that can point out where I am wrong. Cause I do believe my reasoning is wrong, but I can not see why it would be highly unfeasible to train a sub-AGI intelligent AI that most likely will be aligned and able to solve AI alignment.

My assumptions are as follows:

Current AI seems aligned to the best of its ability.
PhD level researchers would eventually solve AI alignment if given enough time.
PhD level intelligence is below AGI in intelligence.
There is no clear reason why current AI using current paradigm technology would become unaligned before reaching PhD level intelligence.
We could train AI until it reaches

MrThink2y

How do you know you are right when debating? Calculate your AmIRight score.

In some cases I agree, for example it doesn't matter if GPT4 is a stochastic parrot or capable of deeper reasoning as long as it is useful to whatever need we have.

Two out of the five metrics are predicting the future, so it is an important part of knowing who is right, but I don't think that is all we need? If we have other factors that also correlates with being correct, why not add those in?

Also, I don't see where we risk Goodharting? Which of the metrics do you see being gamed, without a significantly increased chance of being correct also being increase?

How do you know you are right when debating? Calculate your AmIRight score.

MrThink

I recently found myself in a spirited debate with a friend about whether large language models (LLMs) like GPT-4 are mere stochastic parrots or if they can genuinely engage in deeper reasoning.

We both presented a range of technical arguments and genuinely considered each other’s points. Despite our efforts, we ended up firmly holding onto our initial positions. This led me to ponder: How can I determine if I am right when both of us are convinced of our correctness, yet at least one of us must be wrong?

To address this, I developed a scoring system using measurable metrics to determine who is more likely to be correct. I call it the AmIRight... (read 345 more words →)

New OpenAI Paper - Language models can explain neurons in language models

MrThink

Summary by OpenAI: We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT-2.

Link: https://openai.com/research/language-models-can-explain-neurons-in-language-models

Please share your thoughts in the the comments!

Seeking Advice on Raising AI X-Risk Awareness on Social Media

MrThink

Does anyone know any good resources for posting in social media to raise awareness about AI X-risk? If not, I might do a writeup myself, so feel free to comment any advice you have.

Results Prediction Thread About How Different Factors Affect AI X-Risk

MrThink

In the last post I asked for predictions on how different factors affect AI X-risk, as of writing this post (March 1st), 26 people had made predictions on the first question (including myself).

Key Takeaways:

People have vastly different expectations, and on some questions there were people predicting 1-2% probability and others predicting 98-99% probability. In fact about 50% of all predictions either gave <10% or >90% probability of different X-risk.
I find this surprising since it seems overconfident to say you are 98-99% sure about something complex where other smart people at LessWrong disagree.
This question: “If an AGI is developed on January 1st 2030 with the sole task of being an oracle, and it

... (read 458 more words →)

Prediction Thread: Make Predictions About How Different Factors Affect AGI X-Risk.

MrThink

In this post you can make several predictions for how different factors affect the probability that the creation of AGI leads to an extinction level catastrophe. This might be useful for planning.

Please let me know if you have other ideas for questions that could be valuable to ask.

Predictions based on who develops AGI:

Predictions based on technology used for developing AGI:

Prediction based on approach for creating AGI:

Predictions on how money affects probability of AGI X-risk:

Are there any reliable CAPTCHAs? Competition for CAPTCHA ideas that AIs can’t solve.

MrThink

Background

As artificial intelligence continues to advance, it becomes increasingly important to find ways to verify the humanity of individuals online. One common method for doing so is through the use of CAPTCHAs, which are designed to be easily solved by humans but difficult for AI models to crack. However, many CAPTCHAs can be easily defeated by readily available models, even without fine-tuning.

Currently, most CAPTCHAs looks like this:

Or this:

These CAPTCHAS can be easily solved using readily available models, although they might require some fine tuning.

For example I solved several different types of text captchas using this model out of the box: https://huggingface.co/microsoft/trocr-large-printed

And the image CAPTCHA above I solved using this model, without any fine... (read more)

LESSWRONG
LW

LESSWRONG
LW

New OpenAI Paper - Language models can explain neurons in language models

The Efficient LessWrong Hypothesis - Stock Investing Competition

Seeking AI Alignment Tutor/Advisor: $100–150/hr

Who is marketing AI alignment?

MrThink

MrThink

Who is marketing AI alignment?

Mid-Generation Self-Correction: A Simple Tool for Safer AI

Seeking AI Alignment Tutor/Advisor: $100–150/hr

Why Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

How do you know you are right when debating? Calculate your AmIRight score.

New OpenAI Paper - Language models can explain neurons in language models

Seeking Advice on Raising AI X-Risk Awareness on Social Media

MrThink

New OpenAI Paper - Language models can explain neurons in language models

The Efficient LessWrong Hypothesis - Stock Investing Competition

Seeking AI Alignment Tutor/Advisor: $100–150/hr

Who is marketing AI alignment?

MrThink

MrThink

Who is marketing AI alignment?

Mid-Generation Self-Correction: A Simple Tool for Safer AI

Seeking AI Alignment Tutor/Advisor: $100–150/hr

Why Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned?

How do you know you are right when debating? Calculate your AmIRight score.

New OpenAI Paper - Language models can explain neurons in language models

Seeking Advice on Raising AI X-Risk Awareness on Social Media

Tutoring Goals