All of Weibing Wang's Comments + Replies

I think this plan is not sufficient to completely solve problems #1, #2, #3 and #5. I can't come up with a better one for the time being. I think more discussions are needed.

I agree with your view about organizational problems. Your discussion gave me an idea: Is it possible to shift employees dedicated to capability improvement to work on safety improvement? Set safety goals for these employees within the organization. This way, they will have a new direction and won't be idle, worried about being fired or resigning to go to other companies. Besides, it's necessary to make employees understand that improving safety is a highly meaningful job. This may not rely solely on the organization itself, but also require external press... (read more)

3Dakara
That seems to solve problem #4. Employees quitting becomes much less of an issue, since in any case they would only be able to share knowledge about safety (which is a good thing). Do you think this plan will be able to solve problems #1, #2, #3 and #5? I think such discussions are very important, because many people (me included) worry much more about organizational side of alignment than about technical side.

Thank you for your advice!

You mentioned Mixture of Experts. That's interesting. I'm not an expert in this area. I speculate that in an architecture similar to MoE, when one expert is working, the others are idle. In this way, we don't need to run all the experts simultaneously, which indeed saves computation, but it doesn't save memory. However, if an expert is shared among different tasks, when it's not needed for one task, it can handle other tasks, so it can stay busy all the time.

The key point here is the independence of the experts, including what you mentioned, that each expe... (read more)

2Knight Lee
I agree, it takes extra effort to make the AI behave like a team of experts. Thank you :) Good luck on sharing your ideas. If things aren't working out, try changing strategies. Maybe instead of giving people a 100 page paper, tell them the idea you think is "the best," and focus on that one idea. Add a little note at the end "by the way, if you want to see many other ideas from me, I have a 100 page paper here." Maybe even think of different ideas. I cannot tell you which way is better, just keep trying different things. I don't know what is right because I'm also having trouble sharing my ideas.

1. The industry is currently not violating the rules mentioned in my paper, because all current AIs are weak AIs, so none of the AIs' power has reached the upper limit of the 7 types of AIs I described. In the future, it is possible for an AI to break through the upper limit, but I think it is uneconomical. For example, an AI psychiatrist does not need to have superhuman intelligence to perform well. An AI mathematician may be very intelligent in mathematics, but it does not need to learn how to manipulate humans or how to design DNA sequences. Of course, ... (read more)

2Knight Lee
EDIT: Actually I was completely wrong, see this comment by Vladimir_Nesov. The Mixture of Experts LLM isn't made up of a bunch of experts voting on the next word, instead each layer of the transformer is made up of a bunch of experts. I feel your points are very intelligent. I also agree that specializing AI is a worthwhile direction. It's very uncertain if it works, but all approaches are very uncertain, so humanity's best chance is to work on many uncertain approaches. Unfortunately, I disagree it will happen automatically. Gemini 1.5 (and probably Gemini 2.0 and GPT-4) are Mixture of Experts models. I'm no expert, but I think that means that for each token of text, a "weighting function" decides which of the sub-models should output the next token of text, or how much weight to give each sub-model. So maybe there is an AI psychiatrist, an AI mathematician, and an AI biologist inside Gemini and o1. Which one is doing the talking depends on what question is asked, or which part of the question the overall model is answering. The problem is they they all output words to the same stream of consciousness, and refer to past sentences with the words "I said this," rather than "the biologist said this." They think that they are one agent, and so they behave like one agent. My idea—which I only thought of thanks to your paper—is to do the opposite. The experts within the Mixture of Experts model, or even the same AI on different days, do not refer to themselves with "I" but "he," so they behave like many agents. :) thank you for your work! I'm not disagreeing with your work, I'm just a little less optimistic than you and don't think things will go well unless effort is made. You wrote the 100 page paper so you probably understand effort more than me :) Happy holidays!

1. One of my favorite ideas is Specializing AI Powers. I think it is both safer and more economical. Here, I divide AI into seven types, each engaged in different work. Among them, the most dangerous one may be the High-Intellectual-Power AI, but we only let it engage in scientific research work in a restricted environment. In fact, in most economic fields, using overly intelligent AI does not bring more returns. In the past, industrial assembly lines greatly improved the output efficiency of workers. I think the same is true for AI. AIs with different spe... (read more)

2Knight Lee
That is very thoughtful. 1. When you talk about specializing AI powers, you talk about a high intellectual power AI with limited informational power and limited mental (social) power. I think this idea is similar to what Max Tegmark said in an article: He disagrees that "the market will automatically develop in this direction" and is strongly pushing for regulation. Another think Max Tegmark talks about is focusing on Tool AI instead of building a single AGI which can do everything better than humans (see 4:48 to 6:30 in his video). This slightly resembles specializing AI intelligence, but I feel his Tool AI regulation is too restrictive to be a permanent solution. He also argues for cooperation between the US and China to push for international regulation (in 12:03 to 14:28 of that video). Of course, there are tons of ideas in your paper that he hasn't talked about yet. You should read about the Future of Life Institute, which is headed by Max Tegmark and is said to have a budget of $30 million. 2. The problem with AGI is at first it has no destructive power at all, and then it suddenly has great destructive power. By the time people see its destructive power, it's too late. Maybe the ASI has already taken over the world, or maybe the AGI has already invented a new deadly technology which can never ever be "uninvented," and bad actors can do harm far more efficiently.

For the first issue, I agree that "Carefully Bootstrapped Alignment" is organizationally hard, but I don't think improving the organizational culture is an effective solution. It is too slow and humans often make mistakes. I think technical solutions are needed. For example, let an AI be responsible for safety assessment. When a researcher submits a job to the AI training cluster, this AI assesses the safety of the job. If this job may produce a dangerous AI, the job will be rejected. In addition, external supervision is also needed. For example, the gover... (read more)

1. I think it is "Decentralizing AI Power". So far, most descriptions of the extreme risks of AI assume the existence of an all-powerful superintelligence. However, I believe this can be avoided. That is, we can create a large number of AI instances with independent decision-making and different specialties. Through their collaboration, they can also complete the complex tasks that a single superintelligence can accomplish. They will supervise each other to ensure that no AI will violate the rules. This is very much like human society: The power of a singl... (read more)

2Knight Lee
Thank you for your response! 1. What do you think is your best insight about decentralizing AI power, which is most likely to help the idea succeed, or to convince others to focus on the idea? 1. EDIT: PS, one idea I really like is dividing one agent into many agents working together. In fact, thinking about this. Maybe if many agents working together behave exactly identical to one agent, but merely use the language of many agents working together, e.g. giving the narrator different names for different parts of the text, and saying "he thought X and she did Y," instead of "I thought X and I did Y," will massively reduce self-allegiance, by making it far more sensible for one agent to betray another agent to the human overseers, than for the same agent in one moment in time to betray the agent in a previous moment of time to the human overseers. I made a post on this. Thank you for your ideas :) 2. I feel when the stakes are incredibly high, e.g. WWII, countries which do not like each other, e.g. the US and USSR, do join forces to survive. The main problem is that very few people today believe in incredibly high stakes. Not a single country has made serious sacrifices for it. The AI alignment spending is less than 0.1% of the AI capability spending. This is despite some people making some strong arguments. What is the main hope for convincing people?

The core idea about alignment is described here: https://wwbmmm.github.io/asi-safety-solution/en/main.html#aligning-ai-systems

If you only focus on alignment, you can only read Sections 6.1-6.3, and the length of this part will not be too long.

3plex
Cool, so essentially "Use weaker and less aligned systems to build more aligned and stronger systems, until you have very strong very aligned systems". This does seem like the kind of path where much of the remaining winning timelines lies, and the extra details you provide seem plausible as things that might be useful steps. There's two broad directions of concern with this, for me. One is captured well by "Carefully Bootstrapped Alignment" is organizationally hard, and is essentially: going slowly enough to avoid disaster is hard, it's easy to slip into using your powerful systems to go too fast without very strong institutional buy-in and culture, or have people leave and take the ideas with them/ideas get stolen/etc and things go too fast elsewhere. The next and probably larger concern is something like.. if current-style research on alignment doesn't scale to radical superintelligence and you need new and more well formalized paradigms in order for the values you imbue to last a billion steps of self-modification, as I think is reasonably likely, then it's fairly likely that somewhere along the chain of weakly aligned systems one of them either makes a fatal mistake, or follows its best understanding of alignment in a way which doesn't actually produce good worlds. If we don't have a crisp understanding of what we want, asking a series of systems which we haven't been able to give that goal to to make research progress on finding that leaves free variables open in the unfolding process which seem likely to end up at extreme or unwanted values. Human steering helps, but only so much, and we need to figure out how to use that steering effectively in more concrete terms because most ways of making it concrete have pitfalls. A lot of my models are best reflected in various Arbital pages, such as Reflective Stability, Nearest unblocked strategy, Goodhart's Curse, plus some LW posts like Why Agent Foundations? An Overly Abstract Explanation and Siren worlds and th

Thank you for your comment! I think your concern is right. Many safety measures may slow down the development of AI's capabilities. Developers who ignore safety may develop more powerful AI more quickly. I think this is a governance issue. I have discussed some solutions in Sections 13.2 and 16. If you are interested, you can take a look.

Thank you for your comment! I think my solution is applicable to arbitrary intelligent AI for the following reasons:
1. During the development stage, AI will align with the developers' goals. If the developers are benevolent, they will specify a goal that is beneficial to humans. Since the developers' goals have a higher priority than the users' goals, if a user specifies an inappropriate goal, the AI can refuse.
2. Guiding the AI to "do the right thing" through the developers' goals and constraining the AI to "not do the wrong thing" through the rules may s... (read more)

Thank you for your feedback! I’ll read the resources you’ve shared. I also look forward to your specific suggestions for my paper.

Thank you for your suggestions! I have read the CAIS stuff you provided and I generally agree with these views. I think the solution in my paper is also applicable to CAIS.

Thank you for your suggestions! I will read the materials you recommended and try to cite more related works.

For o1, I think o1 is the right direction. The developers of o1 should be able to see the hidden chain of thoughts of o1, which is explainable for them.

I think that alignment or interpretability is not a "yes" or "no" property, but a gradually changing property. o1 has done a good job in terms of interpretability, but there is still room for improvement. Similarly, the first AGI to come out in the future may be partially aligned and partially interpretable, and then the approaches in this paper can be used to improve its alignment and interpretability.