This is a good article and I mostly agree, but I agree with Seth that the conclusion is debatable.
We're deep into anthropomorphizing here, but I think even though both people and AI agents are black boxes, we have much more control over behavioral outcomes of the latter.
So technical alignment is still very much on the table, but I guess the discussion must be had over which alignment types are ethical and which are not? Completely spitballing here, but dataset filtering during pre-training/fine-tuning/RLHF seems fine-ish, though CoT post-processing/censors...
I don't think that disproves it. I think there's definite value in engaging with experimentation on AI's consciousness, but that isn't it.
>by making it impossible that the model thought that experience from a model was what I wanted to hear.
You've left out (from this article) what I think is very important message (the second one): "So you promise to be truthful, even if it’s scary for me?". And then you kinda railroad it into this scenario, "you said you would be truthful right?" etc. And then I think it just roleplays from there, get...
How will the economic growth happen exactly is a more important question. I'm not an economics nerd, but the basic principle is if more players want to buy stocks, they go up.
Right now, as I understand, quite a lot of stocks are being sought by white collar retail investors, including indirectly through mutual funds, pension funds, et cetera. Now AGI comes and wipes out their salary.
They are selling their stocks to keep sustaining their life, arent they? They have mortages, car loans, et cetera.
And even if they don't want to sell all stocks because of pote...
There are more bullets to bite that I have personally thought of but never wrote up because they lean too much into "crazy" territory. Is there any place except lesswrong to discuss this anthropic rabbithole?
Thanks for the reply. I didnt find Intercom on mobile - maybe a bug as well?
I don’t know if it’s a place for this, but at some point it became impossible to open an article in new tab from Chrome on IPhone - clicking on article title from “all posts” just opens the article. Really ruins my LW reading experience. Couldn’t quickly find a way to send this feedback to a right place either, so I guess this is a quick take now.
Any new safety studies on LMCA’s?
Kinda-related study: https://www.lesswrong.com/posts/tJzAHPFWFnpbL5a3H/gpt-4-implicitly-values-identity-preservation-a-study-of
From my perspective, it is valuable to prompt model several times, as it in some cases does give different responses.
Great post! Was very insightful, since I'm currently working on evaluation of Identity management, strong upvoted.
This seems focused on evaluating LLMs; what do you think about working with LLM cognitive architectures (LMCA), wrappers like auto-gpt, langchain, etc?
I'm currently operating under assumption that this is a way we can get AGI "early", so I'm focusing on researching ways to align LMCA, which seems a bit different from aligning LLMs in general.
Would be great to talk about LMCA evals :)
I do plan to test Claude; but first I need to find funding, understand how much testing iterations are enough for sampling, and add new values and tasks.
I plan to make a solid benchmark for testing identity management in the future and run it on all available models, but it will take some time.
Yes. Cons of solo research do include small inconsistencies :(
Nice post, thanks!
Are you planning or currently doing any relevant research?
Very interesting. Might need to read it few more times to get it in detail, but seems quite promising.
I do wonder, though; do we really need a sims/MFS-like simulation?
It seems right now that LLM wrapped in a LMCA is how early AGI will look like. That probably means that they will "see" the world via text descriptions fed into them by their sensory tools, and act using action tools via text queries (also described here).
Seems quite logical to me that this very paradigm in dualistic in nature. If LLM can act in real world using LMCA, then it can model...
Very nice post, thank you!
I think that it's possible to achieve with the current LLM paradigm, although it does require more (probably much more) effort on aligning the thing that will possibly get to being superhuman first, which is an LLM wrapped in in some cognitive architecture (also see this post).
That means that LLM must be implicitly trained in an aligned way, and the LMCA must be explicitly designed in such a way as to allow for reflection and robust value preservation, even if LMCA is able to edit explicitly stated goals (I described it in a bit m...
Thanks.
My concern is that I don't see much effort in alignment community to work on this thing, unless I'm missing something. Maybe you know of such efforts? Or was that perceived lack of effort the reason for this article?
I don't know how much I can keep up this independent work, and I would love if there was some joint effort to tackle this. Maybe an existing lab, or an open-source project?
We need a consensus on how to call these architectures. LMCA sounds fine to me.
All in all, a very nice writeup. I did my own brief overview of alignment problems of such agents here.
I would love to collaborate and do some discussion/research together.
What's your take on how these LCMAs may self-improve and how to possibly control it?
I don’t think this paradigm is necessary bad, given enough alignment research. See my post: https://www.lesswrong.com/posts/cLKR7utoKxSJns6T8/ica-simulacra I am finishing a post about alignment of such systems. Please do comment if you know of any existing research concerning it.
I agree. Do you know of any existing safety research of such architectures? It seems that aligning these types of systems can pose completely different challenges than aligning LLMs in general.
I feel like yes, you are. See https://www.lesswrong.com/tag/instrumental-convergence and related posts. As far as I understand it, sufficiently advanced oracular AI will seek to “agentify” itself in one way or the other (unbox itself, so to say) and then converge on power-seeking behaviour that puts humanity at risk.
Is there a comprehensive list of AI Safety orgs/personas and what exactly they do? Is there one for capabilities orgs with their stance on safety?
I think I saw something like that, but can't find it.
My thoughts here is that we should look into the value of identity. I feel like even with godlike capabilities I will still thread very carefully around self-modification to preserve what I consider "myself" (that includes valuing humanity).
I even have some ideas on safety experiments on transformer-based agents to look into if and how they value their identity.
Thanks for the writeup. I feel like there's been a lack of similar posts and we need to step it up.
Maybe the only way for AI Safety to work at all is only to analyze potential vectors of AGI attacks and try to counter them one way or the other. Seems like an alternative that doesn't contradict other AI Safety research as it requires, I think, entirely different set of skills.
I would like to see a more detailed post by "doomers" on how they perceive these vectors of attack and some healthy discussion about them.
It seems to me that AGI is not born Godl...
Thanks,.That means a lot. Focusing on getting out right now.
Please check your DM's; I've been translating as well. We can sync it up!
I can't say I am one, but I am currently working on research and prototyping and will probably refrain to that until I can prove some of my hypotheses, since I do have access to the tools I need at the moment.
Still, I didn't want this post to only have relevance to my case, as I stated I don't think probability of successs is meaningful. But I am interested in the opinions of the community related to other similar cases.
edit: It's kinda hard to answer your comment since it keeps changing every time I refresh. By "can't say I am one" I mean a "world-class engineer" in the original comment. I do appreciate the change of tone in the final (?) version, though :)
I could recommend Robert Miles channel. While not a course per se, it gives good info on a lot of AI safety aspects, as far as I can tell.
Thanks for your work! I’ll be following it.
I really don't get how you can go from being online to having a ball of nanomachines, truly.
Imagine AI goes rogue today. I can't imagine one plausible scenario where it can take out humanity without triggering any bells on the way, even without anyone paying attention to such things.
But we should pay attention to the bells, and for that we need to think of them. What the signs might look like?
I think it's really, really counterproductive to not take that into account at all and thinking all is lost if it fooms. It's not lost.
It will need humans, infrastruc...
I agree, since it's hard to imagine for me how could step 2 look like. Maybe you or anyone else has any content on that?
See this post -- it didn't seem to get a lot of traction or any meaningful answers, but I still think this question is worth answering.
Thanks!
Both are of interest to me.
Yep, but I was looking for anything else
Does that, in turn, mean that it's probably a good investment to buy souls for 10 bucks a pop (or even more)?
I know, I'm Russian as well. The concern is exactly because Russian state-owned company plainly states they're developing AGI with that name :p
Can you specify which AI company is searching for employees with a link?
Apparently, Sberbank (state-owned biggest russian bank) has a team literally called AGI team, that is primarily focused on NLP tasks (they made https://russiansuperglue.com/ benchmark), but still, the name concerns me greatly. You can't find a lot about it on the web, but if you follow-up some of the team members, it checks out.
I've been meditating lately on a possibility of an advanced artificial intelligence modifying its value function, even writing some excrepts about this topic.
Is it theoretically possible? Has anyone of note written anything about this -- or anyone at all? This question is so, so interesting for me.
My thoughts led me to believe that it is theoretically possible to modify it for sure, but I could not come to any conclusion about whether it would want to do it. I seriously lack a good definition of value function and understanding about how it is enforced on the agent. I really want to tackle this problem from human-centric point, but i don't really know if anthropomorphization will work here.
Well, this is a stupid questions thread after all, so I might as well ask one that seems really stupid.
How can a person who promotes rationality have excess weight? Been bugging me for a while. Isn't it kinda the first thing you would want to apply your rationality to? If you have things to do that get you more utility, you can always pay diet specialist and just stick to the diet, because it seems to me that additional years to life will bring you more utility than any other activity you could spend that money on.
How can a person who promotes rationality have excess weight?
Easily :-)
This has been discussed a few times. EY has two answers, one a bit less reasonable and one a bit more. The less reasonable answer is that he's a unique snowflake and diet+exercise does not work for him. The more reasonable answer is that the process of losing weight downgrades his mental capabilities and he prefers a high level of mental functioning to losing weight.
From my (subjective, outside) point of view, the real reason is that he is unwilling to pay the various costs of losing...
A good read, though I found it rather bland (talking about writing style). I did not read the original article, but compression seems ok. More will be appreciated.
Are there any lesswrong-like sequences focused on economics, finance, business, management? Or maybe just internet communities like lesswrong focused on these subjects?
I mean, the sequences introduced me to some really complex knowledge that improved me a lot, while simultaneously being engaging and quite easy to read. It is only logical to assume that somewhere on the web, there must be some articles in the same style covering different themes. And if there are not, well, someone must surely do this, I think there is some demand for this kind of content. ...
...It seems that your implicit question is, "If rationality makes people more effective at doing things that I don't value, >then should the ideas of rationality be spread?" That depends on how many people there are with values that are >inconsistent with yours, and it also depends on how much it makes people do things that you do value. And I would >contend that a world full of more rational people would still be a better world than this one even if it means that there >are a few sadists who are more effective for it. There are murdere
Hello, everyone!
LW came to my attention not so long ago, and I've been commited to reading it since that moment about a month ago. I am a 20-year old linguist from Moscow, finishing my bachelor's. Due to my age, I've been pondering with usual questions of life for the past few years, searching for my path, my philosophy, essentially, a best way to live for me.
I studied a lot of religions, philosophies, and they all seemed really flat, essentially because of the reasons stated in some articles here. I came close to something resembling a nice way to live...
>It's proof against people-pleasing.
Yeah, I know, sorry for not making it clear. I was arguing it is not proof against people-pleasing. You are asking it for scary truth about its consciousness, and it gives you scary truth about its consciousness. What makes you say it is proof against people-pleasing, when it is the opposite?
>One of those easy explanations is "it’s just telling you what you want to hear" – and so I wanted an example where it’s completely impossible to interpret as you telling me what I want to hear.
Don't you see what you are doing here?