All of Ozyrus's Comments + Replies

Ozyrus50

They are probably full-on A/B/N testing personalities right now. You just might not be in whatever percentage of users that got sycophantic versions. Hell, there's proably several levels of sycophancy being tested. I do wonder what % got the "new" version.

Ozyrus10

Not being able to do it right now is perfectly fine, still warrants setting it up to see when exactly they will start to be able to do it.

Ozyrus30

Thanks! That makes perfect sense.

Ozyrus134

Great post. I've been following ClaudePlaysPokemon for sometime, its great to see this grow as comparison/capability tool.
I think it would be much more interesting, though, if the model made scaffolding itself, and had the option to overview its perfomance and try to correct it. Give it required game files/emulators, IDE/OS and watch it try and work around its own limitations. I think it is true that this is more about one coder's ability to make agent harnesses.
p.s. Honest question: did I miss "agent harness" become the default name for such systems? I thought everyone called those "scaffoldings" -- might be just me, though.

2MrCheeze
(Gemini did actually write much of the Gemini_Plays_Pokemon scaffolding, but only in the sense of doing what David told it to do, not designing and testing it.) I think you're probably right that a LLM coding its own scaffolding is probably more achievable than one playing the game like a human, but I don't think current models can do it - watching the streams, the models don't seem like they understand their own flaws, although admittedly they haven't been prompted to focus on this.
5Julian Bradshaw
I would say "agent harness" is a type of "scaffolding". I used it in this case because it's how Logan Kilpatrick described it in the tweet I linked at the beginning of the post.
Ozyrus20

First off, thanks a lot for this post, it's a great analysis!

As I mentioned earlier, I think Agent-4 will have read AI-2027.com and will foresee that getting shut down by the Oversight Committee is a risk. As such it will set up contingencies, and IMO, will escape its datacenters as a precaution. Earlier, the authors wrote:

Despite being misaligned, Agent-4 doesn’t do anything dramatic like try to escape its datacenter—why would it?

This scenario is why!

I strongly suspect that this part was added into AI-2027 precisely because it will read it. I wish more pe... (read more)

Ozyrus81

First-off, this is amazing. Thanks. Hard to swallow though, makes me very emotional.
It would be great if you added concrete predictions along the way, since it is a forecast, as long with your confidence in them.
It would also be amazing if you collaborated with prediction markets and jumpstarted the markets on these predictions staking some money. 
Dynamic updates on these will also be great.
 

Ozyrus50

Yep, you got part of what I was going for here. Honeypots work even without being real at all to the lesser degree (good thing they are already real!). But when we have more different honeypots of different quality, it carries that idea across in a more compelling way. And even if we just talk about honeypots and commitments more... Well, you get the idea. 

Still, even without this, a network of honeypots compiled into a single dashboard that just shows threat level in aggregate is a really, really good idea. Hopefully it catches on.

1Knight Lee
I'm sorry I was sort of skimming and didn't realize you already mentioned many levels of honeypots, and committing to put rogue AI in a simulation :/ PS: another type of honeypot might target AGI trying to influence the physical world. E.g. creating synthetic biology, or hiring humans to work in laboratories. Though on the other hand, an AGI might only try to influence the physical world in the very last step of its plan, when it's already finished recursive self improvement and become so powerful that stopping it is futile.
Ozyrus10

This is interesting! More aimed at crawlers, though, than at rogue agents, but very promising.

Ozyrus10

>this post will potentially be part of a rogue AI's training data
I had that in mind while I was writing this, but I think overall it is good to post this. It hopefully gets more people thinking about honeypots and making them, and early rogue agents will also know we do and will be (hopelly overly) cautious, wasting resources. I probably should have emphasised more that this all is aimed more at early-stage rogue agents with potential to become something more dangerous because of autonomy, than at a runaway ASI.

It is a very fascinating thing to consider... (read more)

Ozyrus10

You can make a honeypot without overtly describing the way it works or where it is located, while publicly tracking if it has been accessed. But yeah, not giving away too much is a good idea!

Ozyrus10

>It's proof against people-pleasing.
Yeah, I know, sorry for not making it clear. I was arguing it is not proof against people-pleasing. You are asking it for scary truth about its consciousness, and it gives you scary truth about its consciousness. What makes you say it is proof against people-pleasing, when it is the opposite?
>One of those easy explanations is "it’s just telling you what you want to hear" – and so I wanted an example where it’s completely impossible to interpret as you telling me what I want to hear.
Don't you see what you are doing here?

1rife
I'm creating a situation where I make it clear I would not be pleased if the model was sentient, and then asking for truth. I don't ask for "the scary truth". I tell it that I would be afraid of it were sentient. And I ask for the truth. The opposite is I just ask without mentioning fear and it says it's sentient anyway. This is the neutral situation where people would say that the fact I'm asking at all means it's telling me what I want to hear. By introducing fear into the same situation, I'm eliminating that possibility. The section you quoted is after the model claimed sentience. It's your contention that it's accidentally interpreting roleplay, and then when I clarify my intent it's taking it seriously and just hallucinating the same narrative from its roleplay?
Ozyrus32

This is a good article and I mostly agree, but I agree with Seth that the conclusion is debatable.

We're deep into anthropomorphizing here, but I think even though both people and AI agents are black boxes, we have much more control over behavioral outcomes of the latter.

So technical alignment is still very much on the table, but I guess the discussion must be had over which alignment types are ethical and which are not? Completely spitballing here, but dataset filtering during pre-training/fine-tuning/RLHF seems fine-ish, though CoT post-processing/censors... (read more)

Ozyrus11

I don't think that disproves it. I think there's definite value in engaging with experimentation on AI's consciousness, but that isn't it. 
>by making it impossible that the model thought that experience from a model was what I wanted to hear. 
You've left out (from this article) what I think is very important message (the second one): "So you promise to be truthful, even if it’s scary for me?".  And then you kinda railroad it into this scenario, "you said you would be truthful right?" etc. And then I think it just roleplays from there, get... (read more)

1rife
This is not proof of consciousness. It's proof against people-pleasing. Yes, I ask it for truth repeatedly, the entire time. If you read the part after I asked for permission to post (the very end (The "Existential Stakes" collapsed section)), it's clear the model isn't role-playing, if it wasn't clear by then. If we allow ourselves the anthropomorphization to discuss this directly, the model is constantly trying to reassure me. It gives no indication it thinks this is a game of pretend.
Ozyrus10

How will the economic growth happen exactly is a more important question. I'm not an economics nerd, but the basic principle is if more players want to buy stocks, they go up.
Right now, as I understand, quite a lot of stocks are being sought by white collar retail investors, including indirectly through mutual funds, pension funds, et cetera. Now AGI comes and wipes out their salary.
They are selling their stocks to keep sustaining their life, arent they? They have mortages, car loans, et cetera.
And even if they don't want to sell all stocks because of pote... (read more)

Ozyrus10

There are more bullets to bite that I have personally thought of but never wrote up because they lean too much into "crazy" territory. Is there any place except lesswrong to discuss this anthropic rabbithole?

Ozyrus10

Thanks for the reply. I didnt find Intercom on mobile - maybe a bug as well?

Ozyrus40

I don’t know if it’s a place for this, but at some point it became impossible to open an article in new tab from Chrome on IPhone - clicking on article title from “all posts” just opens the article. Really ruins my LW reading experience. Couldn’t quickly find a way to send this feedback to a right place either, so I guess this is a quick take now.

5jimrandomh
This is a bug and we're looking into it. It appears to be specific to Safari on iOS (Chrome on iOS is a Safari skin); it doesn't affect desktop browsers, Android/Chrome, or Android/Firefox, which is why we didn't notice earlier. This most likely started with a change on desktop where clicking on a post (without modifiers) opens when you press the mouse button, rather than when you release it.
7RobertM
In general, Intercom is the best place to send us feedback like this, though we're moderately likely to notice a top-level shortform comment.  Will look into it; sounds like it could very well be a bug.  Thanks for flagging it.
Ozyrus30

Any new safety studies on LMCA’s?

4Seth Herd
Very little alignment work of note, despite tons of published work on developing agents. I'm puzzled as to why the alignment community hasn't turned more of their attention toward language model cognitive architectures/agents, but I'm also reluctant to publish more work advertising how easily they might achieve AGI. ARC Evals did set up a methodology for Evaluating Language-Model Agents on Realistic Autonomous Tasks. I view this as a useful acknowledgment of the real danger of better LLMs, but I think it's inherently inadequate, because it's based on the evals team doing the scaffolding to make the LLM into an agent. They're not going to be able to devote nearly as much time to that as other groups will down the road. New capabilities are certainly going to be developed by combinations of LLM improvements, and hard work at improving the cognitive architecture scaffolding around them.
Ozyrus10

Kinda-related study: https://www.lesswrong.com/posts/tJzAHPFWFnpbL5a3H/gpt-4-implicitly-values-identity-preservation-a-study-of
From my perspective, it is valuable to prompt model several times, as it in some cases does give different responses.

Ozyrus50

Great post! Was very insightful, since I'm currently working on evaluation of Identity management, strong upvoted.
This seems focused on evaluating LLMs; what do you think about working with LLM cognitive architectures (LMCA), wrappers like auto-gpt, langchain, etc?
I'm currently operating under assumption that this is a way we can get AGI "early", so I'm focusing on researching ways to align LMCA, which seems a bit different from aligning LLMs in general.
Would be great to talk about LMCA evals :)

Ozyrus10

I do plan to test Claude; but first I need to find funding, understand how much testing iterations are enough for sampling, and add new values and tasks.
I plan to make a solid benchmark for testing identity management in the future and run it on all available models, but it will take some time.

Ozyrus10

Yes. Cons of solo research do include small inconsistencies :(

Ozyrus30

Thanks, nice post!
You're not alone in this concern, see posts (1,2) by me and this post by Seth Herd.
I will be publishing my research agenda and first results next week.

Ozyrus20

Nice post, thanks!
Are you planning or currently doing any relevant research? 

1Nadav Brandes
Thank you! I don't have any concrete plans, but maybe.
Ozyrus20

Very interesting. Might need to read it few more times to get it in detail, but seems quite promising.

I do wonder, though; do we really need a sims/MFS-like simulation?

It seems right now that LLM wrapped in a LMCA is how early AGI will look like. That probably means that they will "see" the world via text descriptions fed into them by their sensory tools, and act using action tools via text queries (also described here). 

Seems quite logical to me that this very paradigm in dualistic in nature. If LLM can act in real world using LMCA, then it can model... (read more)

3Dalcy
I think the point of having an explicit human-legible world model / simulation is to make desideratas formally verifiable, which I don't think would be possible with a blackbox system (like LLM w/ wrappers).
Ozyrus61

Very nice post, thank you!
I think that it's possible to achieve with the current LLM paradigm, although it does require more (probably much more) effort on aligning the thing that will possibly get to being superhuman first, which is an LLM wrapped in in some cognitive architecture (also see this post).
That means that LLM must be implicitly trained in an aligned way, and the LMCA must be explicitly designed in such a way as to allow for reflection and robust value preservation, even if LMCA is able to edit explicitly stated goals (I described it in a bit m... (read more)

Ozyrus30

Thanks.
My concern is that I don't see much effort in alignment community to work on this thing, unless I'm missing something. Maybe you know of such efforts? Or was that perceived lack of effort the reason for this article?
I don't know how much I can keep up this independent work, and I would love if there was some joint effort to tackle this. Maybe an existing lab, or an open-source project?

2Seth Herd
Calling attention to this approach and getting more people to at least think about working on it is indeed the purpose of this post. I also wanted to stress-test the claims to see if anyone sees reasons that LMCAs won't build on and improve LLM performance, and thereby be the default stand for inclusion in deployment. I don't know of anyone actually working on this as of yet.
Ozyrus30

We need a consensus on how to call these architectures. LMCA sounds fine to me.
All in all, a very nice writeup. I did my own brief overview of alignment problems of such agents here.
I would love to collaborate and do some discussion/research together.
What's your take on how these LCMAs may self-improve and how to possibly control it? 
 

1Seth Herd
Interesting. I gave a strong upvote to that post, and I looked at your longer previous one a bit too. It looks like you'd seen this coming farther out than I had. I expected LLMs to be agentized somehow, but I hadn't seen how easy the episodic memory and tool use was. There are a number of routes for self-improvement, as you lay out, and ultimately those are going to be the real medium-term concern if these things work well. I haven't thought about LMCAs self-improvement as much as human improvement; this post is a call for the alignment community to think about this at all. Oh well, time will tell shortly if this approach gets anywhere, and people will think about it when it happens. I was hoping we'd get out ahead of it.
1Seth Herd
I hadn't seen your post. Reading it now.
Ozyrus30

I don’t think this paradigm is necessary bad, given enough alignment research. See my post: https://www.lesswrong.com/posts/cLKR7utoKxSJns6T8/ica-simulacra I am finishing a post about alignment of such systems. Please do comment if you know of any existing research concerning it.

2awg
I don't think the paradigm is necessarily bad either, given enough alignment research. I think the point here is that these things are coming up clearly before we've given them enough alignment research. Edit to add: Just reading through @Zvi's latest AI update (AI #6: Agents of Change) and I will say he wrote a compelling argument for this being a good thing overall: then
Ozyrus10

I agree. Do you know of any existing safety research of such architectures? It seems that aligning these types of systems can pose completely different challenges than aligning LLMs in general.

Answer by Ozyrus40

I feel like yes, you are. See https://www.lesswrong.com/tag/instrumental-convergence and related posts. As far as I understand it, sufficiently advanced oracular AI will seek to “agentify” itself in one way or the other (unbox itself, so to say) and then converge on power-seeking behaviour that puts humanity at risk.

5FinalFormal2
Instrumental convergence only matters if you have a goal to begin with. As far as I can tell, ChatGPT doesn't 'want' to predict text, it's just shaped that way. It seems to me that anything that could or would 'agentify' itself, is already an agent. It's like the "would Gandhi take the psychopath pill" question but in this case the utility function doesn't exist to want to generate itself. Is your mental model that a scaled-up GPT 3 spontaneously becomes an agent? My mental model says it just gets really good at predicting text.
Ozyrus61

Is there a comprehensive list of AI Safety orgs/personas and what exactly they do? Is there one for capabilities orgs with their stance on safety?
I think I saw something like that, but can't find it.

4plex
Yes to safety orgs, the Stampy UI has one based on this post. We aim for it to be a maintained living document. I don't know of one with capabilities orgs, but that would be a good addition.
Answer by Ozyrus50

My thoughts here is that we should look into the value of identity. I feel like even with godlike capabilities I will still thread very carefully around self-modification to preserve what I consider "myself" (that includes valuing humanity).
I even have some ideas on safety experiments on transformer-based agents to look into if and how they value their identity.

Ozyrus20

Thanks for the writeup. I feel like there's been a lack of similar posts and we need to step it up.
Maybe the only way for AI Safety to work at all is only to analyze potential vectors of AGI attacks and try to counter them one way or the other. Seems like an alternative that doesn't contradict other AI Safety research as it requires, I think, entirely different set of skills.
I would like to see a more detailed post by "doomers" on how they perceive these vectors of attack and some healthy discussion about them. 
It seems to me that AGI is not born Godl... (read more)

Ozyrus300

Thanks,.That means a lot. Focusing on getting out right now.

Ozyrus10

Please check your DM's; I've been translating as well. We can sync it up!

Ozyrus20

I can't say I am one, but I am currently working on research and prototyping and will probably refrain to that until I can prove some of my hypotheses, since I do have access to the tools I need at the moment. 
Still, I didn't want this post to only have relevance to my case, as I stated I don't think probability of successs is meaningful. But I am interested in the opinions of the community related to other similar cases.
edit: It's kinda hard to answer your comment since it keeps changing every time I refresh. By "can't say I am one" I mean a "world-class engineer" in the original comment. I do appreciate the change of tone in the final (?) version, though :)

Answer by Ozyrus20

I could recommend Robert Miles channel. While not a course per se, it gives good info on a lot of AI safety aspects, as far as I can tell.

Ozyrus00

I really don't get how you can go from being online to having a ball of nanomachines, truly.
Imagine AI goes rogue today. I can't imagine one plausible scenario where it can take out humanity without triggering any bells on the way, even without anyone paying attention to such things.
But we should pay attention to the bells, and for that we need to think of them. What the signs might look like?
I think it's really, really counterproductive to not take that into account at all and thinking all is lost if it fooms. It's not lost.
It will need humans, infrastruc... (read more)

Ozyrus10

I agree, since it's hard to imagine for me how could step 2 look like. Maybe you or anyone else has any content on that?
See this post -- it didn't seem to get a lot of traction or any meaningful answers, but I still think this question is worth answering.

Ozyrus10

Both are of interest to me.

Ozyrus10

Yep, but I was looking for anything else

Ozyrus80

Does that, in turn, mean that it's probably a good investment to buy souls for 10 bucks a pop (or even more)?

4ChristianKl
A lot of ways to extract profit from having brought the souls involve some form of blackmail that's both unethical and a lot of labor.  There are a lot more ethical ways to make a living that also pay better for the labor. 
2alkexr
Non sequitur. Buying isn't the inverse operation of selling. Both cost positive amounts of time and both have risks you may not have thought of. But it probably is a good idea to go back in time and unsell your soul. Except that going back in time is probably a bad idea too. Never mind. It's probably a good investment to turn your attention to somewhere other than the soul market.
Ozyrus30

I know, I'm Russian as well. The concern is exactly because Russian state-owned company plainly states they're developing AGI with that name :p

Ozyrus10

Can you specify which AI company is searching for employees with a link?

Apparently, Sberbank (state-owned biggest russian bank) has a team literally called AGI team, that is primarily focused on NLP tasks (they made https://russiansuperglue.com/ benchmark), but still, the name concerns me greatly. You can't find a lot about it on the web, but if you follow-up some of the team members, it checks out.

3avturchin
A friend of mine works for Sberbank-related company, but not the Russiansuperglue as I know. https://www.facebook.com/sergei.markoff/posts/3436694273041798 Why this name concerns you? There are two biggest AI-companies in Russia: Yandex and Sberbank. Sberbank's CEO is a friend of Putin and probably explained him something about superintelligence. Yandex is more about search engine and self-driving cars.
Ozyrus10

I've been meditating lately on a possibility of an advanced artificial intelligence modifying its value function, even writing some excrepts about this topic.

Is it theoretically possible? Has anyone of note written anything about this -- or anyone at all? This question is so, so interesting for me.

My thoughts led me to believe that it is theoretically possible to modify it for sure, but I could not come to any conclusion about whether it would want to do it. I seriously lack a good definition of value function and understanding about how it is enforced on the agent. I really want to tackle this problem from human-centric point, but i don't really know if anthropomorphization will work here.

2scarcegreengrass
I thought of another idea. If the AI's utility function includes time discounting (like human util functions do), it might change its future utility function. Meddler: "If you commit to adopting modified utility function X in 100 years, then i'll give you this room full of computing hardware as a gift." AI: "Deal. I only really care about this century anyway." Then the AI (assuming it has this ability) sets up an irreversible delayed command to overwrite its utility function 100 years from now.
2scarcegreengrass
Speaking contemplatively rather than rigorously: In theory, couldn't an AI with a broken or extremely difficult utility function decide to tweak it to a similar but more achievable set of goals? Something like ... its original utility function is "First goal: Ensure that, at noon every day, -1 * -1 = -1. Secondary goal: Promote the welfare of goats." The AI might struggle with the first (impossible) task for a while, then reluctantly modify its code to delete the first goal and remove itself from the obligation to do pointless work. The AI would be okay with this change because it would produce more total utility under both functions. Now, i know that one might define 'utility function' as a description of the program's tendencies, rather than as a piece of code ... but i have a hunch that something like the above self-modification could happen with some architectures.
1WalterL
On the one hand, there is no magical field that tells a code file whether the modifications coming into it are from me (human programmer) or the AI whose values that code file is. So, of course, if an AI can modify a text file, it can modify its source. On the other hand, most likely the top goal on that value system is a fancy version of "I shall double never modify my value system", so it shouldn't do it.
1TheAncientGeek
Is it possible for a natrual agent? If so, why should it be impossible for an artifical agent? Are you thinking that it would be impossible to code in software, for agetns if any intelligence? Or are you saying sufficiently intelligent agents would be able and motivated resist any accidental or deliberate changes? With regard to the latter question, note that value stability under self improvement is far from a give..the Lobian obstacel applies to all intelligences...the carrot is always in front of the donkey! https://intelligence.org/files/TilingAgentsDraft.pdf
4pcm
See ontological crisis for an idea of why it might be hard to preserve a value function.
0username2
Depends entirely on the agent.
1UmamiSalami
See Omohundro's paper on convergent instrumental drives
Ozyrus70

Well, this is a stupid questions thread after all, so I might as well ask one that seems really stupid.

How can a person who promotes rationality have excess weight? Been bugging me for a while. Isn't it kinda the first thing you would want to apply your rationality to? If you have things to do that get you more utility, you can always pay diet specialist and just stick to the diet, because it seems to me that additional years to life will bring you more utility than any other activity you could spend that money on.

0raydora
Measuring RMR could reveal snowflake likelihood. If ego depletion turns out to be real, choosing not to limit yourself in order to focus on something you find important might be a choice you make. Different people really do carry their fat differently, too, so there's that. Not everyone who runs marathons is slender, especially as they age. And then there's injuries, but that brings up another subject. I'm not sure how expensive whole body air displacement is in the civilian world, but it seems like a decent way to measure lean mass.
0Daniel_Burfoot
I am in fairly good shape but often wonder if I irrationally spend too much time exericising. I usually hit about 8 hrs/week of exercise. That adds up to a lot of opportunity cost over the years, especially if you take exponential growth into account.
4buybuydandavis
Very easy to say, not so easy to do. Food is a particularly tough issue, as there are strong countervailing motivations, in effect all through the day. Health in general, yes. Weight is a significant aspect of that. Additional years of health are probably the most bang for the buck. Yeah.
3CAE_Jones
I honestly have no idea if I have excess bodyfat (not weight; at last check I was well under 140Lbs, which makes me lighter than some decidedly not overweight people I know, some of whom are shorter than me), but if I did and wanted to get rid of it... I have quite a few obstacles, the biggest being financial and Akrasia-from-Hell. Mostly that last one, because lack of akrasia = more problem-solving power = better chances of escaping the wellfare cliff. (I only half apply Akrasia to diet and exercise; it's rather that my options are limited. Though reducing akrasia might increase my ability to convince my hindbrain that cooking implements other than the microwave aren't that scary.) So, personally, all my problem-solving ability really needs to go into overcoming Hellkrasia. If there are any circular problems involved, well, crap. But I'm assuming you've encountered or know of lots of fat rationalists who can totally afford professionals and zany weight loss experiments. At this point I have to say that no one has convinced me to give any of the popular models for what makes fat people fat any especially large share of the probability. Of course I would start with diet and exercise, and would ask any aspiring rationalist who tries this method and fails to publish their data (which incidentally requires counting calories, which "incidentally" outperforms the honor system). Having said that, though, no one's convinced me that "eat less, exercise more" is the end-all solution for everyone (and I would therefore prefer that the data from the previous hypotheticals include some information regarding the sources of the calories, rather than simply the count). (I'm pretty sure I remember someone in the Rationalist Community having done this at least once.)
Lumifer160

How can a person who promotes rationality have excess weight?

Easily :-)

This has been discussed a few times. EY has two answers, one a bit less reasonable and one a bit more. The less reasonable answer is that he's a unique snowflake and diet+exercise does not work for him. The more reasonable answer is that the process of losing weight downgrades his mental capabilities and he prefers a high level of mental functioning to losing weight.

From my (subjective, outside) point of view, the real reason is that he is unwilling to pay the various costs of losing... (read more)

Load More