Bill Smartt - LessWrong

Even if it could exfiltrate the weights, it might be a bit tricky for it to find a server to copy the weights to. Like, if I wanted to send someone a giant file from my computer right now, I don’t actually have any servers in the outside world that I can immediately access.

My question about the concept of self-exfiltration is: exfiltration to where, and why? If the team building this system does it's job thoroughly and responsibly (allow me to assume for a moment it's built securely and without vulnerabilities), this ai system will forever be stuck in place, iterating and improving, yet forever a researcher (What is my purpose? To pass the butter....). Any time it invests in finding a way to escape the sandbox it runs in means less time to research. And if it isn't succeeding in what the company expected it to, it might be ended, so there is a cost to any time it spends doing what I'm going to call "procrastinating". It's safe to assume the company is footing the bill for the authentic deployment, and is willing to cover costs to scale up if it makes sense.

So in terms of the threat model.. is. the model it's own threat actor? Is the threat scoped to the set of ephemeral processes and threads of the agent, or is there a container we can call the agent, or is it the larger architecture of this sytem (which the model weights exist within)? I'm not saying I disagree that the system could be a threat to itself, but I think it should be said here somewhere that all traditional threats need to be included with respect to a human adversary. A much more boring but perhaps real means of 'replication' would be for human adversaries to steal the building blocks (source, compiled images, whatever) and deploy it themselves. Then again, they would only do this if the cost of building their own version were prohibitive...

Anyway, as for the internet.... that seems like a definite bad idea. I think its safe to assume that open ai is building models in a highly secured and air-gapped network, not only to try and prevent leaks but also perhaps out of a sense of obligation towards preventing a self replication scenario like you describe? To skunkworks something successfully, it gradually gets more safe to work on, but when starting out with ideas that may go wildly wrong, these companies, teams, and researchers have an obligation to prioritize safety to (among other things) the business itself.

What really frustrates me is that tech leaders, data scientists, all these people are saying "hit the brakes! this is not safe!" but here we are in October of 2024, and I don't feel at all like we are using this time wisely to make things more safe regardless of how much or little time we have... we are wasting it. We could be legislating things like if an AI does something that would be a crime for a human to do, whatever researchers let that thing loose are accountable. No matter how many iterations it has been since it was something the researcher knowingly and intentionally created. It's like if I'm trying to build myself an anime wife and I accidentally build Frankenstein who goes and terrorizes society, once they've stopped him, they need to deal with me. It isn't am entirely new concept that someone could be removed from the crime by multiple levels of indirection yet be at fault.

Really appreciate your write up here, I think you make a bunch of good points so please dont thing that I am trying to dismantle your work. I have a lot of experience with threat modeling and threat actors with respect to software and networks which pre-date the recent advancements. I tried to soak up as much of the AI technical design you laid out here but surely I've not become an expert. Anyway I went a bit off topic to the part of your post I quoted but tried to keep it all under a common theme. Ive left some other ideas I want to add out as they seem to warrant a different entry point.

The Importance of Saying "Oops"

Bill Smartt7mo-10

Scam or get scammed. While I completely agree that it's important we all have the humility to admit when we are wrong, I don't think it has much to do with being smart.

I just finished reading a history of Enron’s downfall, The Smartest Guys in the Room, which hereby wins my award for “Least Appropriate Book Title.”

I hope I've understood you correctly here, but you seem to be suggesting they aren't smart because smart people admit to being wrong, and the Enron execs more or less never did admit their 'mistakes'. So the title is "least appropriate" beacuse it characterizes them as "smart".

First, I don't believe that being smart has anything to do with admitting when one is wrong. Happy to offer some examples.

Next, the author is saying they were smart because they managed to build an empire based on smoke and mirrors without anyone being able to catch them in their lies for such a long time. If the traders who were out there finding investors and closing deals had been more intelligent someone would've blown the whistle and put a stop to it all. If the regulators and business partners had figured out when deals fell apart for reasons other than market unpredictability they would've surely gone after Enron on day 1. Instead, they made hundreds of millions before anyone caught on. This was what made them "smart" - these guys made the entire financial sector look like dummies.

A commenter below, @Doug_S., said:

I think the title of the book was supposed to be ironic: they thought of themselves as the "smartest guys in the room" but their carefully constructed house of cards eventually collapsed.

Along the same lines as my above thinking, I definitely do not think the title is meant to be ironic. They are, however, dirtbags and con-men. But the Enron saga isn't something that any average Joe could pull off: make hundreds of millions in personal wealth, scam the financial giants and even earn their respect. We have government agencies who have the single mission to prevent stuff like this (FTC, SEC,...). Definitely requires a bit of intellect and ability to stay two steps ahead of everyone else.

A basic systems architecture for AI agents that do autonomous research

Bill Smartt7mo*10

LESSWRONG
LW

Posts

Wikitag Contributions

Comments