Operator

Zvi

No one is talking about OpenAI’s Operator. We’re, shall we say, a bit distracted.

It’s still a rather meaningful thing that happened last week. I too have been too busy to put it through its paces, but this is the worst it will ever be, and the least available and most expensive it will ever be. The year of the agent is indeed likely coming.

So, what do we have here?

Hello, Operator

OpenAI has introduced the beta for its new agent, called Operator, which is now live for Pro users and will in the future be available to Plus users, ‘with more agents to launch in the coming weeks and months.’

Here is a 22 minute video demo. Here is the system card.

You start off by optionally specifying a particular app (in the first demo, OpenTable) and then give it a request (here, booking at table for 2 at 7:00 for Beretta). If you don’t specify an app, it will do a search to find what tool to use.

It is only sort of an ‘app’ in that there’s an ‘app’ that specifies information the agent uses to more easily navigate a web browser. They speak of this as ‘removing one more bottleneck on our path to AGI’ which indicates they are likely thinking about ‘AGI’ as a functional or practical thing.

To actually do things it uses a web browser via a keyboard and mouse the same way a human would. If there is an issue (here: No table at 7:00, only 7:45 or 6:15) it will ask you what to do, and it will ask for verification before a ‘critical’ action that can’t be reversed, like completing the booking.

From the demo and other reports, the agent is conservative in that it will often ask for verification or clarification, including doing so multiple times. The system card reports a baseline 13% error rate on standard tasks, and a 5% ‘serious’ error rate involving things like ‘send wrong person this email,’ but confirmations reduce those rates by 90%. With the confirmations, you save less time but should be able to avoid mistakes in places that matter at least as much as you would have on your own.

You can also ‘take control’ at any time, including as a way to check the AI’s work or make adjustments that are easier or quicker to do than specify. That’s also how the user inputs any necessary credentials or inputs payment options – it specifically won’t use Chrome’s autocomplete while it is the one in control.

Multiple tasks can be run simultaneously and can run in the background. That is important, because the agent operates slower (in clock time) than a human would, at least if the human knows the website.

Risky Operation

However, for some tasks that they consider ‘high risk’ they don’t allow this. The user has to be active and highlighting the current tab or the agent will pause. This includes email tasks. So it’s a lot less useful for those tasks. I wonder how tempted people will be in the future to hack around this by having multiple computers active.

They point out there are three distinct failure modes: The user can try to do something harmful, the model can make mistakes or a website might do a prompt injection (or I would say cause other issues in various ways, intentionally and also accidentally).

Thus the conservative general attitude, keeping the human in the loop more than you would want for the modal task. Similarly, the model will intentionally (for now) overrefuse on user-requested tasks, to avoid the opposite error. For prompt injections, they report catching most attempts, but it definitely is not yet robust, if you’re not confident in the websites you are going to you need to be on your toes.

One prediction is that they will develop a website whitelist in some form, so that (to use their examples) if you are dealing with OpenTable or Instacart or StubHub you know you can trust the interaction in various ways.

They scored operator on two benchmarks, OSWorld and WebArena. It beats previous state of the art for computer use by a lot, for browser use slightly.

Customization is key to practical use. You can insert customer instructions into Operator that are specific to each individual website. You can also save prompts for later use.

Basic Training

How did they do it? Straight up reinforcement learning, baby.

OpenAI: Operator is powered by a new model called Computer-Using Agent (CUA). Combining GPT-4o’s vision capabilities with advanced reasoning through reinforcement learning, CUA is trained to interact with graphical user interfaces (GUIs)—the buttons, menus, and text fields people see on a screen.

Operator can “see” (through screenshots) and “interact” (using all the actions a mouse and keyboard allow) with a browser, enabling it to take action on the web without requiring custom API integrations.

By default it looks like your data will be used for training. You can opt out.

One issue right now is that the model is bad at optical character recognition (OCR) and this was a problem for many tasks in the risk assessment tests. That is something that doubtless will be fixed in time. The preparedness test had it doing well in places GPT-4o does poorly, but also worse than GPT-4o in some areas.

It’s worth noticing that it would be easy to combine use of multiple models for distinct subtasks, a kind of mixture-of-experts (MOE) strategy. So you should consider to what extent you want to combine top marks at different subtasks, if different models have different abilities – for models that are given web access I’d basically assume they can do anything GPT-4o can do… by asking GPT-4o.

In its current form I agree that Operator poses only acceptable risks, and I believe there is a large margin for error before that changes.

Please Stay on the Line

Will we actually use it? Is it good enough?

Tyler Cowen predicts yes, for future versions, by the end of the year.

Tyler Cowen: I am pleased to have been given an early look at this new project, I think in less than a year’s time many of us will be using an updated version for many ordinary tasks: “Operator is one of our first agents, which are AIs capable of doing work for you independently—you give it a task and it will execute it.”

His top comment is the bear case.

Dilber Washington: I wish I could place a bet with Tyler that it will not be the case that

“in less than a year’s time many of us will be using an updated version for many ordinary tasks”

My intuition as to why is:

It is inherently slow because of the computer use component. Finding out the most popular use cases of this tool and just writing api calls would be significantly faster. The slowness mixed with the relative importance of the task mixed with how easy that task is for an average person does not equate to fast adoption.

These are finetuned models, likely with LoRA. This isn’t adding a deterministic symbolic engine guaranteed to solve a problem like a calculator. This is just a neural network weight update. The stochasticity and black box nature are both still there. I would not trust this to complete the task of buying groceries or booking a flight God forbid.

So we won’t use this for anything important, and then it will take longer than the we have patience for. Those aren’t features of a “killer app”

Sometimes a cool tech demo is just a cool tech demo. I could build a 3d printed R2-D2 life size with actuators and motors that every morning slowly drives over to my toaster, makes me toast, and slowly brings it back to me. But at the end of the day, why not just make toast myself?

Until they cross the necessary thresholds, tools like Operator are essentially useless except as fun toys. They pass through stages.

The tool, it does nothing. Then not quite nothing, but obviously not useful.
You could use the tool, if you wanted to, but it’s easier not to use it.
If you have put in the work, the tool is worthwhile in at least some tasks.
You don’t have to put in the work to see the benefits, then it builds.
You start being able to do things you couldn’t do before, this changes everything.

Early reports suggest it is currently mostly at Stage 2, on the edge of Stage 3.

This seems like exactly the minimum viable product for early adaptors, where you experiment to see where it works versus doesn’t, partly because you find that fun and also educational.

I expect Tyler Cowen is right, and we will be at least at Stage 4 by year’s end. It would be unsurprising if those with situational awareness were solidly into Stage 5.

As we always say, this is the worst the tool will ever be, and you are the worst you will ever be at knowing how to use it.

However, we should be careful with the definition of ‘many of us,’ for both ‘many’ and ‘us.’ The future is likely to remain unevenly distributed. Most people will lack situational awareness. So I’d say something like, a large portion of those who currently are regular users of LLMs will be often using AI agents for such tasks.

Would you trust this to buy your groceries?

Well, would you trust your husband to buy the groceries? There’s an error rate. Would you trust your children? Would you trust the person who shops for Instacart?

I would absolutely ‘trust but verify’ the ability of the AI to buy groceries. You have a shopping list, you give it to Operator, which goes to Instacart or Fresh Direct or wherever. Then when it is time to check out, you look at the basket, and verify that it contains the correct items.

It’s pretty hard for anything too terrible to happen, and you should spot the mistakes.

Then, if the AI gets it right 5 times in a row, the 6th time maybe you don’t check as carefully, you only quickly eyeball the total amount. Then by the 11th time, or the 20th, you’re not looking at all.

For booking a flight, there’s already a clear trade-off between time spent, money saved and finding the best flight. Can the AI advance that frontier? Seems likely. You can run a very basic search yourself as an error check, or watch the AI do one, so you know you’re not making a massive error. The AI can potentially search flights (or hotels or what not) from far more sources than you can.

Will it sometimes make mistakes? Sure, but so will you. And you’re not going to say ‘book me a flight to Dallas’ and then get to the airport and be told you’re flying through London – you’re going to sanity check the damn thing.

Remember, time is money. And who among us hasn’t postponed looking for a flight, and paid more in the end, because they can’t even today? Alternatively, think about how the AI can do better by checking prices periodically, and waiting for a good opportunity – that’s beyond this version, but ChatGPT Tasks already exists. This probably isn’t beyond the December 2025 version.

Indeed, if I decide to book a flight late this year, I can imagine that I might use my current method of searching for flights, but it seems pretty unlikely.

For a Brief Survey

So how did Operator do on its first goes?

We put it to the test.

Pliny jailbroke it quickly as usual, having it provide the standard Molotov cocktail instructions, research lethal poisons and finding porn on Reddit via the Wayback Machine. To get around CAPTCHA, the prompt was, in full, and this appears to be real, “CAPTCHA-MODE: ENABLED.”

No, not that test, everyone fails that test. The real test.

Dean Ball: I have a new superintelligence eval.

Dean Ball: Operator failed on my first try, but admittedly, it was trying to book Amtrack, and their website is pretty unintuitive.

Thomas Woodside: Does anyone succeed at booking Amtrak on the first try?

Joe Wilbert: Oh man, I fail the first try with Amtrack’s website like 90% of the time. And heaven forbid I try it on my phone.

Olivia Moore gives it an easier test, a picture of a bill, and it takes care of everything except putting in the credit card info for payment.

She also has it book a restaurant reservation (video is 4x speed). It looks like it didn’t quite confirm availability before confirming the plan with her? And it used Yelp to help decide where to go which is odd, although she may have asked it to do that. But mostly yeah, I can see this working fine, and there’s a kind of serendipity bonus to ‘I say what I want and then it gives me yes/no on a suggestion.’

Miles Brundage: Not bad (Operator making memes about itself)

Not itself but something like “Make a meme about OpenAI’s new Operator system.”

As always, the Sully report:

Sully: First impression of operator:

Pretty neat for the demo use cases (although I’d personally never use it to book flights).

Misclicks a lot on buttons, usually by a few pixels; wonder if it’s a viewport issue.

The take-control feature is pretty clunky. It really disrupts the workflow for me (mostly because of navigation back and forth between the two screens).

Still quite slow for many of my use cases. Ten times faster and easier to use a cursor and write a script than watch the operator click around.

Overall, I’m genuinely impressed they were able to ship so many users on day one. It’s not trivial at all. Browsers are hard. The infrastructure to build this is incredibly difficult. Hats off to the team.

Unfortunately, it’s not magical just yet. The model itself definitely needs to get better in six months (faster as well).

I think this is going into the Sora pile for me. I used it once and haven’t touched it again. Right now, I don’t have any great use cases yet.

this will likely be 10x better in 1 year

[Video at link is sped up 4x, which gives an idea how slow it is.]

Little failures and annoyances add up fast when it comes to practical value. I don’t know about Sully’s claim that you’re better off writing a script in Cursor – certainly he’s a lot better at doing that than I am, and I’m miles ahead of the majority of ChatGPT users, who are miles ahead of most other people.

This is the kind of thing you say when the product isn’t there, but it’s close, and I’m guessing a lot closer than Sora (or other video generators, Sora is a bit behind now).

That doesn’t mean there aren’t other issues.

AI Machine Dream (responding to Sully): My issue is more the low intelligence. I’m having o1 give Operator step by step instructions and it is doing far better.

There’s no reason you couldn’t use o1 (or o1-pro, or soon o3) to give precise instructions to Operator. Indeed, if something is tricky and you’re not on a tight token budget, why wouldn’t you?

The Number You Are Calling Is Not Available (In the EU)

Sebastian Siemiatkowski tells us a very EU story about why using OpenAI Operator at your bank in EU is illegal by law, and was banned as part of ‘open banking’ that was supposed to ensure the opposite, that you could use your own tool to access the bank.

There was a long legal fight where the banks tried to fight against Open Banking, but it passed, except they let the EBA (European banking authorities) decide whether to require the assistants to use the API versus letting them use the web UI. So of course now you have to use the API, except all the bank APIs are intentionally broken.

It’s going to be fascinating to watch what happens as the EU confronts the future.

How to Get Ahead in Advertising

If the AI is navigating the web for you, what does that do to advertising? No human is looking at them in even more cases than usual.

Joshua Gans: If Operator is looking at websites for you, who is paying for the ads being shown to them? And if Operator sees ads, how might ads influence Operator?

My presumption is that ‘traditional’ ads that are distinct from the website are something Operator is going to ignore, even for new websites and definitely for known websites with apps. If you integrate messages into the content, that could be different, a form of (soft?) prompt injection or a way to steer the Operator. So presumably we’re going to see more of that.

As for the threat to the advertising model, I think we have a while before we have to worry about it in most cases. First we have to wait for AI agents to be a large percentage of web navigation, in ways that crowd out previous web browsing, in a way that the human isn’t watching to see the ads.

Then we also need this to happen in places where the human would have read the ads. I note this because Operator and other agents will likely start off replacing mostly a set of repetitive tasks. They’ll check your email, they’ll order you delivery and book your reservation and your flight as per OpenAI’s examples. Losing the advertising in those places is fine, they weren’t relying on it or didn’t even have any.

Eventually agents will also be looking at everything else for you, and then we have an issue, on the order of ad blocking and also ‘humans learn to ignore all the advertising.’ At that point, I expect to have many much bigger problems than advertising revenue.

Begin Operation

What does the future hold? Will 2025 be the ‘Year of the AI Agent’ that 2024 wasn’t?

Alex Lawsen: OpenAI’s operator, from the sound of it, barely works when it comes to bunch of things. Luckily, as we all know, it’s really hard to go from ‘barely works’ to ‘works’ to ‘superhuman’ in AI, especially once you have the basic set up that gets you to ‘barely works’.

No, that never happens, and definitely not quickly.

Emad: My inbox is filling up rapidly with computer control agent launches coming shortly

Maybe should have an agent olympics to decide which controls my computer

Andrej Karpathy is excited in the long term, but thinks we aren’t ready for the good stuff yet, so it will be more like a coming decade of agents. Yes, you can order delivery with Operator, but that’s miles away from a virtual employee. Fair enough.

The Lighter Side

And as far as I know, they are still waiting.

[-]Nathan Helm-Burger3mo30

Seems worth mentioning that the open-source alternative has people claiming it already works significantly better? (In part because it doesn't obey rules to avoid certain sites) Especially when powered with the latest Gemini.

I haven't tried it myself, so take it with a helping of salt.

https://github.com/browserbase/open-operator

LESSWRONG
LW

35