lc

Sequences

The Territories
Mechanics of Tradecraft

Wikitag Contributions

Comments

Sorted by
lc*50

They strengthen chip export restrictions, order OpenBrain to further restrict its internet connections, and use extreme measures to secure algorithmic progress, like wiretapping OpenBrain employees—this catches the last remaining Chinese spy

Wiretapping? That's it? Was this spy calling Xi from his home phone? xD

lc11934

As a newly minted +100 strong upvote, I think the current karma economy accurately reflects how my opinion should be weighted

lc40

I have Become Stronger

lc470

My strong upvotes are now giving +1 and my regular upvotes give +2.

lc72

Just edited the post because I think the way it was phrased kind of exaggerated the difficulties we've been having applying the newer models. 3.7 was better, as I mentioned to Daniel, just underwhelming and not as big a leap as either 3.6 or certainly 3.5.

lc*110

If you plot a line, does it plateau or does it get to professional human level (i.e. reliably doing all the things you are trying to get it to do as well as a professional human would)?

It plateaus before professional human level, both in a macro sense (comparing what ZeroPath can do vs. human pentesters) and in a micro sense (comparing the individual tasks ZeroPath does when it's analyzing code). At least, the errors the models make are not ones I would expect a professional to make; I haven't actually hired a bunch of pentesters and asked them to do the same tasks we expect of the language models and made the diff. One thing our tool has over people is breadth, but that's because we can parallelize inspection of different pieces and not because the models are doing tasks better than humans.

What about 4.5? Is it as good as 3.7 Sonnet but you don't use it for cost reasons? Or is it actually worse?

We have not yet tried 4.5 as it's so expensive that we would not be able to deploy it, even for limited sections. 

lc100

We use different models for different tasks for cost reasons. The primary workhorse model today is 3.7 sonnet, whose improvement over 3.6 sonnet was smaller than 3.6's improvement over 3.5 sonnet. When taking the job of this workhorse model, o3-mini and the rest of the recent o-series models were strictly worse than 3.6.

lc*71

I haven't read the METR paper in full, but from the examples given I'm worried the tests might be biased in favor of an agent with no capacity for long term memory, or at least not hitting the thresholds where context limitations become a problem:

 

For instance, task #3 here is at the limit of current AI capabilities (takes an hour). But it's also something that could plausibly be done with very little context; if the AI just puts all of the example files in its context window it might be able to write the rest of the decoder from scratch. It might not even need to have the example files in memory while it's debugging its project against the test cases.

Whereas a task to fix a bug in a large software project, while it might take an engineer associated with that project "an hour" to finish, requires stretching the limits of the amount of information it can fit inside a context window, or recall beyond what we seem to be capable of doing today. 

lc5-5

There was a type of guy circa 2021 that basically said that gpt-3 etc. was cool, but we should be cautious about assuming everything was going to change, because the context limitation was a key bottleneck that might never be overcome. That guy's take was briefly "discredited" in subsequent years when LLM companies increased context lengths to 100k, 200k tokens.

I think that was premature. The context limitations (in particular the lack of an equivalent to human long term memory) are the key deficit of current LLMs and we haven't really seen much improvement at all.

Load More