Lech Mazur

Advameg, Inc. CEO

Founder, city-data.com

Author: County-level COVID-19 machine learning case prediction model.

Author: AI assistant for melody composition.

Posts

Sorted by New

4Lech Mazur's Shortform

Wikitag Contributions

Comments

Sorted by

Newest

"AI Rapidly Gets Smarter, And Makes Some of Us Dumber," from Sabine Hossenfelder

Lech Mazur1mo70

It's a video by an influencer who has repeatedly shown no particular insight in any field other than her own. For example, her video about the simulation hypothesis was atrocious. I gave this one a chance, and it's just a high-level summary of some recent developments, nothing interesting.

OpenAI releases GPT-4.5

Lech Mazur1mo911

It's better than 4o across four of my benchmarks: Confabulations, Creative Writing, Thematic Generalization, and Extended NYT Connections. However, since it's an expensive and huge model, I think we'd be talking about AI progress slowing down at this point if it weren't for reasoning models.

Anthropic releases Claude 3.7 Sonnet with extended thinking mode

Lech Mazur1mo70

I ran 3 of my benchmarks so far:

Extended NYT Connections

Claude 3.7 Sonnet Thinking: 4th place, behind o1, o3-mini, DeepSeek R1
Claude 3.7 Sonnet: 11th place
GitHub Repository

Thematic Generalization

Claude 3.7 Sonnet Thinking: 1st place
Claude 3.7 Sonnet: 6th place
GitHub Repository

Creative Story-Writing

Claude 3.7 Sonnet Thinking: 2nd place, behind DeepSeek R1
Claude 3.7 Sonnet: 4th place
GitHub Repository

Note that Grok 3 has not been tested yet (no API available).

Quinn's Shortform

Lech Mazur1mo70

This might blur the distinction between some evals. While it's true that most evals are just about capabilities, some could be positive for improving LLM safety.

I've created 8 (soon to be 9) LLM evals (I'm not funded by anyone, it's mostly out of my own curiosity, not for capability or safety or paper publishing reasons). Using them as examples, improving models to score well on some of them is likely detrimental to AI safety:

https://github.com/lechmazur/step_game - to score better, LLMs must learn to deceive others and hold hidden intentions

https://github.com/lechmazur/deception/ - the disinformation effectiveness part of the benchmark

Some are likely somewhat negative because scoring better would enhance capabilities:

https://github.com/lechmazur/nyt-connections/

https://github.com/lechmazur/generalization

Others focus on capabilities that are probably not dangerous:

https://github.com/lechmazur/writing - creative writing

https://github.com/lechmazur/divergent - divergent thinking in writing

However, improving LLMs to score high on certain evals could be beneficial:

https://github.com/lechmazur/goods - teaching LLMs not to overvalue selfishness

https://github.com/lechmazur/deception/?tab=readme-ov-file#-disinformation-resistance-leaderboard - the disinformation resistance part of the benchmark

https://github.com/lechmazur/confabulations/ - reducing the tendency of LLMs to fabricate information (hallucinate)

I think it's possible to do better than these by intentionally designing evals aimed at creating defensive AIs. It might be better to keep them private and independent. Given the rapid growth of AI capabilities, the lack of apparent concern for an international treaty (as seen in the recent Paris AI summit), and the competitive race dynamics among companies and nations, specifically developing an AI to protect us from threats from other AIs or AIs + humans might be the best we can hope for.

Zvi’s 2024 In Movies

Lech Mazur2mo10

Your ratings have a higher correlation with IMDb ratings at 0.63 (I ran it as a test of Operator).

DeepSeek beats o1-preview on math, ties on coding; will release weights

Lech Mazur4mo111

It seems that 76.6% originally came from the GPT-4o announcement blog post. I'm not sure why it dropped to 60.3% by the time of o1's blog post.

Sabotage Evaluations for Frontier Models

Lech Mazur5mo10

Somewhat related: I just published the LLM Deceptiveness and Gullibility Benchmark. This benchmark evaluates both how well models can generate convincing disinformation and their resilience against deceptive arguments. The analysis covers 19,000 questions and arguments derived from provided articles.

GPT-o1

Lech Mazur5mo10

I included o1-preview and o1-mini in a new hallucination benchmark using provided text documents and deliberately misleading questions. While o1-preview ranks as the top-performing single model, o1-mini's results are somewhat disappointing. A popular existing leaderboard on GitHub uses a highly inaccurate model-based evaluation of document summarization.

The chart above isn't very informative without the non-response rate for these documents, which I've also calculated:

The GitHub page has further notes.

OpenAI o1

Lech Mazur6mo281

NYT Connections results (436 questions):

o1-mini 42.2

o1-preview 87.1

The previous best overall score was my advanced multi-turn ensemble (37.8), while the best LLM score was 26.5 for GPT-4o.

Lech Mazur's Shortform

Lech Mazur7mo30

I've created an ensemble model that employs techniques like multi-step reasoning to establish what should be considered the real current state-of-the-art in LLMs. It substantially exceeds the highest-scoring individual models and subjectively feels smarter:

MMLU-Pro 0-shot CoT: 78.2 vs 75.6 for GPT-4o

NYT Connections, 436 questions: 34.9 vs 26.5 for GPT-4o

GPQA 0-shot CoT: 56.0 vs 52.5 for Claude 3.5 Sonnet.

I might make it publicly accessible if there's enough interest. Of course, there are expected tradeoffs: it's slower and more expensive to run.