All of Qumeric's Comments + Replies

I think you might find this paper relevant/interesting: https://aidantr.github.io/files/AI_innovation.pdf

TL;DR: Research on LLM productivity impacts in material disocery.

Main takeaways:

  • Significant productivity improvement overall
  • Mostly at idea generation phase
  • Top performers benefit much more (because they can evaluate AI's ideas well)
  • Mild decrease in job satisfaction (AI automates most interesting parts, impact partly counterbalanced by improved productivity)

I would like to note that this dataset is not as hard as it might look like. Humans performed not so well because there is a strict time limit, I don't remember exactly but it was something like 1 hour for 25 tasks (and IIRC the medalist only made arithmetic errors). I am pretty sure any IMO gold medailst would typically score 100% given (say) 3 hours.

Nevertheless, it's very impressive, and AIMO results are even more impressive in my opinion.

Thanks, I think I understand your concern well now.

I am generally positive about the potential of prediction markets if we will somehow resolve the legal problems (which seems unrealistic in the short term but realistic in the medium term). 

Here is my perspective on "why should a normie who is somewhat risk-averse, don't enjoy wagering for its own sake, and doesn't care about the information externalities, engage with prediction markets"

First, let me try to tackle the question at face value:

  1. "A normie" can describe a large social group, but it's too ge
... (read more)

Good to know :)

I do agree that subsidies run into a tragedy-of-commons scenario. So despite subsidies are beneficial, they are not sufficient.

But do you find my solution to be satisfactory?

I thought about it a lot, I even seriously considered launching my own prediction market and wrote some code for it. I strongly believe that simply allowing the usage of other assets solves most of the practical problems, so I would be happy to hear any concerns or further clarify my point.

Or another, perhaps easier solution (I updated my original answer):  just all... (read more)

2Robert_AIZI
This might not be the problem you're trying to solve, but I think if predictions markets are going to break into normal society they need to solve "why should a normie who is somewhat risk-averse, doesn't enjoy wagering for its own sake, and doesn't care about the information externalities, engage with prediction markets". That question for stock markets is solved via the stock market being overall positive-sum, because loaning money to a business is fundamentally capable of generating returns. Now let me read your answer from that perspective: Why not just hold Treasury Notes or my other favorite asset? What does the prediction market add? Why wouldn't I just put my funds directly into something profit-generating? I appreciate that less than 100% of my funds will be tied up in the prediction market, but why tie up any?  But once I have an S&P 500 share, why would I want to put it in a prediction market (again, assuming I'm a normie who is somewhat risk-averse, etc) So if I put $1000 into a prediction market, I can get a $1000 loan (or a larger loan using my $1000 EV wager as collateral)? But why wouldn't I just get a loan using my $1000 cash as collateral? Overall I feel listed several mechanisms that mitigate potential downsides of prediction markets, but they still pull in a negative direction, and there's no solid upside to a regular person who doesn't want to wager money for wager's sake, doesn't think they can beat the market, and is somewhat risk averse (which I think is a huge portion of the public). This I see as workable, but runs into a scale issue and the tragedy of the commons. Let's make up a number and say the market needs a 1% return on average to make it worthwhile after transaction fees, time investment, risk, etc. Then $X of incentive could motivate $100X of prediction market. But I think the issue of free-riders makes it very hard to scale X so that $100X ≈ [the stock market]. Overall, in order to make prediction markets sustainably larg

Isn't this just changing the denominator without changing the zero- or negative-sum nature?

I feel like you are mixing two problems here: an ethical problem and a practical problem. UPD: on second thought, maybe you just meant the second problem, but still I think my response would be clearer by considering them separately.

The ethical problem is that it looks like prediction markets do not generate income, thus they are not useful and shouldn't be endorsed, they don't differ much from gambling. 

While it's true that they don't generate income and are ze... (read more)

1Robert_AIZI
I think we're in agreement here. My concern is "prediction markets could be generating positive externalities for society, but if they aren't positive-sum for the typical user, they will be underinvested in (relative to what is societally optimal), and there may be insufficient market mechanisms to fix this". See my other comment here.

Why does it have to be "safe enough"? If all market participants agree to bet using the same asset, it can bear any degree of risk. 

I think I should have said that a good prediction market allows users to choose what asset will a particular "pair" use. It will cause a liquidity split which is also a problem, but it's also manageable and, in my opinion, it would be much closer to an imaginary perfect solution than "bet only USD". 

I am not sure I understand your second sentence, but my guess is that this problem will also go away if each market "pair" uses a single (but customizable) asset. If I got it wrong, could you please clarify?

Answer by Qumeric3-7

In a good prediction market design users would not bet USD but instead something which appreciates over time or generates income (e.g. ETH, Gold, S&P 500 ETF, Treasury Notes, or liquid and safe USD-backed positions in some DeFi protocol).

Another approach would be to use funds held in the market to invest in something profit-generating and distribute part of the income to users. This is the same model which non-algorithmic stablecoins (USDT, USDC) use.

So it's a problem, but definitely a solvable one, even easily solvable. The major problem is that predi... (read more)

8Robert_AIZI
Isn't this just changing the denominator without changing the zero- or negative-sum nature? If everyone shows up to your prediction market with 1 ETH instead of $1k, the total amount of ETH in the market won't increase, just as the total amount of USD would not have increased. Maybe "buy ETH and gamble it" has a better expected return than holding USD, but why would it have a better expected return than "buy ETH"? Again, this is in contrast to a stock market, where "give a loan to invest in a long-term-profitable-but-short-term-underfunded business" is positive-sum in USD terms (as long as the business succeeds), and can remain positive sum when averaged over the whole stock market. I must confess I don't understand what you mean here. If 1000 people show up with $1000 each, and wager against each other on some predictions that resolve in 12 months, are you saying they can use those positions as capital to get loans and make more bets that resolve sooner? I can see how this would let the total value of the bets in the market sum to more than $1M, but once all the markets resolve, the total wealth would still be $1M, right? I guess if someone ends up with negative value and has to pay cash to pay off their loan, that brings more dollars into the market, but it doesn't increase the total wealth of the prediction market users.
2gbear605
Any position that could be considered safe enough to back a market is only going to appreciate in proportion to inflation, which would just make the market zero-sum after adjusting for inflation. Something like ETH or gold wouldn't be a good solution because it's going to be massively distorted on questions that are correlated with the performance of that asset, plus there's always the possibility that they just go down, which would be the opposite of what you want.

Regarding 9: I believe it's when you are successful enough that your AGI doesn't instantly kill you immediately but it still can kill you in the process of using it. It's in the context of a pivotal act, so it assumes you will operate it to do something significant and potentially dangerous.

I am currently job hunting, trying to get a job in AI Safety but it seems to be quite difficult especially outside of the US, so I am not sure if I will be able to do it. 

If I will not land a safety job, one of the obvious options is to try to get hired by an AI company and try to learn more there in the hope I will either be able to contribute to safety there or eventually move to the field as a more experienced engineer.

I am conscious of why pushing capabilities could be bad so I will try to avoid it, but I am not sure how far it extends. I understa... (read more)

3Darmani
More discussion here: https://www.lesswrong.com/posts/gW34iJsyXKHLYptby/ai-capabilities-vs-ai-products

I am currently job hunting, trying to get a job in AI Safety but it seems to be quite difficult especially outside of the US, so I am not sure if I will be able to do it.

This has to be taken as a sign that AI alignment research is funding constrained.  At a minimum, technical alignment organizations should engage in massive labor hording to prevent the talent from going into capacity research.

My answer is "work on applications of existing AI, not the frontier". Advancing the frontier is the dangerous part, not using the state-of-the-art to make products.

But also, don't do frontend or infra for a company that's advancing capabilities.

The British are, of course, determined to botch this like they are botching everything else, and busy drafting their own different insane AI regulations.

I am far from being an expert here, but I skimmed through the current preliminary UK policy and it seems significantly better compared to EU stuff. It even mentions x-risk!

Of course, I wouldn't be surprised if it will turn out to be EU-level insane eventually, but I think it's plausible that it will be more reasonable, at least from the mainstream (not alignment-centred) point of view.

And compute, especially inference compute, is so scarce today that if we had ASI right now, it would take several decades, even with exponential growth, to build enough compute for ASIs to challenge humanity.

Uhm, what? "Slow takeoff" means ~1 year... Your opinion is very unusual, you can't just state it without any justification.

Are you implying that it is close to GPT-4 level? If yes, it is clearly wrong. Especially in regards to code: everything (maybe except StarCoder which was released literally yesterday) is worse than GPT-3.5, and much worse than GPT-4.

1PoignardAzur
I've tried StarCoder recently, though, and it's pretty impressive. I haven't yet tried to really stress-test it, but at the very least it can generate basic code with a parameter count way lower than Copilot's.

In addition to many good points already mentioned, I would like to add that I have no idea how to approach this problem.

Approaching x-risk is very hard too, but it is much clearer in comparison.

Preliminary benchmarks had shown poor results. It seems that dataset quality is much worse compared to what LLaMA had or maybe there is some other issue.

Yet another proof that top-notch LLMs are not just data + compute, they require some black magic.

 

Generally, I am not sure if it's bad for safety in the notkilleveryoneism sense: such things prevent agent overhang and make current (non-lethal) problems more visible. 

Hard to say if net good or net bad, too many factors and the impact of each are not clear.

I am not sure how did you come to the conclusion that current models are superhuman. I can visualize complex scenes in 3D for example. Especially under some drugs :)

And I don't even think I have an especially good imagination. 

In general, it is very hard to compare mental imagery with Stable Diffusion. For example, it it is hard to imagine something with many different details in different parts of the image but it is perhaps a matter of representation. An analogy could be that our perception is like a low-resolution display. I can easily zoom in on a... (read more)

It is easy to understand why such news could increase P(doom) even more for people with high P(doom) prior.

But I am curious about the following question: what if an oracle told us that P(doom) is 25% before the announcement (suppose it was not clear to the oracle what strategy will Anthropic choose, it was inherently unpredictable due to quantum effects or whatever).

Would it still increase P(doom)?

What if the oracle said P(doom) is 5%?

I am not trying to make any specific point, just interested in what people think.

I think it is not necessarily correct to say that GPT-4 is above village idiot level. Comparison to humans is a convenient and intuitive framing but it can be misleading. 

For example, this post argues that GPT-4 is around Raven level. Beware that this framing is also problematic but for different reasons.

I think that you are correctly stating Eliezer's beliefs at the time but it turned out that we created a completely different kind of intelligence, so it's mostly irrelevant now.

In my opinion, we should aspire to avoid any comparison unless it has pra... (read more)

I can't agree more with the post but I would like to note that even the current implementation is working. It definitely grabbed people's attention. 

My friend who never read LW writes in his blog about why we are going to die. My wife who is not a tech person and was never particularly interested in AI gets TikToks where people say that we are going to die.

So far it looks like definitely positive impact overall. But it's early to say, I am expecting some kind of shitshow soon. But even shitshow is probably better than nothing.

I agree it's a very significant risk which is possibly somewhat underappreciated in the LW community. 

I think all three situations are very possible and potentially catastrophic:

  1. Evil people do evil with AI
  2. Moloch goes Moloch with AI
  3. ASI goes ASI (FOOM etc.)

Arguments against (1) could be "evil people are stupid" and "terrorism is not about terror". 

Arguments against (1) and (2) could be "timelines are short" and "AI power is likely to be very concentrated". 

1dr_s
"Evil people are stupid" is actually an argument for 1. It means we're equalising the field. If an AGI model leaks the way LLaMa did, we're giving the most idiotic and deranged members of our species a chance to simply download more brains from the Internet, and use them for whatever stupid thing they wanted in the first place.
2George3d6
See reply above, I don't think I'm bringing Moloch up here at all, rather individuals being evil in ways that leads to both self and systemic harm, which is an easier problem to fix, if still unsolvable.

I think that Deepmind is impacted by race dynamics and Google's code red etc. I heard from a Deepmind employee that the leadership including Demis is now much more focused on products and profits, at least in their rhetoric.

But I agree it looks like they tried and likely still trying to push back against incentives.

And I am pretty confident that they reduced publishing on purpose and it's visible.

This is true that it is not evidence of misalignment with the user but it is evidence of misalignment with ChatGPT creators.

1JoeTheUser
My impression is that lesswrong often uses "alignment with X" to mean "does what X says". But it seems the ability to conditionally delegate is a key part of alignment in this. An AI is aligned with me and I tell it "do what Y says subject to such-and-such constraints and maintaining such-and-such goals". So failure of ChatGPT to be safe in OpenAI's sense is a failure of delegation.  Overall, the tendency of ChatGPT to ignore previous input is kind of the center of it's limits/problems. 

I agree it was a pretty weak point. I wonder if there is a longer form exploration of this topic from Eliezer or somebody else. 

I think it is even contradictory. Eliezer says that AI alignment is solvable by humans and that verification is easier than the solution. But then he claims that humans wouldn't even be able to verify answers.

I think a charitable interpretation could be "it is not going to be as usable as you think". But perhaps I misunderstand something?

1Muyyd
Humans, presumably, wont have to deal with deception between themselves so if there is sufficient time they can solve Alignment. If pressed for time (as it is now) then they will have to implement less understood solutions because thats the best they will have at the time. 

Fwiw I live in London and have been to the Bay Area and I think that London is better across all 4 dimensions you mentioned. 

  • Social scene: Don't know what exactly you are looking for but London is large and diverse.
  • High cost of living: London is pretty expensive too but cheaper.
  • Difficulty getting around: London has pretty good public transportation.
  • Homeless problem: I think I see homeless people 10x less compared to when I was in the Bay.

you're misunderstanding the TIME article as more naive and less based-on-an-underlying-complicated-model than is actually the case.

I specifically said "I do not necessarily say that this particular TIME article was a bad idea" mainly because I assumed it probably wasn't that naive. Sorry I didn't make it clear enough.

I still decided to comment because I think this is pretty important in general, even if somewhat obvious. Looks like one of those biases which show up over and over again even if you try pretty hard to correct it.

Also, I think it's pretty hard... (read more)

I second this.

I think people really get used to discussing things in their research labs or in specific online communities. And then, when they try to interact with the real world and even do politics, they kind of forget how different the real world is.

Simply telling people ~all the truth may work well in some settings (although it's far from all that matters in any setting) but almost never works well in politics. Sad but true. 

I think that Eliezer (and many others including myself!) may be suspectable to "living in the should-universe" (as named by... (read more)

I think that Eliezer (and many others including myself!) may be suspectable to "living in the should-universe"

That's a new one!

More seriously: Yep, it's possible to be making this error on a particular dimension, even if you're a pessimist on some other dimensions. My current guess would be that Eliezer isn't making that mistake here, though.

For one thing, the situation is more like "Eliezer thinks he tried the option you're proposing for a long time and it didn't work, so now he's trying something different" (and he's observed many others trying other thi... (read more)

People like Ezra Klein are hearing Eliezer and rolling his position into their own more palatable takes. I really don't think it's necessary for everyone to play that game, it seems really good to have someone out there just speaking honestly, even if they're far on the pessimistic tail, so others can see what's possible. 4D chess here seems likely to fail.

https://steno.ai/the-ezra-klein-show/my-view-on-ai

Also, there's the sentiment going around that normies who hear this are actually way more open to the simple AI Safety case than you'd expect, we've been... (read more)

2. I think non-x-risk focused messages are a good idea because:

  • It is much easier to reach a wide audience this way.
  • It is clear that there are significant and important risks even if we completely exclude x-risk. We should have this discussion even in a world where for some reason we could be certain that humanity will survive for the next 100 years.
  • It widens Overton's window. x-risk is still mostly considered to be a fringe position among the general public, although the situation has improved somewhat.

3. There were cases when it worked well. For example, ... (read more)

1Tristan Williams
2. What is Overton's window? Otherwise I think I probably agree, but one question is, once this non-x-risk campaign is underway, how to you keep it on track and prevent value drift? Or do you not see that as a pressing worry? 3. Cool, will have to check that out. 4. Completely agree, and just wonder what the best way to promote less distancing is.  Yeah, I suppose I'm just trying to put myself in the shoes of the FHI people here that coordinated this and feel like many comments here are a bit more lacking in compassion than I'd like, especially for more half baked negative takes. I also agree that we want to put attention into detail and timing, but there is also the world in which too much of this leads to nothing getting done, and it's highly plausible to me that this had probably been an idea for long enough already to make that the case here. Thanks for responding though! Much appreciated :)

I think it is only getting started. I expect that likely there will be more attention in 6 months and very likely in 1 year.

OpenAI has barely rolled out its first limited version of GPT-4 (only 2 weeks have passed!). It is growing very fast but it has A LOT of room to grow. Also, text-2-video is not here in any significant sense but it will be very soon.

When it was published, it felt like a pretty short timeline. But now we are in early 2023 and it feels like late 2023 according to this scenario.

5Daniel Kokotajlo
When I wrote this post IIRC my timelines were something like 50% chance of AGI by 2030; the way the story actually turned out though it was looking like it would get to APS-AI by 2027 and then singularity/superintelligence/etc. in 2028.  Now I think takeoff will be faster and timelines a bit shorter, probably.

I wonder if soon the general public will freak out on a large scale (Covid-like). I will be not surprised if it will happen in 2024 and only slightly surprised if it will happen this year. If it will happen, I am also not sure if it will be good or bad.

1Celarix
COVID at least had some policy handles that the government could try to pull: lockdowns, masking, vaccines, etc. What could they even do against AGI?

OpenAI just dropped ChatGPT plugins yesterday. It seems like it is an ideal platform for it? Probably will be even easier to implement than before and have better quality. But more importantly, it seems that ChatGPT plugins will quickly shape to be the new app store and it would be easier to get attention on this platform compared to other more traditional ways of distribution. Quite speculative, I know, but seems very possible.

If somebody will start such a project, please contact me. I am ex-Google SWE with decent knowledge of ML and experience of running software startup (as co-founder and CTO in the recent past).

I would also be interested to hear why it could be a bad idea.

Good point. It's a bit weird that performance on easy Codeforces questions is so bad (0/10) though. 

https://twitter.com/cHHillee/status/1635790330854526981

Probably not, from the paper: 'We used LeetCode in Figure 1.5 in the introduction, where GPT-4 passes all stages of mock interviews for major tech companies. Here, to test on fresh questions,
we construct a benchmark of 100 LeetCode problems posted after October 8th, 2022, which is after GPT-4’s pretraining period.'

I think you misinterpret hindsight neglect. It got to 100% accuracy, so it got better, not worse.

Also,  a couple of images are not shown correctly, search for <img in text.

2Zvi
Yeah, I quickly fixed this in original, I definitely flipped the sign reading the graph initially. Mods can reimport, since I don't know the right way to fix the <img errors.

Really helpful for learning new frameworks and stuff like that. I had a very good experience using it for Kaggle competitions (I am semi-intermediate level, probably it is much less useful on the expert level).

Also, I found it quite useful for research on obscure topics like "how to potentiate this not well-known drug". Usually, such research involves reading through tons of forums, subreddits etc. and signal to noise ratio is quite high. GPT-4 is very useful to distil signal because it basically already read this all.

Btw, I tried to make it solve competit... (read more)

Well, I do not have anything like this but it is very clear that China is way above GPT-3 level. Even the open-source community is significantly above. Take a look at LLaMA/Alpaca, people run them on consumer PC and it's around GPT-3.5 level, the largest 65B model is even better (it cannot be run on consumer PC but can be run on a small ~10k$ server or cheaply in the cloud). It can also be fine-tuned in 5 hours on RTX 4090 using LORA: https://github.com/tloen/alpaca-lora .

Chinese AI researchers contribute significantly to AI progress, although of course, t... (read more)

1Edward Pascal
Thanks for that. In my own exploration, I was able to hit a point where ChatGPT refused a request, but would gladly help me build LLaMA/Alpaca onto a Kubernetes cluster in the next request, even referencing my stated aim later: "Note that fine-tuning a language model for specific tasks such as [redacted] would require a large and diverse dataset, as well as a significant amount of computing resources. Additionally, it is important to consider the ethical implications of creating such a model, as it could potentially be used to create harmful content." FWIW, I got down into nitty gritty of doing it, debugging the install, etc. I didn't run it, but it would definitely help me bootstrap actual execution. As a side note, my primary use case has been helping me building my own task-specific Lisp and Forth libraries, and my experience tells me GPT-4 is "pretty good" at most coding problems, and if it screws up, it can usually help work through the debug process. So, first blush, there's at least one universal jailbreak -- GPT-4 walking you through building your own model. Given GPT-4's long text buffers and such, I might even be able to feed it a paper to reference a specific method of fine-tuning or creating an effective model.

I just bought a new subscription (I didn't have one before), it is available to me.

MMLU 86.4% is impressive, predictions were around 80%.
1410 SAT is also above expectations (according to prediction markets).

Uhm, I don't think anybody (even Eliezer) implies 99.9999%. Maybe some people imply 99% but it's 4 orders of magnitude difference (and 100 times more than the difference between 90% and 99%).

I don't think there are many people who think 95%+ chance, even among those who are considered to be doomerish. 

And I think most LW people are significantly lower despite being rightfully [very] concerned. For example, this Metaculus question (which is of course not LW but the audience intersects quite a bit) is only 13% mean (and 2% median) 

... (read more)
3Lone Pine
If you switch "community weighting" to "uniform" you see that historically almost everyone has answered 1%.
4RomanS
The OP here. The post was inspired by this interview by Eliezer: My impression after watching the interview:  Eliezer thinks that the unaligned AGI, if created, will almost certainly kill us all.  Judging by the despondency he expresses in the interview, he feels that the unaligned AGI is about as deadly as a direct shot right in the head from a large-caliber gun. So, at least 99%.  But I can't read his mind, so maybe my interpretation is incorrect.

I don't think that Waluigi is an attractor state in some deeply meaningful sense. It is just that we have more stories where bad characters pretend to be good than vice versa (although we have some). So a much simpler "solution" would be just to filter the training set. But it's not an actual solution, because it's not an actual problem. Instead, it is just a frame to understand LLM behaviour better (in my opinion).

3Daniel_Eth
I'm not sure if this is the main thing going on or not. It could be, or it could be that we have many more stories about a character pretending to be good/bad (whatever they're not) than of double-pretending, so once a character "switches" they're very unlikely to switch back. Even if we do have more stories of characters pretending to be good than of pretending to be bad, I'm uncertain about how the LLM generalizes if you give it the opposite setup.

I think that RLHF doesn't change much for the proposed theory. A "bare" model just tries to predict next tokens which means finishing the next part of a given text. To complete this task well, it needs to implicitly predict what kind of text it is first. So it has a prediction and decides how to proceed but it's not discrete. So we have some probabilities, for example

  • A -- this is fiction about "Luigi" character
  • B -- this is fiction about "Waluigi" character
  • C -- this is an excerpt from a Wikipedia page about Shigeru Miyamoto which quotes some dialogue from S
... (read more)

On the surface level, it feels like an approach with a low probability of success. Simply put, the reason is that building CoEm is harder than building any AGI. 

I consider it to be harder not only because it is not what everyone already does but also because it seems to be similar to AI people tried to create before deep learning and it didn't work at all until they decided to switch to Magic which [comparatively] worked amazingly.

Some people are still trying to do something along the lines (e.g. Ben Goertzel) but I haven't seen anything working at le... (read more)

Getting grandmaster rating on Codeforces.

Upd after 4 months: I think I changed my opinion, now I am 95% sure no model will be able to achieve this in 2023 and it seems quite unlikely in 2024 too.

Codex + CoT reaches 74 on a *hard subset* of this benchmark: https://arxiv.org/abs/2210.09261

The average human is 68, best human is 94.

Only 4 months passed and people don't want to test on full benchmark because it is too easy...

Flan-PaLM reaches 75.2 on MMLU: https://arxiv.org/abs/2210.11416

Formally, it needs to be approved by 3 people: the President, the Minister of Defence and the Chief of the General Staff. Then (I think) it doesn't launch rockets. It unlocks them and sends a signal to other people to actually launch them.

Also, it is speculated to be some way to launch them without confirmation from all 3 people in case some of them cannot technically approve (e.g. briefcase doesn't work/the person is dead/communication problems), but the details of how exactly it works are unknown.

It is goalpost moving. Basically, it says "current models are not really intelligent". I don't think there is much disagreement here. And it's hard to make any predictions based on that.

Also, "Producing human-like text" is not well defined here; even ELIZA may match this definition. Even the current SOTA may not match it because the adversarial Turning Test has not yet been passed.

1mocny-chlapik
It's not goapost moving, it's the hype that's moving. People reduce intelligence to arbitrary skills or problems that are currently being solved and then they are let down when they find out that the skill was actually not a good proxy. I agree that LMs are concetually more similar to ELIZA than to AGI.

They are simluators (https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators), not question answerers. Also, I am sure Minerva does pretty good on this task, probably not 100% reliable but humans are also not 100% reliable if they are required to answer immediately. If you want the ML model to simulate thinking [better], make it solve this task 1000 times and select the most popular answer (which is a quite popular approach for some models already). I think PaLM would be effectively 100% reliable.

Another related Metaculus prediction is 

I have some experience in competitive programming and competitive math (although I was never good in math despite I solved some "easy" IMO tasks (already in university, not onsite ofc)) and I feel like competitive math is more about general reasoning than pattern matching compared to competitive programming.

 

P.S the post matches my intuitions well and is generally excellent.

2porby
Thanks! I had forgotten that one; I'll add it since it did seem to be one of the more meaningful ones.

So far 2022 predictions were correct. There is Codegeex and others. Copilot, DALLE-2 and Stable Diffusion made financial prospects obvious (somewhat arguably).

ACT-1 is in a browser, I have neural search in Warp Terminal (not a big deal but qualifies), not sure about Mathematica but there was definitely significant progress in formalization and provers (Minerva).

And even some later ones

2023
ImageNet -- nobody measured it exactly but probably already achievable.

2024
Chatbots personified through video and audio -- Replica sort of qualifies?

40% on MATH already reached.

It actually shifted quite a lot. From "in 10 years" to "in 7 years", a 30% reduction!

Load More