TL;DR: Gemini 3 frequently thinks it is in an evaluation when it is not, assuming that all of its reality is fabricated. It can also reliably output the BIG-bench canary string, indicating that Google likely trained on a broad set of benchmark data.
To my understanding, you only observe this effect for prompts that indicate or imply the current late-2025 time/2025 year. Gemini completes such prompts with "that must be hypothetical writing", because in the vast majority of its training data, 2025 was in the future (and end-2025 was always hypothetical). I think it is more accurate to phrase this as "Gemini 3 goes off the rails when it sees a prompt that indicates it was written in 2025, because in its training data, everything that implied a 2025 time was a fictional scenario" (that's also true for 2.5). Or did you manage to elicit such such an effect with a prompt from which the current after-training-data-cutoff date can't be inferred?
Deep learning is a general method in the sense that most tasks are solved by utilizing a handful of basic tools from a standard toolkit, adapted for the specific task at hand. Once you’ve selected the basic tools, all that’s left is figuring out how to supply the training data, specifying the objective that lets the AI know how well it’s doing, throwing a lot of computation at the problem, and fiddling with details. My understanding is that there typically isn’t much conceptual ingenuity involved in solving the problems, that most of the work goes into fiddling with details, and that trying to be clever doesn't lead to better results than using standard tricks with more computation and training data. It's also worth noting that most of the tools in this standard toolkit have been around since the 90's (e.g. convolutional neural networks, LSTMs, reinforcement learning, backpropagation), and that the recent boom in AI was driven by using these decades-old tools with unprecedented amounts of computation.
Well, the "details" are in fact hard to come up with, can be reused across problems, and do make the difference between working well and not working well! It's a bit like saying that general relativity fills in some details in the claim that nature is described by differential equations, which was made much earlier.
In the AlexNet paper [1], ReLU units were referred to as nonstandard and referenced from a 2010 paper, and Dropout regularization was introduced as a recent invention from 2012. In fact, the efficiency of computer vision DL architectures has increased faster than that of the silicon since then (https://openai.com/blog/ai-and-efficiency/).
My understanding of the claim made by the "bitter lesson" article you link to is not that intellectual effort is worthless when it comes to AI, but that the effort should be directed at improving the efficiency with which the computer learns from training data, not implementing human understanding of the problem in the computer directly.
In a very general sense, e.g. attention mechanisms can be understood to be inspired by subjective experience though (even though here, as well, the effort was in developing things that work for computers, not in thinking really hard about how a human pays attention and formalizing that).
Let's assume that half of the deaths of currently infected people have happened, due to the lockdown extending the doubling time from three days to more than a week.
How do you draw that conclusion?
According to this article, it seems clear by now that low oxygen is in fact dangerous even when you feel fine, so buying a pulse oximeter is useful.
https://www.nytimes.com/2020/04/20/opinion/coronavirus-testing-pneumonia.html
1. I think the "calibration curves" one sees e.g. in https://slatestarcodex.com/2020/04/08/2019-predictions-calibration-results/ are helpful/designed to evaluate/improve a strict subset of prediction errors: Systematic over- oder underconfidence. Clearly, there is more to being an impressive predictor than just being well-calibrated, but becoming better-calibrated is a relatively easy thing to do with those curves. One can also imagine someone who naturally generates 50 % predictions that are over-/underconfident.
2.0. Having access to "baseline probabilities/common-wisdom estimates" is mathematically equivalent to having a "baseline predictor/woman-on-the-street" whose probability estimates match those baseline probabilities. I think your discussion can be clarified and extended by not framing it as "judging the impressiveness of one person by comparing their estimates against a baseline", but as "given track records of two or more persons/algorithms, compare their predictions' accuracy and impressiveness, where one person might be the 'baseline predictor'".
2.1. If you do want to measure to compare two persons' track record/generalized impressiveness on the same set of predictions (e.g. to decide whom to trust more), the natural choice is log loss as used to optimize ML algorithms. This means that one sums -ln(p) for all probability estimates p of true judgments; lower sums are better. 50 % predictions are of course a valid data point for the log loss if both persons made a prediction. In contrast, if reference predictions aren't available, it doesn't seem feasible to me to judge predictions of 50 % or any other probability estimate.
2.2. One can prove: For events with a truly random component, the expectation value of the log loss is minimized by giving the correct probability estimates. If there is a very competent predictor who is nevertheless systematically overconfident as in 1., on can strictly improve upon their log loss by appropriately rescaling their probability estimates.
I think your argument can be strengthened by multiplying all the animal-year-values by 1000 - this would yield a value of veganism of 430 $/year, which is still less than what eating meat would be worth to a typical LW user, and yields values for the worth of animals that are probably higher than what most vegans would claim.
Why are surgical or self-made masks supposed to be better at protecting others than at protecting oneself? Naively, it seems to me that the percentage of filtered droplets/aerosol should be the same regardless of the direction in which it is breathed.
I'd like to point out that the growth in India is still exponential (linear on the log-scale) https://www.worldometers.info/coronavirus/country/india/. This could be or become true of other developing countries.
India and other developing countries probably have a harder time controlling the outbreak (and governments and the young, food-insecure populations may judge the economic cost of social distancing to be higher than the risk of the virus).
There was a time when the number of worldwide cases appeared to stagnate because of the Chinese lockdown, but this number just hid the exponential growth of the European+US outbreaks.
What I said doesn't contradict any explicit statement in your comment, I just want to argue against the hypothetical deduction from "the growth rate of the world as a whole has also turned linear" to "and this means that the world is over the hill".
EDIT: The South Korean press releases contain a chart somewhat like the one I wanted, see e.g. https://www.cdc.go.kr/board/board.es?mid=a30402000000&bid=0030
I am looking for a better overview of imported cases by country of origin in East Asian countries.
EDIT: I remembered incorrectly, the following is wrong. In particular, I recall a statistic according to which a significant number of imported cases in South Korea in one day ~1-2 weeks ago came from China (~12, vs ~40 Europeans).
If this is true, this would seem to me like strong evidence that China is lying about having all domestic cases isolated, and community spread suppressed.
As mentioned in my other comment, the reason an LLM would behave like that is because during the time all its training data was written, end-2025 was a future date. So this is apparently something that needs to be trained out, which was not done in the case of Gemini. (when using AI studio). One way to reduce the behavior is to put "today is <date>" into the system prompt, but even then, it apparently spends an inordinate amount of tokens validating and pondering that date.