All of bruberu's Comments + Replies

As for one more test, it was rather close on reversing 400 numbers:

Given these results, it seems pretty obvious that this is a rather advanced model (although Claude Opus was able to do it perfectly, so it may not be SOTA). 

Going back to the original question of where this model came from, I have trouble putting the chance of this necessarily coming from OpenAI above 50%, mainly due to questions about how exactly this was publicized. It seems to be a strange choice to release an unannounced model in Chatbot Arena, especially without any sort of associ... (read more)

7gwern
Nah, it's just a PR stunt. Remember when DeepMind released AlphaGo Master by simply running a 'Magister' Go player online which went undefeated?* Everyone knew it was DeepMind simply because who else could it be? And IIRC, didn't OA also pilot OA5 'anonymously' on DoTA2 ladders? Or how about when Mistral released torrents? (If they had really wanted a blind test, they wouldn't've called it "gpt2", or they could've just rolled it out to a subset of ChatGPT users, who would have no way of knowing the model underneath the interface had been swapped out.) * One downside of that covert testing: DM AFAIK never released a paper on AG Master, or all the complicated & interesting things they were trying before they hit upon the AlphaZero approach.

OK, what I actually did was not realize that the link provided did not link directly to gpt2-chatbot (instead, the front page just compares two random chatbots from a list). After figuring that out, I reran my tests; it was able to do 20, 40, and 100 numbers perfectly.

I've retracted my previous comments.

5bruberu
As for one more test, it was rather close on reversing 400 numbers: Given these results, it seems pretty obvious that this is a rather advanced model (although Claude Opus was able to do it perfectly, so it may not be SOTA).  Going back to the original question of where this model came from, I have trouble putting the chance of this necessarily coming from OpenAI above 50%, mainly due to questions about how exactly this was publicized. It seems to be a strange choice to release an unannounced model in Chatbot Arena, especially without any sort of associated update on GitHub for the model (which would be in https://github.com/lm-sys/FastChat/blob/851ef88a4c2a5dd5fa3bcadd9150f4a1f9e84af1/fastchat/model/model_registry.py#L228 ). However, I think I still have some pretty large error margins, given how little information I can really find.

Interesting; maybe it's an artifact of how we formatted our questions? Or, potentially, the training samples with larger ranges of numbers were higher quality? You could try it like how I did in this failing example:

When I tried this same list with your prompt, both responses were incorrect:

[This comment is no longer endorsed by its author]Reply

By using @Sergii's list reversal benchmark, it seems that this model seems to fail reversing a list of 10 random numbers from 1-10 from random.org about half the time. This is compared to GPT-4's supposed ability to reverse lists of 20 numbers fairly well, and ChatGPT 3.5 seemed to have no trouble itself, although since it isn't a base model, this comparison could potentially be invalid.
This does significantly update me towards believing that this is probably not better than GPT-4.

[This comment is no longer endorsed by its author]Reply
3O O
Seems correct to me (and it did work for a handful of 10 int lists I manually came up with). More impressively, it does this correctly as well:   

I looked a little into the literature on how much alcohol consumption actually affects rates of oral cancers in populations with ALDH polymorphism, and this particular study seems to be helpful in modelling how the likelihood of oral cancer increases with alcohol consumption for this group of people (found in this meta-analysis).

The specific categories of drinking frequency don't seem to be too nice here, given that it was split between drinking <=4 days a week, drinking >=5 days a week and having less than 46g of ethanol per week, and drinking >=... (read more)

9Lao Mein
The Alcohol Flushing Response: An Unrecognized Risk Factor for Esophageal Cancer from Alcohol Consumption - PMC (nih.gov) There are a lot of studies to regarding the assocation between ALDH2 deficiency and oral cancer risk. I think part of the issue is that 1. AFR people are less likely to become alcoholics, or to drink alcohol at all. 2. Japanese in particular have a high proportion of ALDH2 polymorphism, leading to subclinical but still biologically significant levels of acetaldehyde increase after drinking among the non-AFR group. 3. Drinking even small amounts of alcohol when you have AFR is really really bad for cancer risk. 4. Note that ALDH2 deficiency homozygotes would have the highest levels of post-drinking acetaldehyde but have the lowest levels of oral cancer because almost none of them drink. As in, out of ~100 homozygotes, only 2 were recorded as light drinkers, and none as heavy drinkers. This may be survival bias as the definition of heavy drinking may literally kill them.  5. The source for #4 looks like a pretty good meta-study, but some of the tables are off by one for some reason. Might just be on my end. 6. ADH polymorphism is also pretty common in Asian populations, generally in the direction of increased activity. This results in faster conversion of ethanol to acetaldehyde, but often isn't included in these studies. This isn't really relevant for this discussion though. As always, biostatistics is hard! If X causes less drinking, drinking contributes to cancer, and X increases drinking's effects on cancer, X may have positive, neutral, or negative overall correlation with cancer. Most studies I've looked at had a pretty string correlation between ALDH2 deficiency and cancer though, especially after you control for alcohol consumption. It also looks like most researchers in the field think the relationship is causal, with plausible mechanisms.

One other interesting quirk of your model of green is that it appears most of the central (and natural) examples of green for humans involve the utility function box adapting to these stimulating experiences so that their utility function is positively correlated with the way latent variables change over the course of one of an experience. In other words, the utility function gets "attuned" to the result of that experience.

For instance, taking the Zadie Smith example from the essay, her experience of greenness involved starting to appreciate the effect tha... (read more)

2Steven Byrnes
For example, if you didn’t know that walking near a wasp nest is a bad idea, and then you do so, then I guess you could say “some part of the world comes forward … strangely new, and shining with meaning”, because from now on into the future, whenever you see a wasp nest, it will pop out with a new salient meaning “Gah those things suck”. You wouldn’t use the word “attunement” for that obviously. “Attunement” is one of those words that can only refer to good things by definition, just as the word “contamination” can only refer to bad things by definition (detailed discussion here).