faul_sname

Wikitag Contributions

Comments

Sorted by
faul_sname1215

I guess you could choose to read it that way, but I'm not sure why you would - seems like an assumption of bad faith that doesn't feel justified to me, especially on LW.

Saying "I don't like that you invited this person, and I think you shouldn't have, and I think you should reverse that decision, and it's on you if you ignore my advice and it goes poorly" doesn't seem like it's in bad faith to me. Caving to such bids seems like it would invite more such bids in the future, but I don't think making such bids is particularly norm-breaking.

Did we read the same OP?

I don't like what he's about, I think the rationalist community can do better, and I do not want to be a special guest at the same event he's a special guest at. I hope that LessOnline goes well and that those who do go have a great time, and that my assessment is completely off-base. I mean, I don't think it is, but I hope so.

This sounds to me like "hint hint I think you guys should disinvite him, and if it goes badly I will say that I told you so".

A bet here is to revisit revenue per deployed H100 by the end of 2027.  If the number exceeds one‑hundred‑thousand dollars, the efficiency threshold the short camp predicts has arrived. If it remains close to today’s ten‑thousand, the sceptics will have been vindicated.

 

I note that CoreWeave has 250,000 GPUs, mostly Hopper series. If net revenue per H100-GPU-year 10xs, they will be in a very good position to benefit from that. My extremely not rigorous back-of-the-envelope estimate is that the CoreWeave valuation makes sense if H-series GPU prices are expected to drop by 10%-30% per year for the foreseeable future. If you instead expect the price of H100s to rise by a factor of 10, that implies that CoreWeave, a public company, is currently trading at far below the correct price.

Not incredibly strong evidence but it does seem like we have a kind-of-prediction-market-ish thing here on net revenue per H100, and the prediction it's making is quite clear.

Also if the LLM tells you that an existing word points at the concept you asked about, open up a fresh chat without memory and ask what the term it just suggested means.

They test the perplexity (Below: Figure 6 left) of the RLed model’s generations (pink bar) relative to the base model. They find it is lower than the base model's perplexity (turquoise bar), which “suggests that the responses from RL-trained models are highly likely to be generated by the base model” conditioned on the task prompt. (Perplexity higher than the base model would imply the RLed model had either new capabilities or higher diversity than the base model.)


Does "lower-than-base-model perplexity" suggest "likely to be generated by the base model conditioned on the task prompt"? Naively I would expect that lower perplexity according to the base model just means less information per response token, which could happen if the RL-trained model took more words to say the same thing. For example, if the reasoning models have a tendency to restate substrings of the original question verbatim, the per-token perplexity on those substrings will be very close to zero, and so the average perplexity of the RL'd model outputs would be expected to be lower than base model outputs even if the load-bearing outputs of the RL'd model contained surprising-to-the-base-model reasoning.

Still, this is research out of tsinghua and I am a hobbyist, I'm probably misunderstanding something.

Your post does not seem very LLM-style at all to me. It does, however, feel very long. I think the title of the post may also have been lost in translation - based on the content of the post I suggest something like "Peak Effort Comes Before Peak Pain".

Also I think a lot of the ideas in the post have been discussed pretty extensively here already. Not all of them, though. I think you probably would have gotten 10-20 upvotes if you asked the LLM you used for translation to condense your post down to 3 paragraphs while maintaining the core insights (and then of course verified that it did in fact keep the core insights).

Whatever you did with translation did succeed in avoiding the LLM slop attractor, so however you managed that, keep doing it (and also if it was something intentional and repeatable maybe write a post about that topic, since it is of interest to many here).

I agree that that's the premise. I just think that our historical track record of accuracy is poor when we say "surely we'llhave handled all the dumb impediments once we reach this milestone". I don't expect automated ML research to be an exception.

When I'm working on a project, I've noticed a tendency in myself to correctly estimate the difficulty of my current subtask, in which I am almost always stuck on something that sounds dumb to be stuck on and not like making "real" progress on the project, but then to assume that once I'm done resolving the current dumb thing the rest of the project will be smooth sailing in terms of progress.

Anyway, I was just reading AI 2027, and it strikes me that our current task is to build an AI capable of doing AI research, and we're currently stuck on impediments that feel dumb and non-central, but once we finish that task, we expect the rest of the path to the singularity to be smooth sailing in terms of progress.

Edit: s/the path the the singularity/the path to the singularity/

Nope, although it is does have a much higher propensity to exhibit GeoGuessr behavior on pictures on or next to a road when given ambiguous prompts (initial post, slightly more rigorous analysis).

I think it's possible (25%) that o3 was explicitly trained on exactly the GeoGuessr task, but more likely (40%) that it was trained on e.g. minimizing perplexity on image captions, and that knowing the exact location of the image is useful for that, and it managed to evoke the "GeoGuessr" behavior in its reasoning chain once and that behavior was strongly reinforced and now it does it whenever it could plausibly be helpful.

Load More