Archimedes

Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

The customer doesn't pay the fee directly. The vendor pays the fee (and passes the cost to the customer via price). Sometimes vendors offer a cash discount because of this fee.

It already happens indirectly. Most digital money transfers are things like credit card transactions. For these, the credit card company takes a percentage fee and pays the government tax on its profit.

Additional data points:

o1-preview and the new Claude Sonnet 3.5 both significantly improved over prior models on SimpleBench.

The math, coding, and science benchmarks in the o1 announcement post:

BMs

How much does o1-preview update your view? It's much better at Blocksworld for example.

https://x.com/rohanpaul_ai/status/1838349455063437352

https://arxiv.org/pdf/2409.19924v1

There should be some way for readers to flag AI-generated material as inaccurate or misleading, at least if it isn’t explicitly author-approved.

Neither TMS nor ECT didn’t do much for my depression. Eventually, after years of trial and error, I did find a combination of drugs that works pretty well.

I never tried ketamine or psilocybin treatments but I would go that route before ever thinking about trying ECT again.

I suspect fine-tuning specialized models is just squeezing a bit more performance in a particular direction, and not nearly as useful as developing the next-gen model. Complex reasoning takes more steps and tighter coherence among them (the o1 models are a step in this direction). You can try to devote a toddler to studying philosophy, but it won't really work until their brain matures more.

Seeing the distribution calibration you point out does update my opinion a bit.

I feel like there’s still a significant distinction though between adding one calculation step to the question versus asking it to model multiple responses. It would have to model its own distribution in a single pass rather than having the distributions measured over multiple passes align (which I’d expect to happen if the fine-tuning teaches it the hypothetical is just like adding a calculation to the end).

As an analogy, suppose I have a pseudorandom black box function that returns an integer. In order to approximate the distribution of its outputs mod 10, I don’t have to know anything about the function; I just can just sample the function and apply mod 10 post hoc. If I want to say something about this distribution without multiple samples, then I actually have to know something about the function.

This essentially reduces to "What is the next country: Laos, Peru, Fiji?" and "What is the third letter of the next country: Laos, Peru, Fiji?" It's an extra step, but questionable if it requires anything "introspective".

I'm also not sure asking about the nth letter is a great way of computing an additional property. Tokenization makes this sort of thing unnatural for LLMs to reason about, as demonstrated by the famous Strawberry Problem. Humans are a bit unreliable at this too, as demonstrated by your example of "o" being the third letter of "Honduras".

I've been brainstorming about what might make a better test and came up with the following:

Have the LLM predict what its top three most likely choices are for the next country in the sequence and compare that to the objective-level answer of its output distribution when asked for just the next country. You could also ask the probability of each potential choice and see how well-calibrated it is regarding its own logits.

What do you think?

Thanks for pointing that out.

Perhaps the fine-tuning process teaches it to treat the hypothetical as a rephrasing?

It's likely difficult, but it might be possible to test this hypothesis by comparing the activations (or similar interpretability technique) of the object-level response and the hypothetical response of the fine-tuned model.

Load More