It already happens indirectly. Most digital money transfers are things like credit card transactions. For these, the credit card company takes a percentage fee and pays the government tax on its profit.
Additional data points:
o1-preview and the new Claude Sonnet 3.5 both significantly improved over prior models on SimpleBench.
The math, coding, and science benchmarks in the o1 announcement post:
How much does o1-preview update your view? It's much better at Blocksworld for example.
There should be some way for readers to flag AI-generated material as inaccurate or misleading, at least if it isn’t explicitly author-approved.
Neither TMS nor ECT didn’t do much for my depression. Eventually, after years of trial and error, I did find a combination of drugs that works pretty well.
I never tried ketamine or psilocybin treatments but I would go that route before ever thinking about trying ECT again.
I suspect fine-tuning specialized models is just squeezing a bit more performance in a particular direction, and not nearly as useful as developing the next-gen model. Complex reasoning takes more steps and tighter coherence among them (the o1 models are a step in this direction). You can try to devote a toddler to studying philosophy, but it won't really work until their brain matures more.
Seeing the distribution calibration you point out does update my opinion a bit.
I feel like there’s still a significant distinction though between adding one calculation step to the question versus asking it to model multiple responses. It would have to model its own distribution in a single pass rather than having the distributions measured over multiple passes align (which I’d expect to happen if the fine-tuning teaches it the hypothetical is just like adding a calculation to the end).
As an analogy, suppose I have a pseudorandom black box function that returns an integer. In order to approximate the distribution of its outputs mod 10, I don’t have to know anything about the function; I just can just sample the function and apply mod 10 post hoc. If I want to say something about this distribution without multiple samples, then I actually have to know something about the function.
This essentially reduces to "What is the next country: Laos, Peru, Fiji?" and "What is the third letter of the next country: Laos, Peru, Fiji?" It's an extra step, but questionable if it requires anything "introspective".
I'm also not sure asking about the nth letter is a great way of computing an additional property. Tokenization makes this sort of thing unnatural for LLMs to reason about, as demonstrated by the famous Strawberry Problem. Humans are a bit unreliable at this too, as demonstrated by your example of "o" being the third letter of "Honduras".
I've been brainstorming about what might make a better test and came up with the following:
Have the LLM predict what its top three most likely choices are for the next country in the sequence and compare that to the objective-level answer of its output distribution when asked for just the next country. You could also ask the probability of each potential choice and see how well-calibrated it is regarding its own logits.
What do you think?
Thanks for pointing that out.
Perhaps the fine-tuning process teaches it to treat the hypothetical as a rephrasing?
It's likely difficult, but it might be possible to test this hypothesis by comparing the activations (or similar interpretability technique) of the object-level response and the hypothetical response of the fine-tuned model.
The customer doesn't pay the fee directly. The vendor pays the fee (and passes the cost to the customer via price). Sometimes vendors offer a cash discount because of this fee.