I'm surprised that there hasn't been more of a shift to ternary weights a la BitNet 1.58.
What stood out to me in that paper was the perplexity gains over fp weights in equal parameter match-ups, and especially the growth in the advantage as the parameter sizes increased (though only up to quite small model sizes in that paper, which makes me curious about the potential delta in modern SotA scales).
This makes complete sense from the standpoint of the superposition hypothesis (irrespective of its dimensionality, an ongoing discussion).
If nodes are serving more than one role in a network, then constraining the weight to a ternary value as opposed to a floating point range seems like it would be more frequently forcing the network to restructure overlapping node usage to better align nodes to shared directional shifts (positive, negative, or no-op) as opposed to compromise across multiple roles to a floating point avg of the individual role changes.
(Essentially resulting in a sharper vs more fuzzy network mapping.)
A lot of the attention for the paper was around the idea of the overall efficiency gains given the smaller memory footprint, but it really seems like even if there were no additional gains, that models being pretrained from this point onward should seriously consider clamping node precision to improve both the overall network performance and likely make interpretability more successful down the road to boot.
It may be that at the scales we are already at, the main offering of such an approach would be the perplexity advantages over fp weights, with the memory advantages as the beneficial side effect instead?
I wonder if with the next generations of multimodal models we'll see a "rubber ducking" phenomenon where, because their self-attention was spread across mediums, things like CoT and using outputs as a scratch pad will have a significantly improved performance in non-text streams.
Will GPT-4o fed its own auditory outputs with tonal cues and pauses and processed as an audio data stream make connections or leaps it never would if just fed its own text outputs as context?
I think this will be the case, and suspect the various firms dedicating themselves to virtualized human avatars will accidentally stumble into profitable niches - not for providing humans virtual AI clones as an interface, but for providing AIs virtual human clones as an interface. (Which is a bit frustrating, as I really loathe that market segment right now.)
When I think about how Sci-Fi authors projected the future of AI cross- or self-talk, it was towards a super-efficient beeping or binary transmission of pure data betwixt them.
But I increasingly get the sense that, like much of actual AI development over the past few years, a lot of the Sci-Fi thinking was tangential or inverse to the actual vector of progress, particularly in underestimating the inherent value humans bring to bear. The wonders we see developing around us are jumpstarted and continually enabled by the patterns woven by ourselves, and it seems at least the near future developments of models will be conforming to those patterns more and more, not less and less.
Still, it's going to be bizarre as heck to watch a multimodal model's avatar debating itself aloud like I do in my kitchen...
When I wrote this I thought OAI was sort of fudging the audio output and was using SSML as an intermediate step.
After seeing details in the system card, such as copying user voice, it's clearly not fudging.
Which makes me even more sure the above is going to end up prophetically correct.