Comments

Sorted by

It also sounds like a piece of paper, or a map, or a person having vivid hallucinations before falling asleep. But unless you have a whiteboard which can be copied among several hundred people and teleport and be rolled up and fit in a jean pocket, which be timetraveled so you can look at what used to be on the whiteboard or look at what people might write on it in the future, which isn't white at all because there's a colored map printed on it nor is it a single board but arbitrarily many, which has a ledgerbook next to itself which writes itself, and so on, I would suggest that this does not 'sound like a whiteboard' to most people. (No, not even a Biblically-accurate whiteboard.)

Yes, there's a lot of computer-related ones depending on how finegrained you get. (There's a similar issue with my "Ordinary Life Improvements": depending on how you do it, you could come up with a bazillion tiny computer-related 'improvements' which sort of just degenerates into 'enumerating every thing ever involving a transistor in any way' and is not enlightening the same way that, say, 'no indoors smoking' or 'fresh mango' is.) So I would just lump that one under 'Machine Configuration/Administration § Software' as one of the too-obvious-to-be-worth-mentioning hacks.

How did you check Claude's claims here?

Idea: "Conferences as D&D tabletops": you may be able to better organize a conference or convention by borrowing a tool from tabletop roleplaying games - players collaborate by directly manipulating or modifying a 2D map. It seems to me like this could be low-friction and flexibly handles a lot of things that existing 'conware' design patterns don't handle well.

I have not done any work directly on it. The LLMs have kept improving so rapidly since then, especially at coding, that it has not seemed like a good idea to work on it.

Instead, I've been thinking more about how to use LLMs for creative writing or personalization (cf. my Dwarkesh Patel interview, "You should write more online"). To review the past year or two of my writings:

Obviously, I've also been doing a lot of regular writing, and working on the Gwern.net website infrastructure - adding the 'blog' feature has been particularly important, but just getting the small details right on things like "October The First" takes up plenty of time. But the overall through-line is, "how can we start getting meaningful creative work out of LLMs, rather than sleepwalking into the buzzsaw of superhuman coders creating Disneyland-without-children where all the esthetics is just RLHF'd AI slop?"

* This seems particularly useful for fiction. I'm working on a write up of an example with a Robin Sloan microfic where the LLM suggestions get better if you negate them, and particularly if you order them to think about why the suggestions were bad and what that implies before they make any new suggestions - which suggests, in conjunction with the success of the 'brainstorm' prompt, that a major failing of LLMs right now is just that they tend to treat corrections/feedback/suggestions in a 'superficial' manner because the reasoning-mode doesn't kick in when it should. Interestingly, 'superficial' learning may be why dynamic-evaluation/finetuning seems to underperform: https://arxiv.org/abs/2505.01812 https://arxiv.org/abs/2505.00661#google Because adding paraphrases or Q&A to the finetuning data, although it cannot add any new information, improves performance; reminiscent of engrams/traces in human memory - you can have memorized things, but not be able to recall them, if there aren't enough 'paths' to a memory.

I was trying out a hierarchical approach when I stopped, because I wasn't sure if I could trust a LLM to rewrite a whole input without dropping any characters or doing unintended rewrites, and aside from being theoretically more scalable and potentially better by making each step easier and propagating the sorting top-down, if you explicitly turn it into a tree, you can easily check that you get back an exact permutation of the list each time and so that the rewrite was safe. I think that might be unnecessary at this point, given the steady improvement in prompt adherence, so maybe the task is now trivial.

There's no explicit distances calculated: just asking the LLM to sort the list meaningfully.

Very funny, but the OA embeddings were always bad at sentence embedding, specifically, compared to other NN sentence-specialized embeddings; and as the original OA embedding paper somewhat defensively argues, it's not even clear a priori what a sentence embedding should do because a sentence is such a cut-down piece of text, and doing well at a sentence embedding task may only be overfitting or come at the cost of performance on more meaningful text embedding tasks. (Similar to a word embedding: they are so poly-semantic or context-dependent that it seems like they have to have substantial limits - which is part of the motivation for Transformers in the first place, after all...)

That's why I was experimenting with prompting a LLM to do seriation rewrites (instead of just splitting on punctuation to reuse my existing greedy-pairwise approach, and having done with it). A prompted LLM is taking full context and purpose into consideration, and avoid the issues with bad embeddings on very small text. So the seriation outputs aren't crazily random, but sensible. This also helps avoid issues like Algon's where a general-purpose embedding, blind to context or purpose, winds up emphasizing something you don't care about; if Algon had been able to prompt a seriation, like 'sort by theme', the LLM would almost certainly not try to seriate it by the 'question formatting', but would organize his little Q&A question set by topic like biology then chemistry then physics, say. And if it doesn't, then it's easy to add more context or instructions. There are promptable embedders, but they are much more exotic and not necessary here.

(Which makes sense, because if you ask a LLM to sort a list of items in a freeform normal way, like a chat session, they are capable of it; in my poetry selection the other day, "Bell, Crow, Moon: 11 Variations", I had Claude/Gemini/GPT suggest how exactly to sort the 11 poems we curated into a pleasing sequence, and they did come up with a much nicer poetry sequence than the original random one. And why wouldn't they be able to do that, when they were good enough to write most of the poems in the first place?)

Yeah, it's limited by what kind of structure you have. It did seriate your list successfully, sounds like, it's just you have a lot of structure in the list that you don't care about, and so no embedding is going to prioritize the other stuff and the distances aren't useful to you in general. This will hurt any embedding-related use-case, not just seriation - presumably your k-NN lookups aren't terribly useful either and they mostly just pull up hits which have superficial syntactic similarities.

This is probably less of a problem with my annotations because I reformat them before embedding and add in all available metadata (not just the tags or the titles of links in it as a link-bibliography, but also tricks like including the titles of reverse-citations of it, so the more an annotation gets linked, the more the embedding of it reflects its usage), so the formatting is uniform (nothing like "half of them start with 'what is X' and half don't") and there's a lot of very semantic information.

As I've said before, I think you greatly overrate the difficulty of putting search into neural nets, and this is an example of it. It seems to me like it is entirely possible to make a generic LLM implement an equivalent to AlphaZero and be capable of expert iteration, without an elaborate tree scaffolding. A tree search is just another algorithm which can be reified as a sequence, like all algorithms (because they are implemented on a computer).

All AlphaZero is, is a way of doing policy iteration/Newton updates by running a game state forward for a few plies, evaluating, and updating estimates. It's not magic, and can obviously be encoded into a LLM's generative process.

Here's a concrete example of how in-principle I think a LLM can do AlphaZero-style expert iteration for Go: A LLM can serialize a board with value estimates as simply a few hundred tokens (361 points, 361 value estimates, miscellaneous metadata); this means in a frontier LLM like Claude-4-opus with 200k ctx, you can fit in easily 200 board states; so you can serialize out the lookahead of a bunch of possible moves and resulting board states (eg. take the top 14 moves and imagine the resulting board state and then imagine their next 14 top moves, for comparison, TD-Gammon looked forward like 1 move); and can back-propagate an updated value estimate, and spit out the original board state with better value estimates. "Move #4 was better than it looked, so I will +0.01 to the value estimate for it." This improved board is now in context, and can be dynamically-evaluated to update the LLM: now it has to predict the new board state with the final improved estimates, and that improves the policy. The LLM finishes by setting up the next planning step: pick a deeper board state to evaluate next, and if the next board state is the end of the game, then it starts over with a fresh game. Run this indefinitely.

It repeatedly iterates through a possible game, evaluating each position to a certain depth, updating its weights to incorporate the policy improvement from the evaluation, and restarting with a fresh game. All serialized out as a long array/sequence, the tree just being implicitly represented by successive board states. (And then now that you have that in mind, you can imagine how to do things like deep rollouts: 200 moves is around a normal game of Go, so random rollouts are doable from most board states, and the LLM can just toggle between a shallow tree search and deep randomized rollouts if necessary eg by adding a 0/1 token prefix.)

At no point do you need explicit tree scaffolding as you bootstrap from a LLM clueless about playing Go to the high performance that we know LLMs trained by imitation learning on board states/values/policies can reach, and at no point have I invoked a cognitive operation which is not easier than a lot of things we see LLMs do routinely, or where it's implausible that they could do it. It is probably a lot less efficient and has other practical issues like how you integrate the rules of Go akin to AlphaZero/MuZero, etc, but in principle I think this algorithm is well-defined, concrete, and would work.

Load More