Your novel architecture should be parameter-compatible with standard architectures
Some people work on "novel architectures" — alternatives to the standard autoregressive transformer — hoping that labs will be persuaded the new architecture is nicer/safer/more interpretable and switch to it. Others think that's a pipe dream, so the work isn't useful.
I think there's an approach to novel architectures that might be useful, but it probably requires a specific desideratum: parameter compatibility.
Say the standard architecture F computes F(P,x) where x is the input and P is the parameterisation. Your novel architecture computes G(P,x). The key desideratum is that F and G share the same parameterisation P, and you can gracefully switch between F and G during training and inference. That is, on most training batches you optimise P by backpropagating through F(·,x); on some batches you optimise P by backpropagating through G(·,x). At inference time, you can likewise choose F or G per forward pass.
This is strictly more general than "replace F with G". You have two independent dials: what proportion of training steps use G, and what proportion of inference steps use G. You might use G only during training (as a regulariser on P), only during inference (to get some safety property at deployment), or some mixture of both. Setting both dials to 100% recovers wholesale replacement; setting both to 0% recovers standard training.
It's even better if you can interpolate between F and G via a continuous parameter α, i.e. there is a general family H such that H(P, x, α) = F(P,x) when α = 0 and H(P, x, α) = G(P, x) when α = 1. Then you have an independent dial for each batch during training and deployment.
Bilinear MLPs (Pearce et al., 2025) are a good example. F uses a standard gated MLP: f(x) = (Wx) ⊙ σ(Vx). G drops the elementwise nonlinearity: g(x) = (Wx) ⊙ (Vx). The lack of nonlinearity in G means the layer can be expressed as a third-order tensor, enabling weight-based m