I think there's something wrong with your analysis of the longer/shorter survey data.
[EDITED to add:] ... and, having written this and gone back to read the comments on your post, I see that someone there has already said almost exactly the same as I'm saying here. Oh well.
You start out by saying that you should write longer posts if 25% more readers prefer long than prefer short (and similarly for writing shorter posts).
Then you consider three hypotheses: that (as near as possible to) exactly 25% more prefer long than prefer short, that (as near as possible to) exactly 25% more prefer short, and that the numbers preferring long and preferring short are equal.
And you establish that your posterior probability for the first of those is much bigger than for either of the others, and say
Our simple analysis led us to an actionable conclusion: there’s a 97% chance that the preference gap in favor longer posts is closer to 25% than to 0%, so I shouldn’t hesitate to write longer posts.
Everything before the last step is fine (though, as you do remark explicitly, it would be better to consider a continuous range of hypotheses about the preference gap). But surely the last step is just wrong in at least two ways.
With any reasonable prior, I think the data you have make it extremely unlikely that the preference gap is at least 25%.
[EDITED to add:] Oh, one other thing I meant to say but forgot (which, unlike the above, hasn't already been said in comments on your blog). The assumption being made here is, roughly, that people responding to the survey are a uniform random sample from all your readers. But I bet they aren't. In particular, I bet more "engaged" readers are (1) more likely to respond to the survey and (2) more likely to prefer longer meatier posts. So I bet the real preference gap among your whole readership is smaller than the one found in the survey. Of course you may actually prefer to optimize for the experience of your more engaged readers, but again that isn't what you said you wanted to do :-).
Since at 4,000 words the post was running up against the limits of my stamina regardless of readers' preferences, I trust my smart and engaged readers to make all the necessary nitpicks and caveats for me :)
First of all, according to the site stats more than 80% of the people who read the survey filled it out so it makes sense to treat it as representative sample. I forgot to mention that.
To your first point: you're correct that "the real gap is almost certainly above 12.5%" isn't exactly what my posterior is. Again, my goal was to make a decision, so I had to assign decisions based on what the data could show me. I don't need to have a precise interpretation of the result to make a sensible decision based on them, as long as I'm not horribly mistaken about what the results mean.
And what the results mean is, in fact, pretty close to "the real gap is almost certainly above 12.5%" under some reasonable assumptions. Whatever the "real" gap (i.e. the gap I would get if I got an answer from every single one of my current and future readers), the possible gaps I could measure on the survey are almost certainly distributed in some unimodal and pretty symmetric distribution around it. This means that the measured results are about as likely to overshoot the "real gap" by x% as they are to undershoot, at least to a first approximation (i.e. ignoring things like how the question was worded and the phase of the moon). This in turn means that a measured result of a 15% gap on a large sample of reasers does imply that the "real gap" is very likely to be close to 15% and above 12.5%.
Thanks for taking the time to dig into the math, this is what it's about.
Wait, if 80% of your readers took the survey then why on earth are you doing any kind of fancy statistics to estimate that preference gap? If you've got a representative sample that covers a large majority of your readers, then you know what the gap is: it's the gap observed in the sample, which IIRC was a little under 15%. Done.
(The factor-of-2 change in the meaning of that 25% figure seems really strange to me, too, but I take it the issue is just that the way it was introduced didn't mean what I thought it did.)
Again, my goal was to make a decision, so I had to assign decisions based on what the data could show me.
It seems to me like you came up with a sensible metric for determining whether posts should be made longer or shorter, conditional on the post length changing, but that it would be better to determine also whether or not post length should change. That's sort of what the 25% cutoff was pointing at, but note that it doesn't distinguish between the world where it's split 60-5-35 (for longer-same-shorter) and the world where it's split 25-75-0. The first world looks like it needs you to split out your readership and figure out what the subgroups are, and the second world looks like you should moderately increase post length.
(Of course, to actually get the right decision you also need the cost estimate for being too long vs. too short; one might assume that you should tinker with the length until the two unhappy groups are equally sized, but this rests on an assumption that is often wrong.)
I'm with Rif and GJM that the choice of threshold is arbitrary and confusing.
Binning is always wrong, except for decision theory and computational limitations, including those imposed by pedagogy. You said you'll follow up with a continuous version, so I can't blame you for binning here. I don't have a solution, but it seems more confusing than necessary.
Decisions are discrete, which may suggest binning in analysis, but it is generally better to postpone binning as long as possible and the analysis should tell you how to bin, not just be an excuse. Decision theory says that you should maximize expected value, which says that you should have probabilistic beliefs, which is a big motivation for philosophical bayesianism. But though the post is about both bayesianism and decision theory, it fails to make the connection.
I think it would be less arbitrary to pander to the majority. Just consider two hypotheses: most prefer shorter or most longer. Then your posterior is just one number, the credence that the majority prefers longer articles. But if you impose a fixed cost of change, that partitions the space of posteriors into three pieces, corresponding to change to longer articles, no change, and change to shorter articles. The thresholds are in posterior space, not parameter space. The threshold does not depend just on the maximum likelihood parameter, but also on the amount of data. This ends up with a more complicated decision model and a less complicated inference, which may not serve your pedagogical goals, but it shoves the arbitrariness into the numerical costs and benefits, which at least are well-specified and avoids the confusion about 25 vs 12.5 which stems from thinking that words specified the model.
The verbal logic of Bayes’ rule is that whichever alternative gave the highest probability of seeing the evidence we actually observed is the one most supported by the evidence. “Support” means that the probability of the alternative increases from its prior.
That is precise enough to be wrong. The first half of the first sentence is about likelihood and the second sentence defines "support" to mean Bayes factor. They are not equal. I'm not sure if you got this from Arbital because it is hard to search. It does use the word "support," but I don't think it defines it.
You don't seem to actually use this paragraph, instead talking about posteriors. That's the right thing to do, but I think the quoted paragraph suggests "Bayesian hypothesis testing," ie, using Bayes factors for hypothesis testing. Leaving aside your post, do any modern Bayesians actually advocate this? Jeffreys seems to mention it as at least plausible, but then he puts forward Lindley's paradox, which shows that it is bad, at least when the prior is stupid. But people often do use uninformative priors.