Inspired partially by this post and partially by trying to think of simple test cases for a machine learning project I'm working on, here is a (not too hard, you should try answering it yourself) question: Let's say we've observed trials of a Bernoulli random variable, and had a 1
outcome (so were 0
). Laplace's rule of succession (uniform prior over success probability) says that we should estimate a probability of for the next trial being 1
. The question is: What is the prior over bitstrings of length implied by Laplace's rule of succession? In other words, can we convert the rule of succession formula into a probability distribution over bitstrings s
that record outcomes of trials?
Additional clarification of the problem:
Given any particular observation of trials, there will be two bitstrings that are consistent with it, where the last (unobserved) trial is 0
or 1
respectively. We can compute the 1
probability (which should equal the result from the rule of succession) as:
where is the first bits of the string (corresponding to visible observations) and is the Hamming weight function (counts the number of 1
s in a bitstring). Since this requires a normalization anyway, you can also just provide an energy function as your answer. The probability formula in this case is:
If we just pick a uniform distribution over bitstrings, that doesn't work. Then the predicted probability of the next trial is always just .
Answer:
The following energy function works:
This can be checked by computing the probability as:
This energy function biases the distribution towards strings with more extreme ratios between counts of 0
and 1
. We can think of it as countering the entropic effect of strings with an equal balance of 0
and 1
being the most prevalent.
Also tried this, and basically ended up with the same answer as commenter One.
Key idea is that we really only care about drawing 5 trials from this process. So we just have to find a probability distribution over 6 outcomes: a count of for our 5 trials from 0-5. 10^6 datapoints is enough to kill a fair amount of noise by self-averaging, so I treated the fact that hiding a random trial has to reproduce the observed 4-trial distribution as just a hard constraint. (It's a linear constraint in the probabilities.) Then did maximum entropy optimization subject to that constraint. The output distribution in terms of 5-trial counts looked pretty symmetric and was heavier towards the extremes.
Another quick computation from these values yields the p(R | k) numbers asked for in the question: [0.11118619, 0.32422537, 0.49942029, 0.67519768, 0.88914787]
Registering now that my modal expectation is that the situation will mostly look the same in 2028 as it does today. (To give one example from AI 2027, scaling neuralese is going to be hard, and while I can imagine a specific set of changes that would make it possible, it would require changing some fairly fundamental things about model architecture which I can easily imagine taking 3 years to reach production. And neuralese is not the only roadblock to AGI.)
I think one of your general points is something like "slow is smooth, smooth is fast" and also "cooperative is smooth, smooth is fast", both of which I agree with. But the whole "trauma" thing is too much like Bulverism for my taste.
I could be wrong, but from what I've read the domain wall should have mass, so it must travel below light speed. However, the energy difference between the two vacuums would put a large force on the wall, rapidly accelerating it to very close to light speed. Collisions with stars and gravitational effects might cause further weirdness, but ignoring that, I think after a while we basically expect constant acceleration, meaning that light cones starting inside the bubble that are at least a certain distance from the wall would never catch up with the wall. So yeah, definitely above 0.95c.
We probably don't disagree that much. What "original seeing" means is just going and investigating things you're interested in. So doing lengthy research is actually a much more central example of this than coming up with a bold new idea is.
As I say above: "There's not any principled reason why an AI system, even a LLM in particular, couldn't do this."
Some experimental data: https://chatgpt.com/share/67ce164f-a7cc-8005-8ae1-98d92610f658
There's not really anything wrong with ChatGPT's attempt here, but it happens to have picked the same topic as a recent Numberphile video, and I think it's instructive to compare how they present the same topic: https://www.numberphile.com/videos/a-1-58-dimensional-object
My view on this is that writing a worthwhile blog post is not only a writing task, but also an original seeing task. You first have to go and find something out in the world and learn about it before you can write about it. So the obstacle is not necessarily reasoning ("look at this weird rock I found" doesn't involve much reasoning, but could make a good blog post), but a lack of things to say.
There's not any principled reason why an AI system, even a LLM in particular, couldn't do this. There is plenty going on in the world to go and find out, even if you're stuck in the internet. (And even without an internet connection, you can try and explore the world of math.) But it seems like currently the bottleneck is that LLM's don't have anything to say.
Maybe novels might require less of this than blog posts, but I'd guess that writing a good novel is also a task that requires a lot of original seeing.
Thanks for the reply & link. I definitely missed that paragraph, whoops.
IMO even just simple gamete selection would be pretty great for avoiding the worst genetic diseases. I guess tracking nuclei with a microscope is way more feasible than the microwell thing, given how hard it looks to make IVS work at all.
Re the "Appendix: Cheap DNA segment sensing" section, just going to throw out a thought that occurred to me (very much a non-expert). Let's say we're doing IVS, and assume we can separate spermatocytes into separate microwells before they undergo meiosis. The starting cells all have a known genome. Then the cell in each microwell divides into 4 cells. If we sequence 3 of them, then we know by process of elimination what the sequence on the 4th cell is, at a very high level of detail, including crossovers, etc. So we kill 3 cells and look at their DNA, and then we know what DNA the remaining living cell has without doing anything to it.
Okay, DNA sequencing is still fairly expensive, so maybe it's super crazy to do it 3 times to get a single cell with known DNA. But:
If it's too hard to separate the cells into microwells while they're still dividing, maybe there are alternate things we could do like just watching the culture with a microscope and keeping track of who split from who and where they ended up (plus some kind of microfluidics setup to shuffle the sperms around to where we want them).
I have read that some sequencing methods (nanopore) have a high error rate (comparing multiple reads can help correct this). Did you also spot-check some other genes that you have no reason to believe contain mutations to see if they look ok? Seeing a mutation in exactly the gene you expect is only damn strong evidence if there isn't a sequencing error in every third gene.