This post will attempt to formalise the intuition that "it's hard to figure out the turning point of a logistic curve, at least until after that turning point". Ashort visual "proof" can also be found here.
Logistic curves can be specified by three parameters, c, l>0, and k>0. Their equation is then:
Fc,l,k(x)=lekxek(x−c)+1.
Note that this l is different from that in this article. The turning point of this curve is at x=c (where it takes the value of lekc/2) while its supremum is lekc; it tends to this value as x→∞. Take the limit as c→∞ as being the exponentials:
F∞,l,k(x)=lekx.
Figuring out the right curve
We'll imagine a simple Bayesian setup. An analyst of logistic curves is seeing data from one distribution, and has two hypotheses about it: FC,L,K, for values C, L, and K, and Fc,l,k with values c, l, and k. We'll designate FC,L,K by F and Fc,l,k by f.
Now, the true distribution is F, but the analyst doesn't know that. The question we're asking is thus:
Starting from an equal prior on F and f, how much of what kind of observation will the analyst need to establish that F is the true underlying distribution?
Noisy Sampling
If the analyst can sample noiselessly from the curve, then three samples should generally suffice to fully establish F, and one sample should generally suffice to distinguish F from f. So we'll consider the (realistic) situation where there is noise in the samples.
So assume the analyst samples n points, at →x=(x1,x2,…,xn). In return, it gets n values, →y=(y1,y2,…,yn); these are sampled independently from N(F(xi),σ2i). This is a normal distribution with mean F(xi) and standard deviation σi.
The analyst is assumed to know the vector →σ=(σ1,…σn), and indeed everything about this setup, with one exception: whether the means of these normal distributions are F(xi) or f(xi).
Let Pa be the analyst's probability distribution. Their prior gives equal weight to both hypotheses: Pa(F)=Pa(f)=1/2. Let O→y→x be the analyst observing →y after sampling from →x; their posterior is then Pa(f∣O→y→x).
Note that, from our perspective, Pa(F∣→x) is a random variable whose distribution we know. Say that:
→xestablishes the difference between F and f if the expectation of Pa(F∣→x) is less than 1/16.
We could choose other criteria, and this a relatively loose one. It only assumes three bits of information in favour of F over f. Note that since Pa≥0, we can get probability bounds on Pa as well, from this result; for instance:
If E[Pa(f∣→x)]≤q/4, then with probability at least 3/4, Pa(f∣→x)≤q.
So, for instance, our criteria above ensures that with probability at least 3/4, Pa(f∣→x)≤1/4. Conversely, since Pa≤1, probability bounds on Pa translate into expectation bounds, making the two approaches loosely equivalent. We'll use expectation bounds, as they are more natural for this random variable.
Bounding results
Our first result, proven in later sections, is a lower bound on the expectation of Pa(f∣→x):
E[Pa(f∣→x)]≥12∏ni=1[1−erf(δiσi2√2)].(1)
Here erf is the error function and δi is the absolute difference between F(xi) and f(xi). We can then get the slightly looser but more easily computable bound:
E[Pa(f∣→x)]≥12[1−∑ni=1δiσi√2π].(2)
How to sample
Sampling very large positive or negative values
Note that:
0≤Fc,l,k=lekxek(x−c)+1≤lekx.
Hence we can bound the δi via:
δi=|F(xi)−f(xi)|≤max(LeKxi,lekxi).
Let m(xi)=max(LeKxi,lekxi); note this is an increasing function, exponential for very negative xi.
Assume we sample n′ different xi values below a very negative X; then if σ− is the minimum of all the σi for xi≤X, the contribution of these n′ points to the expectation bound is at most 0 and at least:
−n′1σ−√2πm(X).
This gives our result for very negative values:
If noise is irreducible below σ−, then sampling below a very negative X will have very little impact on the analyst's posterior. To get a better result, increasing the X (exponential effect) is generally more powerful than decreasing σ− (inverse linear effect), and much more powerful than getting more samples (linear effect).
The behaviour for large positive xi is also clear: unless lekc=LeKC, f and F must have different asymptotes. So as long as there is an upper bound σ+ on the noise, sampling the curve at large values will cause the expectation of Pa(f∣→x) to converge to 0. For large xi, this is essentially trying to distinguish N(lekc,σ+) from N(LeKC,σ+), so each extra sample applies a multiplicative factor to the expected value of Pa(f∣→x). So, for large samples, the probability of the wrong function converges geometrically to zero in the number of samples.
Finding (any) turning point
So, distinguishing F and f for very low samples is very hard, but distinguishing them for very high samples is generally not very useful. But enough about asymptotic behaviour. The question is: what happens in between, closer to the turning points C and c of F and f?
We can make some scaling and translation choices to simplify F, setting c=0 and l=k=1. So the turning point is at 0 (y value 1/2) and the supremum is 1:
F(x)=exex+1=11+e−x.
Assume now that the noise σi is a constant σ. We want f to have a different turning point, so that can see how easy it is to identify this turning point. Let's choose the worst possible scenario: f is an exponential function with no turning point:
f(x)=lekx.
So, how can the analyst sample so that they have the greatest possible chance of distinguishing between a true function with a turning point at 0, and a false function with no turning point at all?
We have two free variables: the k and l of f, and we typically want to see how well we can do when sampling below a given X. For constant σ, the elements of the bound are given by:
Define d(l,k)(x) as this function, without the σ term. We'll now consider X=0, ie we are sampling at any point before the turning point. Then some experimentation allows us to minimize d(l,k)(x) for negative values, by setting l=0.51 and k=0.69; given these values, d(l,k)(x) is bounded above by 0.007:
Consequently we can use equation (2) to get a bound:
E[Pa(f∣→x)]≥12[1−nσ0.007]
To establish the difference between F and f, we need this below 1/16. Consequently, we need nσ0.007≥78, or
nσ≥125.
So if the noise is 1/200, ie 1% of the value at the turning point, a single data point might suffice. But if the noise is 10% of the value at the turning point, then at least seven samples are needed.
Anyway, that's all the way to the turning point; what about if X is chosen so that the value F(X) is 1/3 (two thirds of the value at the turning point) or 1/4 (a half of the value at the turning point)? To get these, we need X=−log(2) and X=−log(3), respectively. We'll also look at past the turning point, X=log(2) and log(3).
Optimising l and k for all five situations give:
For X=−log(3), nσ0.0015≥78 or nσ≥583.
For X=−log(2), nσ0.0027≥78 or nσ≥324.
For X=0, nσ0.007≥78 or nσ≥125.
For X=log(2), nσ0.015≥78 or nσ≥58.
For X=log(3), nσ0.021≥78 or nσ≥41.
But equation (2) gives poor bounds for low σ. Using equation (1) instead, for σ=1/200 (1% of turning point y-value) and σ=1/20 (10% of turning point y-value), gives the number n of samples needed as:
The bounds above are only good if the values are sampled independently and close to the peak of the d(l,k) function. If the values are not independent - as values sampled close to each other tend not to be - then more must be sampled, and the same goes if the values are sampled away from the peaks.
The other issue is that, here, we've first optimised l and k for minimal peak of d(l,k), then assumed the best xi were sampled. We need to consider the opposite situations, too: given the sampled xi, optimise l and k. So, even if n samples are enough to distinguish F from this specific f, there are other exponential functions F∞,l,k that would be harder to distinguish from F.
Proof
This section will prove the bounds in equation (1) and (2).
since the prior probabilities are equal. Since the analyst knows the true variances, Pa(O→y→x∣f)=P(O→y→x∣f) and similarly for F: we can replace the analyst's probabilities with the true probabilities. So, contracting P(O→y→x∣F) as p1(→y) and P(O→y→x∣f) as p2(→y), we get:
To get the true expectation of this Pa, we need to integrate over the possible values of →y, weighted by the true probability P(O→y→x∣F)=p1(→y) of this happening:
Note that p1(→y) and p2(→y)−1 are both (strictly) positive, and that 1/(p1(→y)−1+p2(→y)−1) is half the harmonic mean of the two.
The harmonic mean of any number of positive elements is bounded below by the minimum value of its arguments. Hence:
E[Pa(f∣→x)]≥12∫min(p1(→y),p2(→y))d→y.
Now, since the noise is independent, pj(→y)=∏ni=1pj(yi) where p1(yi)=P(yi∣xi,F) and p2(yi)=P(yi∣xi,f). For positive elements, the minimum of two products is greater than or equal to the product of minimums, so
The expressions min(p1(yi),p2(yi)) can be expressed analytically. If φ is the probability density function of N(0,1), the normal distribution with mean 0 and variance 1, then
p1(yi)=1σiφ(F(xi)−yiσi),p2(yi)=1σiφ(f(xi)−yiσi).
So the two curves are normal curves with the same variance and means F(xi) and f(xi). Assume, without loss of generality, that F(xi)≤f(yi). Then the two functions will be equal at the midpoint μi=(F(xi)+f(xi))/2, and for yi≤μi, p1(yi) is higher, while for yi≥μi, p2(yi) is higher.
This post will attempt to formalise the intuition that "it's hard to figure out the turning point of a logistic curve, at least until after that turning point". Ashort visual "proof" can also be found here.
Logistic curves
The logistic curves look like this:
Logistic curves can be specified by three parameters, c, l>0, and k>0. Their equation is then:
Fc,l,k(x)=lekxek(x−c)+1.
Note that this l is different from that in this article. The turning point of this curve is at x=c (where it takes the value of lekc/2) while its supremum is lekc; it tends to this value as x→∞. Take the limit as c→∞ as being the exponentials:
F∞,l,k(x)=lekx.
Figuring out the right curve
We'll imagine a simple Bayesian setup. An analyst of logistic curves is seeing data from one distribution, and has two hypotheses about it: FC,L,K, for values C, L, and K, and Fc,l,k with values c, l, and k. We'll designate FC,L,K by F and Fc,l,k by f.
Now, the true distribution is F, but the analyst doesn't know that. The question we're asking is thus:
Noisy Sampling
If the analyst can sample noiselessly from the curve, then three samples should generally suffice to fully establish F, and one sample should generally suffice to distinguish F from f. So we'll consider the (realistic) situation where there is noise in the samples.
So assume the analyst samples n points, at →x=(x1,x2,…,xn). In return, it gets n values, →y=(y1,y2,…,yn); these are sampled independently from N(F(xi),σ2i). This is a normal distribution with mean F(xi) and standard deviation σi.
The analyst is assumed to know the vector →σ=(σ1,…σn), and indeed everything about this setup, with one exception: whether the means of these normal distributions are F(xi) or f(xi).
Let Pa be the analyst's probability distribution. Their prior gives equal weight to both hypotheses: Pa(F)=Pa(f)=1/2. Let O→y→x be the analyst observing →y after sampling from →x; their posterior is then Pa(f∣O→y→x).
Note that, from our perspective, Pa(F∣→x) is a random variable whose distribution we know. Say that:
We could choose other criteria, and this a relatively loose one. It only assumes three bits of information in favour of F over f. Note that since Pa≥0, we can get probability bounds on Pa as well, from this result; for instance:
So, for instance, our criteria above ensures that with probability at least 3/4, Pa(f∣→x)≤1/4. Conversely, since Pa≤1, probability bounds on Pa translate into expectation bounds, making the two approaches loosely equivalent. We'll use expectation bounds, as they are more natural for this random variable.
Bounding results
Our first result, proven in later sections, is a lower bound on the expectation of Pa(f∣→x):
E[Pa(f∣→x)]≥12∏ni=1[1−erf(δiσi2√2)].(1)
Here erf is the error function and δi is the absolute difference between F(xi) and f(xi). We can then get the slightly looser but more easily computable bound:
E[Pa(f∣→x)]≥12[1−∑ni=1δiσi√2π].(2)
How to sample
Sampling very large positive or negative values
Note that:
0≤Fc,l,k=lekxek(x−c)+1≤lekx.
Hence we can bound the δi via:
δi=|F(xi)−f(xi)|≤max(LeKxi,lekxi).
Let m(xi)=max(LeKxi,lekxi); note this is an increasing function, exponential for very negative xi.
Assume we sample n′ different xi values below a very negative X; then if σ− is the minimum of all the σi for xi≤X, the contribution of these n′ points to the expectation bound is at most 0 and at least:
−n′1σ−√2πm(X).
This gives our result for very negative values:
The behaviour for large positive xi is also clear: unless lekc=LeKC, f and F must have different asymptotes. So as long as there is an upper bound σ+ on the noise, sampling the curve at large values will cause the expectation of Pa(f∣→x) to converge to 0. For large xi, this is essentially trying to distinguish N(lekc,σ+) from N(LeKC,σ+), so each extra sample applies a multiplicative factor to the expected value of Pa(f∣→x). So, for large samples, the probability of the wrong function converges geometrically to zero in the number of samples.
Finding (any) turning point
So, distinguishing F and f for very low samples is very hard, but distinguishing them for very high samples is generally not very useful. But enough about asymptotic behaviour. The question is: what happens in between, closer to the turning points C and c of F and f?
We can make some scaling and translation choices to simplify F, setting c=0 and l=k=1. So the turning point is at 0 (y value 1/2) and the supremum is 1:
F(x)=exex+1=11+e−x.
Assume now that the noise σi is a constant σ. We want f to have a different turning point, so that can see how easy it is to identify this turning point. Let's choose the worst possible scenario: f is an exponential function with no turning point:
f(x)=lekx.
So, how can the analyst sample so that they have the greatest possible chance of distinguishing between a true function with a turning point at 0, and a false function with no turning point at all?
We have two free variables: the k and l of f, and we typically want to see how well we can do when sampling below a given X. For constant σ, the elements of the bound are given by:
δiσ√2π=|F(xi)−f(xi)|σ√2π=∣∣ ∣∣le(k+1)x+lekx−ex(ex+1)√2π∣∣ ∣∣1σ.
Define d(l,k)(x) as this function, without the σ term. We'll now consider X=0, ie we are sampling at any point before the turning point. Then some experimentation allows us to minimize d(l,k)(x) for negative values, by setting l=0.51 and k=0.69; given these values, d(l,k)(x) is bounded above by 0.007:
Consequently we can use equation (2) to get a bound:
E[Pa(f∣→x)]≥12[1−nσ0.007]
To establish the difference between F and f, we need this below 1/16. Consequently, we need nσ0.007≥78, or
nσ≥125.
So if the noise is 1/200, ie 1% of the value at the turning point, a single data point might suffice. But if the noise is 10% of the value at the turning point, then at least seven samples are needed.
Anyway, that's all the way to the turning point; what about if X is chosen so that the value F(X) is 1/3 (two thirds of the value at the turning point) or 1/4 (a half of the value at the turning point)? To get these, we need X=−log(2) and X=−log(3), respectively. We'll also look at past the turning point, X=log(2) and log(3).
Optimising l and k for all five situations give:
But equation (2) gives poor bounds for low σ. Using equation (1) instead, for σ=1/200 (1% of turning point y-value) and σ=1/20 (10% of turning point y-value), gives the number n of samples needed as:
Xσ=1/200σ=1/20−log(3)669−log(2)3380114log(2)16log(3)15
Other difficulties
The bounds above are only good if the values are sampled independently and close to the peak of the d(l,k) function. If the values are not independent - as values sampled close to each other tend not to be - then more must be sampled, and the same goes if the values are sampled away from the peaks.
The other issue is that, here, we've first optimised l and k for minimal peak of d(l,k), then assumed the best xi were sampled. We need to consider the opposite situations, too: given the sampled xi, optimise l and k. So, even if n samples are enough to distinguish F from this specific f, there are other exponential functions F∞,l,k that would be harder to distinguish from F.
Proof
This section will prove the bounds in equation (1) and (2).
By Bayes rule:
Pa(f∣O→y→x)=Pa(O→y→x∣f)Pa(f)Pa(O→y→x)=Pa(O→y→x∣f)Pa(f)Pa(O→y→x∣f)Pa(f)+Pa(O→y→x∣F)Pa(F)=Pa(O→y→x∣f)Pa(O→y→x∣f)+Pa(O→y→x∣F),
since the prior probabilities are equal. Since the analyst knows the true variances, Pa(O→y→x∣f)=P(O→y→x∣f) and similarly for F: we can replace the analyst's probabilities with the true probabilities. So, contracting P(O→y→x∣F) as p1(→y) and P(O→y→x∣f) as p2(→y), we get:
Pa(f∣O→y→x)=p2(→y)p2(→y)+p1(→y))=11+p1(→y)p2(→y)−1.
To get the true expectation of this Pa, we need to integrate over the possible values of →y, weighted by the true probability P(O→y→x∣ F)=p1(→y) of this happening:
E[Pa(f∣→x)]=∫11+p1(→y)p2(→y)−1p1(→y)d→y=∫1p1(→y)−1+p2(→y)−1d→y.
Note that p1(→y) and p2(→y)−1 are both (strictly) positive, and that 1/(p1(→y)−1+p2(→y)−1) is half the harmonic mean of the two.
The harmonic mean of any number of positive elements is bounded below by the minimum value of its arguments. Hence: E[Pa(f∣→x)]≥12∫min(p1(→y),p2(→y))d→y.
Now, since the noise is independent, pj(→y)=∏ni=1pj(yi) where p1(yi)=P(yi∣xi,F) and p2(yi)=P(yi∣xi,f). For positive elements, the minimum of two products is greater than or equal to the product of minimums, so
E[Pa(f∣→x)]≥12∫∞−∞∏ni=1min(p1(yi),p2(yi))d→y≥12∏ni=1∫∞−∞min(p1(yi),p2(yi))dyi.
The expressions min(p1(yi),p2(yi)) can be expressed analytically. If φ is the probability density function of N(0,1), the normal distribution with mean 0 and variance 1, then
p1(yi)=1σiφ(F(xi)−yiσi),p2(yi)=1σiφ(f(xi)−yiσi).
So the two curves are normal curves with the same variance and means F(xi) and f(xi). Assume, without loss of generality, that F(xi)≤f(yi). Then the two functions will be equal at the midpoint μi=(F(xi)+f(xi))/2, and for yi≤μi, p1(yi) is higher, while for yi≥μi, p2(yi) is higher.
Thus min(p1(yi),p2(yi))=⎧⎪⎨⎪⎩1σiφ(f(xi)−yiσi), yi≤μi,1σiφ(F(xi)−yiσi), yi≥μi.
If δi=|F(xi)−f(xi)| is the distance between the two peaks, this becomes: min(p1(yi),p2(yi))=⎧⎪⎨⎪⎩1σiφ(μi+δi/2−yiσi), yi≤μi,1σiφ(μi−δi/2−yiσi), yi≥μi.
Since the integral of φ is 1/2[1+erf(y/√2)], for erf the error function, we can bound the expected probability by:
E[Pa(f∣→x)]≥12∏ni=1∫∞−∞min(p1(yi),p2(yi))dyi≥12∏ni=1(∫μi−∞1σiφ(μi+δi/2−yiσi)dyi+∫∞μi1σiφ(μi−δi/2−yiσi)dyi)≥12∏ni=1(∫−δi/2−∞1σiφ(−yiσi)dyi+∫∞δi/21σiφ(−yiσi)dyi)≥12∏ni=1(12[1+erf(−δi/2σi√2)]+12[1+erf(−δi/2σi√2)])≥12∏ni=1[1−erf(δiσi2√2)].
For positive values, the error function is concave, and it has derivative 2/√π at the origin, so
erf(δiσi2√2)≤2√πδiσi2√2=δiσi√2π.
Consequently
E[Pa(f∣→x)]≥12∏ni=1[1−δiσi√2π].
Using the fact that for x,y positive, (1−x)(1−y)≥1−(x+y), we get a final bound:
E[Pa(f∣→x)]≥12[1−∑ni=1δiσi√2π]