$500 Bounty/Prize Problem: Channel Capacity Using "Insensitive" Functions

johnswentworth

LESSWRONG
LW

$500 Bounty/Prize Problem: Channel Capacity Using "Insensitive" Functions — LessWrong

40 $500 Bounty/Prize Problem: Channel Capacity Using "Insensitive" Functions

by johnswentworth

16th May 2023

AI Alignment Forum

3 min read

40 Ω 18

Informal Problem Statement

We have an information channel between Alice and Bob. Alice picks a function. Bob gets to see the value of that function at some randomly chosen input values... but doesn't know exactly which randomly chosen input values. He does get to see the randomly chosen values of some of the input variables, but not all of them.

The problem is to find which functions Alice should pick with what frequencies, in order to maximize the channel capacity.

Why Am I Interested In This?

I'm interested in characterizing functions which are "insensitive" to subsets of their input variables, especially in high-dimensional spaces. For instance, xor of a bunch of random bits is maximally sensitive: if we have a 50/50 distribution over any one of the bits but know all the others, then all information about the output is wiped out. On the other end of the spectrum, a majority function of a bunch of random bits is highly insensitive: if we have a 50/50 distribution over, say, 10% of the bits, but know all the others, then in most cases we can correctly guess the function's output.

I have an argument here that the vast majority of functions are pretty highly sensitive: as the number of unknown inputs increases, information falls off exponentially quickly. On the other hand, the example of majority functions shows that this is not the case for all functions.

Intuitively, in the problem, Alice needs to mostly pick from "insensitive" functions, since Bob mostly can't distinguish between "sensitive" functions.

... And Why Am I Interested In That?

I expect that natural abstractions have to be insensitive features of the world. After all, different agents don't all have exactly the same input data. So, a feature has to be fairly insensitive in order for different agents to agree on its value.

In fact, we could view the problem statement itself as a very rough way of formulating the coordination problem of language: Alice has to pick some function f which takes in an image and returns 0/1 representing whether the image contains an apple. (The choice of function defines what "apple" means, for our purposes.) Then Alice wants to teach baby Bob what "apple" means. So, there's some random stuff around them, and Alice points at the random stuff and says "apple" for some of it, and says something besides "apple" the rest of the time. Baby Bob is effectively observing the value of the function at some randomly-chosen points, and needs to back out which function Alice intended. And Bob doesn't have perfect access to all the bits Alice is seeing, so the function has to be robust.

Formal Problem Statement

Consider the following information channel between Alice and Bob:

Alice picks a function $f : {0, 1}^{n} \to {0, 1}$
Nature generates m possible inputs $x^{1}, . . ., x^{m}$ , each sampled uniformly and independently from ${0, 1}^{n}$ .
Nature also generates $m$ subsets $S_{1}, . . ., S_{m}$ of $1, . . ., n$ , each sampled uniformly and independently from subsets of size $s$ .
Bob observes $Y = (Y_{1}, . . ., Y_{m})$ where $Y_{i} = (f (x_{i}), x_{S_{i}}, S_{i})$ .

The problem is to compute the distribution over $f$ which achieves the channel capacity, i.e.

${argmax}_{P [f]} \sum_{f, Y} P [f] P [Y | f] ln \frac{P [Y | f]}{\sum_{f^{'}} P [Y | f^{'}] P [f^{'}]}$

Bounty/Prize Info

The problem is to characterize the channel throughput maximizing distribution $P [f]$ . The characterization should make clear the answers to questions like:

What functions have the highest probability?
How quickly does the probability fall off as we move "away" from the most probable functions, and what do marginally-less-probable functions look like?
How much probability is assigned to a typical function chosen uniformly at random?
Which functions, if any, are assigned zero probability?

All of these should have human-interpretable answers. No credit will be given for e.g. existence and uniqueness alone (the optimization is convex in the happy direction, so that's pretty easy anyway), or a program which would compute $P [f]$ given superexponential compute.

I may give partial or even full awards for variations on the above problem, depending on my own judgement of how useful they are. For instance, any of the following seem reasonable and potentially valuable:

Different domain/range for f.
(Nontrivial) big-O size restrictions on S
Asymptotic results in general

Deadline is end of June. If there are multiple qualifying answers, then I will award prize money based on how useful I judge each answer to be.

Bounties (closed)Logic & Mathematics Natural AbstractionAI

Frontpage

40 Ω 18

$500 Bounty/Prize Problem: Channel Capacity Using "Insensitive" Functions

New Comment

11 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:11 AM

[-]Joe Benton3yΩ9174

Here's an argument that I think works in the limit where . For very large $m$ , each subset $S \subseteq [n]$ of size $s$ will occur many times as an $S_{i}$ . Moreover, for each set of coodinate values $x_{S} := (x_{j})_{j \in S}$ , there will be many $x^{i}$ such that $S_{i} = S$ and $x_{S}^{i} = x_{S}$ . Therefore, using the law of large numbers for any subset $S \subseteq [n]$ of size $s$ and any set of coordinate values $x_{S}$ , Bob can infer $N (x_{S}, f) := # {x^{'} \in {0, 1}^{n} : x_{S}^{'} = x_{S} and f (x^{'}) = 1}$ , i.e. the number of inputs $x^{'}$ that agree with $x_{S}$ at the coordinates indexed by $S$ for which $f (x^{'}) = 1$ . Additionally, the random variable $Y$ contains no other information about $f$ that is not contained in this set of statistics.

Given a function $f : {0, 1}^{n} \to {0, 1}$ , let $F (f)$ denote the set of all functions $f^{'}$ such that for each subset $S \subseteq [n]$ of size $s$ and each set of coordinate values $x_{S}$ , $N (x_{S}, f) = N (x_{S}, f^{'})$ . The argument of the previous paragraph then implies that upon observing $Y$ Bob can deduce the equivalence class $F (f)$ , but no other information about $f$ from $Y$ .

If we have a random variable $X$ and we define another random variable $Y$ as a deterministic function $Y (X)$ , then it is easy to show that the set of distributions of $X$ which maximize the mutual information between $Y$ and $X$ are precisely those for which $Y$ has a uniform distribution. Applying this to the above setup, we see that the distributions over $f$ which maximize the channel capacity are those for which $F (f)$ is uniformly distributed over all the values that it can take.

If we suppose in addition that $f$ has the maximum entropy distribution subject to this constraint, we find that:

$P [f] \propto 1 / (# {f^{'} : N (x_{S}, f) = N (x_{S}, f^{'}) for all S \subseteq [n], x_{S} \in {0, 1}^{S}})$ .

Intuitively, this says that the probability of a given function $f$ is inversely proportional to the number of other functions $f^{'}$ that have the same number of ones on every hyperplane defined by fixing a set of some $s$ coordinates $x_{S}$ . This seems to correspond roughly to sensitivity: we intuitively expect there to be a lot of such functions $f^{'}$ when the number of ones that $f$ outputs on most hyperplanes is approximately half of the number of total points on that hyperplane, and saying that a function's output is approximately half ones and half zeros on a given hyperplane is roughly saying that $f$ is sensitive to the remaining unspecified coordinates.

It's not obvious to me that the above expression for $P [f]$ is feasible to compute in practice, but I think it is fairly naturally interpretable.

The key reason the problem is tractable in the case $m \to \infty$ is that the law of large numbers means the probabilities wash out and we get an essentially deterministic problem. In the case where $m$ is finite, this won't happen and in particular I expect you'll run into complications that arise from interaction terms where the method for combining the information from two observations $Y_{i}$ and $Y_{j}$ with $S_{i} \neq S_{j}$ is not clean.

I expect you're more likely to end up with a tractable solution if you rephrase the problem statement slightly so that the subsets of inputs over which you're agregating outputs when observed (in the case above, these subsets are the hyperplanes defined by the coordinates $x_{S}$ ) are disjoint or overlapping. (For example, taking all the $S_{i}$ to be random but equal would ensure this.) It strikes me that there might be a formulation like this that still captures the key features of the problem you're interested in, but I've not thought about this in detail.

[-]johnswentworth3y51

This answer clears the bar for at least some prize money to be paid out, though the amount will depend on how far other answers go by the deadline.

One thing which would make it stronger would be to provide a human-interpretable function for each equivalence class (so Alice can achieve the channel capacity by choosing among those functions).

The suggestions for variants of the problem are good suggestions, and good solutions to those variants would probably also qualify for prize money.

[-]Erik Jenner3y153

I'm interested in characterizing functions which are "insensitive" to subsets of their input variables, especially in high-dimensional spaces.

There's a field called "Analysis of boolean functions" (essentially Fourier analysis of functions ) that seems relevant to this question and perhaps to your specific problem statement. In particular, the notion of "total influence" of a boolean function is meant to capture its sensitivity (e.g. the XOR function on all inputs has maximal total influence). This is the standard reference, see section 2.3 for total influence. Boolean functions with low influence (i.e. "insensitive" functions) are an important topic in this field, so I expect there are some relevant results (see e.g. tribes functions and the KKL theorem, though those specifically address a somewhat different question than your problem statement).

[-]Dweomite3y106

I'm not sure whether this is helpful, but this reminds me of Error Correction Codes, a way of transmitting information through a noisy channel that trades bandwidth for reliability by encoding the intended message redundantly.

An explanation that I found helpful when learning about them was that you can think of a packet of N bits as specifying one corner of an N-dimensional hypercube, and an ECC as saying that you'll only intentionally transmit certain corners and not others. If you select a subset of corners such that no allowed corner is adjacent to any other, then a 1-bit error will always land you on a disallowed corner, so the receiver will know an error occurred. If all allowed corners are some distance from all other allowed corners, then you can guess the most likely intended corner based on distance from the corner received.

An XOR of all bits is maximally noisy because every corner with a value of "1" is surrounded by corners with a value of "0" and vice-versa. The corners corresponding to a given answer are maximally dispersed, so every error changes the result.

The inverse of that strategy is to designate exactly-opposite corners as "0" and "1", and then map all the remaining corners by which of those they're closer to. In other words, slice the hypercube in half, and then assign the same value to all corners in a given half. (The "majority of bits" function does exactly this.)

I don't think I can personally convert that into an answer to the stated problem, but maybe it will give someone else a thread to pull on?

[-]Sune3y10

I tried and failed to formalize this. Let me sketch the argument, to show where I ran into problems.

Considering a code with a corresponding decoding function $d : {0, 1}^{n} \to {0, 1}^{k}$ , and assume that $m ≪ 2^{k / 2}$ .

For any function $g : {0, 1}^{k} \to {0, 1}$ we can define $f = g \circ d$ . We then choose $g$ randomly from the $2^{2^{k}}$ such functions. We want to code to be such that for random $x \in {0, 1}^{n}$ and random $S \subset {1, \dots, n}$ the information $(x_{S}, S}$ is enough to deduce $d (x) \in {0, 1}^{k}$ , with high probability. Then each $Y_{i}$ would give Bob one bit of information about $g$ (its value at the point $d (x_{i})$ ) and hence one bit about $f$ . Here we use the assumption $m ≪ 2^{k / 2}$ to avoid collisions $d (x_{i}) = d (x_{j})$ .

Unfortunately, this argument does not work. The issue is that $x \in {0, 1}^{n}$ is chosen at random, instead of as an encoding $c (x^{'})$ of a message. Because of this, we should not expect $x$ to be close to a valid code, so we should not expect there to be a decoding method that will give consistent decodings of $(x_{S}, S)$ for different values of $S$ .

It is not clear to me if this is a bug in the solution or a bug in the problem! The world is not random, so why do we want $x$ to be uniform in ${0, 1}^{n}$ ?

[-]interstice3y*50

Here's a possible construction, building on Dweomite's suggestion to use error correction codes. We treat the selection of the random variables that Bob sees as a binary erasure channel with probability of erasure. Choose some error-correcting code for the BEC. Now, Alice constructs a function in the following way: For each input $x$ , randomly choose a subset $S$ of size $s$ . Apply the decoder for our error-correcting code to the subset; we obtain some string $y$ ^[1]. Alice chooses a function $f$ which is constant on pre-images of a given string $y$ under this mapping. She could either choose such an $f$ uniformly at random, or possibly treat the selection of inputs as another binary erasure channel(on sequences of 1s and 0s indexed by the string $y$ ) and use another error-correcting code.

You could also repeat this procedure several times and take a majority vote. ↩︎

[-]Sune3y10

I don't follow the construction. Alice don't know x and S when choosing f. If she is taking the preimage for all 2^n values of x, each with a random S, she will have many overlapping preimages.

[-]interstice3y20

Yes. But I don't see why that's a problem? Which preimage a given x would be assigned to is random. The hope is that repeated trials would give the same preimage frequently enough for it to be a meaningful partition of the input space. How well it would work depends on the details of the ECC but I suspect it would work reasonably well in many cases. You could also just apply the decoder directly to the string x but I thought that might be a bit more unnatural since in reality Bob will never see the full string.

[-]Aprillion3y41

sampled uniformly and independently

🤔 I don't believe this definition fits the "apple" example - uniform samples from a concept space of "apple or not apple" would NEVER™ contain any positive example (almost everything is "not apple")... or what assumption am I missing that would make the relative target volume more than ~zero (for high n)?

Bob will observe a highly optimized set of Y, carefully selected by Alice, so the corresponding inputs will be Vastly correlated and interdependent at least for the positive examples (centeroid first, dynamically selected for error-correction later 🤷‍♀️), not at all selected by Nature, right?

[-]Thomas Kwa3y*Ω342

I'm fairly sure you can get a result something like "it's not necessary to put positive probability mass on two different functions that can't be distinguished by observing only s bits", so some functions can get zero probability, e.g. the XOR of all combinations of at least s+1 bits.

edit: The proof is easy. Let , $f_{2}$ be two such indistinguishable functions that you place positive probability on, F be a random variable for the function, and F' be F but with all probability mass for $f_{2}$ replaced by $f_{1}$ . Then $P (Y | F) = P (Y | F^{'})$ . But this means $H (Y | F) = H (Y | F^{'})$ and so $I (Y, F) = H (Y) - H (Y | F) = H (Y) - H (Y, F^{'}) = I (Y, F^{'}) .$ You don't lose any channel capacity switching to $F^{'} .$

[-]Sune3y32

This question is non-trivial even for . Here it becomes: let Alice choose a probability $p$ (which has to be on the form $p = \frac{k}{2^{n}}$ but this is irrelevant for large $n$ ) and Bob observes the binomially distributed number $B (m, p)$ . With which distribution should Alice choose $p$ to maximize the capacity of this channel.

Moderation Log