[D] Understanding the "curse of dimensionality" with respect to decision trees and checkerboards

This checkerboard example is a common "bad example" in statistics, but I'd agree with the last poster that it doesn't have a lot to do with "typical" real data.

Here's a simpler example that is quite similar in spirit. Flip n coins independently and write down the results $X_{1},\ldots,X_{n}$. Then set the last coin $X_{n+1}$ to be heads if you had an even number of heads, and tails otherwise. It is an easy exercise to check that all n+1 of these have the same distribution, and furthermore any subset of size at most n is i.i.d.! This means that looking at any (n-1) datapoints tells you nothing at all about the remaining 2 data points... but of course looking at (n) datapoints tells you exactly what the last point is.

The main idea here is that, as the dimension n gets larger, it is possible to construct collections of random variables where you don't learn <i>anything at all</i> until you have seen a larger and larger number of datapoints.

The checkerboard is basically the same example as the coin-flip one, but the notation is a bit more complicated.

/r/statistics Thread