Dato Winners' Interview: 1st place, Mad Professors (4 level StackNet, XGBoost)

Mega-ensembles are: slow, redundant, unwieldy, unnecessary, risky, brittle, complex, inelegant, power-wasting, environment polluting, braindead monsters.

But this time, competitors were not penalized for complexity. Like in Formula 1 racing, it only matters how fast you can go, not that a formula 1 car is not allowed on normal roads or that someone with just a 2CV can never hope to beat them.

This mega-ensemble probably contains individual models that do nearly as well. Without the ensemble there is just no way to tell how close these more production-friendly models are to the limits of what is currently possible.

Complaining about shaving off those last fractions of a percent on an optimization problem in a top sport setting is nonsensical.

If they used online learning then their best online model may take just a few hours to run and take a few MB of memory. But that won't beat state-of-the-art in-memory techniques with more aggressive feature engineering.

In the performance map you can see that #1 place score was reached already in the 2nd level of stacking. If they are allowed to prune their solution, I bet they could also optimize for less complexity. Or transfer some of the power of the beast to a shallow net.

I think morthenh was #2 with just an average of two models. These models have tremendous commercial value. They originated in a commercial setting, with more constraints on runtime and complexity. Likewise, a small subset of models from a mega-ensemble would be commercially interesting. I don't think cryptocerous understand this contest enough to talk about commercial value. Using models like these would allow marketing companies to create native advertising that is near indistinguishable from non-commercial content. You'd probably wish they had no commercial values.

I think mega-ensembles definitely have academic value. They implement the ideas, they show what really works and what only marginally adds. They show the ideas, like stacked generalization, that survive the test of time and start to blossom with better hardware and machine learning libraries. I think using concepts from differential privacy in machine learning models is cutting-edge, and is interesting from a cryptography view.

Finally, I do not think that mega-ensembles are ugly. They combine many learning theories to profit from all their strengths. Did they need a perceptron in there? Probably not. Should they put it in there nevertheless? Sure, why not. Better than throwing the model (and spent energy) away.

I think mega-ensembles are at a point where neural nets once were too. The theory behind them has matured to the point of making them practically deploy-able. It's just that the hardware is currently lacking a bit. Over the next 5 years we will see much more network-on-network-on-network-in-network structures appear. And if deep learning is allowed to stack layer upon layer of perceptrons and still call it a single model, then why shouldn't a stack of diverse models nodes be intrinsically different?

As to why you are encouraged to use massive ensembles? You are competing with a lot more competitors these days in a one-vs-all manner. You have to somehow turn this into a many-vs-all approach to beat the luck, and the tuning by a 1000 "monkeys"-on-typewriters/ random grid searches.

/r/MachineLearning Thread Parent Link - blog.kaggle.com