Day to day activity and challenges

Kaggle is about hyper-optimization. There is no penalty for long training and testing times, nor for complexity of the solution. In real-life data science this is a part of the equation, not the entire thing.

1) Domain knowledge can hamper hyper-optimization. Human bias / "intuition" makes it so you do not try things that can yield signal, or stick to things that "must" work ("that interaction will never work!", "'age' must be added, as it is clearly informative!"). On Kaggle the top competitors brute-force a dataset for months to find all the signals in there. Once a plateau is reached, you need extreme measures to make it to the top. Hyper-optimization means to (automatically) try everything. Randomness in machine learning is not a bad thing! All said: Feature engineering is a major, but still just a part, of winning Kaggle competitions. Remember that while an evaluation metric is objective, you are also free to exploit leakage.

2) My work is very much like certain Kaggle competitions, but with a lot more data cleaning, a lot more human elements, more worries about implementation, interpretation and concept drift :). Everything from supervised, unsupervised, classification, regression, data mining. XGBoost rocks for both Kaggle and business.

/r/MachineLearning Thread