One of the things I encountered today is Bagging As a way to get an approximation of the perfect distribution beyond point estimation. You can then use it to penalize predictions with high uncertainty, which helps reduce false positives.

To me it sounds like a handy trick I’ve roughly found *0 hits on google* So I wanted to share it. Of course, I could have completely reversed something (my ML skills have some noticeable gaps), so please let me know if that is the case.

This is a small toy model to explain the idea. Suppose you have an observation $$ x_i $$ that is $$ mathcal {N} (0, 1) $$. There is also $$ y_i = 0.2 + 0.3 / (1 + x_i ^ 2) $$, and the label $$ z_i $$ sampled from the Bernoulli distribution given by $$ P (z_i = 1) = y_i $$ There are (that is, simply flipped) weighted coins whose odds are determined by $$ y_i $$.

Gradient boost decision tree It’s one of the cutting edge of regression and classification, with amazing performance. They are available in scikit-learn You can easily plug it into an existing script.

Let’s say you want to find the best choice of $$ x $$ that gives a large value of $$ y $$. For example, you may have click data and want to find the item with the most clicks. Train a model that fits GBDT to the $$ (x_i, z_i) $$ observations and predicts whether an item will be clicked based on the value of $$ x $$.

Unfortunately, we may end up with wild predictions. In this particular case, with a single noise point in the training data around $$ x_i = -4 $$, the model has a large negative value of $$ x $$ that maximizes $$ y $$. I believe it is suitable for.

This is the secret. Instead of training a single GBDT, train 100 small GBDTs with strongly subsampled data (replace and sample 100 instead of 1000 data points). Now you can use all the predicted values for each GBDT to understand the uncertainty of $$ y $$. This is great. Because (especially) you can penalize uncertain estimates. In this case, we chose a value in the 20th percentile. I’m still not sure if the $$ y $$ distribution represents a probability distribution, but I think this is just an example. Bootstrap You can also use Bayes’ theorem before deriving the posterior distribution.

Why is this useful? For example, when recommending music on Spotify, it’s much more important to make mistakes on the safe side and remove false positives at the expense of false positives. If you can explicitly penalize uncertainty, you can focus on recommendations that are a safe bet and have more support in your historical data.

Tagging with: Math