Parameter Optimization and Early Exit
Andrew Gelman calls this the folk theorem of statistical computing:
When you have computational problems, often there’s a problem with your model.
(This is not a folk theorem per se, as it is not a theorem; but the name stuck. [1])
There is an interesting corollary:
When you are performing parameter selection using cross-validation or a similar blind procedure, the bad parameter choices take longer than the good ones!
This reminds me of the advertising quip (typically attributed to John Wanamaker) that Half the money I spend on advertising is wasted; the trouble is I don't know which half.
Of course, you can reply: if I knew the right parameters, then you would not need to find them in the first place. Still, understanding that your computation is slower in the really bad cases can be actionable.
§
Here is an example from machine learning: if you are using a support vector machine based system, you will often need to fit two parameters:
The SVM penalty $C$.
The kernel parameter (in the case of radial basis functions, the width $sigma$).
A simple solution is to try a grid of these parameters:
for c in [2**(-2), 2**(-1), 2**0,...]: for s in [2**(-2), 2**(-1), 2**0,...]: use cross validation to evaluate using c & s pick best one
There is a fair share of wasted computation. For example, let's say you have a parameter choice that is so awful it gives you 100% error rate and another which is so good it gives you 0% error. If you are lucky and you check the good combination first, you can abort the bad parameter choices early: the first time you see an error, you know it will never be as good.
This leads to the following algorithm:
error = { param -> 0 for all parameters } until done: next_param = parameter with less error which has not run all folds run crossvalidation fold on next_param error[next_param] += error on this fold best_err = error[ best completed_parameter_value ] if best_err is minimum error: return best_completed_parameter_value
This aborts as early as possible with the best error.
§
This is implemented in my machine learning package (github link) by default.
I tested it on murphylab_slf7dna (with a bit of hacking of the internals to print statistics &c). I see that fitting with the right parameters takes 650ms (after preprocessing). We check a total of 48 parameter values. So we might expect that to take 0.65 * 48 = 31s. Since bad parameters take longer, it actually takes 48s (50% longer).
Using the early exit trick, it takes it down to 24s. This is half the time of running the full grid. This despite the fact that slightly more than the full grid was run: 57%. [1] A really interesting question is whether you can formalise and prove it. A good model will often be one that has a nice little "fitness" peak, which is also easy to fit. Likelihood functions with local optima all over the place or ill-conditioned may correspond to worse models. There may be a proof hiding in here.