As mentioned in the post on practical surrogate modeling, special care should be taken when using and reporting performance metrics. We will discuss three major points here:

  1. Use of \(p\)-values for regression coefficients.
  2. Use of \(R^2\) for regression analysis.
  3. Use of accuracy for classification results.

These have already been discussed around the web, and Point #1 has received particular attention, even being the subject of a major declaration of the American Statistical Society. Their advice (in a nutshell): do NOT use \(p\)-values at all. Their (politically-correct) conclusion is:

Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning.

All is said in the last phrase: “No single index should substitute for scientific reasoning.” Here, I would replace the descriptive “scienitific” by “objective reasoning in context.”

Point 2

So, what about Point #2? Is \(R^2\) also dangerous? The answer is, yes - please see C. Shalizi’s writings on the subject. He, and others, recommend not to use \(R^2\) either. Here, I will explain why not to, and under which circumstances it can still be useful. But the principle of “no single index” remains valid, so we should definitely base our evaluation on supplementary, or alternative metrics.

What is so “dangerous” about \(R^2\)? Well, the list is quite long - see this presentation. In summary:

  • \(R^2\) does not measure goodness of fit.
  • \(R^2\) does not measure predictive error.
  • \(R^2\) does not allow you to compare models using transformed responses.
  • \(R^2\) does not measure how one variable explains another, contrary to its usual definition.

On the positive side, \(R^2\) can be used for:

  1. Determining whether a change of explanatory variable improves the fit.
  2. Comparing multi-variable models with different subsets of explanatory variables.
  3. Indicating collinearity when both the \(p\)-value and \(R^2\) are small.

What are the alternative, or supplementary metrics that should be used here? There is general agreement that the use of a simple error metric, together with a visual appraisal based on a predicted vs. actual values plot is a good path to follow. Usually we use an \(L_1\) ot \(L_2\) based metric, such as

  • MAE = mean absolute error
  • MSE = mean squared error
  • RMSE = root mean squared error

defined as

\[\mathrm{MAE} = \frac{1}{n}\sum_{i=1}^{n} |y_i - \hat{y}_i | ,\] \[\mathrm{MSE} = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 ,\] \[\mathrm{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2} ,\]

where \(y_i\) are the actual and \(\hat{y}_i\) are the predicted values of the response \(y.\)

Point 3

Finally, for point #3, the global accuracy of a classifictaion method is usually reported by the ratio of good classifications divided by the total number of (test) samples. That is, in the confusion (truth) table of predicted vs. true labels, we sum the diagonal elements and divide by the sum of all the elements in the matrix,

\[a = \frac{\sum_{i=1}^K m_{ii}}{\sum_{i,j=1}^K m_{ij}} ,\]

where there are \(K\) classes.

The overall accuracy, usually reported for classification methods, can be very misleading. We recall its definition in the binary case (2 classes, Positive or Negative),

\[\mathrm{accuracy} = \frac{ \mathrm{TP}+ \mathrm{TN}}{ \mathrm{TP}+ \mathrm{TN} + \mathrm{FP}+ \mathrm{FN}},\]

the proportion of true positive and negative classifications divided by the total number of samples—true and false positives and negatives. This is a good metric only for the case where the classes are balanced, i.e. we have approximately the same number of samples in each class. For example, if class A has 95 members and class B has 5, then by simply supposing that all the samples are A, we obtain an accuracy of \(0.95.\) We have not measured the accuracy of the method itself. For this, we must resort to the quantities known as precision and recall (known as ``sensitivity,’’ in the binary classification case), or their harmonic mean, the \(F_1\) score. Let us recall the definitions and see which one should be used in a particular context. The precision measures the proportion of correct positive classifications, TP, among all the positives identified, TP plus FP,

\[\mathrm{precision} = \frac{ \mathrm{TP}}{ \mathrm{TP}+ \mathrm{FP}}.\]

This metric should be used when the cost of a false positive is high, and we want to avoid these FPs as much as possible—think of the case of preventative maintenance, where the cost of halting and repairing a machine, unnecessarily, is too high.

The recall metric gives the number of positive cases correctly identified out of the total number of positives in the dataset,

\[\mathrm{recall} = \frac{ \mathrm{TP}}{ \mathrm{TP} + \mathrm{FN}}.\]

This metric is better suited to contexts where it is important to identify as many positives as possible—think of the case where the failure of a critical component in a machine could have disastrous consequences. Finally, the \(F_1\) score, defined as

\[F_1 = \frac{ 2}{ \mathrm{recall}^{-1} + \mathrm{precision}^{-1} },\]

provides a balanced metric, between recall and precision. One should always extract a detailed report form the model, where these quantities are displayed, for each class. Then, better conclusions can be made, taking into account

  • class imbalances,
  • context-dependent gravity of the different accuracies.

My colleague, Gaël Varoquaux, director of the scikit-learn foundation, has an excellent tutorial on his site, entitled “Understanding and diagnosing your machine-learning models” that has a section on metrics for judging the success of a model.