Credit Analytics Statistical Models’ Backtesting and Recalibration: A Primer (2024)

Introduction

Model backtesting and recalibration are important and natural stages in the lifecycle of any statistical model and should be performed on an annual basis. At S&P Global Market Intelligence, we provide risk managers at non-financial and financial corporations with a suite of statistical models to help evaluate counterparty credit risk. The models form part of the Credit Analytics offering. We produce annual backtesting reports for each of our models, which provide comprehensive assessments on model performance using the data from the most recent calendar year. Whenever significant model performance deterioration is observed in the annual model backtesting process, model recalibration becomes necessary. The following sections explain the definitions of backtesting and recalibration within the context of S&P Global Market Intelligence models.

Model Backtesting

Our quantitative credit risk models are developed based on an extensive database (including company financials and other market-driven information, as well as macroeconomic and socio-economic factors) and advanced optimization techniques, and typically have strong in-sample model performance during development. However, the out-of-sample model performance on new datasets may also be important for risk managers. The annual model backtesting serves the purpose of testing out-of-sample model performance, often using the most recent calendar year’s data. Generally speaking, a good predictive statistical model should demonstrate good in-sample performance, as well as excellent and stable backtesting (out-of-sample) performance. Otherwise, it may indicate problems during development (such as overfitting, multi-collinearity, etc.) and the need to recalibrate, which we will cover more in detail later.

Backtesting Methodology

First, the most recent calendar year’s available and unbiased observations from the pre-scored database are collected as the backtest sample. Then model assumptions are checked to see if they’re still applicable. Next, important and relevant statistics are generated that are used to measure model performance during development. If all performance results are satisfactory, we conclude that our models can continue to serve their purposes.

Since different models utilize distinct modeling techniques and methodologies due to differing objectives, the corresponding performance metrics may vary from one model to another.

Probability of Default (PD) Models

For PD models, including PD Model Fundamentals (PDFN) and PD Model Market Signals (PDMS), the core output is a one-year forward-looking PD value. The following primary test results are presented in the main part of the annual backtesting report:

1. Model Accuracy: Receiver Operating Characteristic (ROC) is used to assess the model’s ability to correctly discriminate defaulters from non-defaulters, and an 80% ROC indicates good model performance.
2. Average PD values versus Observed Default Rates (ODRs): The average PD of defaulters is expected to be much larger than that of non-defaulters, and a clear separation of PD mapped score distributions is a good sign of the model’s strong discriminatory power. The mean PD value is also compared with ODR, after being mapped to a credit score.[1] A two notch difference or less suggests low model risk.
3. Mobility: This is used to reflect the stability of the model’s outputs, and it relates to the chosen “philosophy” behind each model. Usually, a model with a mobility metric above 80% is deemed as a “Point-in-Time” model. As demonstrated in the latest backtesting report, PDMS is a market-driven model and has a much higher mobility metric than PDFN (approximately 88% versus 53%) that generates “Through-the Cycle” PD values.

Secondary Analysis	Description
Distribution of input variables	Check if there are outliers in the backtesting sample.
Correlation between input variables	Test if rank correlation matrix changes significantly compared to that in model development.
Percentiles of PD values	Present several PD percentile values from the backtesting sample.
Trend analysis	Display the trends of PD values of defaulters and non-defaulters within two years prior to default.
Log-likelihood, Geometric Mean Probability (GMP), and Pick Up over Naïve Model[2]	Log-likelihood and GMP measure the quality of the model’s fit to the backtesting sample, while the third measure quantifies how much the statistical model outperforms the naïve model.
Type I and Type II errors	Two types of errors are shown under different cut-off values to depict the model’s discriminatory power.
Bias review	Monte Carlo tests are used to compare the level of estimated PDs and observed default frequency under different correlation assumptions between input variables.

Scoring Models

Credit Analytics’ CreditModel™ is a scoring model that aims to generate credit scores that statistically match S&P Global Ratings credit ratings. In this case, the most important performance metric is the agreement between the model’s outputs and their actual S&P Global Ratings Standalone Credit Profile (SACP). We provide primary backtesting results mainly from three perspectives:

1. Statistical Alignment: For each sub-model, match ratio statistics are generated to reflect how well the model’s output aligns with the SACPs. For example, exact match ratio shows the percentage of companies whose scores are the same as their SACP, while match ratio within one notch measures the percentage of companies whose scores are less than one notch away from their SACP. The threshold values of low model risk are set to be 20% for exact match, 55% for within one notch, and 80% for within two notches, respectively.
2. Neutrality: Neutrality tells the average distance between company scores and corresponding SACPs. Model risk is low if neutrality falls between -0.8 and 0.8.
3. Mobility: CreditModel typically has a mobility value around 40%-50%, given the inherent stability of rating assessments that the model tries to mimic.

In general, higher match ratio statistics and lower neutrality are indicators of good model performance. The table below displays additional information shown in the Appendix of annual backtesting reports.

Secondary Analysis	Description
Performance by industry	Repeat the statistical alignment and neutrality tests for each industry.
Performance by asset class	Separate the backtesting sample into two asset classes and generate match ratio statistics and neutrality.
Impact of Parental & Government (P&G) overlay	Test how P&G overlay would impact the model’s performance.

Model Recalibration

Generally speaking, a predictive statistical model could function very well for a few years after release. As time goes by, however, its predictive power may be weakened, either due to a structural change of underlying risk drivers that determine a company’s creditworthiness, or as a result of different market conditions: macroeconomic environment (e.g. global recessions), industry structure, business patterns, accounting standards, etc. Over time, some inputs may become more important than others and, thus, need higher weights or need to be replaced/excluded. In this case, model recalibration becomes necessary and is usually triggered by a significant deterioration in the annual model backtesting results when compared to the original model development. At S&P Global Market Intelligence, model recalibration is not only about refreshing the variables’ coefficients, but it also serves the purpose of reviewing the model in more detail (including methodology, data, structure, etc.).

Apart from model performance deterioration, model recalibration could also be triggered by the need to reflect the latest years’ observations. For example, the recalibrated PDFN - Public Corporates includes nearly 10,000 new observations by expanding the training period from Fiscal Year (FY) 2002 - FY2012 to FY2002 - FY2016. Hence, regardless of model performance deterioration, we recalibrate models around every four to six years. Moreover, whenever we kick-off a model recalibration, we also try to incorporate client feedback. For instance, high volatility of PDs was a concern with PDMS 1.1, which has been resolved in the recalibrated PDMS 2.0.

Recalibration Methodology

Model recalibration often consists of the following steps:

1. Revision of model inputs: We test existing independent variables to check if they still play significant roles in determining the model’s output, and add/delete variables when necessary. We may also replace some variables to align with S&P Global Market Intelligence’s Credit Assessment Scorecards’ (Scorecards) inputs and S&P Global Ratings general criteria.
2. Modification of model structures: Sometimes there is a need to change model features to enhance performance and/or simplify the model structure. For example, in PDFN 1.1 financial risk variables and business risk variables were fitted separately to obtain a financial risk PD and business risk PD, and then these two PDs were combined to obtain the final overall PD. In contrast, recalibrated PDFN 2.0 adopts a simpler and more easy-to-understand framework that fits all independent variables simultaneously to obtain the final PD.
3. Model retraining: After finalizing all independent variables and model structure, we train the model and obtain a new set of parameters, reflecting the updated relationship between input variables and model output.
4. Adjustment: The last step is to align the model output with certain benchmarks, such as S&P Global Ratings credit ratings and Risk Dashboard from the European Banking Authority, to reflect the population universe.[3]

The table below outlines the major changes for the recalibrated Credit Analytics models that will be released in 2020. The common feature is that the recalibrated models include recent observations in the training process, while their performance remains at a similarly high level or becomes better.

Incoming Models	Highlights
CreditModel 3.0	High correlation issues are resolved. Performance for specific sectors and China are optimized. Country Risk Score is aligned with that used in Scorecards.
PDFN 2.0	Non-significant and high correlated variables are dropped. A few input variables are replaced. Model structure is simplified, without compromising model performance. Country Risk Score and Corporate Industry Risk Score are aligned with that used in Scorecards.
PDMS 2.0	PD distribution is more aligned with S&P Global Ratings credit ratings. PD volatility is reduced. PD size adjustment is more granular. Credit Default Swap Market Derived Spread adjustment is introduced. Country Risk Score is aligned with that used in Scorecards.

For more information on each model and the specific enhancements when compared to the prior version, please refer to the Credit Analytics “Help” section on the S&P Capital IQ platform.

^{[1] Lowercase nomenclature is used to differentiate S&P Global Market Intelligence credit model scores from the credit ratings issued by S&P Global Ratings.}

^{[2] The naïve model randomly guesses the outcome (without any discriminatory power), yet gives the correct average default rate.}

^{[3] Source: Risk Dashboard, European Banking Authority, https://eba.europa.eu/risk-analysis-and-data/risk-dashboard.}