Skip to content

Measuring Performance Beyond Common Metrics: Aligning Offline Evaluations with Real-World Key Performance Indicators

Offline stats don't ensure achievement? Our analysis explores the differences between offline and online metrics to sync your models with actual world KPIs.

Assessing Performance beyond AUC and RMSE: Achieving Alignment of Offline Metrics with Real-World...
Assessing Performance beyond AUC and RMSE: Achieving Alignment of Offline Metrics with Real-World Key Performance Indicators

Measuring Performance Beyond Common Metrics: Aligning Offline Evaluations with Real-World Key Performance Indicators

Madhura Raut, Principal Data Scientist at Workday, is leading the charge in designing large-scale machine learning systems for labor demand forecasting. A seasoned keynote speaker at prestigious data science conferences, Raut has also served as a judge and mentor at multiple codecrunch hackathons. Her colleague, Anima Anandkumar, is the lead researcher for large-scale machine learning systems for workforce planning at Workday.

The duo and their team are tackling a common challenge in the field of machine learning: the online-offline gap. This discrepancy between offline simulations and actual online results can hinder the effectiveness of machine learning models.

To bridge this gap, the team has been exploring various strategies. One approach is to analyse correlations between offline metrics and successful online results. This can help identify which offline metrics are reliable predictors of online success.

However, the team recognised that a traditional evaluation framework might not suffice. To address this, they redefined their evaluation framework to include a custom business-weighted metric. This metric penalises underprediction more heavily for trending products and explicitly tracks stockouts.

Simulating interactions using methods like bandit simulators and counterfactual evaluation is another strategy the team is employing. These techniques help narrow the online-offline gap by providing a more realistic simulation of user behaviour.

The team also advocates for choosing multiple proxy metrics that approximate business outcomes. This approach can help reduce the online-offline discrepancy by providing a more comprehensive view of potential outcomes.

A practical example of the team's work involves a retailer who saw minimal improvements and even worse results online when they deployed a new demand forecasting model that relied solely on RMSE as an evaluation metric. By adopting the strategies outlined above, the retailer was able to improve their online performance significantly.

Finally, the team stresses the importance of monitoring input data and output KPIs after deployment. This ensures that the discrepancy doesn't silently reopen as user behaviour evolves.

The challenge lies in finding the best offline evaluation frameworks and metrics that can predict online success. By doing so, teams can experiment and innovate faster, minimise wasted A/B tests, and build better machine learning systems.

Read also:

Latest