Measuring Performance Beyond Common Metrics: Aligning Offline Evaluations with Real-World Key Performance Indicators
Madhura Raut, Principal Data Scientist at Workday, is leading the charge in designing large-scale machine learning systems for labor demand forecasting. A seasoned keynote speaker at prestigious data science conferences, Raut has also served as a judge and mentor at multiple codecrunch hackathons. Her colleague, Anima Anandkumar, is the lead researcher for large-scale machine learning systems for workforce planning at Workday.
The duo and their team are tackling a common challenge in the field of machine learning: the online-offline gap. This discrepancy between offline simulations and actual online results can hinder the effectiveness of machine learning models.
To bridge this gap, the team has been exploring various strategies. One approach is to analyse correlations between offline metrics and successful online results. This can help identify which offline metrics are reliable predictors of online success.
However, the team recognised that a traditional evaluation framework might not suffice. To address this, they redefined their evaluation framework to include a custom business-weighted metric. This metric penalises underprediction more heavily for trending products and explicitly tracks stockouts.
Simulating interactions using methods like bandit simulators and counterfactual evaluation is another strategy the team is employing. These techniques help narrow the online-offline gap by providing a more realistic simulation of user behaviour.
The team also advocates for choosing multiple proxy metrics that approximate business outcomes. This approach can help reduce the online-offline discrepancy by providing a more comprehensive view of potential outcomes.
A practical example of the team's work involves a retailer who saw minimal improvements and even worse results online when they deployed a new demand forecasting model that relied solely on RMSE as an evaluation metric. By adopting the strategies outlined above, the retailer was able to improve their online performance significantly.
Finally, the team stresses the importance of monitoring input data and output KPIs after deployment. This ensures that the discrepancy doesn't silently reopen as user behaviour evolves.
The challenge lies in finding the best offline evaluation frameworks and metrics that can predict online success. By doing so, teams can experiment and innovate faster, minimise wasted A/B tests, and build better machine learning systems.
Read also:
- Impact of Alcohol on the Human Body: Nine Aspects of Health Alteration Due to Alcohol Consumption
- Understanding the Concept of Obesity
- Microbiome's Impact on Emotional States, Judgement, and Mental Health Conditions
- Criticisms levelled by a patient advocate towards MPK's judgement on PCR testing procedures