This is the supplementary githob repository of the paper: "Mind the Long Tail: Understanding the Difficulty of Delay Detection in Business Processes", submitted to BPM 2026.
The supplementary report of the paper is accessible here
Clone this GitHub repository to your local machine. To install and set up the required environment on a Linux system, run the following commands:
conda create -n delay python=3.11
conda activate delay
pip install -r imbalanced_regression.txt
conda clean --allTo execute the pipeline for a dataset (e.g., BPIC20PTC) and an imbalanced regression technique (e.g., BMSE) run the following:
python main.py --dataset BPIC20PTC --IR BMSE
If no imbalanced regression technique is parsed (--IR) the Vanilla model is trained.
CSW (Cost Sensitive re-Weighting) and EAL (Error-Aware Loss) can be combined with Label Distribution Smoothing (LDS) and/or Feature Distribution Smoothing (FDS). Therefore, the pipeline includes running experiments with four different configurations (wos: without smoothing, LDS, FDS, LDS+FDS). For more information, please refer to Delving into Deep Imbalanced Regression and its corresponding GitHub repository.
Balanced MSE (BMSE) cannot be combined with LDS, but the authors suggested that FDS should be complementary to their technique. Therefore, the piepline includes experiments with two configurations (wos and FDS). For more information, please refer to Balanced MSE for Imbalanced Visual Regression and its its corresponding GitHub repository.
The pipeline includes the same two configurations (wos and FDS) for Squared Error Relevance Area (SERA). For more information, please refer to Model Optimization in Imbalanced Regression and Imbalanced regression and extreme value prediction. The original implementation of SERA is provided in R (in this package), and in our implementation it is implemented in Python.
For running SMOGN, please go into the SMOGN branch and run the following (replace the argument parameters as needed):
python main.py --dataset BPIC20PTC --sampling SMOGN --smogn_rel_thres 0.8 --smogn_over_ratio 5.0 --smogn_under_ratio 0.3
To train the uncertainty-aware approach based on survival analysis, --IR argument must be set to 'survival'. It is also possible to train an uncertainty-aware model based on quantile regression using 'quantile' for --IR argument.
- All event logs are collected here.
- All configurations that are used for hyper-parameter optimization and training are collected here. You need to adjust cdg.data.path in the cfg file in order to determine the path to the XES or CSV file.
Once the survival model is trained, the second step for uncertainty-aware classification (and training the point-estimate deterministic baseline) for a dataset (e.g., BPIC20PTC) can be achieved by running the following:
python delay_analysis.py --dataset BPIC20PTC