This repository contains a regression analysis on the Diabetes dataset from scikit-learn, implemented as part of a university assignment.
The goal of this project is to compare multiple regression models for predicting disease progression using tabular medical data. The models are evaluated using 6-fold cross-validation and multiple regression performance metrics.
- Random Forest Regressor
- Support Vector Regressor (SVR)
- k-Nearest Neighbors (KNN) Regressor
- Gaussian Process Regressor
- 6-fold K-Fold cross-validation
- Evaluation on the test set only for model comparison
- Performance metrics:
- RMSE (Root Mean Squared Error)
- MAE (Mean Absolute Error)
- Max Error
- MAPE (Mean Absolute Percentage Error)
SHAP (SHapley Additive exPlanations) is used to interpret the predictions of the selected models.
- SHAP summary plots
- SHAP waterfall plots (for representative test samples)
The easiest way to run this project is through Google Colab, as the notebook is fully self-contained and requires no local setup.
- Click the “Open in Colab” button at the top of this README.
- Run all cells sequentially.
- (Optional) To export the results locally, uncomment the CSV download lines at the end of the notebook and re-run it.
This option is recommended for quick experimentation and reproducibility.
You may also download the notebook and run it locally using Jupyter Notebook.
Steps:
- Download the file
diabetes_regression.ipynbfrom this repository. - Ensure the required Python libraries are installed (e.g.
numpy,pandas,scikit-learn,matplotlib,shap). - Open the notebook and run all cells.
Note: Running locally may require additional setup compared to Google Colab.
The file regression_results.csv contains the aggregated results from all folds, models, and datasets (train/test).
The CSV file is generated by the notebook.
To regenerate it, uncomment the last lines at the end of diabetes_regression.ipynb and re-run the notebook.
The file regression_results.csv contains the aggregated evaluation results for all regression models, including both training and test sets across all folds.
This file is generated automatically by the notebook during execution.
A fixed random seed is used to ensure reproducible results. Feature scaling and preprocessing are applied within each fold to prevent data leakage. The dataset is loaded directly from the scikit-learn library.
This project was developed as part of a university assignment for the course Machine Learning at the University of Macedonia.
This repository is intended strictly for educational purposes. The implementation and results should not be considered as medical or clinical advice.