--- language: - en library_name: xgboost tags: - tabular-regression - air-quality - optuna - environmental - structured-data metrics: - r2 - mae - rmse model-index: - name: Breathe Easy XGBoost AQI Predictor results: - task: type: tabular-regression name: Tabular Regression dataset: name: AQI Of India type: AdityaaXD/AQI-Of-India metrics: - type: r2 value: 0.925 name: R^2 Score - type: mae value: 11.42 name: Mean Absolute Error - type: rmse value: 22.90 name: Root Mean Squared Error --- # Breathe Easy AQI Predictor (XGBoost Tuned) ## Model Description This is an XGBoost regression model trained to predict the Air Quality Index (AQI) of Indian cities based on historical pollutant concentrations and engineered temporal features. It is the flagship model of the "Breathe Easy" project, demonstrating state-of-the-art predictive performance. The model was meticulously tuned using **Optuna** Bayesian optimization to maximize its generalizability across diverse geographic locations and seasonal variations in India. - **Developer:** Aditya (@AdityaaXD) - **Model Type:** Gradient Boosting Tree Regressor (XGBoost) - **Task:** Tabular Regression - **Associated Dataset:** [AdityaaXD/AQI-Of-India](https://huggingface.co/datasets/AdityaaXD/AQI-Of-India) - **Repository:** [Github Repository](https://github.com/ADITYA-tp01/AQI-Prediction-Model) ## Uses ### Direct Use You can use the model to predict the daily AQI value based on the previous days' pollutant data, rolling averages, and cyclical date features. ### Downstream Use This model is configured to be embedded within real-time Streamlit dashboards and monitoring tools to provide localized air quality forecasting and raise public health alerts. ### Out-of-Scope Use The model relies heavily on the specific distribution of pollutants found in India (CPCB metrics). Deploying this model out-of-the-box for European or North American cities (which use different AQI formula scales and have different baseline pollutant ratios) without retraining will likely result in inaccurate predictions. ## Training Data The model was trained on the `city_day.csv` dataset, which contains daily aggregations of pollutants across major Indian cities from 2015 to 2020. **Core Features Engineered:** - Advanced temporal features (Rolling 3/7-day means, Lags of 1/2 days) - Cyclical encodings for Month and Day of the week (Sin/Cos transformations) - Target-encoded City names - Pollutant ratios (e.g., PM2.5 to PM10 ratio) ## Training Details & Hyperparameters The model's hyperparameters were optimized using Optuna over 50 trials. - **Framework:** XGBoost - **Objective:** `reg:squarederror` - **Evaluation Metric:** RMSE (during training) ## Evaluation The model was evaluated using a strict temporal split to prevent data leakage (Training: 2015-2018, Validation: 2019, Test: 2020). ### Metrics | Metric | Score | Note | |---|---|---| | **R² (R-Squared)** | 0.925 | Explains ~92.5% of variance in AQI | | **MAE** | 11.42 | Extremely accurate given AQI ranges up to 500+ | | **RMSE** | 22.90 | | | **MAPE** | 11.68% | | *Note: While CatBoost achieved a slightly higher R² (0.928) internally, this tuned XGBoost model was selected for its balance of high accuracy and extremely fast inference time for dashboard deployment.* ## Explainability (SHAP) Global feature importance using SHAP reveals that: 1. **PM2.5 and PM10** concentrations strongly dictate the predicted AQI. 2. **Lag_1_AQI** (yesterday's AQI) is heavily relied upon, acting as a powerful baseline anchor for the tree splits. 3. **Month_cos** (seasonal impact) is a secondary but highly distinct driver, specifically accounting for the severe winter pollution spikes (November - January). ## How to Get Started with the Model Since this is a pickled Scikit-Learn/XGBoost pipeline, you can load it directly via `joblib` or `pickle`. ```python import pickle import pandas as pd # 1. Load the model with open("models/best_xgboost_tuned.pkl", "rb") as f: model = pickle.load(f) # 2. Prepare your features # (Ensure your DataFrame matches the 20+ columns generated during feature engineering) # X_new = pd.DataFrame(...) # 3. Predict AQI # predictions = model.predict(X_new) ```