---
language:
- en
library_name: xgboost
tags:
- tabular-regression
- air-quality
- optuna
- environmental
- structured-data
metrics:
- r2
- mae
- rmse
model-index:
- name: Breathe Easy XGBoost AQI Predictor
  results:
  - task:
      type: tabular-regression
      name: Tabular Regression
    dataset:
      name: AQI Of India
      type: AdityaaXD/AQI-Of-India
    metrics:
      - type: r2
        value: 0.925
        name: R^2 Score
      - type: mae
        value: 11.42
        name: Mean Absolute Error
      - type: rmse
        value: 22.90
        name: Root Mean Squared Error
---

# Breathe Easy AQI Predictor (XGBoost Tuned)

## Model Description

This is an XGBoost regression model trained to predict the Air Quality Index (AQI) of Indian cities based on historical pollutant concentrations and engineered temporal features. It is the flagship model of the "Breathe Easy" project, demonstrating state-of-the-art predictive performance.

The model was meticulously tuned using **Optuna** Bayesian optimization to maximize its generalizability across diverse geographic locations and seasonal variations in India.

- **Developer:** Aditya (@AdityaaXD)
- **Model Type:** Gradient Boosting Tree Regressor (XGBoost)
- **Task:** Tabular Regression
- **Associated Dataset:** [AdityaaXD/AQI-Of-India](https://huggingface.co/datasets/AdityaaXD/AQI-Of-India)
- **Repository:** [Github Repository](https://github.com/ADITYA-tp01/AQI-Prediction-Model)

## Uses

### Direct Use

You can use the model to predict the daily AQI value based on the previous days' pollutant data, rolling averages, and cyclical date features.

### Downstream Use

This model is configured to be embedded within real-time Streamlit dashboards and monitoring tools to provide localized air quality forecasting and raise public health alerts.

### Out-of-Scope Use

The model relies heavily on the specific distribution of pollutants found in India (CPCB metrics). Deploying this model out-of-the-box for European or North American cities (which use different AQI formula scales and have different baseline pollutant ratios) without retraining will likely result in inaccurate predictions.

## Training Data

The model was trained on the `city_day.csv` dataset, which contains daily aggregations of pollutants across major Indian cities from 2015 to 2020.

**Core Features Engineered:**
- Advanced temporal features (Rolling 3/7-day means, Lags of 1/2 days)
- Cyclical encodings for Month and Day of the week (Sin/Cos transformations)
- Target-encoded City names
- Pollutant ratios (e.g., PM2.5 to PM10 ratio)

## Training Details & Hyperparameters

The model's hyperparameters were optimized using Optuna over 50 trials.

- **Framework:** XGBoost
- **Objective:** `reg:squarederror`
- **Evaluation Metric:** RMSE (during training)

## Evaluation

The model was evaluated using a strict temporal split to prevent data leakage (Training: 2015-2018, Validation: 2019, Test: 2020).

### Metrics

| Metric | Score | Note |
|---|---|---|
| **R² (R-Squared)** | 0.925 | Explains ~92.5% of variance in AQI |
| **MAE** | 11.42 | Extremely accurate given AQI ranges up to 500+ |
| **RMSE** | 22.90 | |
| **MAPE** | 11.68% | |

*Note: While CatBoost achieved a slightly higher R² (0.928) internally, this tuned XGBoost model was selected for its balance of high accuracy and extremely fast inference time for dashboard deployment.*

## Explainability (SHAP)

Global feature importance using SHAP reveals that:
1. **PM2.5 and PM10** concentrations strongly dictate the predicted AQI.
2. **Lag_1_AQI** (yesterday's AQI) is heavily relied upon, acting as a powerful baseline anchor for the tree splits.
3. **Month_cos** (seasonal impact) is a secondary but highly distinct driver, specifically accounting for the severe winter pollution spikes (November - January).

## How to Get Started with the Model

Since this is a pickled Scikit-Learn/XGBoost pipeline, you can load it directly via `joblib` or `pickle`.

```python
import pickle
import pandas as pd

# 1. Load the model
with open("models/best_xgboost_tuned.pkl", "rb") as f:
    model = pickle.load(f)

# 2. Prepare your features 
# (Ensure your DataFrame matches the 20+ columns generated during feature engineering)
# X_new = pd.DataFrame(...)

# 3. Predict AQI
# predictions = model.predict(X_new)
```