Wind Forecast ML Pipeline
🌬️ Wind Forecast ML Pipeline
This project presents a modular machine learning pipeline designed to forecast wind speed and wind direction for multiple wind farms using weather forecast model data.
Built as part of a broader energy optimization effort, this system helps improve the forecast accuracy of wind resources by training per-site ML models, automating predictions, and logging performance daily.
🧠 Key Goals
- Increase short-term forecast accuracy for wind assets.
- Automate daily forecasting (24 hours ahead).
- Track per-site model performance over time.
- Support easy retraining and feature testing.
🏗️ System Architecture
The pipeline is built using modular notebooks (or scripts), each responsible for a distinct stage:
1. Feature Selection
- Evaluate and compare feature sets per wind site.
- Models tested: XGBoost, LightGBM, Random Forest.
- Metrics calculated: MAE, RMSE, R², MAPE, Bias.
- Best features and results stored in a tracking table.
2. Model Training
- Trains a dedicated model per site using selected features.
- Stores trained models in cloud storage (e.g., object store).
- Logs metadata: hyperparameters, metrics, model paths.
3. Daily Prediction
- Runs once per day (e.g., at 10:00 AM) for next-day forecasts.
- Loads weather model inputs + trained models.
- Predicts 24 hourly values for wind speed and direction.
- Skips sites with missing data and logs the issue.
📥 Data Sources (Generalized)
- Weather Forecast Models:
- Global Forecast System (GFS)
- ECMWF
-
Other regional weather models
-
Data Format: tabular format with fields like:
timestamp_to
wind_speed
,wind_direction
facility_id
🗃️ Output Tables (Generalized Schema)
🔹 model_features
Tracks selected features and performance per model.
Facility ID | Model Type | Selected Features | MAE | RMSE | R² | MAPE | Timestamp |
---|---|---|---|---|---|---|---|
🔹 model_registry
Stores paths, parameters, and metrics for trained models.
| Facility ID | Model Path | Best Params | MAE | R² | Timestamp | ... |
🔹 daily_predictions
Daily predictions stored per facility, per hour.
| Facility ID | DateTime | Wind Speed | Wind Direction | Model Version | Prediction Time |
🔹 skipped_sites
Logs skipped predictions due to missing features.
| Facility ID | Date | Missing Inputs | Logged At |
🛠️ Tools & Tech
- Python (pandas, scikit-learn, xgboost, etc.)
- Notebook-based pipeline (Jupyter or Databricks style)
- Object storage or Lakehouse for data + model storage
- Cloud scheduling or job orchestration
- Versioned model tracking tables
📈 Outcomes
- Trained models per wind farm with optimized feature sets.
- Forecasts stored daily for operational use.
- Fail-safe logic to skip and log incomplete predictions.
- Easily extendable to power forecasting or anomaly detection.
🚀 Possible Enhancements
- Auto-retraining if accuracy drops.
- Dashboards for forecast vs. actual (e.g., Power BI).
- Alerts for persistent data gaps or model drift.
- Integration with power curve models for generation estimation.
📝 Notes
- This is an anonymized version of a real production pipeline used in renewable energy forecasting.
- All data, identifiers, and systems have been generalized for public sharing.