Yildirimhan Aydin Personal Website

Wind Forecast ML Pipeline

🌬️ Wind Forecast ML Pipeline

This project presents a modular machine learning pipeline designed to forecast wind speed and wind direction for multiple wind farms using weather forecast model data.

Built as part of a broader energy optimization effort, this system helps improve the forecast accuracy of wind resources by training per-site ML models, automating predictions, and logging performance daily.

🧠 Key Goals

Increase short-term forecast accuracy for wind assets.
Automate daily forecasting (24 hours ahead).
Track per-site model performance over time.
Support easy retraining and feature testing.

🏗️ System Architecture

The pipeline is built using modular notebooks (or scripts), each responsible for a distinct stage:

1. Feature Selection

Evaluate and compare feature sets per wind site.
Models tested: XGBoost, LightGBM, Random Forest.
Metrics calculated: MAE, RMSE, R², MAPE, Bias.
Best features and results stored in a tracking table.

2. Model Training

Trains a dedicated model per site using selected features.
Stores trained models in cloud storage (e.g., object store).
Logs metadata: hyperparameters, metrics, model paths.

3. Daily Prediction

Runs once per day (e.g., at 10:00 AM) for next-day forecasts.
Loads weather model inputs + trained models.
Predicts 24 hourly values for wind speed and direction.
Skips sites with missing data and logs the issue.

📥 Data Sources (Generalized)

Weather Forecast Models:
Global Forecast System (GFS)
ECMWF
Other regional weather models
Data Format: tabular format with fields like:
timestamp_to
wind_speed, wind_direction
facility_id

🗃️ Output Tables (Generalized Schema)

🔹 `model_features`

Tracks selected features and performance per model.

Facility ID	Model Type	Selected Features	MAE	RMSE	R²	MAPE	Timestamp

🔹 `model_registry`

Stores paths, parameters, and metrics for trained models.

| Facility ID | Model Path | Best Params | MAE | R² | Timestamp | ... |

🔹 `daily_predictions`

Daily predictions stored per facility, per hour.

🔹 `skipped_sites`

Logs skipped predictions due to missing features.

🛠️ Tools & Tech

Python (pandas, scikit-learn, xgboost, etc.)
Notebook-based pipeline (Jupyter or Databricks style)
Object storage or Lakehouse for data + model storage
Cloud scheduling or job orchestration
Versioned model tracking tables

📈 Outcomes

Trained models per wind farm with optimized feature sets.
Forecasts stored daily for operational use.
Fail-safe logic to skip and log incomplete predictions.
Easily extendable to power forecasting or anomaly detection.

🚀 Possible Enhancements

Auto-retraining if accuracy drops.
Dashboards for forecast vs. actual (e.g., Power BI).
Alerts for persistent data gaps or model drift.
Integration with power curve models for generation estimation.

📝 Notes

This is an anonymized version of a real production pipeline used in renewable energy forecasting.
All data, identifiers, and systems have been generalized for public sharing.