Code Documentation
incube.main
get_inferenceobj(config, logger, device_folder)
Creates and returns an inference object based on the specified target model in the configuration.
Parameters: |
|
---|
Returns: |
|
---|
Raises: |
|
---|
get_trainobj(config, logger, device_folder)
Creates and returns a training object based on the specified configuration.
Parameters: |
|
---|
Returns: |
|
---|
Raises: |
|
---|
load_config(config_path)
Loads a YAML configuration file from the specified path.
Parameters: |
|
---|
Returns: |
|
---|
Raises: |
|
---|
main(args)
The main entry point of the application. This function handles the execution of different modes (train or predict) based on the provided arguments.
Parameters: |
|
---|
Raises: |
|
---|
Note
Ensure that the configuration file exists and is properly formatted. The logger is initialized based on the provided arguments and configuration.
predict(config, logger, folders)
Perform prediction on processed dataset folders for each device.
Parameters: |
|
---|
The function iterates through all subdirectories in the processed dataset path,
treating each subdirectory as a device folder. For each device folder, it:
1. Extracts the device name from the folder name.
2. Logs the start of the prediction process for the device.
3. Creates an inference object using get_inferenceobj
.
4. Calls the predict
method of the inference object to perform predictions.
5. Logs the completion of the prediction process for the device.
read_data(config, logger, dataset_name)
Reads a processed dataset from a specified path and returns it as a DataFrame.
Parameters: |
|
---|
Returns: |
|
---|
Raises: |
|
---|
set_logger(args, config)
Configures and initializes a logger based on the provided arguments and configuration.
Parameters: |
|
---|
Returns: |
|
---|
train(config, logger, folders)
Trains models for each device folder found in the processed dataset path.
Parameters: |
|
---|
The function iterates through all subdirectories in the processed dataset path,
treating each subdirectory as a device folder. For each device folder:
1. Logs the start of the training process for the device.
2. Creates a training object using get_trainobj
.
3. Trains the model using the train_model
method of the training object.
4. Logs the completion of the training process for the device.
incube.modeling.train
TrainGB
__build_results_dataset(Y, preds, train_index, val_index, test_index)
Builds a results dataset by combining actual and predicted values generated by time series model.
Parameters: |
|
---|
Returns: |
|
---|
Notes
- The resulting DataFrame is sorted by 'timestamp_start' and 'lag'.
__create_lag_features(df_device)
Creates lag features for a given DataFrame based on the context length and horizon length.
This method generates lagged feature columns and target columns for time series data. It shifts the "ElectricWConsumed" column by positive and negative offsets to create lagged features for model input and target prediction, respectively. The resulting DataFrame is filtered to remove rows with NaN values in the lagged columns and is downsampled based on the specified stride.
Parameters: |
|
---|
Returns: |
|
---|
Attributes: |
|
---|
Notes
- The DataFrame is sorted by the "timestamp" column before creating lag features.
- Rows with NaN values in any of the lagged columns are dropped.
- The stride parameter determines the downsampling rate of the resulting DataFrame.
__extract_metrics(X_train, y_train, X_val, y_val, X_test, y_test)
Extracts and computes evaluation metrics (RMSE, MAE, R2) for the model predictions on training, validation, and test datasets. Optionally saves the computed metrics to a CSV file if a stats folder is provided.
Parameters: |
|
---|
Returns: |
|
---|
Notes
- The method computes metrics for specific lags (0, 12, 23).
- If
self.stats_folder
is not None, the metrics are saved to a CSV file namedstats_<device>.csv
in the specified folder. - Logs a debug message if saving stats is skipped due to a None path.
__get_model(params)
Creates and returns an XGBoost model with the specified parameters.
Parameters: |
|
---|
Returns: |
|
---|
__init__(cv: int, strategy: str, early_stopping_rounds: int, eval_metric: str, context_length: int, horizon_length: int, stride: int, cov_cols: list, cat_cols: list, train_size: float, valid_size: float, test_size: float, device_folder: str, logger: object, path_save_model: str, extract_metrics: bool, stats_folder: str, plot_folder: str, timestamp_column: str, trials: int, return_results: bool, quantiles: list, top_n: int)
Initialize the training configuration for the model.
Parameters: |
|
---|
__plot_boxplot_per_lag(plot_path, df_results)
Generates and saves a boxplot visualizing the absolute error per lag and dataset.
This method calculates the absolute error between the actual and predicted values in the provided DataFrame, and creates a boxplot to display the distribution of errors for each lag and dataset type. The plot is saved as a PNG file.
Parameters: |
|
---|
Saves
A PNG file of the boxplot at the specified plot_path
with the filename
formatted as "{self.device}_BOXPLOT_ERROR.png".
__plot_features_importances(plot_path)
Plots and saves the feature importances of the model.
This method generates a plot of the feature importances for the model and saves it as a PNG file in the specified directory.
Parameters: |
|
---|
Saves
A PNG file named "plot_path
directory, where self.device
attribute.
__plot_lag0(plot_path, df_results)
Generates and saves a line plot for lag 0 predictions and actual values.
This method creates a plot comparing the actual values and predicted values for lag 0, separated by dataset type (TRAIN, VALID, TEST). The plot is saved as a PNG file in the specified directory.
Parameters: |
|
---|
Saves
A PNG file named "{device}_TRAIN_LAG0.png" in the specified plot_path
directory, where device
is an attribute of the class instance.
__plot_random_day(plot_path, df_results)
Plots and saves a visualization of the actual vs predicted values for a random day from the provided results dataframe.
Parameters: |
|
---|
Saves
A PNG image of the plot in the specified plot_path
directory with the filename
format "{device}_TRAIN_RANDOM_DAY.png".
__plots(df_results)
Generates and saves various plots based on the provided results dataframe.
This method creates a directory for saving plots if it does not already exist. It then generates and saves the following types of plots: - Lag 0 plot - Random day plot - Boxplot per lag - Feature importances plot
Parameters: |
|
---|
Returns: |
|
---|
__read_data()
Reads and processes training data from a CSV file.
This method performs the following steps: 1. Reads the training data CSV file located in the device folder. 2. Parses the specified timestamp column as datetime. 3. Creates lag features for the dataset. 4. Splits the dataset into training and testing subsets.
Returns: |
|
---|
__save_model()
Saves the current model to a specified path in JSON format.
This method checks if the path_save_model
attribute is set. If it is None
,
the method logs a debug message and skips the save operation. Otherwise, it
creates the necessary directory structure and saves the model to a JSON file
named after the device.
The saved model file will be located at:
<path_save_model>/GB/<device>.json
Note
- The method assumes that the
self.model
object has asave_model
method that handles saving the model to the specified file path.
Returns: |
|
---|
__search_best_parameters(X_train, y_train, X_val, y_val)
Searches for the best hyperparameters for the model using Optuna.
This method performs hyperparameter optimization for an XGBoost model by defining a search space and evaluating the performance of different parameter combinations using a validation dataset. The best parameters are then used to create and return a model.
Parameters: |
|
---|
Returns: |
|
---|
Notes
- The method uses Optuna to perform the hyperparameter search.
- The objective function minimizes the root mean squared error (RMSE) on the validation dataset.
- The best hyperparameters and the corresponding score are logged.
Raises: |
|
---|
__split_dataset(df_device)
Splits the dataset into training, validation, and test sets, and separates features (X) and target (y) for each set.
Parameters: |
|
---|
Returns: |
|
---|
Notes
- The dataframe is expected to have a "timestamp" column, which will be used as the index.
- The categorical columns specified in
self.cat_cols
are converted to the "category" dtype. - The split sizes are determined by
self.train_size
,self.valid_size
, andself.test_size
, which represent proportions of the dataset. - The
self.horizon_length
parameter is used to exclude the last portion of the dataset from the test set. - The features are selected based on
self.cov_cols
andself.lag_cols_feat
, while the target is selected based onself.lag_cols_target
.
train_model()
Trains the model using the provided training, validation, and test datasets.
This method reads the data, searches for the best model parameters, trains the model, saves the trained model, and optionally extracts metrics and generates plots.
Returns: |
|
---|
Steps
- Reads the training, validation, and test datasets.
- Searches for the best model parameters using the validation set.
- Trains the model on the training dataset.
- Saves the trained model to disk.
- Optionally extracts metrics and generates plots if
self.extract_metrics
is True. - Returns the metrics DataFrame if
self.return_results
is True.
Note
- The method assumes that the following helper methods are implemented:
__read_data
: Reads and splits the data into training, validation, and test sets.__search_best_parameters
: Searches for the best hyperparameters for the model.__save_model
: Saves the trained model to disk.__extract_metrics
: Extracts performance metrics for the model.__plots
: Generates plots for the extracted metrics.- Logging is used to track the progress of the training process.
incube.modeling.predict
PredictGB
__build_results_dataset(start_time, preds)
Builds a results dataset containing lag values, timestamps, and predictions.
Parameters: |
|
---|
Returns: |
|
---|
__create_lag_features(df_device)
Creates lag features for a given DataFrame to be used in time series modeling.
This method generates lagged features for both past (context) and future (horizon) values of the "ElectricWConsumed" column. It also removes rows with missing values resulting from the lagging process and applies a stride to downsample the data.
Parameters: |
|
---|
Returns: |
|
---|
Attributes: |
|
---|
Notes
- The
context_length
attribute determines the number of past lags to create. - The
horizon_length
attribute determines the number of future lags to create. - The
stride
attribute determines the downsampling rate of the resulting DataFrame.
__forecast(df_forecast)
Generates a forecast using the provided dataframe and the trained model.
Parameters: |
|
---|
Returns: |
|
---|
__init__(path_model, device_folder, logger, stats_folder, plot_folder, context_length, horizon_length, stride, output_path, return_results, timestamp_column, cat_cols, cov_cols)
Initializes the prediction model with the specified parameters.
Parameters: |
|
---|
__load_model()
Loads a machine learning model for the specified device.
This method constructs the path to the model file based on the device name. If the model file does not exist at the constructed path, a warning is logged. The method then initializes an XGBRegressor instance and loads the model from the specified file.
Attributes: |
|
---|
Raises: |
|
---|
__plot(df_results)
Generates and saves a plot of predictions over time.
This method creates a line plot using the provided DataFrame df_results
,
which contains timestamps and prediction values. The plot is saved as a PNG
file in a specified folder structure based on the plot_folder
and device
attributes of the class. If plot_folder
is None, the method skips the
plotting process.
Parameters: |
|
---|
Behavior
- Creates a directory structure if it does not exist.
- Saves the plot as a PNG file named
<device>_INFERENCE.png
in the folder<plot_folder>/<device>
. - Configures the plot with appropriate titles, labels, and font sizes.
Notes
- The method uses
matplotlib
for plotting. - The plot is closed after saving to free up resources.
- If
plot_folder
is None, a debug message is logged, and the method returns without generating a plot.
__read_data()
Reads and processes historical and forecast data for a device.
This method reads forecast CSV files, located in the device_folder
directory.
It ensures that the historical data matches the required context length and the
forecast data matches the required horizon length. Categorical columns are converted
to the "category" data type.
Returns: |
|
---|
Raises: |
|
---|
__save_forecast(df_results)
Saves the forecast results to a CSV file if an output path is specified.
Parameters: |
|
---|
Returns: |
|
---|
Notes
- If
self.output_path
is None, the method logs a debug message and skips saving. - The file is saved with the name of the device as
<device>.csv
in the specified output path.
predict()
Executes the prediction process by loading the model, reading input data, creating lag features, generating forecasts, saving the results, and plotting the forecasts. Optionally returns the forecast results.
Steps:
1. Load the prediction model.
2. Read the input data required for forecasting.
3. Create lag features from the input data.
4. Generate forecasts using the model.
5. Save the forecast results to a CSV file.
6. Plot the forecast results.
7. Optionally return the forecast results if self.return_results
is True.
Returns: |
|
---|
incube.modeling.model
XGBRegressorQuantileMultistep
Bases: XGBRegressor
fit(X, y, **kwargs)
Fits multiple models for time series forecasting using XGBoost.
This method trains a series of models, one for each target in the prediction horizon, using quantile regression. The training and validation datasets are provided as inputs, along with additional parameters for model configuration.
Parameters: |
|
---|
Attributes: |
|
---|
Raises: |
|
---|
Example
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=True)
get_features_importance(importance_type='gain', top_n=15)
Compute and return the aggregated feature importance across all trained models. This method calculates the importance of features based on the specified importance type and aggregates the values across all models in the ensemble. It then returns the top N features sorted by their importance. Parameters:
importance_type : str, optional The type of importance to retrieve from the models. Default is 'gain'. Common options include 'weight', 'gain', 'cover', etc., depending on the model's API. top_n : int, optional The number of top features to return based on their importance. Default is 15. Returns:
pd.Series A pandas Series containing the top N features sorted by their aggregated importance across all models. The index represents the feature names, and the values represent their importance scores. Raises:
ValueError
If no models have been trained yet (i.e., self.models
is empty).
predict(X, y=None)
Generate predictions for the given input data and optionally compare them with actual values.
Parameters: |
|
---|
Returns: |
|
---|
Notes
- The method assumes that
self.models
is a list of models, each corresponding to a specific lag. - Each model in
self.models
should have aninplace_predict
method that returns predictions in the form of a 2D array with columns representing lower bound, predicted value, and upper bound. - The length of
self.models
should matchself.prediction_len
, which defines the number of lags to predict.