rumboost package

Submodules

rumboost.constant_parameter module

class rumboost.constant_parameter.Constant(name: str, value: float)[source]

Bases: object

A class to represent a constant parameter, like ASCs. These parameters are not splitted on.

name

The name of the parameter.

Type:: str

value

The value of the parameter.

Type:: float

__call__():: Returns the value of the parameter.

boost(grad, hess, value):

boost(grad: array, hess: array)[source]

Boost the parameter by the given grad and hess.

Parameters:

grad (np.array) – The gradient of the loss function. (n_samples,)
hess (np.array) – The hessian of the loss function. (n_samples,)

rumboost.constant_parameter.compute_grad_hess(preds, device, num_classes, labels, labels_j)[source]

rumboost.datasets module

rumboost.datasets.load_preprocess_Airplane(test_size: float = 0.3, random_state: int = 42)[source]: Load and preprocess the Airplane dataset. See Biogeme website for data.

rumboost.datasets.load_preprocess_LPMC(path='/media/nicolas-salvade/Windows/Users/DAF1/OneDrive - University College London/Documents/PhD - UCL/rumboost-dev/Data/')[source]

Load and preprocess the LPMC dataset.

Returns:

dataset_train (pandas Dataframe) – The training dataset ready to use.
dataset_test (pandas Dataframe) – The training dataset ready to use.
folds (zip(list, list)) – 5 folds of indices grouped by household for CV.

rumboost.datasets.load_preprocess_MTMC(test_size: float = 0.2, random_state: int = 1, path='/media/nicolas-salvade/Windows/Users/DAF1/OneDrive - University College London/Documents/PhD - UCL/rumboost/Data/')[source]: Load and preprocess the MTMC dataset.

rumboost.datasets.load_preprocess_MTMC_all(test_size: float = 0.2, random_state: int = 1, path='/media/nicolas-salvade/Windows/Users/DAF1/OneDrive - University College London/Documents/PhD - UCL/rumboost/Data/')[source]: Load and preprocess the MTMC dataset for all swiss zones.

rumboost.datasets.load_preprocess_Netherlands(test_size: float = 0.3, random_state: int = 42)[source]: Load and preprocess the Netherlands dataset. See Biogeme website for data.

rumboost.datasets.load_preprocess_Optima()[source]

Load and preprocess the Optima dataset. See Biogeme website for data.

Returns:

dataset_train (pandas Dataframe) – The training dataset ready to use.
dataset_test (pandas Dataframe) – The training dataset ready to use.
folds (zip(list, list)) – 5 folds of indices grouped by household for CV.

rumboost.datasets.load_preprocess_Parking(test_size: float = 0.3, random_state: int = 42)[source]: Load and preprocess the Parking dataset. See Biogeme website for data.

rumboost.datasets.load_preprocess_SwissMetro(test_size: float = 0.3, random_state: int = 42, full_data=False, path='../Data/')[source]

Load and preprocess the SwissMetro dataset. See Biogeme website for data.

Parameters:

test_size (float, optional (default = 0.3)) – The proportion of data used for test set.
random_state (int, optional (default = 42)) – For reproducibility in the train-test split
full_data (bool, optional (default = False)) – If the full dataset should be returned.
path (str, optional) – The path to the data.

Returns:

dataset_train (pandas Dataframe) – The training dataset ready to use.
dataset_test (pandas Dataframe) – The training dataset ready to use.

rumboost.datasets.load_preprocess_Telephone(test_size: float = 0.3, random_state: int = 3)[source]: Load and preprocess the Telephone dataset. See Biogeme website for data.

rumboost.datasets.load_preprocess_Vaccines()[source]: Load and preprocess the Vaccines dataset.

rumboost.datasets.prepare_dataset(rum_structure, df_train, num_classes, df_test=None, target='choice', free_raw_data=False, save_dataset=None, load_dataset=None)[source]

Prepare and save if required the datasets for RUMBoost.

Parameters:

rum_structure (list of dict) – The structure of the RUM model.
df_train (pandas DataFrame) – The training dataset.
params (dict) – The parameters of the model.
num_classes (int) – The number of classes.
df_test (list of pandas DataFrame, optional) – The list of test datasets.
target (str, optional) – The target variable.
free_raw_data (bool, optional) – If the raw data should be freed.
save_dataset (str, optional) – The path to save the datasets.
load_dataset (str, optional) – The path to load the datasets.

Returns:

train_sets (dict) – The training datasets.
valid_sets (dict) – The validation datasets.

rumboost.datasets.stratified_group_k_fold(X, y, groups, k, seed=None)[source]

Stratified Group K-Fold cross-validator Provides train/test indices to split data in train/test sets.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input samples.
y (array-like of shape (n_samples,)) – The target values.
groups (array-like of shape (n_samples,)) – Group labels for the samples used while splitting the dataset into train/test set.
k (int) – Number of folds. Must be at least 2.
seed (int, optional) – Random seed for shuffling the data.

Yields:

train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.

rumboost.linear_trees module

class rumboost.linear_trees.LinearTree(x: ndarray = None, init_leaf_val: int = 0, monotonic_constraint: int = 0, max_bin: int = 255, learning_rate: float = 0.1, lambda_l1: float = 0, lambda_l2: float = 0, bagging_fraction: float = 1, bagging_freq: int = 1, min_data_in_leaf: int = 20, min_sum_hessian_in_leaf: float = 0.001, min_gain_to_split: int = 0, min_data_in_bin: int = 3)[source]

Bases: object

add_valid(data, name: str) → LinearTree[source]

Add validation data.

Parameters:

data (Dataset) – Validation data.
name (str) – Name of validation data.

Returns:

self – Booster with set validation data.

Return type:

Booster

build_lightgbm_style_histogram(feature_values: ndarray, max_bin: int = 255, min_data_in_bin: int = 3)[source]

Build histogram for feature values similar to LightGBM’s histogram-based binning.

Parameters:

feature_values (np.ndarray) – Feature values to be binned.
max_bin (int, optional) – Maximum number of bins to create. Default is 255.
min_data_in_bin (int, optional) – Minimum number of data points required in each bin. Default is 3.

Returns:

bin_edges (np.ndarray) – Edges of the bins.
histogram (np.ndarray) – Histogram of the feature values.
bin_indices (np.ndarray) – Indices of the bins for each feature value.

dump_model(**kwargs) → dict[source]: Dump the model to a json string.

eval_train(feval)[source]

eval_valid(feval)[source]

feature_importance(type: str)[source]

Get the feature importance for the specified type.

Parameters:: type (str) – Type of feature importance to retrieve. Currently only “gain” is supported.
Returns:: Feature importance values for the specified type.
Return type:: np.ndarray

free_dataset()[source]

model_from_string(s: dict)[source]: Load the model from a dictionary.

model_to_string(**kwargs) → dict[source]: Serialize the model to a JSON string.

predict(x)[source]

rollback_one_iter()[source]: Rollback the last boosting iteration.

set_train_data_name(name: str) → LinearTree[source]

Set the name to the training Dataset.

Parameters:: name (str) – Name for the training Dataset.
Returns:: self – Booster with set training Dataset name.
Return type:: Booster

update(train_set, fobj)[source]

Update the model with new training data and compute the best split.

Parameters:

train_set (Dataset) – Training dataset containing the feature values.
fobj (callable) – Objective function to compute gradients and Hessians.

update_bounds()[source]: Update the bounds for the left and right leaves based on the current split and leaf values. This is necessary for enforcing monotonic constraints.

rumboost.metrics module

rumboost.metrics.accuracy(preds, labels)[source]

Compute accuracy of the model.

Parameters:

preds (numpy array) – Predictions for all data points and each classes from a softmax function. preds[i, j] correspond to the prediction of data point i to belong to class j.
labels (numpy array) – The labels of the original dataset, as int.

Returns:

Accuracy – The computed accuracy, as a float.

Return type:

float

rumboost.metrics.binary_cross_entropy(preds, labels)[source]

Compute binary cross entropy for given predictions and data.

Parameters:

preds (numpy array) – Predictions for all data points and each classes from a sigmoid function. preds[i, j] correspond to the prediction of data point i to belong to class j.
labels (numpy array) – The labels of the original dataset, as int.

Returns:

Cross entropy – The negative cross-entropy, as float.

Return type:

float

rumboost.metrics.coral_eval(preds, labels)[source]

Evaluate the Coral model using the multilabel binary cross-entropy loss function.

Parameters:

preds (np.array) – The predictions of the model.
labels (np.array) – The labels of the dataset.

Returns:

loss – The cross-entropy loss.

Return type:

float

rumboost.metrics.cross_entropy(preds, labels)[source]

Compute negative cross entropy for given predictions and data.

Parameters:

preds (numpy array) – Predictions for all data points and each classes from a softmax function. preds[i, j] correspond to the prediction of data point i to belong to class j.
labels (numpy array) – The labels of the original dataset, as int.

Returns:

Cross entropy – The negative cross-entropy, as float.

Return type:

float

rumboost.metrics.mse(preds, target)[source]

Compute mean squared error for given predictions and data.

Parameters:

preds (numpy array) – Predictions for all data points and each classes from a regression model. preds[i, j] correspond to the prediction of data point i for class j.
target (numpy array) – The target values of the original dataset.

Returns:

Mean squared error – The mean squared error, as float.

Return type:

float

rumboost.metrics.safe_softplus(x, beta=1, threshold=20)[source]

Compute the softplus function in a safe way to avoid numerical issues.

Parameters:

x (numpy array) – The input of the softplus function.
beta (float) – The beta parameter for the softplus function.
threshold (float) – The threshold for the input of the exponential function.

Returns:

Softplus – The softplus function applied to x.

Return type:

numpy array

rumboost.metrics.weighted_binary_cross_entropy(logits, labels)[source]

Compute weighted binary cross entropy for given logits and data. The weights are all ones. This function is used in the ordinal regression model with coral estimation.

Parameters:

logits (numpy array) – Logits for all data points and each classes. logits[i, j] correspond to the logits of data point i to class j.
labels (numpy array) – The labels of the original dataset, as int.

Returns:

Cross entropy – The negative cross-entropy, as float.

Return type:

float

rumboost.models module

rumboost.models.Airplane(df_train, for_prob=False)[source]

rumboost.models.LPMC(dataset_train, for_prob=False)[source]

Create a MNL on the LPMC dataset. The model is a slightly modified version from teh code that can be found here: https://github.com/JoseAngelMartinB/prediction-behavioural-analysis-ml-travel-mode-choice.

Parameters:: dataset_train (pandas DataFrame) – The training dataset.
Returns:: biogeme – The BIOGEME object containing the model.
Return type:: bio.BIOGEME

rumboost.models.LPMC_nested(dataset_train, for_prob=False)[source]

Create a nested logit model on the LPMC dataset. The model is a slightly modified version from teh code that can be found here: https://github.com/JoseAngelMartinB/prediction-behavioural-analysis-ml-travel-mode-choice.

Parameters:: dataset_train (pandas DataFrame) – The training dataset.
Returns:: biogeme – The BIOGEME object containing the model.
Return type:: bio.BIOGEME

rumboost.models.LPMC_nested_normalised(dataset_train, for_prob=False)[source]

Create a nested logit model on the LPMC dataset, normalised for biogeme estimation. The model is a slightly modified version from teh code that can be found here: https://github.com/JoseAngelMartinB/prediction-behavioural-analysis-ml-travel-mode-choice.

Parameters:: dataset_train (pandas DataFrame) – The training dataset.
Returns:: biogeme – The BIOGEME object containing the model.
Return type:: bio.BIOGEME

rumboost.models.LPMC_normalised(dataset_train, for_prob=False)[source]

Create a MNL on the LPMC dataset, normalised for biogeme estimation. The model is a slightly modified version from teh code that can be found here: https://github.com/JoseAngelMartinB/prediction-behavioural-analysis-ml-travel-mode-choice.

Parameters:: dataset_train (pandas DataFrame) – The training dataset.
Returns:: biogeme – The BIOGEME object containing the model.
Return type:: bio.BIOGEME

rumboost.models.MTMC_lausanne_CNL(dataset_train: DataFrame, for_prob=False, results=None)[source]

Estimation of a CNL model.

Parameters:

dataset_train (pandas DataFrame) – The training dataset.
for_prob (bool, optional) – If True, the function returns a BIOGEME object for probability calculation.
results (bio.BIOGEME, optional (default=None)) – The biogeme model estimated.

Returns:

biogeme – The BIOGEME object containing the model.

Return type:

bio.BIOGEME

rumboost.models.MTMC_lausanne_MNL(dataset_train: DataFrame, for_prob=False, results=None)[source]

Estimation of a MNL model.

Parameters:

dataset_train (pandas DataFrame) – The training dataset.
for_prob (bool, optional) – If True, the function returns a BIOGEME object for probability calculation.
results (bio.BIOGEME, optional (default=None)) – The biogeme model estimated.

Returns:

biogeme – The BIOGEME object containing the model.

Return type:

bio.BIOGEME

rumboost.models.Netherlands(df_train, for_prob=False)[source]

rumboost.models.Optima(dataset_train, for_prob=False)[source]

Create a MNL on the OPTIMA dataset. The model is a slightly modified version from the code that can be found here: https://github.com/JoseAngelMartinB/prediction-behavioural-analysis-ml-travel-mode-choice.

Parameters:: dataset_train (pandas DataFrame) – The training dataset.
Returns:: biogeme – The BIOGEME object containing the model.
Return type:: bio.BIOGEME

rumboost.models.Parking(df_train, for_prob=False)[source]

rumboost.models.SwissMetro(dataset_train: DataFrame, for_prob=False)[source]

Create a MNL on the swissmetro dataset.

Parameters:: dataset_train (pandas DataFrame) – The training dataset.
Returns:: biogeme – The BIOGEME object containing the model.
Return type:: bio.BIOGEME

rumboost.models.SwissMetro_MNL(dataset_train: DataFrame, for_prob=False)[source]

Create a simple MNL on the swissmetro dataset.

Parameters:: dataset_train (pandas DataFrame) – The training dataset.
Returns:: biogeme – The BIOGEME object containing the model.
Return type:: bio.BIOGEME

rumboost.models.SwissMetro_nested(dataset_train: DataFrame, for_prob=False)[source]

Create a nested logit model on the swissmetro dataset.

Parameters:: dataset_train (pandas DataFrame) – The training dataset.
Returns:: biogeme – The BIOGEME object containing the model.
Return type:: bio.BIOGEME

rumboost.models.SwissMetro_normalised(dataset_train: DataFrame, for_prob=False)[source]

Create a MNL on the swissmetro dataset.

Parameters:: dataset_train (pandas DataFrame) – The training dataset.
Returns:: biogeme – The BIOGEME object containing the model.
Return type:: bio.BIOGEME

rumboost.models.Telephone(df_train, for_prob=False)[source]

rumboost.models.Vaccines(dataset_train: DataFrame, for_prob=False)[source]

Create a MNL on the Vaccine dataset.

Parameters:: dataset_train (pandas DataFrame) – The training dataset.
Returns:: biogeme – The BIOGEME object containing the model.
Return type:: bio.BIOGEME

rumboost.nested_cross_nested module

rumboost.nested_cross_nested.cross_nested_probs(raw_preds, mu, alphas)[source]

Compute nested predictions.

Parameters:

raw_preds (numpy.ndarray) – The raw predictions from the booster
mu (list) – The list of mu values for each nest. The first value corresponds to the first nest and so on.
alphas (numpy.ndarray) – An array of J (alternatives) by M (nests). alpha_jn represents the degree of membership of alternative j to nest n. For example, alpha_12 = 0.5 means that alternative one belongs 50% to nest 2.

Returns:

preds (numpy.ndarray) – The cross nested predictions
pred_i_m (numpy.ndarray) – The prediction of choosing alt i knowing nest m
pred_m (numpy.ndarray) – The prediction of choosing nest m

rumboost.nested_cross_nested.nest_probs(raw_preds, mu, nests, nest_alt)[source]

compute nested predictions.

Parameters:

raw_preds – The raw predictions from the booster
mu – The list of mu values for each nest. The first value correspond to the first nest and so on.
nests – The dictionary keys are alternatives number and their values are their nest number. By example, {0:0, 1:1, 2:0} means that alt 0 and 2 are in nest 0 and alt 1 is in nest 1.
nest_alt – The nest of each alternative. By example, [0, 1, 0] means that alt 0 and 2 are in nest 0 and alt 1 is in nest 1.

Returns:

preds.T – The nested predictions
pred_i_m – The prediction of choosing alt i knowing nest m
pred_m – The prediction of choosing nest m

rumboost.nested_cross_nested.optimise_mu_or_alpha(params_to_optimise, labels, rumb, optimise_mu, optimise_alpha, alpha_shape)[source]

Optimize mu or alpha values for a given dataset.

Parameters:

params_to_optimise (list) – The list of mu or alpha values to optimize.
labels (numpy.ndarray, optional (default=None)) – The labels of the original dataset, as int.
rumb (RUMBoost, optional (default=None)) – A trained RUMBoost object.
optimise_mu (bool, optional (default=False)) – Whether to optimize mu values.
optimise_alpha (bool, optional (default=False)) – Whether to optimize alpha values.
alpha_shape (tuple) – The shape of the alpha values.

Returns:

loss – The loss according to the optimization of mu or alpha values.

Return type:

int

rumboost.ordinal module

rumboost.ordinal.diff_to_threshold(threshold_diff)[source]

Convert differences between thresholds to thresholds

Parameters:: threshold_diff (numpy.ndarray) – List of differences between thresholds, with the first element being the first threshold
Returns:: List of thresholds
Return type:: numpy.ndarray

rumboost.ordinal.optimise_thresholds_coral(thresh_diff, labels, raw_preds)[source]

Optimise thresholds for ordinal regression, with a coral model.

Parameters:

thresh_diff (numpy.ndarray) – List of threshold differnces (first element is the first threshold)
labels (numpy.ndarray) – List of labels
raw_preds (numpy.ndarray) – List of predictions

Returns:

loss – The loss according to the optimisation of thresholds.

Return type:

int

rumboost.ordinal.optimise_thresholds_proportional_odds(thresh_diff, labels, raw_preds)[source]

Optimise thresholds for ordinal regression, according to the proportional odds model.

Parameters:

thresh_diff (numpy.ndarray) – List of threshold differnces (first element is the first threshold)
labels (numpy.ndarray) – List of labels
raw_preds (numpy.ndarray) – List of predictions

Returns:

loss – The loss according to the optimisation of thresholds.

Return type:

int

rumboost.ordinal.threshold_preds(raw_preds, thresholds)[source]

Calculate the probabilities of each ordinal class given the raw predictions and thresholds.

Parameters:

raw_preds (numpy.ndarray) – List of raw predictions
thresholds (numpy.ndarray) – List of thresholds

Returns:

List of probabilities of each ordinal class

Return type:

numpy.ndarray

rumboost.ordinal.threshold_to_diff(thresholds)[source]

Convert thresholds to differences between thresholds

Parameters:: thresholds (numpy.ndarray) – List of thresholds
Returns:: List of differences between thresholds, with the first element being the first threshold
Return type:: numpy.ndarray

rumboost.post_process module

rumboost.post_process.assist_model_spec(model: RUMBoost, dataset: DataFrame, choice: Series, alt_to_normalise: int = 0, return_utilities: bool = False, dataset_test: DataFrame = None, choice_test: Series = None)[source]

Provide a piece-wise linear model spcification based on a pre-trained rumboost model.

Parameters:

model (RUMBoost) – A trained rumboost model.
dataset (pd.DataFrame) – A dataset used to train the model
choice (pd.Series) – A series containing the choices
alt_to_normalise (int, optional (default=0)) – The variables of that alternative will be normalised when needed (socio-economic characteristics, ascs, …).
utilities (bool, optional (default=False)) – If True, the model will return the utility values, otherwise it will return the loglogit values.
dataset_test (pd.DataFrame, optional (default=None)) – Only for predictions. If None, the dataset used to train the model will be used.
choice_test (pd.Series, optional (default=None)) – A series containing the choices for the test dataset

Returns:

model_spec – A dictionary containing the model specification used to train a biogeme model.

Return type:

dict

rumboost.post_process.bootstrap(dataset: DataFrame, model_specification: dict, num_it: int = 100, seed: int = 42)[source]

Performs bootstrapping, with given dataset, parameters and rum_structure. For now, only a basic rumboost can be used.

Parameters:

dataset (pd.DataFrame) – A dataset used to train RUMBoost
model_specification (dict) – A dictionary containing the model specification used to train the model. It should follow the same structure than in the rum_train() function.
num_it (int, optional (default=100)) – The number of bootstrapping iterations
seed (int, optional (default=42)) – The seed used to randomly sample the dataset.

Returns:

models – Return a list containing all trained models.

Return type:

list

rumboost.post_process.estimate_dcm_with_assisted_spec(dataset: DataFrame, choice: Series, model: RUMBoost, dataset_name: str = 'SwissMetro')[source]

Estimate a Discrete Choice Model (currently only logit) with a piece-wise linear model specification based on a pre-trained rumboost model.

Parameters:

dataset (pd.DataFrame) – A dataset used to train the model
choice (pd.Series) – A series containing the choices
model (RUMBoost) – A trained rumboost model.
dataset_name (str, optional (default="SwissMetro")) – The dataset name

Returns:

estimated_model

Return type:

biogeme.results.bioResults

rumboost.post_process.predict_with_assisted_spec(dataset_train: DataFrame, dataset_test: DataFrame, choice_train: Series, choice_test: Series, model: RUMBoost, beta_values: dict, utilities: bool = False)[source]

Predict choices with a piece-wise linear model specification based on a pre-trained rumboost model.

Parameters:

dataset_train (pd.DataFrame) – A dataset used for estimation
dataset_test (pd.DataFrame) – A dataset used for prediction
choice_train (pd.Series) – A series containing the training set choices
choice_test (pd.Series) – A series containing the test set choices
model (RUMBoost) – A trained rumboost model.
beta_values (dict) – A dictionary containing the beta values of the model, estimated on the train set.
utilities (bool, optional (default=False)) – If True, the model will return the utilities instead of the log-probs.

Returns:

prediction_results

Return type:

biogeme.results.bioResults

rumboost.post_process.split_fe_model(model: RUMBoost)[source]

Split a functional effect model and returns its two parts

Parameters:

model (RUMBoost) – A functional effect RUMBoost model with rum_structure

Returns:

attributes_model (RUMBoost) – The part of the functional effect model with trip attributes without interaction
socio_economic_model (RUMBoost) – The part of the model leading to the individual-specific constant, where socio-economic characteristics fully interact.

rumboost.rumboost module

Library with training routines of LightGBM.

class rumboost.rumboost.CVRUMBoost[source]

Bases: object

CVRUMBoost in LightGBM.

Auxiliary data structure to hold and redirect all boosters of cv function. This class has the same methods as Booster class. All method calls are actually performed for underlying Boosters and then all returned results are returned in a list.

rum_boosters

The list of underlying fitted models.

Type:: list of RUMBoost

best_iteration

The best iteration of fitted model.

Type:: int

class rumboost.rumboost.RUMBoost(model_file=None, **kwargs)[source]

Bases: object

RUMBoost for doing Random Utility Modelling in LightGBM.

Auxiliary data structure to implement boosters of rum_train() function for multiclass classification. This class has the same methods as Booster class. All method calls, except for the following methods, are actually performed for underlying Boosters.

model_from_string()
model_to_string()
save_model()

boosters

The list of fitted models.

Type:: list of Booster

valid_sets

Validation sets of the RUMBoost. By default None, to avoid computing cross entropy if there are no validation sets.

Type:: None

f_obj(preds, data)[source]

f_obj_binary(preds, data)[source]

f_obj_coral(preds, data)[source]

f_obj_cross_nested(preds, data)[source]

f_obj_full_hessian(_, __)[source]

Objective function of the boosters, for the full hessian.

Returns:

grad (numpy array) – The gradient with the cross-entropy loss function.
hess (numpy array) – The hessian with the cross-entropy loss function.

f_obj_mse(preds, data)[source]

f_obj_nest(preds, data)[source]

f_obj_proportional_odds(preds, data)[source]

model_from_string(model_str: str)[source]

Load RUMBoost from a string.

Parameters:: model_str (str) – Model will be loaded from this string.
Returns:: self – Loaded RUMBoost object.
Return type:: RUMBoost

model_to_string(num_iteration: int | None = None, start_iteration: int = 0, importance_type: str = 'split') → str[source]

Save RUMBoost to JSON string.

Parameters:

num_iteration (int or None, optional (default=None)) – Index of the iteration that should be saved. If None, if the best iteration exists, it is saved; otherwise, all iterations are saved. If <= 0, all iterations are saved.
start_iteration (int, optional (default=0)) – Start index of the iteration that should be saved.
importance_type (str, optional (default="split")) – What type of feature importance should be saved. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.

Returns:

str_repr – JSON string representation of RUMBoost.

Return type:

str

multiply_grad_hess_by_data()[source]: Decorator to multiply the gradient and hessian by the number of observations for the jth booster. This is used to scale the gradient and hessian when boosting from the parameter space, according to the chain rule.

predict(data, start_iteration: int = 0, num_iteration: int = -1, raw_score: bool = True, pred_leaf: bool = False, pred_contrib: bool = False, data_has_header: bool = False, validate_features: bool = False, utilities: bool = False)[source]

Predict logic.

Parameters:

data (str, pathlib.Path, numpy array, pandas DataFrame, H2O DataTable's Frame or scipy.sparse) – Data source for prediction. If str or pathlib.Path, it represents the path to a text file (CSV, TSV, or LibSVM).
start_iteration (int, optional (default=0)) – Start index of the iteration to predict.
num_iteration (int, optional (default=-1)) – Iteration used for prediction.
raw_score (bool, optional (default=False)) – Whether to predict raw scores.
pred_leaf (bool, optional (default=False)) – Whether to predict leaf index.
pred_contrib (bool, optional (default=False)) – Whether to predict feature contributions.
data_has_header (bool, optional (default=False)) – Whether data has header. Used only for txt data.
validate_features (bool, optional (default=False)) – If True, ensure that the features used to predict match the ones used to train. Used only if data is pandas DataFrame.
utilities (bool, optional (default=False)) – If True, return raw utilities for each class, without generating probabilities.

Returns:

result – Prediction result. Can be sparse or a list of sparse objects (each element represents predictions for one class) for feature contributions (when pred_contrib=True).

Return type:

numpy array, scipy.sparse or list of scipy.sparse

save_model(filename: str | Path, num_iteration: int | None = None, start_iteration: int = 0, importance_type: str = 'split') → RUMBoost[source]

Save RUMBoost to a file as JSON text.

Parameters:

filename (str or pathlib.Path) – Filename to save RUMBoost.
num_iteration (int or None, optional (default=None)) – Index of the iteration that should be saved. If None, if the best iteration exists, it is saved; otherwise, all iterations are saved. If <= 0, all iterations are saved.
start_iteration (int, optional (default=0)) – Start index of the iteration that should be saved.
importance_type (str, optional (default="split")) – What type of feature importance should be saved. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.

Returns:

self – Returns self.

Return type:

RUMBoost

rumboost.rumboost.rum_cv(params, train_set, num_boost_round=100, folds=None, nfold=5, stratified=True, shuffle=True, metrics=None, fobj=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', early_stopping_rounds=None, fpreproc=None, verbose_eval=None, show_stdv=True, seed=0, callbacks=None, eval_train_metric=False, return_cvbooster=False, rum_structure=None, biogeme_model=None)[source]

Perform the cross-validation with given parameters.

Parameters:

params (dict) – Parameters for Booster.
train_set (Dataset) – Data to be trained on.
num_boost_round (int, optional (default=100)) – Number of boosting iterations.
folds (generator or iterator of (train_idx, test_idx) tuples, scikit-learn splitter object or None, optional (default=None)) – If generator or iterator, it should yield the train and test indices for each fold. If object, it should be one of the scikit-learn splitter classes (https://scikit-learn.org/stable/modules/classes.html#splitter-classes) and have split method. This argument has highest priority over other data split arguments.
nfold (int, optional (default=5)) – Number of folds in CV.
stratified (bool, optional (default=True)) – Whether to perform stratified sampling.
shuffle (bool, optional (default=True)) – Whether to shuffle before splitting data.
metrics (str, list of str, or None, optional (default=None)) – Evaluation metrics to be monitored while CV. If not None, the metric in params will be overridden.
fobj (callable or None, optional (default=None)) –
Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

predslist or numpy 1-D array
The predicted values. Predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task.

train_dataDataset
The training dataset.

gradlist or numpy 1-D array
The value of the first order derivative (gradient) of the loss with respect to the elements of preds for each sample point.

hesslist or numpy 1-D array
The value of the second order derivative (Hessian) of the loss with respect to the elements of preds for each sample point.

For multi-class task, the preds is group by class_id first, then group by row_id. If you want to get i-th row preds in j-th class, the access way is score[j * num_data + i] and you should group grad and hess in this way as well.
feval (callable, list of callable, or None, optional (default=None)) –
Customized evaluation function. Each evaluation function should accept two parameters: preds, train_data, and return (eval_name, eval_result, is_higher_better) or list of such tuples.

predslist or numpy 1-D array
The predicted values. If fobj is specified, predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task in this case.

train_dataDataset
The training dataset.

eval_namestr
The name of evaluation function (without whitespace).

eval_resultfloat
The eval result.

is_higher_betterbool
Is eval result higher better, e.g. AUC is is_higher_better.

For multi-class task, the preds is group by class_id first, then group by row_id. If you want to get i-th row preds in j-th class, the access way is preds[j * num_data + i]. To ignore the default metric corresponding to the used objective, set metrics to the string "None".
init_model (str, pathlib.Path, Booster or None, optional (default=None)) – Filename of LightGBM model or Booster instance used for continue training.
feature_name (list of str, or 'auto', optional (default="auto")) – Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.
categorical_feature (list of str or int, or 'auto', optional (default="auto")) – Categorical features. If list of int, interpreted as indices. If list of str, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature.
early_stopping_rounds (int or None, optional (default=None)) – Activates early stopping. CV score needs to improve at least every early_stopping_rounds round(s) to continue. Requires at least one metric. If there’s more than one, will check all of them. To check only the first metric, set the first_metric_only parameter to True in params. Last entry in evaluation history is the one from the best iteration.
fpreproc (callable or None, optional (default=None)) – Preprocessing function that takes (dtrain, dtest, params) and returns transformed versions of those.
verbose_eval (bool, int, or None, optional (default=None)) – Whether to display the progress. If True, progress will be displayed at every boosting stage. If int, progress will be displayed at every given verbose_eval boosting stage.
show_stdv (bool, optional (default=True)) – Whether to display the standard deviation in progress. Results are not affected by this parameter, and always contain std.
seed (int, optional (default=0)) – Seed used to generate the folds (passed to numpy.random.seed).
callbacks (list of callable, or None, optional (default=None)) – List of callback functions that are applied at each iteration. See Callbacks in Python API for more information.
eval_train_metric (bool, optional (default=False)) – Whether to display the train metric in progress. The score of the metric is calculated again after each training step, so there is some impact on performance.
return_cvbooster (bool, optional (default=False)) – Whether to return Booster models trained on each fold through CVBooster.
rum_structure (dict, optional (default=None)) –
List of dictionaries specifying the RUM structure. The list must contain one dictionary for each class, which describes the utility structure for that class. Each dictionary has three allowed keys.

cols : list of columns included in that class monotone_constraints : list of monotonic constraints on parameters interaction_constraints : list of interaction constraints on features

if None, a biogeme_model must be specified
biogeme_model (biogeme.biogeme.BIOGEME, optional (default=None)) – A biogeme.biogeme.BIOGEME object representing a biogeme model, used to create the rum_structure. A biogeme model is required if rum_structure is None, otherwise should be None.

Returns:

eval_hist – Evaluation history. The dictionary has the following format: {‘metric1-mean’: [values], ‘metric1-stdv’: [values], ‘metric2-mean’: [values], ‘metric2-stdv’: [values], …}. If return_cvbooster=True, also returns trained boosters via cvbooster key.

Return type:

dict

Perform the RUM training with given parameters.

Parameters:

train_set (Dataset or dict[int, Any]) –
Data to be trained on. Set free_raw_data=False when creating the dataset. If it is a dictionary, the key-value pairs should be:
- ”train_sets”: the corresponding preprocessed Dataset.
- ”num_data”: the number of observations in the dataset.
- ”labels”: the labels of the full dataset.
- ”labels_j”: the labels of the dataset for each class (binary).
model_specification (dict) –
Dictionary specifying the model specification. The required keys are:
- ’general_params’: dict
  Dictionary containing the general parameters for the RUMBoost model. The dictionary can contain the following keys:
  
  ’num_iterations’: int
  Number of boosting iterations.
  
  ’num_classes’: int
  Number of classes. If equal to 2 and no additional keys are provided, the model will perfomr binary classification. If greater than 2, the model will perform multiclass classification. If equal to 1, the model will perform regression with MSE (other loss functions will be implemented in the future).
  
  ’subsampling’: float, optional (default = 1.0)
  Subsample ratio of gradient when boosting
  
  ’subsampling_freq’: int, optional (default = 0)
  Subsample frequency.
  
  ’subsample_valid’: float, optional (default = 1.0)
  Subsample ratio of validation data.
  
  ’batch_size’: int, optional (default = 0)
  Batch size for the training. The batch size will override the subsampling.
  
  ’early_stopping_rounds’: int, optional (default = None)
  Activates early stopping. The model will train until the validation score stops improving.
  
  ’verbosity’: int, optional (default = 1)
  Verbosity of the model.
  
  ’verbose_interval’: int, optional (default = 10)
  Interval of the verbosity display. only used if verbosity > 1.
  
  ’max_booster_to_update’: int, optional (default = num_classes)
  Maximum number of boosters to update at each round. It has to be at least equal to the number of classes, and at most equal to the number of classes times the maximum number of boosters in the smallest utility function. This is intended to update each utility function with the same number of trees.
  
  ’boost_from_parameter_space’: list, optional (default = [])
  If True, the boosting will be done in the parameter space, as opposed to the utility space. It means that the GBDT algorithm will ouput betas instead of piece-wise constant utility values. The resulting utility functions will be piece-wise linear. Monotonicity is not guaranteed in this case and only one variable per parameter ensemble is allowed.
  
  ’optim_interval’: int, optional (default = 20)
  If all the ensembles are boosted from the parameter space, the interval at which the ASCs are optimised. If 0, the ASCs are fixed.
  
  ’save_model_interval’: int, optional (default = 0)
  The interval at which the model will be saved during training.
  
  ’eval_function’: func (default = cross_entropy if multi-class, binary_log_loss if binary, mse if regression)
  The evaluation function to be used.
  
  ’full_hessian’: bool, optional (default = False)
  If True, the full hessian is used to compute the gradients and hessians. Currently only implemented for the multiclass case, and only works with cuda.
-‘rum_structure’list[dict[str, Any]]
List of dictionaries specifying the variable used to create the parameter ensemble, and their monotonicity or interaction. The list must contain one dictionary for each parameter. Each dictionary has four required keys:

’utility’: list of alternatives in which the parameter ensemble is used. If more than
one alternative is specified, the parameter ensemble is shared across alternatives, and the number of variables shared must be equal to the number of alternatives.

’variables’: list of columns from the train_set included in that parameter_ensemble.
This is the list of variables on which the splits will be done.

’boosting_params’: dict
Dictionary containing the boosting parameters for the parameter ensemble. These parameters are the same than Lightgbm parameters. More information here: https://lightgbm.readthedocs.io/en/latest/Parameters.html.

’shared’: bool
If True, the parameter ensemble is shared across all alternatives. When shared, the number of variables shared must be equal to the number of alternatives. If the same variable is shared across alternatives, it must be repeated in the variables list (by example variables = [‘var1’, ‘var1’, ‘var1’] and utility = [0, 1, 2]).

And two optional keys:

’endogenous_variable’: str
The name of one variable in the train_set. This is only used if boosted from the parameter space, and the variable is not included in the variables list. The output of the trees are the slope and the variable in endogenous_variable is the variable used in the beta times x output. The variable must be continuous or binary.

’init_leaf_val’: float
Initial leaf value for the ensemble in the parameter space. This will only be used for ensembles boosted from the parameter space.
The other keys are optional and can be:
- ’nested_logit’: dict
  
  Nested logit model specification. The dictionary must contain:
  
  ’mu’: ndarray
  An array of mu values, the scaling parameters, for each nest. The first value of the array correspond to nest 0, and so on. By default, the value of mu is 1 and is optimised through scipy.minimize. Mu is competing against other parameter ensembles at each round to be selected as the updated parameter ensemble.
  
  ’nests’: dict
  A dictionary representing the nesting structure. Keys are nests, and values are the the list of alternatives in the nest. For example {0: [0, 1], 1: [2, 3]} means that alternative 0 and 1 are in nest 0, and alternative 2 and 3 are in nest 1.
  
  ’optimise_mu’: bool or list[bool], optional (default = True)
  If True, the mu values are optimised through scipy.minimize. If a list of booleans, the length must be equal to the number of nests. By example, [True, False] means that mu_0 is optimised and mu_1 is fixed.
  
  ’optim_interval’: int, optional (default = 20)
  Interval at which the mu values are optimised.
- ’cross_nested_logit’: dict
  
  Cross-nested logit model specification. The dictionary must contain:
  
  ’mu’: ndarray
  An array of mu values, the scaling parameters, for each nest. The first value of the array correspond to nest 0, and so on.
  
  ’alphas’: ndarray
  An array of J (alternatives) by M (nests). alpha_jn represents the degree of membership of alternative j to nest n By example, alpha_12 = 0.5 means that alternative one belongs 50% to nest 2.
  
  ’optimise_mu’: bool or list[bool], optional (default = True)
  If True, the mu values are optimised through scipy.minimize. If a list of booleans, the length must be equal to the number of nests. By example, [True, False] means that mu_0 is optimised and mu_1 is fixed.
  
  ’optimise_alphas’: bool or ndarray[bool], optional (default = False)
  If True, the alphas are optimised through scipy.minimize. This is not recommended for high dimensionality datasets as it can be computationally expensive. If an array of boolean, the array must have the same size than alphas. By example if optimise_alphas_ij = True, alphas_ij will be optimised.
  
  ’optim_interval’: int, optional (default = 20)
  Interval at which the mu and/or alpha values are optimised.
- ’ordinal_logit’: dict
  Ordinal logit model specification. The dictionary must contain:
  
  ’model’: str, default = ‘proportional_odds’
  
  The type of ordinal model. It can be:
  
  ’proportional_odds’: the proportional odds model.
  
  ’coral’: a rank consistent binary decomposition model.
  
  ’optim_interval’: int, optional (default = 20)
  Interval at which the thresholds are optimised. This is only used for the proportional odds and the coral models. If 0, the thresholds are fixed. For ordinal models, the thresholds are optimised from the first iteration.
num_boost_round (int, optional (default = 100)) – Number of boosting iterations.
valid_sets (list of Dataset, dict, or None, optional (default = None)) –
List of data to be evaluated on during training. If the train_set is passed as already preprocessed, it is assumed that valid_sets are also preprocessed. Therefore it should be a dictionary following this structure:
- ”valid_sets”: a list of list of corresponding preprocessed validation Datasets.
- ”valid_labels”: a list of the valid dataset labels.
- ”num_data”: a list of the number of data in validation datasets.
Note, you can pass several datasets for validation, but only the first one will be used for early stopping.
feval (callable, list of callable, or None, optional (default = None)) –
Customized evaluation function. Each evaluation function should accept two parameters: preds, eval_data, and return (eval_name, eval_result, is_higher_better) or list of such tuples.

predsnumpy 1-D array or numpy 2-D array (for multi-class task)
The predicted values. For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes]. If custom objective function is used, predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task in this case.

eval_dataDataset
A Dataset to evaluate.

eval_namestr
The name of evaluation function (without whitespaces).

eval_resultfloat
The eval result.

is_higher_betterbool
Is eval result higher better, e.g. AUC is is_higher_better.

To ignore the default metric corresponding to the used objective, set the metric parameter to the string "None" in params.
init_models (list[str], list[pathlib.Path], list[Booster] or None, optional (default = None)) – List of filenames of LightGBM model or Booster instance used for continue training. There should be one model for each rum_structure dictionary.
feature_name (list of str, or 'auto', optional (default = "auto")) – Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.
categorical_feature (list of str or int, or 'auto', optional (default = "auto")) – Categorical features. If list of int, interpreted as indices. If list of str, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature. Floating point numbers in categorical features will be rounded towards 0.
keep_training_booster (bool, optional (default = False)) – Whether the returned Booster will be used to keep training. If False, the returned value will be converted into _InnerPredictor before returning. This means you won’t be able to use eval, eval_train or eval_valid methods of the returned Booster. When your model is very large and cause the memory error, you can try to set this param to True to avoid the model conversion performed during the internal call of model_to_string. You can still use _InnerPredictor as init_model for future continue training.
callbacks (list of callable, or None, optional (default = None)) – List of callback functions that are applied at each iteration. See Callbacks in Python API for more information.
torch_tensors (dict, optional (default=None)) –
If a dictionary is passed, torch.Tensors will be used for computing prediction, objective function and cross-entropy calculations. This require pytorch to be installed. The dictionary should follow the following form:

’device’: ‘cpu’, ‘gpu’ or ‘cuda’
The device on which the calculations will be performed.

’torch_compile’: bool
If True, the prediction, objective function and cross-entropy calculations will be compiled with torch.compile. If used with GPU or cuda, it requires to be on a linux os.

Note

A custom objective function can be provided for the objective parameter. It should accept two parameters: preds, train_data and return (grad, hess).

predsnumpy 1-D array or numpy 2-D array (for multi-class task)
The predicted values. Predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task.

train_dataDataset
The training dataset.

gradnumpy 1-D array or numpy 2-D array (for multi-class task)
The value of the first order derivative (gradient) of the loss with respect to the elements of preds for each sample point.

hessnumpy 1-D array or numpy 2-D array (for multi-class task)
The value of the second order derivative (Hessian) of the loss with respect to the elements of preds for each sample point.

For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes], and grad and hess should be returned in the same format.

Returns:: rum_booster – The trained RUMBoost model.
Return type:: RUMBoost

rumboost.torch_functions module

rumboost.torch_functions.binary_cross_entropy_torch(preds, label)[source]

Compute binary cross entropy for given predictions and data.

Parameters:

preds (torch.Tensor) – Predictions for all data points from a sigmoid function.
label (torch.Tensor) – The labels of the original dataset, as int.

Returns:

Binary cross entropy – The binary cross-entropy, as float.

Return type:

float

rumboost.torch_functions.binary_cross_entropy_torch_compiled(preds, label)[source]

Compute binary cross entropy for given predictions and data.

Parameters:

preds (torch.Tensor) – Predictions for all data points from a sigmoid function.
label (torch.Tensor) – The labels of the original dataset, as int.

Returns:

Binary cross entropy – The binary cross-entropy, as float.

Return type:

float

rumboost.torch_functions.compile_decorator(func)

rumboost.torch_functions.coral_eval_torch(preds, labels)[source]

Compute the coral evaluation function.

Parameters:

preds (torch.Tensor) – The predictions from the booster
labels (torch.Tensor) – The labels of the original dataset, as int.

Returns:

The coral evaluation function, as float.

Return type:

float

rumboost.torch_functions.coral_eval_torch_compiled(preds, labels)[source]

Compute the coral evaluation function.

Parameters:

preds (torch.Tensor) – The predictions from the booster
labels (torch.Tensor) – The labels of the original dataset, as int.

Returns:

The coral evaluation function, as float.

Return type:

float

rumboost.torch_functions.cross_entropy_torch(preds, labels)[source]

Compute negative cross entropy for given predictions and data.

Parameters:

preds (torch.Tensor) – Predictions for all data points and each classes from a softmax function. preds[i, j] correspond to the prediction of data point i to belong to class j.
labels (torch.Tensor) – The labels of the original dataset, as int.

Returns:

Cross entropy – The negative cross-entropy, as float.

Return type:

float

rumboost.torch_functions.cross_entropy_torch_compiled(preds, labels)[source]

Compute negative cross entropy for given predictions and data.

Parameters:

preds (torch.Tensor) – Predictions for all data points and each classes from a softmax function. preds[i, j] correspond to the prediction of data point i to belong to class j.
labels (torch.Tensor) – The labels of the original dataset, as int.

Returns:

Cross entropy – The negative cross-entropy, as float.

Return type:

float

rumboost.torch_functions.mse_torch(preds, target)[source]

Compute the mean squared error for given predictions and data.

Parameters:

preds (torch.Tensor) – Predictions for all data points.
target (torch.Tensor) – The target values of the original dataset.

Returns:

Mean squared error – The mean squared error, as float.

Return type:

float

rumboost.torch_functions.mse_torch_compiled(preds, target)[source]

Compute the mean squared error for given predictions and data.

Parameters:

preds (torch.Tensor) – Predictions for all data points.
target (torch.Tensor) – The target values of the original dataset.

Returns:

Mean squared error – The mean squared error, as float.

Return type:

float

rumboost.utility_plotting module

rumboost.utility_plotting.compute_VoT(util_collection, u, f1, f2)[source]

The function compute the Value of Time of the attributes specified in attribute_VoT.

Parameters:

util_collection (dict) – A dictionary containing the type of utility to use for all features in all utilities.
u (str) – The utility number, as a str (e.g. ‘0’, ‘1’, …).
f1 (str) – The time related attribtue name.
f2 (str) – The cost related attribtue name.

Returns:

VoT – The function calculating value of time for attribute1 and attribute2.

Return type:

lamda function

rumboost.utility_plotting.create_name(features)[source]: Create new feature names from a list of feature names

rumboost.utility_plotting.function_2d(weights_2d, x_vect, y_vect)[source]

Create the nonlinear contour plot for parameters, from weights gathered in getweights_v2

Parameters:

weights_2d (dict) – Pandas DataFrame containing all possible rectangles with their corresponding area values, for the given feature and utility.
x_vect (numpy array) – Vector of higher level feature.
y_vect (numpy array) – Vector of lower level feature.

Returns:

contour_plot_values – Array with values at (x,y) points.

Return type:

numpy array

rumboost.utility_plotting.get_asc(weights, alt_to_normalise='Driving', alternatives={'Cycling': '1', 'Driving': '3', 'Public Transport': '2', 'Walking': '0'})[source]: Retrieve ASCs from a dictionary of all values from a dictionary of leaves values per alternative per feature

rumboost.utility_plotting.get_child(model, weights, weights_2d, weights_market, tree, split_points, features, feature_names, i, market_segm, direction=None)[source]: Dig into the tree to get splitting points, features, left and right leaves values

rumboost.utility_plotting.get_weights(model, num_iteration=None)[source]

Get leaf values from a RUMBoost model.

Parameters:

model (RUMBoost) – A trained RUMBoost object.
num_iteration (int, optional (default = None)) – The number of iterations to consider in the model.

Returns:

weights_df (pandas DataFrame) – DataFrame containing all split points and their corresponding left and right leaves value, for all features.
weights_2d_df (pandas DataFrame) – Dataframe with weights arranged for a 2d plot, used in the case of 2d feature interaction.
weights_market (pandas DataFrame) – Dataframe with weights arranged for market segmentation, used in the case of market segmentation.

rumboost.utility_plotting.lintree_to_weights(split_and_leaf_values: dict, feature: str, utility: int)[source]

Convert a split and leaf values dictionary from a linear tree to a list of weights. The split_and_leaf_values dictionary should contain the keys “splits” and “leaves”, where “splits” is a list of split points and “leaves” is a list of leaf values.

Parameters:

split_and_leaf_values (dict) – A dictionary containing the split points and leaf values. It should have the keys “splits” and “leaves”.
feature (str) – The name of the feature for which the weights are being calculated.
utility (int) – The utility index for which the weights are being calculated.

Returns:

lin_weights – A list of lists, where each inner list contains the feature name, split point, left leaf value, right leaf value, and utility index.

Return type:

list

rumboost.utility_plotting.non_lin_function(weights_ordered, x_min, x_max, num_points, boosted_from_parameter_space=False)[source]

Create the nonlinear function for parameters, from weights ordered by ascending splitting points.

Parameters:

weights_ordered (dict) – Dictionary containing splitting points and corresponding cumulative weights value for a specific feature’s parameter.
x_min (float, int) – Minimum x value for which the nonlinear function is computed.
x_max (float, int) – Maximum x value for which the nonlinear function is computed.
num_points (int) – Number of points used to draw the nonlinear function line.
boosted_from_parameter_space (bool, optional (default = False)) – Set to True if the weights are from the parameter space. It means that the weights are betas, and not piece-wise continuous utilities.

Returns:

x_values (list) – X values for which the function will be plotted.
nonlin_function (list) – Values of the function at the corresponding x points.

rumboost.utility_plotting.plot_2d(model, feature1: str, feature2: str, min1: int, max1: int, min2: int, max2: int, save_figure: bool = False, utility_names: list[str] = ['Walking', 'Cycling', 'Public Transport', 'Driving'], num_points=1000)[source]

Plot a 2nd order feature interaction as a contour plot.

Parameters:

model (RUMBoost) – A RUMBoost object.
feature1 (str) – Name of feature 1.
feature2 (str) – Name of feature 2.
min1 (int) – Minimum value of feature 1.
max1 (int) – Maximum value of feature 1.
min2 (int) – Minimum value of feature 2.
max2 (int) – Maximum value of feature 2.
save_figure (bool, optional (default = False)) – If true, save the figure as a png file
utility_names (list[str], optional (default=['Walking', 'Cycling', 'Public Transport', 'Driving'])) – List of the alternative names
num_points (int, optional (default=1000)) – The number of points per axis. The total number of points is num_points**2.

rumboost.utility_plotting.plot_VoT(data_train, util_collection, attribute_VoT, utility_names, draw_range, save_figure=False, num_points=1000)[source]

The function plot the Value of Time of the attributes specified in attribute_VoT.

Parameters:

util_collection (dict) – A dictionary containing the type of utility to use for all features in all utilities.
attribute_VoT (dict) – A dictionary with keys being the utility number (as string) and values being a tuple of the attributes to compute the VoT on. The structure follows this form: {utility: (attribute1, attribute2)}
utility_names (dict) – A dictionary containing the names of the utilities. The structure of the dictionary follows this form: {utility: names}
draw_range (dict) – A dictionary containing the range of the attributes to draw the VoT. The structure of the dictionary follows this form: {utility: {attribute: (min, max)}}
save_figure (bool, optional (default = False)) – If True, save the plot as a png file.
num_points (int, optional (default = 1000)) – The number of points used to draw the contour plot.

rumboost.utility_plotting.plot_bootstrap(models: list, dataset: DataFrame, features: dict[list[str]])[source]

Plot the bootstrap sampling.

Parameters:

models (list) – A list containing all the trained mdoels of the bootstrap sampling
dataset (pd.DataFrame) – The full dataset used for training
features (dict[list[str]]) – A dictionary of lists of strings contaning the number of alternatives, and the features for that alternative, e.g. {‘0’:[‘feature_1’, …], ‘1’: [], …]

rumboost.utility_plotting.plot_ind_spec_constant(socec_model, dataset_train, alternatives: list[str])[source]

Plot a histogram of all alternatives individual specific constant of a functional effect model.

Parameters:

socec_model – The part of the functional effect model with full interactions of socio-economic characteristics.
dataset_train – The dataset used to train the model. It must be a lightGBM Dataset object.
alternatives (list[str]) – The list of alternatives name.

rumboost.utility_plotting.plot_market_segm(model, X, asc_normalised: bool = True, utility_names: list[str] = ['Walking', 'Cycling', 'Public Transport', 'Driving'])[source]

Plot the market segmentation.

Parameters:

model (RUMBoost) – A RUMBoost object.
X (pandas DataFrame) – Training data.
asc_normalised (bool, optional (default = False)) – If True, scale down utilities to be zero at the y axis.
utility_names (list[str], optional (default = ['Walking', 'Cycling', 'Public Transport', 'Driving'])) – Names of utilities.

rumboost.utility_plotting.plot_parameters(model, X, utility_names, feature_names=None, asc_normalised=True, with_asc=False, xlabel_max=None, only_tt=False, only_1d=True, sm_tt_cost=False, num_iteration=None, ylim=None, boost_from_parameter_space=None, group_feature=None, save_file='')[source]

Plot the non linear impact of parameters on the utility function.

Parameters:

model (RUMBoost) – A RUMBoost object.
X (pandas dataframe) – Features used to train the model, in a pandas dataframe.
utility_name (dict) – Dictionary mapping booster indices to their utility names. Keys should be a string of the booster index, and values should be the utility name.
feature_names (list, optional (default = None)) – List of feature names.
asc_normalised (bool, optional (default = True)) – If True, scale down utilities to be zero at the y axis.
with_asc (bool, optional (default = False)) – If True, add the ASCs to all graphs (one is normalised, and asc_normalised must be True).
xlabel_max (dict, optional (default = None)) – Dictionary mapping boosters to their maximum value on the x axis.
only_tt (bool, optional (default = False)) – If True, plot only travel time and distance.
only_1d (bool, optional (default = True)) – If False, plot only the features separately.
sm_tt_cost (bool, optional (default = False)) – If True, plot only the swissmetro travel time and cost on the same figure.
num_iteration (int, optional (default = None)) – The number of iterations to plot. If None, plot all iterations.
ylim (list[tuple], optional (default = None)) – List of tuples containing the y limits for each plot.
boost_from_parameter_space (dict[dict[bool]], optional (default = None)) – Dictionary of dictionary mapping booster to their type of boosting (parameter or utility space). First key should be a string of the booster index, first value / second key should be the utility name and second value is True if boosted from parameter space, False otherwise.
group_feature (dict, optional (default = None)) – This variable can be used if a feature have several ensembles, and we want to group all ensembles in one plot. Keys should be the feature name, and values should be the list of ensembles index in rum_structure.
save_file (str, optional (default='')) – The name to save the figure with. The figure will be saved only if save_file is not an empty string.

rumboost.utility_plotting.plot_pop_VoT(data_train, util_collection, attribute_VoT, save_figure=False)[source]

Plot the Value of Time for the given observations.

Parameters:

data_train (pd.DataFrame) – The training dataset.
util_collection (dict) – A dictionary containing the utility function (spline or tree) to use for all features in all utilities where the VoT is computed. it follows this structure {utility: {feature: tree/spline function}}
attribute_VoT (dict) – A dictionary with keys being the utility number (as string) and values being a tuple of the attributes to compute the VoT on. The structure follows this form: {utility: (attribute1, attribute2)}
save_figure (bool, optional (default = False)) – If True, save the plot as a png file.

rumboost.utility_plotting.plot_spline(model, data_train, spline_collection, utility_names, mean_splines=False, x_knots_dict=None, linear_extrapolation=False, save_fig=False, lpmc_tt_cost=False, sm_tt_cost=False, save_file='')[source]

Plot the spline interpolation for all utilities interpolated.

Parameters:

model (RUMBoost) – A RUMBoost object.
data_train (pandas Dataframe) – The full training dataset.
spline_collection (dict) – A dictionary containing the optimal number of splines for each feature interpolated of each utility
mean_splines (bool, optional (default = False)) – Must be True if the splines are computed at the mean distribution of data for stairs.
x_knots_dict (dict, optional (default = None)) – A dictionary in the form of {utility: {attribute: x_knots}} where x_knots are the spline knots for the corresponding utility and attributes
linear_extrapolation (bool, optional (default = False)) – If True, the splines are linearly extrapolated.
save_fig (bool, optional (default = False)) – If True, save the plot as a png file.
lpmc_tt_cost (bool, optional (default = False)) – If True, plot only the LPMC travel time and cost on the same figure.
sm_tt_cost (bool, optional (default = False)) – If True, plot only the swissmetro travel time and cost on the same figure.
save_file (str, optional (default='')) – The name to save the figure with.

rumboost.utility_plotting.plot_util(model, data_train, points=10000)[source]

Plot the raw utility functions of all features. This is done directly from the predict attribute of lightgbm.Boosters.

Parameters:

model (RUMBoost) – A RUMBoost object.
data_train (pandas Dataframe) – The full training dataset.
points (int, optional (default = 10000)) – The number of points used to draw the line plot.

rumboost.utility_plotting.weights_to_plot_v2(model, market_segm=False, num_iteration=None)[source]

Arrange weights by ascending splitting points and cumulative sum of weights.

Parameters:

model (RUMBoost) – A trained RUMBoost object.
market_segm (bool, optional (default = False)) – If True, the weights are arranged for market segmentation.
num_iteration (int, optional (default = None)) – The number of iterations to consider in the model.

Returns:

weights_for_plot – Dictionary containing splitting points and corresponding cumulative weights value for all features.

Return type:

dict

rumboost.utility_smoothing module

class rumboost.utility_smoothing.LinearExtrapolatorWrapper(pchip)[source]

Bases: object

A wrapper class that adds linear extrapolation to a PchipInterpolator object.

pchip_linear_extrapolator(x)[source]

rumboost.utility_smoothing.edge_mutant_stable(x_spline, y_data, n_knot, edge_fraction=0.25, middle_quantile=(0.05, 0.95), jump_data_limitation=95, jump_weight=1.0)[source]

A function that initialise spline knots for classification with the stability of edge and mutant data.

Paramaters

x_datanumpy array: Data from the interpolated feature.
y_datanumpy array: V(x_value), the values of the utility at x.
x_splinenumpy array: A vector of x values used to plot the splines.
n_knotint: Total number of knots to place.
edge_fractionfloat: Fraction of number of knots reserved for each edge. Evenly distribute these knots at each edge. Aim to improve the stability of edge interpolation.
middle_quantiletuple(float, float): Quantile range on Cumulative Distribution Function (CDF) used to distribute knots (a range for middle knots to distribute). Aim to improve the stability of the middle knots and reduce the risk of extrapolation.
jump_data_limitationint or None: Apply percentile limitation to the mutant data and keep the robustness. Aim to avoid the extreme data jump and avoid the knots accumulate at the jump data points
jump_weightfloat: Decide how middle knots will distribute. The data with big changes has larger weight and the flat data has less weight. Aim to distribute knots on the useful place to get a better smooth spline. It will combine with the jump_data_limitation.

returns:: x_knots
rtype:: Monotonically increasing knot positions

rumboost.utility_smoothing.find_best_spline(x_spline, y_data, weights, num_splines, monotonic=0, linear_extrapolation=False, x0_method=None, optimise_knot_position=True, x_data_val=None, y_data_val=None, fix_first=False, fix_last=False, deg_freedom=None, n_iter=1, max_iter=200, method='SLSQP', criterion='BIC', edge_fraction=0.25, middle_quantile=(0.05, 0.95), jump_data_limitation=95, jump_weight=1.0)[source]

A function that apply monotonic spline interpolation on a given feature.

Parameters:

x_data (numpy array) – Data from the interpolated feature.
y_data (numpy array) – V(x_value), the values of the utility at x.
num_splines (tuple(int, int)) – The number of splines to use for interpolation.
monotonic (int, optional (default=0)) – If 0, the spline is not monotonic. If 1, the spline is increasing. If -1, the spline is decreasing.
linear_extrapolation (bool, optional (default=False)) – If True, the splines are linearly extrapolated.
x0_method (str, optional (default=None)) – The method to use for the initial knots. Can be ‘quantile’, ‘linearly_spaced’, ‘random’ and ‘optimised’.
optimise_knot_position (bool, optional (default=True)) – If True, the knots position is optimised with scipy.minimize.
x_data_val (numpy array, optional (default=None)) – Data from the interpolated feature for validation.
y_data_val (numpy array, optional (default=None)) – V(x_value), the values of the utility at x for validation.
fix_first (bool, optional (default=False)) – If True, the first knot is fixed at the minimum value of the feature.
fix_last (bool, optional (default=False)) – If True, the last knot is fixed at the maximum value of the feature.
deg_freedom (int, optional (default=None)) – The degrees of freedom for the smoothing splines.
n_iter (int, optional (default=1)) – The number of iterations for the optimization.
max_iter (int, optional (default=200)) – The maximum number of iterations for the optimization.
method (str, optional (default='SLSQP')) – The optimization method to use.
criterion (str, optional (default='BIC')) – The criterion to use for model selection.

Returns:

best_spline – The best spline object.

Return type:

scipy.interpolate.PchipInterpolator or scipy.interpolate.CubicSpline

rumboost.utility_smoothing.independent_smoothing(weights, dataset_train, spline_utilities, num_spline_range, x0_method='quantile', linear_extrapolation=False, monotonic_structure=None, X_val=None, optimise_knot_position=True, fix_first=False, fix_last=False, deg_freedom=None, n_iter=1, max_iter=200, method='SLSQP', criterion='BIC', edge_fraction=0.25, middle_quantile=(0.05, 0.95), jump_data_limitation=95, jump_weight=1.0)[source]

A function that creates a new utility collection with independent smoothing.

Parameters:

weights (dict) – A dictionary containing all leaf values for all utilities and all features.
dataset_train (pandas DataFrame) – The pandas DataFrame used for training.
spline_utilities (dict) – A dictionary containing attributes where splines are applied. Must be in the form ] {utility_indx: [attributes1, attributes2, …], …}.
num_spline_range (dict) – A dictionary of the same format than weights of features names for each utility that are interpolated with monotonic splines. The key is a spline interpolated feature name, and the value is a tuple with min and max number of splines used for interpolation. There should be a key for all features where splines are used.
x0_method (str, optional (default='quantile')) – The method to use for the initial knots. Can be ‘quantile’, ‘linearly_spaced’, ‘random’ and ‘optimised’. If optimised, the knots are optimised with the MSE on the curve.
linear_extrapolation (bool, optional (default=False)) – If True, the splines are linearly extrapolated.
monotonic_structure (dict[dict[int]], optional (default=None)) – A dictionary of the same format than weights of features names for each utility. The first key contains the utility index. The second key contains the feature name. The value is an int representing the monotonic nature of that feature. If -1, the feature is decreasing, if 1, the feature is increasing, if 0, the feature is not monotonic.
X_val (pandas DataFrame, optional (default=None)) – The pandas DataFrame used for validation. If None, the validation is done on the training set.
optimise_knot_position (bool, optional (default=True)) – If True, the knots position is optimised with scipy.minimize.
fix_first (bool, optional (default=False)) – If True, the first knot is fixed at the minimum value of the feature.
fix_last (bool, optional (default=False)) – If True, the last knot is fixed at the maximum value of the feature.
deg_freedom (int, optional (default=None)) – The degrees of freedom for the smoothing splines.
n_iter (int, optional (default=1)) – The number of iterations for the optimization.
max_iter (int, optional (default=200)) – The maximum number of iterations for the optimization.
method (str, optional (default='SLSQP')) – The optimization method to use.
criterion (str, optional (default='BIC')) – The criterion to use for model selection.

Returns:

new_util_collection – A dictionary containing the new utility collection with independent smoothing.

Return type:

dict

rumboost.utility_smoothing.independent_utility_collection(weights, data, num_splines_feat, spline_utilities, x0_method=None, linear_extrapolation=False, monotonic_structure=None, X_val=None, optimise_knot_position=True, fix_first=False, fix_last=False, deg_freedom=None, n_iter=1, max_iter=200, method='SLSQP', criterion='BIC', edge_fraction=0.25, middle_quantile=(0.05, 0.95), jump_data_limitation=95, jump_weight=1.0)[source]

Create a dictionary that stores what type of utility (smoothed or not) should be used for smooth_predict.

Parameters:

weights (dict) – A dictionary containing all leaf values for all utilities and all features.
data (pandas DataFrame) – The pandas DataFrame used for training.
num_splines_feat (dict) – A dictionary of the same format than weights of features names for each utility that are interpolated with monotonic splines. The key is a spline interpolated feature name, and the value is the number of splines used for interpolation as an int. There should be a key for all features where splines are used.
spline_utilities (dict) – A dictionary containing attributes where splines are applied. Must be in the form ] {utility_indx: [attributes1, attributes2, …], …}.
x0_method (str, optional (default=None)) – The method to use for the initial knots. Can be ‘quantile’, ‘linearly_spaced’, ‘random’ and ‘optimised’.
linear_extrapolation (bool, optional (default=False)) – If True, the splines are linearly extrapolated.
monotonic_structure (dict[dict[int]], optional (default=None)) – A dictionary of the same format than weights of features names for each utility. The first key contains the utility index. The second key contains the feature name. The value is an int representing the monotonic nature of that feature. If -1, the feature is decreasing, if 1, the feature is increasing, if 0, the feature is not monotonic.
X_val (pandas DataFrame, optional (default=None)) – The pandas DataFrame used for validation. If None, the validation is done on the training set.
optimise_knot_position (bool, optional (default=True)) – If True, the knots position is optimised with scipy.minimize.
fix_first (bool, optional (default=False)) – If True, the first knot is fixed at the minimum value of the feature.
fix_last (bool, optional (default=False)) – If True, the last knot is fixed at the maximum value of the feature.
deg_freedom (int, optional (default=None)) – The degrees of freedom for the smoothing splines.
n_iter (int, optional (default=1)) – The number of iterations for the optimization.
max_iter (int, optional (default=200)) – The maximum number of iterations for the optimization.
method (str, optional (default='SLSQP')) – The optimization method to use.
criterion (str, optional (default='BIC')) – The criterion to use for model selection.

Returns:

util_collection – A dictionary containing the type of utility to use for all features in all utilities.

Return type:

dict

rumboost.utility_smoothing.mean_monotone_spline(x_data, x_mean, y_data, y_mean, num_splines=15)[source]

A function that apply monotonic spline interpolation on a given feature. The difference with monotone_spline, is that the knots are on the closest stairs mean.

Parameters:

x_data (numpy array) – Data from the interpolated feature.
x_mean (numpy array) – The x coordinate of the vector of mean points at each stairs
y_data (numpy array) – V(x_value), the values of the utility at x.
y_mean (numpy array) – The y coordinate of the vector of mean points at each stairs

Returns:

x_spline (numpy array) – A vector of x values used to plot the splines.
y_spline (numpy array) – A vector of the spline values at x_spline.
pchip (scipy.interpolate.PchipInterpolator) – The scipy interpolator object from the monotonic splines.

rumboost.utility_smoothing.monotone_spline(x_spline, weights, num_splines=5, x_knots=None, y_knots=None, linear_extrapolation=False, monotonic=0)[source]

A function that apply monotonic spline interpolation on a given feature.

Parameters:

x_spline (numpy array) – Data from the interpolated feature.
weights (dict) – The dictionary corresponding to the feature leaf values.
num_splines (int, optional (default=5)) – The number of splines used for interpolation.
x_knots (numpy array, optional (default=None)) – The positions of knots. If None, linearly spaced.
y_knots (numpy array, optional (default=None)) – The value of the utility at knots. Need to be specified if x_knots is passed.
linear_extrapolation (bool, optional (default=False)) – If True, the splines are linearly extrapolated.
monotonic (int, optional (default=0)) – The monotonic nature of the feature. If -1, the feature is decreasing, if 1, the feature is increasing, if 0, the feature is not monotonic.

Returns:

x_spline (numpy array) – A vector of x values used to plot the splines.
y_spline (numpy array) – A vector of the spline values at x_spline.
pchip (scipy.interpolate.PchipInterpolator) – The scipy interpolator object from the monotonic splines.
x_knots (numpy array) – The positions of knots. If None, linearly spaced.
y_knots (numpy array) – The value of the utility at knots.

rumboost.utility_smoothing.optimal_knots_position(weights, dataset_train, dataset_test, labels_test, spline_utilities, num_spline_range, monotonic_structure, optimisation_problem='local', max_iter=100, optimise=True, deg_freedom=None, n_iter=1, fix_first=False, fix_last=False, task='multiclass', x0='quantile', criterion='BIC', folds=None, linear_extrapolation=False, method='SLSQP', mu=None, nests=None, alphas=None, thresholds=None, edge_fraction=0.25, middle_quantile=(0.05, 0.95), jump_data_limitation=95, jump_weight=1.0)[source]

Find the optimal position of knots for a given number of knots for given attributes.

Parameters:

weights (dict) – A dictionary containing all leaf values for all utilities and all features.
dataset_train (pandas DataFrame) – The pandas DataFrame used for training.
dataset_test (pandas DataFrame) – The pandas DataFrame used for testing.
labels_test (pandas Series or numpy array) – The labels of the dataset used for testing.
spline_utilities (dict) – A dictionary containing attributes where splines are applied. Must be in the form ] {utility_indx: [attributes1, attributes2, …], …}.
num_splines_range (dict) – A dictionary of the same format than weights of features names for each utility that are interpolated with monotonic splines. The key is a spline interpolated feature name, and the value is the number of splines used for interpolation as an int. There should be a key for all features where splines are used.
monotonic_structure (dict[dict[int]]) – A dictionary of the same format than weights of features names for each utility. The first key contains the utility index. The second key contains the feature name. The value is an int representing the monotonic nature of that feature. If -1, the feature is decreasing, if 1, the feature is increasing, if 0, the feature is not monotonic.
optimisation_problem (str, optional (default='local')) – The optimisation problem to solve. Can be ‘local’ or ‘global’. If ‘local’, the optimisation is performed independently for each feature, with objective to minimise the mean squared error between the smoothed and non-smoothed curves. If ‘global’, the optimisation is performed jointly for all features, with objective to minimise the cross entropy loss of the smoothed predictions.
max_iter (int, optional (default=100)) – The maximum number of iterations from the solver
optimise (bool, optional (default=True)) – If True, optimise the knots position with scipy.minimize
deg_freedom (int, optional (default=None)) – The degree of freedom. If not specified, it is the number of knots to optimise.
n_iter (int, optional (default=None)) – The number of iteration, to leverage the randomness induced by the local minimizer.
fix_first (bool, optional (default=False)) – If True, the first knot is fixed at the minimum value of the feature.
fix_last (bool, optional (default=False)) – If True, the last knot is fixed at the maximum value of the feature.
task (str, optional (default='multiclass')) – The task to perform. Can be ‘multiclass’, ‘binary’ or ‘regression’.
x0 (str, optional (default='quantile')) – The initialisation of the knots. Can be ‘quantile’, ‘quantile_random’, ‘linearly_spaced’, ‘optimised’ and ‘random’.
criterion (str, optional (default='BIC')) – The criterion to use for the optimisation. Can be ‘BIC’, ‘AIC’ or ‘VAL’. If ‘BIC’, the Bayesian Information Criterion is used. If ‘AIC’, the Akaike Information Criterion is used. If ‘VAL’, the Validation loss is used.
linear_extrapolation (bool, optional (default=False)) – If True, the splines are linearly extrapolated.
method (str, optional (default='SLSQP')) – The method to use for the optimization. Can be any scipy optimization method.
mu (float, optional (default=None)) – The mean parameter for the utility functions.
nests (list, optional (default=None)) – The nested structure for the utility functions.
alphas (list, optional (default=None)) – The alpha parameters for the utility functions.
thresholds (list, optional (default=None)) – The thresholds for the utility functions.

Returns:

x_opt – The result of scipy.minimize.

Return type:

OptimizeResult

rumboost.utility_smoothing.optimise_single_spline(x_knots, weights, monotonic, linear_extrapolation, x_spline, y_data)[source]

A function that apply monotonic spline interpolation on a given feature.

Parameters:

x_knots (numpy array) – Knots for the spline.
weights (numpy array) – Weights learnt by RUMBoost.
monotonic (int) – If 0, the spline is not monotonic. If 1, the spline is increasing. If -1, the spline is decreasing.
linear_extrapolation (bool) – If True, the splines are linearly extrapolated.
x_spline (numpy array) – Data from the interpolated feature.
y_data (numpy array) – V(x_value), the values of the utility at x.

rumboost.utility_smoothing.optimise_splines(x_knots, weights, data_train, data_test, labels_test, spline_utilities, num_spline_range, deg_freedom=None, task='multiclass', criterion='BIC', linear_extrapolation=False, monotonic_structure=None, with_collection=False, mu=None, nests=None, alphas=None, thresholds=None)[source]

Function wrapper to find the optimal position of knots for each feature. The optimal position is the one who minimises the CE loss.

Parameters:

np.array (x_knots ; 1d) – The positions of knots in a 1d array, following this structure: np.array([x_att1_1, x_att1_2, … x_att1_m, x_att2_1, … x_attn_m]) where m is the number of knots and n the number of attributes that are interpolated with splines.
weights (dict) – A dictionary containing all leaf values for all utilities and all features.
data_train (pandas DataFrame) – The pandas DataFrame used for training.
data_test (pandas DataFrame) – The pandas DataFrame used for testing.
label_test (pandas Series or numpy array) – The labels of the dataset used for testing.
spline_utilities (dict) – A dictionary containing attributes where splines are applied. Must be in the form ] {utility_indx: [attributes1, attributes2, …], …}.
num_splines_range (dict) – A dictionary of the same format than weights of features names for each utility that are interpolated with monotonic splines. The key is a spline interpolated feature name, and the value is the number of splines used for interpolation as an int. There should be a key for all features where splines are used.
deg_freedom (int, optional (default=None)) – The degree of freedom. If not specified, it is the number of knots to optimize.
task (str, optional (default='multiclass')) – The task to perform. Can be ‘multiclass’, ‘binary’ or ‘regression’.
criterion (str, optional (default='BIC')) – The criterion to use for the optimisation. Can be ‘BIC’, ‘AIC’ or ‘VAL’.
linear_extrapolation (bool, optional (default=False)) – If True, the splines are linearly extrapolated.
monotonic_structure (dict[dict[int]], optional (default=None)) – A dictionary of the same format than weights of features names for each utility. The first key contains the utility index. The second key contains the feature name. The value is an int representing the monotonic nature of that feature. If -1, the feature is decreasing, if 1, the feature is increasing, if 0, the feature is not monotonic.
with_collection (bool, optional (default=False)) – If True, return the utility collection.
mu (float, optional (default=None)) – The mean parameter for the utility functions.
nests (list, optional (default=None)) – The nested structure for the utility functions.
alphas (list, optional (default=None)) – The alpha parameters for the utility functions.
thresholds (list, optional (default=None)) – The thresholds for the utility functions.

Returns:

loss – The final cross entropy or BIC on the test set.

Return type:

float

rumboost.utility_smoothing.smooth_predict(data_test, util_collection, utilities=False, mu=None, nests=None, alphas=None, thresholds=None)[source]

A prediction function that used monotonic spline interpolation on some features to predict their utilities. The function should be used with a trained model only.

Parameters:

data_test (pandas DataFrame) – A pandas DataFrame containing the observations that will be predicted.
util_collection (dict) – A dictionary containing the type of utility to use for all features in all utilities.
utilities (bool, optional (default = False)) – if True, return the raw utilities.
mu (ndarray, optional (default=None)) – An array of mu values, the scaling parameters, for each nest. The first value of the array correspond to nest 0, and so on.
nests (dict, optional (default=None)) – A dictionary representing the nesting structure. Keys are nests, and values are the the list of alternatives in the nest. For example {0: [0, 1], 1: [2, 3]} means that alternative 0 and 1 are in nest 0, and alternative 2 and 3 are in nest 1.
alphas (ndarray, optional (default=None)) – An array of J (alternatives) by M (nests). alpha_jn represents the degree of membership of alternative j to nest n By example, alpha_12 = 0.5 means that alternative one belongs 50% to nest 2.
thresholds (ndarray, optional (default=None)) – An array of thresholds for ordinal regression.

Returns:

preds – A numpy array containing the predictions for each class for each observation. Predictions are computed through the softmax function, unless the raw utilities are requested. A prediction for class j for observation n will be U[n, j].

Return type:

numpy array

rumboost.utility_smoothing.updated_utility_collection(weights, data, num_splines_feat, spline_utilities, mean_splines=False, x_knots=None, linear_extrapolation=False, monotonic_structure=None)[source]

Create a dictionary that stores what type of utility (smoothed or not) should be used for smooth_predict.

Parameters:

weights (dict) – A dictionary containing all leaf values for all utilities and all features.
data (pandas DataFrame) – The pandas DataFrame used for training.
num_splines_feat (dict) – A dictionary of the same format than weights of features names for each utility that are interpolated with monotonic splines. The key is a spline interpolated feature name, and the value is the number of splines used for interpolation as an int. There should be a key for all features where splines are used.
spline_utilities (dict) – A dictionary containing attributes where splines are applied. Must be in the form ] {utility_indx: [attributes1, attributes2, …], …}.
mean_splines (bool, optional (default = False)) – If True, the splines are computed at the mean distribution of data for stairs.
x_knots (dict) – A dictionary in the form of {utility: {attribute: x_knots}} where x_knots are the spline knots for the corresponding utility and attributes
linear_extrapolation (bool, optional (default=False)) – If True, the splines are linearly extrapolated.
monotonic_structure (dict[dict[int]], optional (default=None)) – A dictionary of the same format than weights of features names for each utility. The first key contains the utility index. The second key contains the feature name. The value is an int representing the monotonic nature of that feature. If -1, the feature is decreasing, if 1, the feature is increasing, if 0, the feature is not monotonic.

Returns:

util_collection – A dictionary containing the type of utility to use for all features in all utilities.

Return type:

dict

rumboost.utils module

rumboost.utils.bio_to_rumboost(model, all_columns=False, monotonic_constraints=True, interaction_contraints=True, fct_effect_variables=[])[source]

Converts a biogeme model to a rumboost dict.

Parameters:

model (a BIOGEME object) – The model used to create the rumboost structure dictionary.
all_columns (bool, optional (default = False)) – If True, do not consider alternative-specific features.
monotonic_constraints (bool, optional (default = True)) – If False, do not consider monotonic constraints.
interaction_contraints (bool, optional (default = True)) – If False, do not consider feature interactions constraints.
fct_effect_variables (list, optional (default = [])) – The list of variables in the functional effect part of the model

Returns:

rum_structure – A dictionary specifying the structure of a RUMBoost object.

Return type:

dict

rumboost.utils.data_leaf_value(data, weights_feature, technique='data_weighted')[source]

Computes the utility values of given data, according to the prespecified technique.

Parameters:

data (pandas.Series) – The column of the dataframe associated with the feature.
weight_feature (dict) – The dictionary corresponding to the feature leaf values.
technique (str, optional (default = weight_data)) –
The technique used to compute data values. It can be:

data_weighted : feature data and its utility values. mid_point : the mid point in between all splitting points. mean_data : the mean of data in between all splitting points. mid_point_weighted : the mid points in between all splitting points, weighted by the number of data points in the interval. mean_data_weighted : the mean of data in between all splitting points, weighted by the number of data points in the interval.

Returns:

data_ordered (numpy array) – X coordinates of the data, or feature data point values.
data_values (numpy array) – Y coordinates of the data, or utility values

rumboost.utils.get_mean_pos(data, split_points)[source]

Return the mean point in-between two split points for a specific feature (used in smoothing). At end points, it is the mean of data before the first split point, and after the last split point.

Parameters:

data (pandas.Series) – The column of the dataframe associated with the feature.
split_points (list) – The list of split points for that feature.

Returns:

mean_data – A list of points in the mean of every consecutive split points.

Return type:

list

rumboost.utils.get_mid_pos(data, split_points, end='data')[source]

Return the mid point in-between two split points for a specific feature (used in pw linear predict).

Parameters:

data (pandas Series) – The column of the dataframe associated with the feature.
split_points (list) – The list of split points for that feature.
end (str) –

How to compute the mid position of the first and last point, it can be:
-‘data’: add min and max values of data -‘split point’: add first and last split points -‘mean_data’: add the mean of data before the first split point, and after the last split point

Returns:

mid_pos – A list of points in the middle of every consecutive split points.

Return type:

list

rumboost.utils.get_pair(parent)[source]: Return beta and variable names on a tupple from a parent expression.

rumboost.utils.map_x_knots(x_knots, num_splines_range, x_first=None, x_last=None)[source]

Map the 1d array of x_knots into a dictionary with utility and attributes as keys.

Parameters:

x_knots (1d np.array) – The positions of knots in a 1d array, following this structure: np.array([x_att1_1, x_att1_2, … x_att1_m, x_att2_1, … x_attn_m]) where m is the number of knots and n the number of attributes that are interpolated with splines.
num_splines_range (dict) – A dictionary of the same format than weights of features names for each utility that are interpolated with monotonic splines. The key is a spline interpolated feature name, and the value is the number of splines used for interpolation as an int. There should be a key for all features where splines are used.
x_first (list, optional (default=None)) – A list of all first knots in the order of the attributes from spline_utilities and num_splines_range.
x_last (list, optional (default=None)) – A list of all last knots in the order of the attributes from spline_utilities and num_splines_range.

Returns:

x_knots_dict – A dictionary in the form of {utility: {attribute: x_knots}} where x_knots are the spline knots for the corresponding utility and attributes

Return type:

dict

rumboost.utils.optimise_asc(asc, raw_preds, labels)[source]

Optimise the ASC parameters of the model.

Parameters:

asc (np.array) – The array of ASC parameters.
raw_preds (np.array) – The raw predictions of the model.
labels (np.array) – The labels of the dataset.

Returns:

asc (np.array)
The optimised ASC parameters.

rumboost.utils.process_parent(parent, pairs)[source]: Dig into the biogeme expression to retrieve name of variable and beta parameter. Work only with simple utility specification (beta * variable).

rumboost.utils.sort_dict(dict_to_sort)[source]

Sort a dictionary by its keys.

Parameters:: dict_to_sort (dict) – A dictionary to sort.
Returns:: dict_sorted – The sorted dictionary.
Return type:: dict

Module contents

class rumboost.RUMBoost(model_file=None, **kwargs)[source]

Bases: object

RUMBoost for doing Random Utility Modelling in LightGBM.

Auxiliary data structure to implement boosters of rum_train() function for multiclass classification. This class has the same methods as Booster class. All method calls, except for the following methods, are actually performed for underlying Boosters.

model_from_string()
model_to_string()
save_model()

boosters

The list of fitted models.

Type:: list of Booster

valid_sets

Validation sets of the RUMBoost. By default None, to avoid computing cross entropy if there are no validation sets.

Type:: None

f_obj(preds, data)[source]

f_obj_binary(preds, data)[source]

f_obj_coral(preds, data)[source]

f_obj_cross_nested(preds, data)[source]

f_obj_full_hessian(_, __)[source]

Objective function of the boosters, for the full hessian.

Returns:

grad (numpy array) – The gradient with the cross-entropy loss function.
hess (numpy array) – The hessian with the cross-entropy loss function.

f_obj_mse(preds, data)[source]

f_obj_nest(preds, data)[source]

f_obj_proportional_odds(preds, data)[source]

model_from_string(model_str: str)[source]

Load RUMBoost from a string.

Parameters:: model_str (str) – Model will be loaded from this string.
Returns:: self – Loaded RUMBoost object.
Return type:: RUMBoost

model_to_string(num_iteration: int | None = None, start_iteration: int = 0, importance_type: str = 'split') → str[source]

Save RUMBoost to JSON string.

Parameters:

num_iteration (int or None, optional (default=None)) – Index of the iteration that should be saved. If None, if the best iteration exists, it is saved; otherwise, all iterations are saved. If <= 0, all iterations are saved.
start_iteration (int, optional (default=0)) – Start index of the iteration that should be saved.
importance_type (str, optional (default="split")) – What type of feature importance should be saved. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.

Returns:

str_repr – JSON string representation of RUMBoost.

Return type:

str

multiply_grad_hess_by_data()[source]: Decorator to multiply the gradient and hessian by the number of observations for the jth booster. This is used to scale the gradient and hessian when boosting from the parameter space, according to the chain rule.

predict(data, start_iteration: int = 0, num_iteration: int = -1, raw_score: bool = True, pred_leaf: bool = False, pred_contrib: bool = False, data_has_header: bool = False, validate_features: bool = False, utilities: bool = False)[source]

Predict logic.

Parameters:

data (str, pathlib.Path, numpy array, pandas DataFrame, H2O DataTable's Frame or scipy.sparse) – Data source for prediction. If str or pathlib.Path, it represents the path to a text file (CSV, TSV, or LibSVM).
start_iteration (int, optional (default=0)) – Start index of the iteration to predict.
num_iteration (int, optional (default=-1)) – Iteration used for prediction.
raw_score (bool, optional (default=False)) – Whether to predict raw scores.
pred_leaf (bool, optional (default=False)) – Whether to predict leaf index.
pred_contrib (bool, optional (default=False)) – Whether to predict feature contributions.
data_has_header (bool, optional (default=False)) – Whether data has header. Used only for txt data.
validate_features (bool, optional (default=False)) – If True, ensure that the features used to predict match the ones used to train. Used only if data is pandas DataFrame.
utilities (bool, optional (default=False)) – If True, return raw utilities for each class, without generating probabilities.

Returns:

result – Prediction result. Can be sparse or a list of sparse objects (each element represents predictions for one class) for feature contributions (when pred_contrib=True).

Return type:

numpy array, scipy.sparse or list of scipy.sparse

save_model(filename: str | Path, num_iteration: int | None = None, start_iteration: int = 0, importance_type: str = 'split') → RUMBoost[source]

Save RUMBoost to a file as JSON text.

Parameters:

filename (str or pathlib.Path) – Filename to save RUMBoost.
num_iteration (int or None, optional (default=None)) – Index of the iteration that should be saved. If None, if the best iteration exists, it is saved; otherwise, all iterations are saved. If <= 0, all iterations are saved.
start_iteration (int, optional (default=0)) – Start index of the iteration that should be saved.
importance_type (str, optional (default="split")) – What type of feature importance should be saved. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.

Returns:

self – Returns self.

Return type:

RUMBoost

Perform the RUM training with given parameters.

Parameters:

train_set (Dataset or dict[int, Any]) –
Data to be trained on. Set free_raw_data=False when creating the dataset. If it is a dictionary, the key-value pairs should be:
- ”train_sets”: the corresponding preprocessed Dataset.
- ”num_data”: the number of observations in the dataset.
- ”labels”: the labels of the full dataset.
- ”labels_j”: the labels of the dataset for each class (binary).
model_specification (dict) –
Dictionary specifying the model specification. The required keys are:
- ’general_params’: dict
  Dictionary containing the general parameters for the RUMBoost model. The dictionary can contain the following keys:
  
  ’num_iterations’: int
  Number of boosting iterations.
  
  ’num_classes’: int
  Number of classes. If equal to 2 and no additional keys are provided, the model will perfomr binary classification. If greater than 2, the model will perform multiclass classification. If equal to 1, the model will perform regression with MSE (other loss functions will be implemented in the future).
  
  ’subsampling’: float, optional (default = 1.0)
  Subsample ratio of gradient when boosting
  
  ’subsampling_freq’: int, optional (default = 0)
  Subsample frequency.
  
  ’subsample_valid’: float, optional (default = 1.0)
  Subsample ratio of validation data.
  
  ’batch_size’: int, optional (default = 0)
  Batch size for the training. The batch size will override the subsampling.
  
  ’early_stopping_rounds’: int, optional (default = None)
  Activates early stopping. The model will train until the validation score stops improving.
  
  ’verbosity’: int, optional (default = 1)
  Verbosity of the model.
  
  ’verbose_interval’: int, optional (default = 10)
  Interval of the verbosity display. only used if verbosity > 1.
  
  ’max_booster_to_update’: int, optional (default = num_classes)
  Maximum number of boosters to update at each round. It has to be at least equal to the number of classes, and at most equal to the number of classes times the maximum number of boosters in the smallest utility function. This is intended to update each utility function with the same number of trees.
  
  ’boost_from_parameter_space’: list, optional (default = [])
  If True, the boosting will be done in the parameter space, as opposed to the utility space. It means that the GBDT algorithm will ouput betas instead of piece-wise constant utility values. The resulting utility functions will be piece-wise linear. Monotonicity is not guaranteed in this case and only one variable per parameter ensemble is allowed.
  
  ’optim_interval’: int, optional (default = 20)
  If all the ensembles are boosted from the parameter space, the interval at which the ASCs are optimised. If 0, the ASCs are fixed.
  
  ’save_model_interval’: int, optional (default = 0)
  The interval at which the model will be saved during training.
  
  ’eval_function’: func (default = cross_entropy if multi-class, binary_log_loss if binary, mse if regression)
  The evaluation function to be used.
  
  ’full_hessian’: bool, optional (default = False)
  If True, the full hessian is used to compute the gradients and hessians. Currently only implemented for the multiclass case, and only works with cuda.
-‘rum_structure’list[dict[str, Any]]
List of dictionaries specifying the variable used to create the parameter ensemble, and their monotonicity or interaction. The list must contain one dictionary for each parameter. Each dictionary has four required keys:

’utility’: list of alternatives in which the parameter ensemble is used. If more than
one alternative is specified, the parameter ensemble is shared across alternatives, and the number of variables shared must be equal to the number of alternatives.

’variables’: list of columns from the train_set included in that parameter_ensemble.
This is the list of variables on which the splits will be done.

’boosting_params’: dict
Dictionary containing the boosting parameters for the parameter ensemble. These parameters are the same than Lightgbm parameters. More information here: https://lightgbm.readthedocs.io/en/latest/Parameters.html.

’shared’: bool
If True, the parameter ensemble is shared across all alternatives. When shared, the number of variables shared must be equal to the number of alternatives. If the same variable is shared across alternatives, it must be repeated in the variables list (by example variables = [‘var1’, ‘var1’, ‘var1’] and utility = [0, 1, 2]).

And two optional keys:

’endogenous_variable’: str
The name of one variable in the train_set. This is only used if boosted from the parameter space, and the variable is not included in the variables list. The output of the trees are the slope and the variable in endogenous_variable is the variable used in the beta times x output. The variable must be continuous or binary.

’init_leaf_val’: float
Initial leaf value for the ensemble in the parameter space. This will only be used for ensembles boosted from the parameter space.
The other keys are optional and can be:
- ’nested_logit’: dict
  
  Nested logit model specification. The dictionary must contain:
  
  ’mu’: ndarray
  An array of mu values, the scaling parameters, for each nest. The first value of the array correspond to nest 0, and so on. By default, the value of mu is 1 and is optimised through scipy.minimize. Mu is competing against other parameter ensembles at each round to be selected as the updated parameter ensemble.
  
  ’nests’: dict
  A dictionary representing the nesting structure. Keys are nests, and values are the the list of alternatives in the nest. For example {0: [0, 1], 1: [2, 3]} means that alternative 0 and 1 are in nest 0, and alternative 2 and 3 are in nest 1.
  
  ’optimise_mu’: bool or list[bool], optional (default = True)
  If True, the mu values are optimised through scipy.minimize. If a list of booleans, the length must be equal to the number of nests. By example, [True, False] means that mu_0 is optimised and mu_1 is fixed.
  
  ’optim_interval’: int, optional (default = 20)
  Interval at which the mu values are optimised.
- ’cross_nested_logit’: dict
  
  Cross-nested logit model specification. The dictionary must contain:
  
  ’mu’: ndarray
  An array of mu values, the scaling parameters, for each nest. The first value of the array correspond to nest 0, and so on.
  
  ’alphas’: ndarray
  An array of J (alternatives) by M (nests). alpha_jn represents the degree of membership of alternative j to nest n By example, alpha_12 = 0.5 means that alternative one belongs 50% to nest 2.
  
  ’optimise_mu’: bool or list[bool], optional (default = True)
  If True, the mu values are optimised through scipy.minimize. If a list of booleans, the length must be equal to the number of nests. By example, [True, False] means that mu_0 is optimised and mu_1 is fixed.
  
  ’optimise_alphas’: bool or ndarray[bool], optional (default = False)
  If True, the alphas are optimised through scipy.minimize. This is not recommended for high dimensionality datasets as it can be computationally expensive. If an array of boolean, the array must have the same size than alphas. By example if optimise_alphas_ij = True, alphas_ij will be optimised.
  
  ’optim_interval’: int, optional (default = 20)
  Interval at which the mu and/or alpha values are optimised.
- ’ordinal_logit’: dict
  Ordinal logit model specification. The dictionary must contain:
  
  ’model’: str, default = ‘proportional_odds’
  
  The type of ordinal model. It can be:
  
  ’proportional_odds’: the proportional odds model.
  
  ’coral’: a rank consistent binary decomposition model.
  
  ’optim_interval’: int, optional (default = 20)
  Interval at which the thresholds are optimised. This is only used for the proportional odds and the coral models. If 0, the thresholds are fixed. For ordinal models, the thresholds are optimised from the first iteration.
num_boost_round (int, optional (default = 100)) – Number of boosting iterations.
valid_sets (list of Dataset, dict, or None, optional (default = None)) –
List of data to be evaluated on during training. If the train_set is passed as already preprocessed, it is assumed that valid_sets are also preprocessed. Therefore it should be a dictionary following this structure:
- ”valid_sets”: a list of list of corresponding preprocessed validation Datasets.
- ”valid_labels”: a list of the valid dataset labels.
- ”num_data”: a list of the number of data in validation datasets.
Note, you can pass several datasets for validation, but only the first one will be used for early stopping.
feval (callable, list of callable, or None, optional (default = None)) –
Customized evaluation function. Each evaluation function should accept two parameters: preds, eval_data, and return (eval_name, eval_result, is_higher_better) or list of such tuples.

predsnumpy 1-D array or numpy 2-D array (for multi-class task)
The predicted values. For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes]. If custom objective function is used, predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task in this case.

eval_dataDataset
A Dataset to evaluate.

eval_namestr
The name of evaluation function (without whitespaces).

eval_resultfloat
The eval result.

is_higher_betterbool
Is eval result higher better, e.g. AUC is is_higher_better.

To ignore the default metric corresponding to the used objective, set the metric parameter to the string "None" in params.
init_models (list[str], list[pathlib.Path], list[Booster] or None, optional (default = None)) – List of filenames of LightGBM model or Booster instance used for continue training. There should be one model for each rum_structure dictionary.
feature_name (list of str, or 'auto', optional (default = "auto")) – Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.
categorical_feature (list of str or int, or 'auto', optional (default = "auto")) – Categorical features. If list of int, interpreted as indices. If list of str, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature. Floating point numbers in categorical features will be rounded towards 0.
keep_training_booster (bool, optional (default = False)) – Whether the returned Booster will be used to keep training. If False, the returned value will be converted into _InnerPredictor before returning. This means you won’t be able to use eval, eval_train or eval_valid methods of the returned Booster. When your model is very large and cause the memory error, you can try to set this param to True to avoid the model conversion performed during the internal call of model_to_string. You can still use _InnerPredictor as init_model for future continue training.
callbacks (list of callable, or None, optional (default = None)) – List of callback functions that are applied at each iteration. See Callbacks in Python API for more information.
torch_tensors (dict, optional (default=None)) –
If a dictionary is passed, torch.Tensors will be used for computing prediction, objective function and cross-entropy calculations. This require pytorch to be installed. The dictionary should follow the following form:

’device’: ‘cpu’, ‘gpu’ or ‘cuda’
The device on which the calculations will be performed.

’torch_compile’: bool
If True, the prediction, objective function and cross-entropy calculations will be compiled with torch.compile. If used with GPU or cuda, it requires to be on a linux os.

Note

A custom objective function can be provided for the objective parameter. It should accept two parameters: preds, train_data and return (grad, hess).

predsnumpy 1-D array or numpy 2-D array (for multi-class task)
The predicted values. Predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task.

train_dataDataset
The training dataset.

gradnumpy 1-D array or numpy 2-D array (for multi-class task)
The value of the first order derivative (gradient) of the loss with respect to the elements of preds for each sample point.

hessnumpy 1-D array or numpy 2-D array (for multi-class task)
The value of the second order derivative (Hessian) of the loss with respect to the elements of preds for each sample point.

For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes], and grad and hess should be returned in the same format.

Returns:: rum_booster – The trained RUMBoost model.
Return type:: RUMBoost