rumboost package

Submodules

rumboost.basic_functions module

rumboost.dataset module

rumboost.dataset.load_preprocess_Airplane(test_size: float = 0.3, random_state: int = 42)[source]
rumboost.dataset.load_preprocess_LPMC()[source]

Load and preprocess the LPMC dataset.

Returns

dataset_trainpandas Dataframe

The training dataset ready to use.

dataset_testpandas Dataframe

The training dataset ready to use.

foldszip(list, list)

5 folds of indices grouped by household for CV.

rumboost.dataset.load_preprocess_Netherlands(test_size: float = 0.3, random_state: int = 42)[source]
rumboost.dataset.load_preprocess_Optima()[source]

Load and preprocess the Optima dataset.

Returns

dataset_trainpandas Dataframe

The training dataset ready to use.

dataset_testpandas Dataframe

The training dataset ready to use.

foldszip(list, list)

5 folds of indices grouped by household for CV.

rumboost.dataset.load_preprocess_Parking(test_size: float = 0.3, random_state: int = 42)[source]
rumboost.dataset.load_preprocess_SwissMetro(test_size: float = 0.3, random_state: int = 42, full_data=False)[source]

Load and preprocess the SwissMetro dataset.

Parameters

test_sizefloat, optional (default = 0.3)

The proportion of data used for test set.

random_stateint, optional (default = 42)

For reproducibility in the train-test split

Returns

dataset_trainpandas Dataframe

The training dataset ready to use.

dataset_testpandas Dataframe

The training dataset ready to use.

rumboost.dataset.load_preprocess_Telephone(test_size: float = 0.3, random_state: int = 3)[source]
rumboost.dataset.load_preprocess_Vaccines()[source]

rumboost.models module

rumboost.models.Airplane(df_train, for_prob=False)[source]
rumboost.models.LPMC(dataset_train, for_prob=False)[source]

Create a MNL on the LPMC dataset. The model is a slightly modified version from teh code that can be found here: https://github.com/JoseAngelMartinB/prediction-behavioural-analysis-ml-travel-mode-choice.

Parameters

dataset_trainpandas DataFrame

The training dataset.

Returns

biogemebio.BIOGEME

The BIOGEME object containing the model.

rumboost.models.LPMC_nested(dataset_train, for_prob=False)[source]

Create a nested logit model on the LPMC dataset. The model is a slightly modified version from teh code that can be found here: https://github.com/JoseAngelMartinB/prediction-behavioural-analysis-ml-travel-mode-choice.

Parameters

dataset_trainpandas DataFrame

The training dataset.

Returns

biogemebio.BIOGEME

The BIOGEME object containing the model.

rumboost.models.LPMC_nested_normalised(dataset_train, for_prob=False)[source]

Create a nested logit model on the LPMC dataset, normalised for biogeme estimation. The model is a slightly modified version from teh code that can be found here: https://github.com/JoseAngelMartinB/prediction-behavioural-analysis-ml-travel-mode-choice.

Parameters

dataset_trainpandas DataFrame

The training dataset.

Returns

biogemebio.BIOGEME

The BIOGEME object containing the model.

rumboost.models.LPMC_normalised(dataset_train, for_prob=False)[source]

Create a MNL on the LPMC dataset, normalised for biogeme estimation. The model is a slightly modified version from teh code that can be found here: https://github.com/JoseAngelMartinB/prediction-behavioural-analysis-ml-travel-mode-choice.

Parameters

dataset_trainpandas DataFrame

The training dataset.

Returns

biogemebio.BIOGEME

The BIOGEME object containing the model.

rumboost.models.Netherlands(df_train, for_prob=False)[source]
rumboost.models.Optima(dataset_train, for_prob=False)[source]

Create a MNL on the OPTIMA dataset. The model is a slightly modified version from the code that can be found here: https://github.com/JoseAngelMartinB/prediction-behavioural-analysis-ml-travel-mode-choice.

Parameters

dataset_trainpandas DataFrame

The training dataset.

Returns

biogemebio.BIOGEME

The BIOGEME object containing the model.

rumboost.models.Parking(df_train, for_prob=False)[source]
rumboost.models.SwissMetro(dataset_train: DataFrame, for_prob=False)[source]

Create a MNL on the swissmetro dataset.

Parameters

dataset_trainpandas DataFrame

The training dataset.

Returns

biogemebio.BIOGEME

The BIOGEME object containing the model.

rumboost.models.SwissMetro_MNL(dataset_train: DataFrame, for_prob=False)[source]

Create a simple MNL on the swissmetro dataset.

Parameters

dataset_trainpandas DataFrame

The training dataset.

Returns

biogemebio.BIOGEME

The BIOGEME object containing the model.

rumboost.models.SwissMetro_nested(dataset_train: DataFrame, for_prob=False)[source]

Create a nested logit model on the swissmetro dataset.

Parameters

dataset_trainpandas DataFrame

The training dataset.

Returns

biogemebio.BIOGEME

The BIOGEME object containing the model.

rumboost.models.SwissMetro_normalised(dataset_train: DataFrame, for_prob=False)[source]

Create a MNL on the swissmetro dataset, normalised for biogeme estimation.

Parameters

dataset_trainpandas DataFrame

The training dataset.

Returns

biogemebio.BIOGEME

The BIOGEME object containing the model.

rumboost.models.Telephone(df_train, for_prob=False)[source]
rumboost.models.Vaccines(dataset_train: DataFrame, for_prob=False)[source]

Create a MNL on the Vaccine dataset.

Parameters

dataset_trainpandas DataFrame

The training dataset.

Returns

biogemebio.BIOGEME

The BIOGEME object containing the model.

rumboost.rumboost module

Library with training routines of LightGBM.

class rumboost.rumboost.CVRUMBoost[source]

Bases: object

CVRUMBoost in LightGBM.

Auxiliary data structure to hold and redirect all boosters of cv function. This class has the same methods as Booster class. All method calls are actually performed for underlying Boosters and then all returned results are returned in a list.

Attributes

rum_boosterslist of RUMBoost

The list of underlying fitted models.

best_iterationint

The best iteration of fitted model.

class rumboost.rumboost.RUMBoost(model_file=None)[source]

Bases: object

RUMBoost for doing Random Utility Modelling in LightGBM.

Auxiliary data structure to implement boosters of rum_train() function for multiclass classification. This class has the same methods as Booster class. All method calls, except for the following methods, are actually performed for underlying Boosters.

  • model_from_string()

  • model_to_string()

  • save_model()

Attributes

boosterslist of Booster

The list of fitted models.

valid_setsNone

Validation sets of the RUMBoost. By default None, to avoid computing cross entropy if there are no validation sets.

f_obj(_, train_set: Dataset)[source]

Objective function of the binary classification boosters, but based on softmax predictions.

Parameters

train_setDataset

Training set used to train the jth booster. It means that it is not the full training set but rather another dataset containing the relevant features for that utility. It is the jth dataset in the RUMBoost object.

Returns

gradnumpy array

The gradient with the cross-entropy loss function. It is the predictions minus the binary labels (if it is used for the jth booster, labels will be 1 if the chosen class is j, 0 if it is any other classes).

hessnumpy array

The hessian with the cross-entropy loss function (second derivative approximation rather than the hessian). Calculated as factor * preds * (1 - preds).

f_obj_cross_nested(_, train_set: Dataset)[source]

Objective function of the binary classification boosters, for a cross-nested rumboost.

Parameters

train_setDataset

Training set used to train the jth booster. It means that it is not the full training set but rather another dataset containing the relevant features for that utility. It is the jth dataset in the RUMBoost object.

Returns

gradnumpy array

The gradient with the cross-entropy loss function and cross-nested probabilities.

hessnumpy array

The hessian with the cross-entropy loss function and cross-nested probabilities (second derivative approximation rather than the hessian).

f_obj_nest(_, train_set: Dataset)[source]

Objective function of the binary classification boosters, for a nested rumboost.

Parameters

train_setDataset

Training set used to train the jth booster. It means that it is not the full training set but rather another dataset containing the relevant features for that utility. It is the jth dataset in the RUMBoost object.

Returns

gradnumpy array

The gradient with the cross-entropy loss function and nested probabilities.

hessnumpy array

The hessian with the cross-entropy loss function and nested probabilities (second derivative approximation rather than the hessian).

model_from_string(model_str: str)[source]

Load RUMBoost from a string.

Parameters

model_strstr

Model will be loaded from this string.

Returns

selfRUMBoost

Loaded RUMBoost object.

model_to_string(num_iteration: int | None = None, start_iteration: int = 0, importance_type: str = 'split') str[source]

Save RUMBoost to JSON string.

Parameters

num_iterationint or None, optional (default=None)

Index of the iteration that should be saved. If None, if the best iteration exists, it is saved; otherwise, all iterations are saved. If <= 0, all iterations are saved.

start_iterationint, optional (default=0)

Start index of the iteration that should be saved.

importance_typestr, optional (default=”split”)

What type of feature importance should be saved. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.

Returns

str_reprstr

JSON string representation of RUMBoost.

predict(data, start_iteration: int = 0, num_iteration: int = -1, raw_score: bool = True, pred_leaf: bool = False, pred_contrib: bool = False, data_has_header: bool = False, validate_features: bool = False, utilities: bool = False, nests: dict = None, mu: list[float] = None, alphas: array = None)[source]

Predict logic.

Parameters

datastr, pathlib.Path, numpy array, pandas DataFrame, H2O DataTable’s Frame or scipy.sparse

Data source for prediction. If str or pathlib.Path, it represents the path to a text file (CSV, TSV, or LibSVM).

start_iterationint, optional (default=0)

Start index of the iteration to predict.

num_iterationint, optional (default=-1)

Iteration used for prediction.

raw_scorebool, optional (default=False)

Whether to predict raw scores.

pred_leafbool, optional (default=False)

Whether to predict leaf index.

pred_contribbool, optional (default=False)

Whether to predict feature contributions.

data_has_headerbool, optional (default=False)

Whether data has header. Used only for txt data.

validate_featuresbool, optional (default=False)

If True, ensure that the features used to predict match the ones used to train. Used only if data is pandas DataFrame.

utilitiesbool, optional (default=True)

If True, return raw utilities for each class, without generating probabilities.

nestsdict, optional (default=None)

If not none, compute predictions with the nested probability function. The dictionary keys are alternatives number and their values are their nest number. By example {0:0, 1:1, 2:0} means that alt 0 and 2 are in nest 0 and alt 1 is in nest 1.

mulist, optional (default=None)

Only used, and required, if nests is True. It is the list of mu values for each nest. The first value correspond to the first nest and so on.

alphasndarray, optional (default=None)

An array of J (alternatives) by M (nests). alpha_jn represents the degree of membership of alternative j to nest n By example, alpha_12 = 0.5 means that alternative one belongs 50% to nest 2.

Returns

resultnumpy array, scipy.sparse or list of scipy.sparse

Prediction result. Can be sparse or a list of sparse objects (each element represents predictions for one class) for feature contributions (when pred_contrib=True).

save_model(filename: str | Path, num_iteration: int | None = None, start_iteration: int = 0, importance_type: str = 'split') RUMBoost[source]

Save RUMBoost to a file as JSON text.

Parameters

filenamestr or pathlib.Path

Filename to save RUMBoost.

num_iterationint or None, optional (default=None)

Index of the iteration that should be saved. If None, if the best iteration exists, it is saved; otherwise, all iterations are saved. If <= 0, all iterations are saved.

start_iterationint, optional (default=0)

Start index of the iteration that should be saved.

importance_typestr, optional (default=”split”)

What type of feature importance should be saved. If “split”, result contains numbers of times the feature is used in a model. If “gain”, result contains total gains of splits which use the feature.

Returns

selfRUMBoost

Returns self.

rumboost.rumboost.rum_cv(params, train_set, num_boost_round=100, folds=None, nfold=5, stratified=True, shuffle=True, metrics=None, fobj=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', early_stopping_rounds=None, fpreproc=None, verbose_eval=None, show_stdv=True, seed=0, callbacks=None, eval_train_metric=False, return_cvbooster=False, rum_structure=None, biogeme_model=None)[source]

Perform the cross-validation with given parameters.

Parameters

paramsdict

Parameters for Booster.

train_setDataset

Data to be trained on.

num_boost_roundint, optional (default=100)

Number of boosting iterations.

foldsgenerator or iterator of (train_idx, test_idx) tuples, scikit-learn splitter object or None, optional (default=None)

If generator or iterator, it should yield the train and test indices for each fold. If object, it should be one of the scikit-learn splitter classes (https://scikit-learn.org/stable/modules/classes.html#splitter-classes) and have split method. This argument has highest priority over other data split arguments.

nfoldint, optional (default=5)

Number of folds in CV.

stratifiedbool, optional (default=True)

Whether to perform stratified sampling.

shufflebool, optional (default=True)

Whether to shuffle before splitting data.

metricsstr, list of str, or None, optional (default=None)

Evaluation metrics to be monitored while CV. If not None, the metric in params will be overridden.

fobjcallable or None, optional (default=None)

Customized objective function. Should accept two parameters: preds, train_data, and return (grad, hess).

predslist or numpy 1-D array

The predicted values. Predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task.

train_dataDataset

The training dataset.

gradlist or numpy 1-D array

The value of the first order derivative (gradient) of the loss with respect to the elements of preds for each sample point.

hesslist or numpy 1-D array

The value of the second order derivative (Hessian) of the loss with respect to the elements of preds for each sample point.

For multi-class task, the preds is group by class_id first, then group by row_id. If you want to get i-th row preds in j-th class, the access way is score[j * num_data + i] and you should group grad and hess in this way as well.

fevalcallable, list of callable, or None, optional (default=None)

Customized evaluation function. Each evaluation function should accept two parameters: preds, train_data, and return (eval_name, eval_result, is_higher_better) or list of such tuples.

predslist or numpy 1-D array

The predicted values. If fobj is specified, predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task in this case.

train_dataDataset

The training dataset.

eval_namestr

The name of evaluation function (without whitespace).

eval_resultfloat

The eval result.

is_higher_betterbool

Is eval result higher better, e.g. AUC is is_higher_better.

For multi-class task, the preds is group by class_id first, then group by row_id. If you want to get i-th row preds in j-th class, the access way is preds[j * num_data + i]. To ignore the default metric corresponding to the used objective, set metrics to the string "None".

init_modelstr, pathlib.Path, Booster or None, optional (default=None)

Filename of LightGBM model or Booster instance used for continue training.

feature_namelist of str, or ‘auto’, optional (default=”auto”)

Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.

categorical_featurelist of str or int, or ‘auto’, optional (default=”auto”)

Categorical features. If list of int, interpreted as indices. If list of str, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature.

early_stopping_roundsint or None, optional (default=None)

Activates early stopping. CV score needs to improve at least every early_stopping_rounds round(s) to continue. Requires at least one metric. If there’s more than one, will check all of them. To check only the first metric, set the first_metric_only parameter to True in params. Last entry in evaluation history is the one from the best iteration.

fpreproccallable or None, optional (default=None)

Preprocessing function that takes (dtrain, dtest, params) and returns transformed versions of those.

verbose_evalbool, int, or None, optional (default=None)

Whether to display the progress. If True, progress will be displayed at every boosting stage. If int, progress will be displayed at every given verbose_eval boosting stage.

show_stdvbool, optional (default=True)

Whether to display the standard deviation in progress. Results are not affected by this parameter, and always contain std.

seedint, optional (default=0)

Seed used to generate the folds (passed to numpy.random.seed).

callbackslist of callable, or None, optional (default=None)

List of callback functions that are applied at each iteration. See Callbacks in Python API for more information.

eval_train_metricbool, optional (default=False)

Whether to display the train metric in progress. The score of the metric is calculated again after each training step, so there is some impact on performance.

return_cvboosterbool, optional (default=False)

Whether to return Booster models trained on each fold through CVBooster.

rum_structuredict, optional (default=None)

List of dictionaries specifying the RUM structure. The list must contain one dictionary for each class, which describes the utility structure for that class. Each dictionary has three allowed keys.

cols : list of columns included in that class monotone_constraints : list of monotonic constraints on parameters interaction_constraints : list of interaction constraints on features

if None, a biogeme_model must be specified

biogeme_model: biogeme.biogeme.BIOGEME, optional (default=None)

A biogeme.biogeme.BIOGEME object representing a biogeme model, used to create the rum_structure. A biogeme model is required if rum_structure is None, otherwise should be None.

Returns

eval_histdict

Evaluation history. The dictionary has the following format: {‘metric1-mean’: [values], ‘metric1-stdv’: [values], ‘metric2-mean’: [values], ‘metric2-stdv’: [values], …}. If return_cvbooster=True, also returns trained boosters via cvbooster key.

rumboost.rumboost.rum_train(params: dict[str, Any], train_set: Dataset, rum_structure: list[dict[str, Any]] = None, num_boost_round: int = 100, valid_sets: list[Dataset] | None = None, valid_names: list[str] | None = None, feval: Callable[[List | ndarray, Dataset], Tuple[str, float, bool]] | list[Callable[[List | ndarray, Dataset], Tuple[str, float, bool]]] | None = None, init_model: str | Path | Booster | None = None, feature_name: list[str] | str = 'auto', categorical_feature: list[str] | list[int] | str = 'auto', keep_training_booster: bool = False, callbacks: list[Callable] | None = None, nests: dict = None, mu: list = None, params_fe: dict = None, alphas: array = None) RUMBoost[source]

Perform the RUM training with given parameters.

Parameters

paramsdict

Parameters for training. Values passed through params take precedence over those supplied via arguments. If num_classes > 2, please specify params[‘objective’] = ‘multiclass’.

train_setDataset

Data to be trained on. Set free_raw_data=False when creating the dataset.

rum_structurelist[dict[str, Any]], optional (default = None)

List of dictionaries specifying the RUM structure. The list must contain one dictionary for each class, which describes the utility structure for that class. Each dictionary has three allowed keys. ‘cols’: list of columns included in that class ‘monotone_constraints’: list of monotonic constraints on parameters ‘interaction_constraints’: list of interaction constraints on features if None, a biogeme_model must be specified

biogeme_modelBIOGEME, optional (default = None)

A BIOGEME object representing a biogeme model, used to create the rum_structure. A biogeme model is required if rum_structure is None, otherwise should be None.

num_boost_roundint, optional (default = 100)

Number of boosting iterations.

valid_setslist of Dataset, or None, optional (default = None)

List of data to be evaluated on during training.

valid_nameslist of str, or None, optional (default = None)

Names of valid_sets.

fevalcallable, list of callable, or None, optional (default = None)

Customized evaluation function. Each evaluation function should accept two parameters: preds, eval_data, and return (eval_name, eval_result, is_higher_better) or list of such tuples.

predsnumpy 1-D array or numpy 2-D array (for multi-class task)

The predicted values. For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes]. If custom objective function is used, predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task in this case.

eval_dataDataset

A Dataset to evaluate.

eval_namestr

The name of evaluation function (without whitespaces).

eval_resultfloat

The eval result.

is_higher_betterbool

Is eval result higher better, e.g. AUC is is_higher_better.

To ignore the default metric corresponding to the used objective, set the metric parameter to the string "None" in params.

init_modelstr, pathlib.Path, Booster or None, optional (default = None)

Filename of LightGBM model or Booster instance used for continue training.

feature_namelist of str, or ‘auto’, optional (default = “auto”)

Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.

categorical_featurelist of str or int, or ‘auto’, optional (default = “auto”)

Categorical features. If list of int, interpreted as indices. If list of str, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature. Floating point numbers in categorical features will be rounded towards 0.

keep_training_boosterbool, optional (default = False)

Whether the returned Booster will be used to keep training. If False, the returned value will be converted into _InnerPredictor before returning. This means you won’t be able to use eval, eval_train or eval_valid methods of the returned Booster. When your model is very large and cause the memory error, you can try to set this param to True to avoid the model conversion performed during the internal call of model_to_string. You can still use _InnerPredictor as init_model for future continue training.

callbackslist of callable, or None, optional (default = None)

List of callback functions that are applied at each iteration. See Callbacks in Python API for more information.

mulist, optional (default=None)

List of mu values, the scaling parameter, for each nest. The first value of the list correspond to nest 0, and so on.

nestdict, optional (default=None)

Dictionary representing the nesting structure. Keys are alternatives, and values are the nest they belong to. By example, {0:0, 1:1, 2:0} means alt 0 and 2 belong to nest 0 and alt 1 belongs to nest 1.

params_fedict, optional (default=None)

Parameters for training the socio-economic part of a functional effect model.

alphasndarray, optional (default=None)

An array of J (alternatives) by M (nests). alpha_jn represents the degree of membership of alternative j to nest n By example, alpha_12 = 0.5 means that alternative one belongs 50% to nest 2.

Note

A custom objective function can be provided for the objective parameter. It should accept two parameters: preds, train_data and return (grad, hess).

predsnumpy 1-D array or numpy 2-D array (for multi-class task)

The predicted values. Predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task.

train_dataDataset

The training dataset.

gradnumpy 1-D array or numpy 2-D array (for multi-class task)

The value of the first order derivative (gradient) of the loss with respect to the elements of preds for each sample point.

hessnumpy 1-D array or numpy 2-D array (for multi-class task)

The value of the second order derivative (Hessian) of the loss with respect to the elements of preds for each sample point.

For multi-class task, preds are numpy 2-D array of shape = [n_samples, n_classes], and grad and hess should be returned in the same format.

Returns

rum_boosterRUMBoost

The trained RUMBoost model.

rumboost.utility_plotting module

rumboost.utility_plotting.plot_2d(model, feature1: str, feature2: str, min1: int, max1: int, min2: int, max2: int, save_figure: bool = False, utility_names: list[str] = ['Walking', 'Cycling', 'Public Transport', 'Driving'], num_points=1000)[source]

Plot a 2nd order feature interaction as a contour plot.

Parameters

modelRUMBoost

A RUMBoost object.

feature1str

Name of feature 1.

feature2str

Name of feature 2.

min1int

Minimum value of feature 1.

max1int

Maximum value of feature 1.

min2int

Minimum value of feature 2.

max2int

Maximum value of feature 2.

save_figurebool, optional (default = False)

If true, save the figure as a png file

utility_nameslist[str]

List of the alternative names

num_pointsint, optional (default=1000)

The number of points per axis. The total number of points is num_points**2.

rumboost.utility_plotting.plot_VoT(data_train, util_collection, attribute_VoT, utility_names, draw_range, save_figure=False, num_points=1000)[source]

The function plot the Value of Time of the attributes specified in attribute_VoT.

Parameters

util_collectiondict

A dictionary containing the type of utility to use for all features in all utilities.

attribute_VoTdict

A dictionary with keys being the utility number (as string) and values being a tuple of the attributes to compute the VoT on. The structure follows this form: {utility: (attribute1, attribute2)}

rumboost.utility_plotting.plot_bootstrap(models: list, dataset: DataFrame, features: dict[list[str]])[source]

Plot the bootstrap sampling.

Parameters

models: list

A list containing all the trained mdoels of the bootstrap sampling

dataset: pd.DataFrame

The full dataset used for training

features: dict[list[str]]

A dictionary of lists of strings contaning the number of alternatives, and the features for that alternative, e.g. {‘0’:[‘feature_1’, …], ‘1’: [], …]

rumboost.utility_plotting.plot_ind_spec_constant(socec_model, dataset_train, alternatives: list[str])[source]

Plot a histogram of all alternatives individual specific constant of a functional effect model.

Parameters

socec_model:

The part of the functional effect model with full interactions of socio-economic characteristics.

dataset_train:

The dataset used to train the model. It must be a lightGBM Dataset object.

alternatives:

The list of alternatives name.

rumboost.utility_plotting.plot_market_segm(model, X, asc_normalised: bool = True, utility_names: list[str] = ['Walking', 'Cycling', 'Public Transport', 'Driving'])[source]

Plot the market segmentation.

Parameters

modelRUMBoost

A RUMBoost object.

Xpandas DataFrame

Training data.

asc_normalisedbool, optional (default = False)

If True, scale down utilities to be zero at the y axis.

utility_nameslist[str], optional (default = [‘Walking’, ‘Cycling’, ‘Public Transport’, ‘Driving’])

Names of utilities.

rumboost.utility_plotting.plot_parameters(model, X, utility_names, Betas=None, model_unconstrained=None, with_pw=False, save_figure=False, asc_normalised=False, with_asc=False, with_cat=True, only_tt=False, only_1d=False, with_fit=False, fit_all=True, technique='weighted_data', data_sep=False, sm_tt_cost=False, save_file='')[source]

Plot the non linear impact of parameters on the utility function. When specified, unconstrained parameters and parameters from a RUM model can be added to the plot.

Parameters

modelRUMBoost

A RUMBoost object.

Xpandas dataframe

Features used to train the model, in a pandas dataframe.

utility_namedict

Dictionary mapping utilities indices to their names.

Betaslist, optional (default = None)

List of beta parameters value from a RUM. They should be listed in the same order as in the RUMBoost model.

model_unconstrainedLightGBM model, optional (default = None)

The unconstrained model. Must be trained and compatible with dump_model().

with_pwbool, optional (default = False)

If the piece-wise function should be included in the graph.

save_figurebool, optional (default = False)

If True, save the plot as a png file.

asc_normalisedbool, optional (default = False)

If True, scale down utilities to be zero at the y axis.

with_ascbool, optional (default = False)

If True, add the ASCs to all graphs (one is normalised, and asc_normalised must be True).

with_catbool, optional (default = True)

If False, categorical features are not plotted.

only_ttbool, optional (default = False)

If True, plot only travel time and distance.

only_1dbool, optional (default = False)

If True, plot only the features separately.

with_fitbool, optional (default = False)

If True, fit the data with simple functions to approximate the step functions.

fit_allbool, optional (default = True)

If False, plot only the best fitting function.

techniquestr, optional (default = ‘weighted_data’)

The technique for data sampling in the function fitting.

data_sepbool, optional (default = False)

If True, split the data to fit subsets of data.

sm_tt_costbool, optional (default = False)

If True, plot only the swissmetro travel time and cost on the same figure.

save_filestr, optional (default=’’)

The name to save the figure with.

rumboost.utility_plotting.plot_pop_VoT(data_test, util_collection, attribute_VoT, save_figure=False)[source]
rumboost.utility_plotting.plot_spline(model, data_train, spline_collection, utility_names, mean_splines=False, x_knots_dict=None, save_fig=False, lpmc_tt_cost=False, sm_tt_cost=False, save_file='')[source]

Plot the spline interpolation for all utilities interpolated.

Parameters

modelRUMBoost

A RUMBoost object.

data_trainpandas Dataframe

The full training dataset.

spline_collectiondict

A dictionary containing the optimal number of splines for each feature interpolated of each utility

mean_splinesbool, optional (default = False)

Must be True if the splines are computed at the mean distribution of data for stairs.

x_knots_dictdict

A dictionary in the form of {utility: {attribute: x_knots}} where x_knots are the spline knots for the corresponding utility and attributes

rumboost.utility_plotting.plot_util(model, data_train, points=10000)[source]

Plot the raw utility functions of all features. This is done directly from the predict attribute of lightgbm.Boosters.

Parameters

modelRUMBoost

A RUMBoost object.

data_trainpandas Dataframe

The full training dataset.

pointsint, optional (default = 10000)

The number of points used to draw the line plot.

rumboost.utility_plotting.plot_util_pw(model, data_train, points=10000)[source]

Plot the piece-wise utility function

Parameters

modelRUMBoost

A RUMBoost object.

data_trainpandas Dataframe

The full training dataset.

pointsint, optional (default = 10000)

The number of points used to draw the line plot.

rumboost.utility_smoothing module

rumboost.utility_smoothing.find_best_num_splines(weights, data_train, data_test, label_test, spline_utilities, mean_splines=False, search_technique='greedy')[source]

DEPRECATED Find the best number of splines fro each features prespecified.

Parameters

weightsdict

A dictionary containing all leaf values for all utilities and all features.

data_trainpandas DataFrame

The pandas DataFrame used for training.

data_testpandas DataFrame

The pandas DataFrame used for testing.

label_testpandas Series or numpy array

The labels of the dataset used for testing.

spline_utilitiesdict[list[str]]

A dictionary of lists. The dictionary should contain the index of alternatives as a str (i.e., ‘0’, ‘1’, …). The list contains features where splines will be applied.

mean_splinesbool, optional (default = False)

If True, the splines are computed at the mean distribution of data for stairs.

search_techniquestr, optional (default = ‘greedy’)

The technique used to search for the best number of splines. It can be ‘greedy’ (i.e., optimise one feature after each other, while storing the feature value), ‘greedy_ranked’ (i.e., same as ‘greedy’ but starts with the feature with the largest utility range) or ‘feature_independant’.

Returns

best_splinesdict

A dictionary containing the optimal number of splines for each feature interpolated of each utility

ceint

The negative cross-entropy on the test set

rumboost.utility_smoothing.find_feat_best_fit(model, data, technique='weighted_data')[source]

Find the best fit among several functions according to the least-squares for all features.

Parameter

modelRUMBoost

A RUMBoost object.

datapandas DataFrame

The pandas DataFrame used for training.

techniquestr, optional (default = ‘weighted_data)

The technique used to approximate the stair utility in data_leaf_values.

Returns

best_fitdict

A dictionary used to store the best fitting functions for all utilities and all features. For each utility and feature, the dictionary contains three keys:

best_func : the name of the best function best_params : the parameters associated with the best function best_score : the sum of the least squares score

rumboost.utility_smoothing.fit_func(data, weight, technique='weighted_data')[source]

Fit a function that minimises the least-squares.

Parameter

datapandas Series

The pandas Series containing data about the feature that is being fitted.

weightdict

The dictionary containing weights ordered for the feature being fitted.

techniquestr, optional (default = ‘weighted_data)

The technique used to approximate the stair utility in data_leaf_values.

Returns

func_fitteddict

A dictionary with the name of the fitted function as key and its parameters as value.

fit_scoreint

The corresponding sum of least squares of the fit.

rumboost.utility_smoothing.mean_monotone_spline(x_data, x_mean, y_data, y_mean, num_splines=15)[source]

A function that apply monotonic spline interpolation on a given feature. The difference with monotone_spline, is that the knots are on the closest stairs mean.

Parameters

x_datanumpy array

Data from the interpolated feature.

x_meannumpy array

The x coordinate of the vector of mean points at each stairs

y_datanumpy array

V(x_value), the values of the utility at x.

y_meannumpy array

The y coordinate of the vector of mean points at each stairs

Returns

x_splinenumpy array

A vector of x values used to plot the splines.

y_splinenumpy array

A vector of the spline values at x_spline.

pchipscipy.interpolate.PchipInterpolator

The scipy interpolator object from the monotonic splines.

rumboost.utility_smoothing.monotone_spline(x_spline, weights, num_splines=5, x_knots=None, y_knots=None)[source]

A function that apply monotonic spline interpolation on a given feature.

Parameters

x_splinenumpy array

Data from the interpolated feature.

weightsdict

The dictionary corresponding to the feature leaf values.

num_splinesint, optional (default=5)

The number of splines used for interpolation.

x_knotsnumpy array, optional (default=None)

The positions of knots. If None, linearly spaced.

y_knotsnumpy array, optional (default=None)

The value of the utility at knots. Need to be specified if x_knots is passed.

Returns

x_splinenumpy array

A vector of x values used to plot the splines.

y_splinenumpy array

A vector of the spline values at x_spline.

pchipscipy.interpolate.PchipInterpolator

The scipy interpolator object from the monotonic splines.

x_knotsnumpy array

The positions of knots. If None, linearly spaced.

y_knotsnumpy array

The value of the utility at knots.

rumboost.utility_smoothing.optimal_knots_position(weights, dataset_train, dataset_test, labels_test, spline_utilities, num_spline_range, max_iter=100, optimize=True, deg_freedom=None, n_iter=1, x_first=None, x_last=None, mu=None, nests=None, fe_model=None)[source]

Find the optimal position of knots for a given number of knots for given attributes.

Parameters

weightsdict

A dictionary containing all leaf values for all utilities and all features.

dataset_trainpandas DataFrame

The pandas DataFrame used for training.

dataset_testpandas DataFrame

The pandas DataFrame used for testing.

labels_testpandas Series or numpy array

The labels of the dataset used for testing.

spline_utilitiesdict

A dictionary containing attributes where splines are applied. Must be in the form ] {utility_indx: [attributes1, attributes2, …], …}.

num_splines_rangedict

A dictionary of the same format than weights of features names for each utility that are interpolated with monotonic splines. The key is a spline interpolated feature name, and the value is the number of splines used for interpolation as an int. There should be a key for all features where splines are used.

max_iterint, optional (default=100)

The maximum number of iterations from the solver

optimizebool, optional (default=True)

If True, optimize the knots position with scipy.minimize

deg_freedomint, optional (default=None)

The degree of freedom. If not specified, it is the number of knots to optimize.

n_iterint, optional (default=None)

The number of iteration, to leverage the randomness induced by the local minimizer.

x_firstlist, optional (default=None)

A list of all first knots in the order of the attributes from spline_utilities and num_splines_range.

x_lastlist, optional (default=None)

A list of all last knots in the order of the attributes from spline_utilities and num_splines_range.

mulist, optional (default=None)

Only used, and required, if nests is True. It is the list of mu values for each nest. The first value correspond to the first nest and so on.

nestsdict, optional (default=False)

If not none, compute predictions with the nested probability function. The dictionary keys are alternatives number and their values are their nest number. By example {0:0, 1:1, 2:0} means that alt 0 and 2 are in nest 0 and alt 1 is in nest 1.

fe_modelRUMBoost, optional (default=None)

The socio-economic characteristics part of the functional effect model.

Returns

x_optOptimizeResult

The result of scipy.minimize.

rumboost.utility_smoothing.optimise_splines(x_knots, weights, data_train, data_test, label_test, spline_utilities, num_spline_range, x_first=None, x_last=None, deg_freedom=None, mu=None, nests=None, fe_model=None)[source]

Function wrapper to find the optimal position of knots for each feature. The optimal position is the one who minimises the CE loss.

Parameters

x_knots ; 1d np.array

The positions of knots in a 1d array, following this structure: np.array([x_att1_1, x_att1_2, … x_att1_m, x_att2_1, … x_attn_m]) where m is the number of knots and n the number of attributes that are interpolated with splines.

weightsdict

A dictionary containing all leaf values for all utilities and all features.

data_trainpandas DataFrame

The pandas DataFrame used for training.

data_testpandas DataFrame

The pandas DataFrame used for testing.

label_testpandas Series or numpy array

The labels of the dataset used for testing.

spline_utilitiesdict

A dictionary containing attributes where splines are applied. Must be in the form ] {utility_indx: [attributes1, attributes2, …], …}.

num_splines_rangedict

A dictionary of the same format than weights of features names for each utility that are interpolated with monotonic splines. The key is a spline interpolated feature name, and the value is the number of splines used for interpolation as an int. There should be a key for all features where splines are used.

x_firstlist, optional (default=None)

A list of all first knots in the order of the attributes from spline_utilities and num_splines_range.

x_lastlist, optional (default=None)

A list of all last knots in the order of the attributes from spline_utilities and num_splines_range.

mulist, optional (default=None)

Only used, and required, if nests is True. It is the list of mu values for each nest. The first value correspond to the first nest and so on.

nestsdict, optional (default=False)

If not none, compute predictions with the nested probability function. The dictionary keys are alternatives number and their values are their nest number. By example {0:0, 1:1, 2:0} means that alt 0 and 2 are in nest 0 and alt 1 is in nest 1.

fe_modelRUMBoost, optional (default=None)

The socio-economic characteristics part of the functional effect model.

Returns

loss: float

The final cross entropy or BIC on the test set.

rumboost.utility_smoothing.smooth_predict(data_test, util_collection, utilities=False, mu=None, nests=None, fe_model=None, target='choice')[source]

A prediction function that used monotonic spline interpolation on some features to predict their utilities. The function should be used with a trained model only.

Parameters

data_testpandas DataFrame

A pandas DataFrame containing the observations that will be predicted.

util_collectiondict

A dictionary containing the type of utility to use for all features in all utilities.

utilitiesbool, optional (default = False)

if True, return the raw utilities.

mulist, optional (default=None)

Only used, and required, if nests is True. It is the list of mu values for each nest. The first value correspond to the first nest and so on.

nestsdict, optional (default=False)

If not none, compute predictions with the nested probability function. The dictionary keys are alternatives number and their values are their nest number. By example {0:0, 1:1, 2:0} means that alt 0 and 2 are in nest 0 and alt 1 is in nest 1.

fe_modelRUMBoost, optional (default=None)

The socio-economic characteristics part of the functional effect model.

Returns

predsnumpy array

A numpy array containing the predictions for each class for each observation. Predictions are computed through the softmax function, unless the raw utilities are requested. A prediction for class j for observation n will be U[n, j].

rumboost.utility_smoothing.stairs_to_pw(model, train_data, data_to_transform=None, util_for_plot=False)[source]

DEPRECATED Transform a stair output to a piecewise linear prediction.

Parameters

modelRUMBoost

A trained RUMBoost object.

train_datapandas DataFrame

The full dataset used for training.

data_to_transformpandas DataFrame, optional (default = None)

The data that need to be transform for prediction. If None, the training dataset is used.

util_for_plotbool, optional (default = False)

If True, the output is formatted for plotting.

Returns

pw_utilitynumpy array or list

The piece-wise output. It is usually a numpy array, but can be a list if util_for_plot is True.

rumboost.utility_smoothing.updated_utility_collection(weights, data, num_splines_feat, spline_utilities, mean_splines=False, x_knots=None)[source]

Create a dictionary that stores what type of utility (smoothed or not) should be used for smooth_predict.

Parameters

weightsdict

A dictionary containing all leaf values for all utilities and all features.

datapandas DataFrame

The pandas DataFrame used for training.

num_splines_featdict

A dictionary of the same format than weights of features names for each utility that are interpolated with monotonic splines. The key is a spline interpolated feature name, and the value is the number of splines used for interpolation as an int. There should be a key for all features where splines are used.

spline_utilitiesdict

A dictionary containing attributes where splines are applied. Must be in the form ] {utility_indx: [attributes1, attributes2, …], …}.

mean_splinesbool, optional (default = False)

If True, the splines are computed at the mean distribution of data for stairs.

x_knotsdict

A dictionary in the form of {utility: {attribute: x_knots}} where x_knots are the spline knots for the corresponding utility and attributes

Returns

util_collectiondict

A dictionary containing the type of utility to use for all features in all utilities.

rumboost.utils module

rumboost.utils.accuracy(preds, labels)[source]

Compute accuracy of the model.

Parameters

predsnumpy array

Predictions for all data points and each classes from a softmax function. preds[i, j] correspond to the prediction of data point i to belong to class j.

labelsnumpy array

The labels of the original dataset, as int.

Returns

Accuracy: float

The computed accuracy, as a float.

rumboost.utils.bio_to_rumboost(model, all_columns=False, monotonic_constraints=True, interaction_contraints=True, fct_effect_variables=[])[source]

Converts a biogeme model to a rumboost dict.

Parameters

modela BIOGEME object

The model used to create the rumboost structure dictionary.

all_columnsbool, optional (default = False)

If True, do not consider alternative-specific features.

monotonic_constraintsbool, optional (default = True)

If False, do not consider monotonic constraints.

interaction_contraintsbool, optional (default = True)

If False, do not consider feature interactions constraints.

fct_effect_variableslist, optional (default = [])

The list of variables in the functional effect part of the model

Returns

rum_structuredict

A dictionary specifying the structure of a RUMBoost object.

rumboost.utils.compute_VoT(util_collection, u, f1, f2)[source]

The function compute the Value of Time of the attributes specified in attribute_VoT.

Parameters

util_collectiondict

A dictionary containing the type of utility to use for all features in all utilities.

ustr

The utility number, as a str (e.g. ‘0’, ‘1’, …).

f1str

The time related attribtue name.

f2str

The cost related attribtue name.

Return

VoTlamda function

The function calculating value of time for attribute1 and attribute2.

rumboost.utils.create_name(features)[source]

Create new feature names from a list of feature names

rumboost.utils.cross_entropy(preds, labels)[source]

Compute negative cross entropy for given predictions and data.

Parameters

preds: numpy array

Predictions for all data points and each classes from a softmax function. preds[i, j] correspond to the prediction of data point i to belong to class j.

labels: numpy array

The labels of the original dataset, as int.

Returns

Cross entropyfloat

The negative cross-entropy, as float.

rumboost.utils.cross_nested_probs(raw_preds, mu, alphas)[source]

compute nested predictions.

Parameters

raw_preds :

The raw predictions from the booster

mu :

The list of mu values for each nest. The first value correspond to the first nest and so on.

alphas :

An array of J (alternatives) by M (nests). alpha_jn represents the degree of membership of alternative j to nest n By example, alpha_12 = 0.5 means that alternative one belongs 50% to nest 2.

Returns

raw_preds :

The cross nested predictions

pred_i_m :

The prediction of choosing alt i knowing nest m

pred_m :

The prediction of choosing nest m

rumboost.utils.data_leaf_value(data, weights_feature, technique='data_weighted')[source]

Computes the utility values of given data, according to the prespecified technique.

Parameters

datapandas.Series

The column of the dataframe associated with the feature.

weight_featuredict

The dictionary corresponding to the feature leaf values.

techniquestr, optional (default = weight_data)

The technique used to compute data values. It can be:

data_weighted : feature data and its utility values. mid_point : the mid point in between all splitting points. mean_data : the mean of data in between all splitting points. mid_point_weighted : the mid points in between all splitting points, weighted by the number of data points in the interval. mean_data_weighted : the mean of data in between all splitting points, weighted by the number of data points in the interval.

Returns

data_orderednumpy array

X coordinates of the data, or feature data point values.

data_valuesnumpy array

Y coordinates of the data, or utility values

rumboost.utils.find_disc(x_values, grad)[source]

Find discontinuities for a given feature values. The angle must be smaller than 0.2 radian and the slope bigger than 5. Values are normalised.

Parameters

x_valuesnumpy array

X coordinates of the point to find discontinuities.

gradnumpy array

A vector with gradient values at each given points.

Returns

discnumpy array

The coordinates of discontinuities.

disc_idxnumpy array

The index of discontinuities.

num_discint

The number of discontinuities.

rumboost.utils.function_2d(weights_2d, x_vect, y_vect)[source]

Create the nonlinear contour plot for parameters, from weights gathered in getweights_v2

Parameters

weights_2ddict

Pandas DataFrame containing all possible rectangles with their corresponding area values, for the given feature and utility.

x_vectnumpy array

Vector of higher level feature.

y_vectnumpy array

Vector of lower level feature.

Returns

contour_plot_valuesnumpy array

Array with values at (x,y) points.

rumboost.utils.get_angle_diff(x_values, y_values)[source]

Computes the angle between three given points.

Parameters

x_valuesnumpy array

X coordinates of the point to compute the angle.

y_valuesnumpy array

Y coordinates of the point to compute the angle.

Returns

diff_anglelist

A list containing all vectors for each subsequent three points.

rumboost.utils.get_asc(weights, alt_to_normalise='Driving', alternatives={'Cycling': '1', 'Driving': '3', 'Public Transport': '2', 'Walking': '0'})[source]

Retrieve ASCs from a dictionary of all values from a dictionary of leaves values per alternative per feature

rumboost.utils.get_child(model, weights, weights_2d, weights_market, tree, split_points, features, feature_names, i, market_segm, direction=None)[source]

Dig into the tree to get splitting points, features, left and right leaves values

rumboost.utils.get_grad(x, y, technique='slope', sample_points=30, normalise=False)[source]

Computes the arc gradient according to the prespecified technique.

Parameters

xnumpy array

X coordinates of the point to compute the gradient.

ynumpy array

Y coordinates of the point to compute the gradient.

techniquestr, optional (default = slope)

The technique used to compute data values. It can be:

slope : compute the slope as gradient between each point. sample_data : compute the slope between uniformly distributed sampled data.

Returns

gradnumpy array

A vector with gradient values at each given points.

x_samplenumpy array

The x coordinates of the sampled points if the technique is sample_data.

y_samplenumpy array

The y coordinates of the sampled points if the technique is sample_data.

rumboost.utils.get_mean_pos(data, split_points)[source]

Return the mean point in-between two split points for a specific feature (used in smoothing). At end points, it is the mean of data before the first split point, and after the last split point.

Parameters

datapandas.Series

The column of the dataframe associated with the feature.

split_pointslist

The list of split points for that feature.

Returns

mean_datalist

A list of points in the mean of every consecutive split points.

rumboost.utils.get_mid_pos(data, split_points, end='data')[source]

Return the mid point in-between two split points for a specific feature (used in pw linear predict).

Parameters

data: pandas Series

The column of the dataframe associated with the feature.

split_pointslist

The list of split points for that feature.

endstr
How to compute the mid position of the first and last point, it can be:

-‘data’: add min and max values of data -‘split point’: add first and last split points -‘mean_data’: add the mean of data before the first split point, and after the last split point

Returns

mid_poslist

A list of points in the middle of every consecutive split points.

rumboost.utils.get_pair(parent)[source]

Return beta and variable names on a tupple from a parent expression.

rumboost.utils.get_weights(model)[source]

Get leaf values from a RUMBoost model.

Parameters

modelRUMBoost

A trained RUMBoost object.

Returns

weights_dfpandas DataFrame

DataFrame containing all split points and their corresponding left and right leaves value, for all features.

weights_2d_dfpandas DataFrame

Dataframe with weights arranged for a 2d plot, used in the case of 2d feature interaction.

weights_marketpandas DataFrame

Dataframe with weights arranged for market segmentation, used in the case of market segmentation.

rumboost.utils.map_x_knots(x_knots, num_splines_range, x_first=None, x_last=None)[source]

Map the 1d array of x_knots into a dictionary with utility and attributes as keys.

Parameters

x_knots1d np.array

The positions of knots in a 1d array, following this structure: np.array([x_att1_1, x_att1_2, … x_att1_m, x_att2_1, … x_attn_m]) where m is the number of knots and n the number of attributes that are interpolated with splines.

num_splines_range: dict

A dictionary of the same format than weights of features names for each utility that are interpolated with monotonic splines. The key is a spline interpolated feature name, and the value is the number of splines used for interpolation as an int. There should be a key for all features where splines are used.

x_firstlist, optional (default=None)

A list of all first knots in the order of the attributes from spline_utilities and num_splines_range.

x_lastlist, optional (default=None)

A list of all last knots in the order of the attributes from spline_utilities and num_splines_range.

Returns

x_knots_dictdict

A dictionary in the form of {utility: {attribute: x_knots}} where x_knots are the spline knots for the corresponding utility and attributes

rumboost.utils.nest_probs(raw_preds, mu, nests)[source]

compute nested predictions.

Parameters

raw_preds :

The raw predictions from the booster

mu :

The list of mu values for each nest. The first value correspond to the first nest and so on.

nests :

The dictionary keys are alternatives number and their values are their nest number. By example, {0:0, 1:1, 2:0} means that alt 0 and 2 are in nest 0 and alt 1 is in nest 1.

Returns

preds.T :

The nested predictions

pred_i_m :

The prediction of choosing alt i knowing nest m

pred_m :

The prediction of choosing nest m

rumboost.utils.non_lin_function(weights_ordered, x_min, x_max, num_points)[source]

Create the nonlinear function for parameters, from weights ordered by ascending splitting points.

Parameters

weights_ordereddict

Dictionary containing splitting points and corresponding cumulative weights value for a specific feature’s parameter.

x_minfloat, int

Minimum x value for which the nonlinear function is computed.

x_maxfloat, int

Maximum x value for which the nonlinear function is computed.

num_pointsint

Number of points used to draw the nonlinear function line.

Returns

x_valueslist

X values for which the function will be plotted.

nonlin_functionlist

Values of the function at the corresponding x points.

rumboost.utils.process_parent(parent, pairs)[source]

Dig into the biogeme expression to retrieve name of variable and beta parameter. Work only with simple utility specification (beta * variable).

rumboost.utils.stratified_group_k_fold(X, y, groups, k, seed=None)[source]
rumboost.utils.utility_ranking(weights, spline_utilities)[source]

Rank attributes utility importance by their utility range. The first rank is the attribute having the largest max(V(x)) - min(V(x)).

Parameters

weightsdict

A dictionary containing all the split points and leaf values for all attributes, for all utilities.

spline_utilitiesdict

A dictionary containing attributes where splines are applied. Must be in the form ] {utility_indx: [attributes1, attributes2, …], …}.

Returns

util_ranks_ascendlist of tupple

A list of tupple where the first tupple is the one having the largest utility range. Tupples are composed of their utility and the name of their attributes.

rumboost.utils.weights_to_plot_v2(model, market_segm=False)[source]

Arrange weights by ascending splitting points and cumulative sum of weights.

Parameters

modelRUMBoost

A trained RUMBoost object.

Returns

weights_for_plotdict

Dictionary containing splitting points and corresponding cumulative weights value for all features.

Module contents