Skip to content

CoresetTreeServiceLR

CoresetTreeServiceLR ¤

CoresetTreeServiceLR(*, data_manager=None, data_params=None, n_instances=None, max_memory_gb=None, optimized_for, chunk_size=None, chunk_by=None, coreset_size=None, coreset_params=None, working_directory=None, cache_dir=None, node_train_function=None, node_train_function_params=None, node_metadata_func=None, chunk_sample_ratio=None, model_cls=None)

Bases: CoresetTreeServiceSupervisedMixin, CoresetTreeService

Subclass of CoresetTreeService for Linear Regression. A service class for creating a coreset tree and working with it. optimized_for is a required parameter defining the main usage of the service: 'training', 'cleaning' or both, optimized_for=['training', 'cleaning']. The service will decide whether to build an actual Coreset Tree or to build a single Coreset over the entire dataset, based on the triplet: n_instances, max_memory_gb and the 'number of features' (deduced from the dataset). The chunk_size and coreset_size will be deduced based on the above triplet too. In case chunk_size and coreset_size are provided, they will override all above mentioned parameters (less recommended).

Parameters:

Name Type Description Default
data_manager DataManagerT

DataManagerBase subclass, optional. The class used to interact with the provided data and store it locally. By default, only the sampled data is stored in HDF5 files format.

None
data_params Union[DataParams, dict]

DataParams, optional. Data preprocessing information.

None
n_instances int

int. The total number of instances that are going to be processed (can be an estimation). This parameter is required and the only one from the above mentioned quadruplet, which isn't deduced from the data.

None
max_memory_gb int

int, optional. The maximum memory in GB that should be used. When not provided, the server's total memory is used. In any case only 80% of the provided memory or the server's total memory is considered.

None
optimized_for Union[list, str]

str or list Either 'training', 'cleaning' or or both ['training', 'cleaning']. The main usage of the service.

required
chunk_size Union[dict, int]

int, optional. The number of instances to be used when creating a coreset node in the tree. When defined, it will override the parameters of optimized_for, n_instances, n_classes and max_memory_gb. chunk_size=0: Nodes are created based on input chunks. chunk_size=-1: Force the service to create a single coreset from the entire dataset (if it fits into memory).

None
chunk_by Union[Callable, str, list]

function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored.

None
coreset_size Union[int, float, dict]

int or float, optional. Represents the coreset size of each node in the coreset tree. If provided as a float, it represents the ratio between each chunk and the resulting coreset.In any case the coreset_size is limited to 60% of the chunk_size. The coreset is constructed by sampling data instances from the dataset based on their calculated importance. Since each instance may be sampled more than once, in practice, the actual size of the coreset is mostly smaller than coreset_size.

None
coreset_params Union[CoresetParams, dict]

CoresetParams or dict, optional. Coreset algorithm specific parameters.

None
node_train_function Callable[[ndarray, ndarray, ndarray], Any]

Callable, optional. method for training model at tree node level.

None
node_train_function_params dict

dict, optional. kwargs to be used when calling node_train_function.

None
node_metadata_func Callable[[Tuple[ndarray], ndarray, Union[list, None]], Union[list, dict, None]]

callable, optional. A method for storing user meta data on each node.

None
working_directory Union[str, PathLike]

str, path, optional. Local directory where intermediate data is stored.

None
cache_dir Union[str, PathLike]

str, path, optional. For internal use when loading a saved service.

None
chunk_sample_ratio float

float, optional. Indicates the size of the sample that will be taken and saved from each chunk on top of the Coreset for the validation methods. The values are from the range [0,1]. For example, chunk_sample_ratio=0.5, means that 50% of the data instances from each chunk will be saved.

None
model_cls Any

A Scikit-learn compatible model class, optional. The model class used to train the model on the coreset, in case a specific model instance wasn't passed to fit or the validation methods. The default model class is sklearn's LinearRegression.

None

auto_preprocessing ¤

auto_preprocessing(X=None, sparse_output=False, copy=False)

Apply auto-preprocessing on the provided prediction test data, similarly to the way it is done by the fit or get_coreset methods. Preprocessing includes handling missing values amd categorical encoding.

Parameters:

Name Type Description Default
X Union[Iterable, Iterable[Iterable]]

array like or iterator of arrays like. An array or an iterator of features.

None
sparse_output bool

boolean, default False. When set to True, the function will create a sparse matrix after preprocessing.

False
copy bool

boolean, default False. False (default) - Input data might be updated as result of this function. True - Data is copied before processing (impacts memory).

False

Returns:

Type Description

A DataFrame of the preprocessed data.

build ¤

build(X, y, indices=None, props=None, *, chunk_size=None, chunk_by=None, copy=False, n_jobs=None, verbose=1)

Create a coreset tree from the parameters X, y, indices and props (properties). build functions may be called only once. To add more data to the coreset tree use one of the partial_build functions. Categorical features are automatically one-hot encoded and missing values are automatically handled.

Parameters:

Name Type Description Default
X Union[Iterable, Iterable[Iterable]]

array like or iterator of arrays like. An array or an iterator of features. Categorical features are automatically one-hot encoded and missing values are automatically handled.

required
y Union[Iterable[Any], Iterable[Iterable[Any]]]

array like or iterator of arrays like. An array or an iterator of targets.

required
indices Union[Iterable[Any], Iterable[Iterable[Any]]]

array like or iterator of arrays like, optional. An array or an iterator with indices of X.

None
props Union[Iterable[Any], Iterable[Iterable[Any]]]

array like or iterator of arrays like, optional. An array or an iterator of properties. Properties, won’t be used to compute the Coreset or train the model, but it is possible to filter_out_samples on them or to pass them in the select_from_function of get_cleaning_samples.

None
chunk_size int

int, optional. The number of instances used when creating a coreset node in the tree. chunk_size=0: Nodes are created based on input chunks. chunk_size=-1: Force the service to create a single coreset from the entire dataset (if it fits into memory).

None
chunk_by Union[Callable, str, list]

function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored.

None
copy bool

boolean, default False. False (default) - Input data might be updated as result of this function and functions such as update_targets or update_features. True - Data is copied before processing (impacts memory).

False
n_jobs int

Default: number of CPUs. Number of jobs to run in parallel during build.

None
verbose int

optional The verbose level for printing build progress, 0 - silent, 1 - (default) print.

1

Returns:

Type Description
CoresetTreeService

self

build_from_df ¤

build_from_df(datasets, target_datasets=None, *, chunk_size=None, chunk_by=None, copy=False, n_jobs=None, verbose=1)

Create a coreset tree from pandas DataFrame(s). build functions may be called only once. To add more data to the coreset tree use one of the partial_build functions. Categorical features are automatically one-hot encoded and missing values are automatically handled.

Parameters:

Name Type Description Default
datasets Union[Iterator[DataFrame], DataFrame]

pandas DataFrame or a DataFrame iterator. Data includes features, may include labels and may include indices.

required
target_datasets Union[Iterator[Union[DataFrame, Series]], DataFrame, Series]

pandas DataFrame or a DataFrame iterator, optional. Use when data is split to features and target. Should include only one column.

None
chunk_size int

int, optional. The number of instances used when creating a coreset node in the tree. chunk_size=0: Nodes are created based on input chunks. chunk_size=-1: Force the service to create a single coreset from the entire dataset (if it fits into memory).

None
chunk_by Union[Callable, str, list]

function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored.

None
copy bool

boolean, default False. False (default) - Input data might be updated as result of this function and functions such as update_targets or update_features. True - Data is copied before processing (impacts memory).

False
n_jobs int

Default: number of CPUs. Number of jobs to run in parallel during build.

None
verbose int

optional The verbose level for printing build progress, 0 - silent, 1 - (default) print.

1

Returns:

Type Description
CoresetTreeService

self

build_from_file ¤

build_from_file(file_path, target_file_path=None, *, reader_f=pd.read_csv, reader_kwargs=None, reader_chunk_size_param_name=None, chunk_size=None, chunk_by=None, n_jobs=None, verbose=1)

Create a coreset tree based on data taken from local storage. build functions may be called only once. To add more data to the coreset tree use one of the partial_build functions. Categorical features are automatically one-hot encoded and missing values are automatically handled.

Parameters:

Name Type Description Default
file_path Union[Union[str, PathLike], Iterable[Union[str, PathLike]]]

file, list of files, directory, list of directories. Path(s) to the place where data is stored. Data includes features, may include targets and may include indices.

required
target_file_path Union[Union[str, PathLike], Iterable[Union[str, PathLike]]]

file, list of files, directory, list of directories, optional. Use when the dataset files are split to features and target. Each file should include only one column.

None
reader_f Callable

pandas like read method, optional, default pandas read_csv. For example, to read excel files use pandas read_excel.

read_csv
reader_kwargs dict

dict, optional. Keyword arguments used when calling reader_f method.

None
reader_chunk_size_param_name str

str, optional. reader_f input parameter name for reading file in chunks. When not provided we'll try to figure it out our self. Based on the data, we decide on the optimal chunk size to read and use this parameter as input when calling reader_f. Use "ignore" to skip the automatic chunk reading logic.

None
chunk_size int

int, optional. The number of instances used when creating a coreset node in the tree. chunk_size=0: Nodes are created based on input chunks. chunk_size=-1: Force the service to create a single coreset from the entire dataset (if it fits into memory).

None
chunk_by Union[Callable, str, list]

function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored.

None
n_jobs int

Default: number of CPUs. Number of jobs to run in parallel during build.

None
verbose int

optional The verbose level for printing build progress, 0 - silent, 1 - (default) print.

1

Returns:

Type Description
CoresetTreeService

self

cross_validate ¤

cross_validate(level=None, model=None, scoring=None, return_model=False, verbose=0, preprocessing_stage='auto', sparse_threshold=0.01, **model_params)

Method for cross-validation on the coreset tree. This function is only applicable in case the coreset tree was optimized_for 'training'.

Parameters:

Name Type Description Default
level int

int, optional. The level of the tree on which the training and validation will be performed. If None, the best level will be selected.

None
model Any

A Scikit-learn compatible model instance, optional. When provided, model_params are not relevant. The model class needs to implement the usual scikit-learn interface. Default: instantiate the service model class using input model_params.

None
scoring Union[str, Callable[[BaseEstimator, ndarray, ndarray], float]]

callable or string, optional. If it is a callable object, it must return a scalar score. The signature of the call is (model, X, y), where model is the ML model to be evaluated, X is the data and y is the ground truth labeling. For example, it can be produced using sklearn.metrics.make_scorer. If it is a string, it must be a valid name of a Scikit-learn scoring method If None, the default scorer of the current model is used.

None
return_model bool

bool, optional. If True, the trained model is also returned.

False
verbose int

int, optional. Controls the verbosity: the higher, the more messages. >=1 : The number of folds and hyperparameter combinations to process at the start and the time it took, best hyperparameters found and their score at the end. >=2 : the score is also displayed;

0
preprocessing_stage Union[str, None]

string, optional, default auto.

The different stages reflect the data preprocessing workflow.

- user - Return the data after any user defined data preprocessing (if defined).

- auto - Return the data after applying auto-preprocessing, including one-hot-encoding, converting Boolean fields to numeric, etc.

'auto'
sparse_threshold float

float, optional, default 0.01. Creates a sparse matrix from the features (X), if the data density after preprocessing is below sparse_threshold, otherwise, will create an array. (Applicable only for preprocessing_stage='auto').

0.01
model_params

kwargs, optional. The hyper-parameters of the model. If not provided, the default values are used.

{}

Returns:

Type Description
Union[List[float], Tuple[List[float], List[BaseEstimator]]]

A list of scores, one for each fold. If return_model=True, a list of trained models is also returned (one model for each fold).

filter_out_samples ¤

filter_out_samples(filter_function, force_resample_all=None, force_sensitivity_recalc=None, force_do_nothing=False)

Remove samples from the coreset tree, based on the provided filter function. The coreset tree is automatically updated to accommodate to the changes.

Parameters:

Name Type Description Default
filter_function Callable[[Iterable, Iterable, Union[Iterable, None], Union[Iterable, None]], Iterable[Any]]

function, optional. A function that returns a list of indices to be removed from the tree. The function should accept 4 parameters as input: indices, X, y, props and return a list(iterator) of indices to be removed from the coreset tree. For example, in order to remove all instances with a target equal to 6, use the following function: filter_function = lambda indices, X, y, props : indices[y = 6].

required
force_resample_all Optional[int]

int, optional. Force full resampling of the affected nodes in the coreset tree, starting from level=force_resample_all. None - Do not force_resample_all (default), 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level.

None
force_sensitivity_recalc Optional[int]

int, optional. Force the recalculation of the sensitivity and partial resampling of the affected nodes, based on the coreset's quality, starting from level=force_sensitivity_recalc. None - If self.chunk_sample_ratio<1 - one level above leaf node level. If self.chunk_sample_ratio=1 - leaf level 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level.

None
force_do_nothing Optional[bool]

bool, optional, default False. When set to True, suppresses any update to the coreset tree until update_dirty is called.

False

fit ¤

fit(level=0, seq_from=None, seq_to=None, model=None, preprocessing_stage='auto', sparse_threshold=0.01, **model_params)

Fit a model on the coreset tree. This model will be used when predict and predict_proba are called. This function is only applicable in case the coreset tree was optimized_for 'training'.

Parameters:

Name Type Description Default
level int

Defines the depth level of the tree from which the coreset is extracted. Level 0 returns the coreset from the head of the tree with around coreset_size samples. Level 1 returns the coreset from the level below the head of the tree with around twice of the samples compared to level 0, etc. If the passed level is greater than the maximal level of the tree, the maximal available level is used.

0
seq_from Any

string/datetime, optional The starting sequence of the training set.

None
seq_to Any

string/datetime, optional The ending sequence of the training set.

None
model Any

A Scikit-learn compatible model instance, optional. When provided, model_params are not relevant. Default: instantiate the service model class using input model_params.

None
preprocessing_stage Union[str, None]

string, optional, default auto.

The different stages reflect the data preprocessing workflow.

- user - Return the data after any user defined data preprocessing (if defined).

- auto - Return the data after applying auto-preprocessing, including one-hot-encoding, converting Boolean fields to numeric, etc.

'auto'
sparse_threshold float

float, optional, default 0.01. Creates a sparse matrix from the features (X), if the data density after preprocessing is below sparse_threshold, otherwise, will create an array. (Applicable only for preprocessing_stage='auto').

0.01
model_params

Model hyperparameters kwargs. Input when instantiating default model class.

{}

Returns:

Type Description

Fitted estimator.

get_cleaning_samples ¤

get_cleaning_samples(size=None, ignore_indices=None, select_from_indices=None, select_from_function=None, ignore_seen_samples=True)

Returns indices of samples in descending order of importance. Useful for identifying mislabeled instances and other anomalies in the data. size must be provided. Function must be called after build. This function is only applicable in case the coreset tree was optimized_for 'cleaning'. This function is not for retrieving the coreset (use get_coreset in this case).

Parameters:

Name Type Description Default
size int

required, optional. Number of samples to return.

None
ignore_indices Iterable

array-like, optional. An array of indices to ignore when selecting cleaning samples.

None
select_from_indices Iterable

array-like, optional. An array of indices to consider when selecting cleaning samples.

None
select_from_function Callable[[Iterable, Iterable, Union[Iterable, None], Union[Iterable, None]], Iterable[Any]]

function, optional. Pass a function in order to limit the selection of the cleaning samples accordingly. The function should accept 4 parameters as input: indices, X, y, props. and return a list(iterator) of the desired indices.

None
ignore_seen_samples bool

bool, optional, default True. Exclude already seen samples and set the seen flag on any indices returned by the function.

True

Returns:

Name Type Description
Dict Union[ValueError, dict]

idx: array-like[int]. Cleaning samples indices. X: array-like[int]. X array. y: array-like[int]. y array. importance: array-like[float]. The importance property. Instances that receive a high Importance in the Coreset computation, require attention as they usually indicate a labeling error, anomaly, out-of-distribution problem or other data-related issue.

get_coreset ¤

get_coreset(level=0, preprocessing_stage='user', sparse_threshold=0.01, as_df=False, with_index=False, seq_from=None, seq_to=None)

Get tree's coreset data in one of the preprocessing_stage(s) in the data preprocessing workflow. Use the level parameter to control the level of the tree from which samples will be returned. This function is only applicable in case the coreset tree was optimized_for 'training'.

Parameters:

Name Type Description Default
level int

int, optional, default 0. Defines the depth level of the tree from which the coreset is extracted. Level 0 returns the coreset from the head of the tree with around coreset_size samples. Level 1 returns the coreset from the level below the head of the tree with around twice of the samples compared to level 0, etc. If the passed level is greater than the maximal level of the tree, the maximal available level is used.

0
preprocessing_stage Union[str, None]

string, optional, default user.

The different stages reflect the data preprocessing workflow.

- original - Return the data as it was handed to the Coreset’s build function (The data_params.save_orig flag should be set for this option to be available).

- user - Return the data after any user defined data preprocessing (if defined).

- auto - Return the data after applying auto-preprocessing, including one-hot-encoding, converting Boolean fields to numeric, etc.

'user'
sparse_threshold float

float, optional, default 0.01. Returns the features (X) as a sparse matrix if the data density after preprocessing is below sparse_threshold, otherwise, will return the data as an array (Applicable only for preprocessing_stage=auto).

0.01
as_df bool

boolean, optional, default False. When True, returns the X as a pandas DataFrame.

False
with_index bool

boolean, optional, default False. Relevant only when preprocessing_stage=auto. Should the returned data include the index column.

False
seq_from Any

string or datetime, optional, default None. The start sequence to filter samples by.

None
seq_to Any

string or datetime, optional, default None. The end sequence to filter samples by.

None

Returns:

Type Description
dict

A dictionary representing the Coreset: ind: A numpy array of indices. X: A numpy array of the feature matrix. y: A numpy array of the target values. w: A numpy array of sample weights. n_represents: The number of instances represented by the coreset. features_out: A list of the output features, if preprocessing_stage=auto, otherwise None. props: A numpy array of properties, or None if not available.

get_coreset_size ¤

get_coreset_size(level=0, seq_from=None, seq_to=None)

Returns the size of the tree's coreset data. Use the level parameter to control the level of the tree from which samples will be returned. This function is only applicable in case the coreset tree was optimized_for 'training'.

Parameters:

Name Type Description Default
level int

int, optional, default 0. Defines the depth level of the tree from which the coreset is extracted. Level 0 returns the coreset from the head of the tree with around coreset_size samples. Level 1 returns the coreset from the level below the head of the tree with around twice of the samples compared to level 0, etc. If the passed level is greater than the maximal level of the tree, the maximal available level is used.

0
seq_from Any

string or datetime, optional, default None. The start sequence to filter samples by.

None
seq_to Union[str, datetime]

string or datetime, optional, default None. The end sequence to filter samples by.

None

Returns:

Name Type Description
int int

coreset size

get_hyperparameter_tuning_data ¤

get_hyperparameter_tuning_data(level=None, validation_method='cross validation', preprocessing_stage='user', sparse_threshold=0.01, validation_size=0.2, seq_train_from=None, seq_train_to=None, seq_validate_from=None, seq_validate_to=None, as_df=True)

A method for retrieving the data for hyperparameter tuning with cross validation, using the coreset tree. The returned data can be used with Scikit-learn’s GridSearchCV, with skopt’s BayesSearchCV and with any other hyperparameter tuning method that can accept a fold iterator object. Note: When using this method with Scikit-learn's GridSearchCV and similar methods, the refit parameter must be set to False. This is because the returned dataset (X, y and w) includes both training and validation data due to the use of a splitter. The returned dataset (X, y and w) is the concatenation of training data for all folds followed by validation data for all folds. By default, GridSearchCV refits the estimator on the entire dataset, not just the training portion, and this behavior cannot be modified and is incorrect. In this case, refit should be handled manually after the cross-validation process, by calling get_coreset with the same parameters that were passed to this function to retrieve the data and then fitting on the returned data using the best hyperparameters found in GridSearchCV. This function is only applicable in case the coreset tree was optimized_for 'training'.

Parameters:

Name Type Description Default
level int

int, optional, default 0. Defines the depth level of the tree from which the coreset is extracted. Level 0 returns the coreset from the head of the tree with around coreset_size samples. Level 1 returns the coreset from the level below the head of the tree with around twice of the samples compared to level 0, etc. If the passed level is greater than the maximal level of the tree, the maximal available level is used.

None
validation_method str

str, optional. Indicates which validation method will be used. The possible values are 'cross validation', 'hold-out validation' and 'seq-dependent validation'.

'cross validation'
preprocessing_stage Union[str, None]

string, optional, default user.

The different stages reflect the data preprocessing workflow.

- original - Return the data as it was handed to the Coreset’s build function (The data_params.save_orig flag should be set for this option to be available).

- user - Return the data after any user defined data preprocessing (if defined).

- auto - Return the data after applying auto-preprocessing, including one-hot-encoding, converting Boolean fields to numeric, etc.

'user'
sparse_threshold float

float, optional, default 0.01. Returns the features (X) as a sparse matrix if the data density after preprocessing is below sparse_threshold, otherwise, will return the data as an array (Applicable only for preprocessing_stage='auto').

0.01
validation_size float

float, optional, default 0.2. The size of the validation set, as a percentage of the training set size for hold-out validation.

0.2
seq_train_from Any

Any, optional. The starting sequence of the training set for seq-dependent validation.

None
seq_train_to Any

Any, optional. The ending sequence of the training set for seq-dependent validation.

None
seq_validate_from Any

Any, optional. The starting sequence number of the validation set for seq-dependent validation.

None
seq_validate_to Any

Any, optional. The ending sequence number of the validation set for seq-dependent validation.

None
as_df bool

boolean, optional, default False. When True, returns the X as a pandas DataFrame.

True

Returns:

Type Description
Dict[str, Union[ndarray, FoldIterator, Any]]

A dictionary with the following keys: ind: The indices of the data. X: The data. y: The labels. w: The weights. splitter: The fold iterator. model_params: The model parameters.

get_max_level ¤

get_max_level()

Return the maximal level of the coreset tree. Level 0 is the head of the tree. Level 1 is the level below the head of the tree, etc.

grid_search(param_grid, level=None, validation_method='cross validation', model=None, scoring=None, refit=True, verbose=0, preprocessing_stage='auto', sparse_threshold=0.01, error_score=np.nan, validation_size=0.2, seq_train_from=None, seq_train_to=None, seq_validate_from=None, seq_validate_to=None, n_jobs=None)

A method for performing hyperparameter selection by grid search, using the coreset tree. This function is only applicable in case the coreset tree was optimized_for 'training'.

Parameters:

Name Type Description Default
param_grid Union[Dict[str, List], List[Dict[str, List]]]

dict or list of dicts. Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

required
level int

int, optional. The level of the tree on which the training and validation will be performed. If None, the best level will be selected.

None
validation_method str

str, optional. Indicates which validation method will be used. The possible values are 'cross validation', 'hold-out validation' and 'seq-dependent validation'. If 'cross validation' is selected, the process involves progressing through folds. We first train and validate all hyperparameter combinations for each fold, before moving on to the subsequent folds.

'cross validation'
model Any

A Scikit-learn compatible model instance, optional. The model class needs to implement the usual scikit-learn interface.

None
scoring Union[str, Callable[[BaseEstimator, ndarray, ndarray], float]]

callable or string, optional. If it is a callable object, it must return a scalar score. The signature of the call is (model, X, y), where model is the ML model to be evaluated, X is the data and y is the ground truth labeling. For example, it can be produced using sklearn.metrics.make_scorer. If it is a string, it must be a valid name of a Scikit-learn scoring method If None, the default scorer of the current model is used.

None
refit bool

bool, optional. If True, retrain the model on the whole coreset using the best found hyperparameters, and return the model. This model will be used when predict and predict_proba are called.

True
verbose int

int, optional Controls the verbosity: the higher, the more messages. >=1 : The number of folds and hyperparameter combinations to process at the start and the time it took, best hyperparameters found and their score at the end. >=2 : The score and time for each fold and hyperparameter combination.

0
preprocessing_stage Union[str, None]

string, optional, default auto.

The different stages reflect the data preprocessing workflow.

- user - Return the data after any user defined data preprocessing (if defined).

- auto - Return the data after applying auto-preprocessing, including one-hot-encoding, converting Boolean fields to numeric, etc.

'auto'
sparse_threshold float

float, optional, default 0.01. Creates a sparse matrix from the features (X), if the data density after preprocessing is below sparse_threshold, otherwise, will create an array. (Applicable only for preprocessing_stage='auto').

0.01
error_score Union[str, float, int]

"raise" or numeric, optional. Value to assign to the score if an error occurs in model training. If set to "raise", the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

nan
validation_size float

float, optional, default 0.2. The size of the validation set, as a percentage of the training set size for hold-out validation.

0.2
seq_train_from Any

Any, optional. The starting sequence of the training set for seq-dependent validation.

None
seq_train_to Any

Any, optional. The ending sequence of the training set for seq-dependent validation.

None
seq_validate_from Any

Any, optional. The starting sequence number of the validation set for seq-dependent validation.

None
seq_validate_to Any

Any, optional. The ending sequence number of the validation set for seq-dependent validation.

None
n_jobs int

int, optional. Default: number of CPUs. Number of jobs to run in parallel during grid search.

None

Returns:

Type Description
Union[Tuple[Dict, DataFrame, BaseEstimator], Tuple[Dict, DataFrame]]

A dict with the best hyperparameters setting, among those provided by the user. The keys are the hyperparameters names, while the dicts' values are the hyperparameters values. A Pandas DataFrame holding the score for each hyperparameter combination and fold. For the 'cross validation' method the average across all folds for each hyperparameter combination is included too. If refit=True, the retrained model is also returned.

holdout_validate ¤

holdout_validate(level=None, validation_size=0.2, model=None, scoring=None, return_model=False, verbose=0, preprocessing_stage='auto', sparse_threshold=0.01, **model_params)

A method for hold-out validation on the coreset tree. The validation set is always the last part of the dataset. This function is only applicable in case the coreset tree was optimized_for training.

Parameters:

Name Type Description Default
level int

int, optional. The level of the tree on which the training and validation will be performed. If None, the best level will be selected.

None
validation_size float

float, optional. The percentage of the dataset that will be used for validating the model.

0.2
model Any

A Scikit-learn compatible model instance, optional. When provided, model_params are not relevant. The model class needs to implement the usual scikit-learn interface. Default: instantiate the service model class using input model_params.

None
scoring Union[str, Callable[[BaseEstimator, ndarray, ndarray], float]]

callable or string, optional. If it is a callable object, it must return a scalar score. The signature of the call is (model, X, y), where model is the ML model to be evaluated, X is the data and y is the ground truth labeling. For example, it can be produced using sklearn.metrics.make_scorer. If it is a string, it must be a valid name of a Scikit-learn scoring method If None, the default scorer of the current model is used.

None
return_model bool

bool, optional. If True, the trained model is also returned.

False
verbose int

int, optional. Controls the verbosity: the higher, the more messages. >=1 : The number of hyperparameter combinations to process at the start and the time it took, best hyperparameters found and their score at the end. >=2 : The score and time for each hyperparameter combination.

0
preprocessing_stage Union[str, None]

string, optional, default auto.

The different stages reflect the data preprocessing workflow.

- user - Return the data after any user defined data preprocessing (if defined).

- auto - Return the data after applying auto-preprocessing, including one-hot-encoding, converting Boolean fields to numeric, etc.

'auto'
sparse_threshold float

float, optional, default 0.01. Creates a sparse matrix from the features (X), if the data density after preprocessing is below sparse_threshold, otherwise, will create an array. (Applicable only for preprocessing_stage='auto').

0.01
model_params

kwargs, optional. The hyper-parameters of the model. If not provided, the default values are used.

{}

Returns:

Type Description
Union[List[float], Tuple[List[float], List[BaseEstimator]]]

The validation score. If return_model=True, the trained model is also returned.

is_dirty ¤

is_dirty()

Returns:

Type Description
bool

Indicates whether the coreset tree has nodes marked as dirty, meaning they were affected by any of the methods: remove_samples, update_targets, update_features or filter_out_samples, when they were called with force_do_nothing.

load classmethod ¤

load(dir_path, name=None, *, data_manager=None, load_buffer=True, working_directory=None)

Restore a service object from a local directory.

Parameters:

Name Type Description Default
dir_path Union[str, PathLike]

str, path. Local directory where service data is stored.

required
name str

string, optional, default service class name (lower case). The name prefix of the subdirectory to load. When several subdirectories having the same name prefix are found, the last one, ordered by name, is selected. For example when saving with override=False, the chosen subdirectory is the last saved.

None
data_manager DataManagerT

DataManagerBase subclass, optional. When specified, input data manger will be used instead of restoring it from the saved configuration.

None
load_buffer bool

boolean, optional, default True. If set, load saved buffer (a partial node of the tree) from disk and add it to the tree.

True
working_directory Union[str, PathLike]

str, path, optional, default use working_directory from saved configuration. Local directory where intermediate data is stored.

None

Returns:

Type Description
CoresetTreeService

CoresetTreeService object

partial_build ¤

partial_build(X, y, indices=None, props=None, *, chunk_size=None, chunk_by=None, copy=False, n_jobs=None, verbose=1)

Add new samples to a coreset tree from parameters X, y, indices and props (properties). Categorical features are automatically one-hot encoded and missing values are automatically handled.

Parameters:

Name Type Description Default
X Union[Iterable, Iterable[Iterable]]

array like or iterator of arrays like. An array or an iterator of features. Categorical features are automatically one-hot encoded and missing values are automatically handled.

required
y Union[Iterable[Any], Iterable[Iterable[Any]]]

array like or iterator of arrays like. An array or an iterator of targets.

required
indices Union[Iterable[Any], Iterable[Iterable[Any]]]

array like or iterator of arrays like, optional. An array or an iterator with indices of X.

None
props Union[Iterable[Any], Iterable[Iterable[Any]]]

array like or iterator of arrays like, optional. An array or an iterator of properties. Properties, won’t be used to compute the Coreset or train the model, but it is possible to filter_out_samples on them or to pass them in the select_from_function of get_cleaning_samples.

None
chunk_size int

int, optional. The number of instances used when creating a coreset node in the tree. chunk_size=0: Nodes are created based on input chunks.

None
chunk_by Union[Callable, str, list]

function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored.

None
copy bool

boolean, default False. False (default) - Input data might be updated as result of this function and functions such as update_targets or update_features. True - Data is copied before processing (impacts memory).

False
n_jobs int

Default: number of CPUs. Number of jobs to run in parallel during build.

None
verbose int

optional The verbose level for printing build progress, 0 - silent, 1 - (default) print.

1

Returns:

Type Description
CoresetTreeService

self

partial_build_from_df ¤

partial_build_from_df(datasets, target_datasets=None, *, chunk_size=None, chunk_by=None, copy=False, n_jobs=None, verbose=1)

Add new samples to a coreset tree based on the pandas DataFrame iterator. Categorical features are automatically one-hot encoded and missing values are automatically handled.

Parameters:

Name Type Description Default
datasets Union[Iterator[DataFrame], DataFrame]

pandas DataFrame or a DataFrame iterator. Data includes features, may include targets and may include indices.

required
target_datasets Union[Iterator[DataFrame], DataFrame]

pandas DataFrame or a DataFrame iterator, optional. Use when data is split to features and target. Should include only one column.

None
chunk_size int

int, optional, default previous used chunk_size. The number of instances used when creating a coreset node in the tree. chunk_size=0: Nodes are created based on input chunks.

None
chunk_by Union[Callable, str, list]

function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored.

None
copy bool

boolean, default False. False (default) - Input data might be updated as result of this function and functions such as update_targets or update_features. True - Data is copied before processing (impacts memory).

False
n_jobs int

Default: number of CPUs. Number of jobs to run in parallel during build.

None
verbose int

optional The verbose level for printing build progress, 0 - silent, 1 - (default) print.

1

Returns:

Type Description
CoresetTreeService

self

partial_build_from_file ¤

partial_build_from_file(file_path, target_file_path=None, *, reader_f=pd.read_csv, reader_kwargs=None, reader_chunk_size_param_name=None, chunk_size=None, chunk_by=None, n_jobs=None, verbose=1)

Add new samples to a coreset tree based on data taken from local storage. Categorical features are automatically one-hot encoded and missing values are automatically handled.

Parameters:

Name Type Description Default
file_path Union[Union[str, PathLike], Iterable[Union[str, PathLike]]]

file, list of files, directory, list of directories. Path(s) to the place where data is stored. Data includes features, may include targets and may include indices.

required
target_file_path Union[Union[str, PathLike], Iterable[Union[str, PathLike]]]

file, list of files, directory, list of directories, optional. Use when files are split to features and target. Each file should include only one column.

None
reader_f Callable

pandas like read method, optional, default pandas read_csv. For example, to read excel files use pandas read_excel.

read_csv
reader_kwargs dict

dict, optional. Keyword arguments used when calling reader_f method.

None
reader_chunk_size_param_name str

str, optional. reader_f input parameter name for reading file in chunks. When not provided we'll try to figure it out our self. Based on the data, we decide on the optimal chunk size to read and use this parameter as input when calling reader_f. Use "ignore" to skip the automatic chunk reading logic.

None
chunk_size int

int, optional, default previous used chunk_size. The number of instances used when creating a coreset node in the tree. chunk_size=0: Nodes are created based on input chunks.

None
chunk_by Union[Callable, str, list]

function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored.

None
n_jobs int

Default: number of CPUs. Number of jobs to run in parallel during build.

None
verbose int

optional The verbose level for printing build progress, 0 - silent, 1 - (default) print.

1

Returns:

Type Description
CoresetTreeService

self

plot ¤

plot(dir_path=None, selected_trees=None)

Produce a tree graph plot and save figure as a local png file.

Parameters:

Name Type Description Default
dir_path Union[str, PathLike]

string or PathLike. Path to save the plot figure in; if not provided, or if isn't valid/doesn't exist, the figure will be saved in the current directory (from which this method is called).

None
selected_trees dict

dict, optional. A dictionary containing the names of the image file(s) to be generated.

None

Returns:

Type Description
Path

Image file path

predict ¤

predict(X, sparse_output=False, copy=False)

Run prediction on the trained model. This function is only applicable in case the coreset tree was optimized_for 'training' and in case fit() or grid_search(refit=True) where called before. The function automatically preprocesses the data according to the preprocessing_stage used to train the model.

Parameters:

Name Type Description Default
X Union[Iterable, Iterable[Iterable]]

An array of features.

required
sparse_output bool

boolean, optional, default False. When set to True, the function will create a sparse matrix after preprocessing and pass it to the predict function.

False
copy bool

boolean, default False. False (default) - Input data might be updated as result of this function. True - Data is copied before processing (impacts memory).

False

Returns:

Type Description

Model prediction results.

predict_proba ¤

predict_proba(X, sparse_output=False, copy=False)

Run prediction on the trained model. This function is only applicable in case the coreset tree was optimized_for 'training' and in case fit() or grid_search(refit=True) where called before. The function automatically preprocesses the data according to the preprocessing_stage used to train the model.

Parameters:

Name Type Description Default
X Union[Iterable, Iterable[Iterable]]

An array of features.

required
sparse_output bool

boolean, optional, default False. When set to True, the function will create a sparse matrix after preprocessing and pass it to the predict_proba function.

False
copy bool

boolean, default False. False (default) - Input data might be updated as result of this function. True - Data is copied before processing (impacts memory).

False

Returns:

Type Description

Returns the probability of the sample for each class in the model.

print ¤

print(selected_tree=None)

Print the tree's string representation.

Parameters:

Name Type Description Default
selected_tree str

string, optional. Which tree to print. Defaults to printing all.

None

remove_samples ¤

remove_samples(indices, force_resample_all=None, force_sensitivity_recalc=None, force_do_nothing=False)

Remove samples from the coreset tree. The coreset tree is automatically updated to accommodate to the changes.

Parameters:

Name Type Description Default
indices Iterable

array-like. An array of indices to be removed from the coreset tree.

required
force_resample_all Optional[int]

int, optional. Force full resampling of the affected nodes in the coreset tree, starting from level=force_resample_all. None - Do not force_resample_all (default), 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level.

None
force_sensitivity_recalc Optional[int]

int, optional. Force the recalculation of the sensitivity and partial resampling of the affected nodes, based on the coreset's quality, starting from level=force_sensitivity_recalc. None - If self.chunk_sample_ratio<1 - one level above leaf node level. If self.chunk_sample_ratio=1 - leaf level 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level.

None
force_do_nothing Optional[bool]

bool, optional, default False. When set to True, suppresses any update to the coreset tree until update_dirty is called.

False

save ¤

save(dir_path=None, name=None, save_buffer=True, override=False, allow_pickle=True)

Save service configuration and relevant data to a local directory. Use this method when the service needs to be restored.

Parameters:

Name Type Description Default
dir_path Union[str, PathLike]

string or PathLike, optional, default self.working_directory. A local directory for saving service's files.

None
name str

string, optional, default service class name (lower case). Name of the subdirectory where the data will be stored.

None
save_buffer bool

boolean, default True. Save also the data in the buffer (a partial node of the tree) along with the rest of the saved data.

True
override bool

bool, optional, default False. False: add a timestamp suffix so each save won’t override the previous ones. True: The existing subdirectory with the provided name is overridden.

False
allow_pickle bool

bool, optional, default True. True: Saves the Coreset tree in pickle format (much faster). False: Saves the Coreset tree in JSON format.

True

Returns:

Type Description
Path

Save directory path.

save_coreset ¤

save_coreset(file_path, level=0, preprocessing_stage='user', with_index=True)

Get the coreset from the tree and save it to a file. Use the level parameter to control the level of the tree from which samples will be returned. This function is only applicable in case the coreset tree was optimized_for 'training'.

Parameters:

Name Type Description Default
file_path Union[str, PathLike]

string or PathLike. Local file path to store the coreset.

required
level int

int, optional, default 0. Defines the depth level of the tree from which the coreset is extracted. Level 0 returns the coreset from the head of the tree with around coreset_size samples. Level 1 returns the coreset from the level below the head of the tree with around twice of the samples compared to level 0, etc. If the passed level is greater than the maximal level of the tree, the maximal available level is used.

0
preprocessing_stage Union[str, None]

string, optional, default user.

The different stages reflect the data preprocessing workflow.

- original - Return the data as it was handed to the Coreset’s build function (The data_params.save_orig flag should be set for this option to be available).

- user - Return the data after any user defined data preprocessing (if defined).

- auto - Return the data after applying auto-preprocessing, including one-hot-encoding, converting Boolean fields to numeric, etc.

'user'
with_index bool

boolean, optional, default False. Relevant only when preprocessing_stage=auto. Should the returned data include the index column.

True

seq_dependent_validate ¤

seq_dependent_validate(level=None, seq_train_from=None, seq_train_to=None, seq_validate_from=None, seq_validate_to=None, model=None, scoring=None, return_model=False, verbose=0, preprocessing_stage='auto', sparse_threshold=0.01, **model_params)

The method allows to train and validate on a subset of the Coreset tree, according to the seq_column defined in the DataParams structure passed to the init. This function is only applicable in case the coreset tree was optimized_for training.

Parameters:

Name Type Description Default
level int

int, optional. The level of the tree from which the search for the best matching nodes starts. Nodes closer to the leaf level than the specified level, may be selected to better match the provided seq parameters.If None, the search starts from level 0, the head of the tree. If None, the best level will be selected.

None
seq_train_from Any

Any, optional. The starting sequence of the training set.

None
seq_train_to Any

Any, optional. The ending sequence of the training set.

None
seq_validate_from Any

Any, optional. The starting sequence number of the validation set.

None
seq_validate_to Any

Any, optional. The ending sequence number of the validation set.

None
model Any

A Scikit-learn compatible model instance, optional. When provided, model_params are not relevant. The model class needs to implement the usual scikit-learn interface. Default: instantiate the service model class using input model_params.

None
scoring Union[str, Callable[[BaseEstimator, ndarray, ndarray], float]]

callable or string, optional. If it is a callable object, it must return a scalar score. The signature of the call is (model, X, y), where model is the ML model to be evaluated, X is the data and y is the ground truth labeling. For example, it can be produced using sklearn.metrics.make_scorer. If it is a string, it must be a valid name of a Scikit-learn scoring method If None, the default scorer of the current model is used.

None
return_model bool

bool, optional. If True, the trained model is also returned.

False
verbose int

int, optional. Controls the verbosity: the higher, the more messages. >=1 : The number of hyperparameter combinations to process at the start and the time it took, best hyperparameters found and their score at the end. >=2 : The score and time for each hyperparameter combination.

0
preprocessing_stage Union[str, None]

string, optional, default auto.

The different stages reflect the data preprocessing workflow.

- user - Return the data after any user defined data preprocessing (if defined).

- auto - Return the data after applying auto-preprocessing, including one-hot-encoding, converting Boolean fields to numeric, etc.

'auto'
sparse_threshold float

float, optional, default 0.01. Creates a sparse matrix from the features (X), if the data density after preprocessing is below sparse_threshold, otherwise, will create an array. (Applicable only for preprocessing_stage='auto').

0.01
model_params

kwargs, optional. The hyper-parameters of the model. If not provided, the default values are used.

{}

Returns:

Type Description
Union[List[float], Tuple[List[float], List[BaseEstimator]]]

The validation score. If return_model=True, the trained model is also returned.

set_model_cls ¤

set_model_cls(model_cls)

Set the model class used to train the model on the coreset, in case a specific model instance wasn't passed to fit or the validation methods.

Parameters:

Name Type Description Default
model_cls Any

A Scikit-learn compatible model class.

required

set_seen_indication ¤

set_seen_indication(seen_flag=True, indices=None)

Set samples as 'seen' or 'unseen'. Not providing an indices list defaults to setting the flag on all samples. This function is only applicable in case the coreset tree was optimized_for 'cleaning'.

Parameters:

Name Type Description Default
seen_flag bool

bool, optional, default True. Set 'seen' or 'unseen' flag

True
indices Iterable

array like, optional. Set flag only for the provided list of indices. Defaults to all indices.

None

update_dirty ¤

update_dirty(force_resample_all=None, force_sensitivity_recalc=None)

Calculate the sensitivity and resample the nodes that were marked as dirty, meaning they were affected by any of the methods: remove_samples, update_targets, update_features or filter_out_samples, when they were called with force_do_nothing.

Parameters:

Name Type Description Default
force_resample_all Optional[int]

int, optional. Force full resampling of the affected nodes in the coreset tree, starting from level=force_resample_all. None - Do not force_resample_all (default), 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level.

None
force_sensitivity_recalc Optional[int]

int, optional. Force the recalculation of the sensitivity and partial resampling of the affected nodes, based on the coreset's quality, starting from level=force_sensitivity_recalc. None - If self.chunk_sample_ratio<1 - one level above leaf node level. If self.chunk_sample_ratio=1 - leaf level 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level.

None

update_features ¤

update_features(indices, X, feature_names=None, force_resample_all=None, force_sensitivity_recalc=None, force_do_nothing=False)

Update the features for selected samples on the coreset tree. The coreset tree is automatically updated to accommodate to the changes.

Parameters:

Name Type Description Default
indices Iterable

array-like. An array of indices to be updated.

required
X Iterable

array-like. An array of features. Should have the same length as indices.

required
feature_names Iterable[str]

If the quantity of features in X is not equal to the quantity of features in the original coreset, this param should contain list of names of passed features.

None
force_resample_all Optional[int]

int, optional. Force full resampling of the affected nodes in the coreset tree, starting from level=force_resample_all. None - Do not force_resample_all (default), 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level.

None
force_sensitivity_recalc Optional[int]

int, optional. Force the recalculation of the sensitivity and partial resampling of the affected nodes, based on the coreset's quality, starting from level=force_sensitivity_recalc. None - If self.chunk_sample_ratio<1 - one level above leaf node level. If self.chunk_sample_ratio=1 - leaf level 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level.

None
force_do_nothing Optional[bool]

bool, optional, default False. When set to True, suppresses any update to the coreset tree until update_dirty is called.

False

update_targets ¤

update_targets(indices, y, force_resample_all=None, force_sensitivity_recalc=None, force_do_nothing=False)

Update the targets for selected samples on the coreset tree. The coreset tree is automatically updated to accommodate to the changes.

Parameters:

Name Type Description Default
indices Iterable

array-like. An array of indices to be updated.

required
y Iterable

array-like. An array of classes/labels. Should have the same length as indices.

required
force_resample_all Optional[int]

int, optional. Force full resampling of the affected nodes in the coreset tree, starting from level=force_resample_all. None - Do not force_resample_all (default), 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level.

None
force_sensitivity_recalc Optional[int]

int, optional. Force the recalculation of the sensitivity and partial resampling of the affected nodes, based on the coreset's quality, starting from level=force_sensitivity_recalc. None - If self.chunk_sample_ratio<1 - one level above leaf node level. If self.chunk_sample_ratio=1 - leaf level 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level.

None
force_do_nothing Optional[bool]

bool, optional, default False. When set to True, suppresses any update to the coreset tree until update_dirty is called.

False