CoresetTreeServiceKMeans
CoresetTreeServiceKMeans ¤
CoresetTreeServiceKMeans(*, data_manager=None, data_params=None, n_instances=None, max_memory_gb=None, optimized_for, chunk_size=None, chunk_by=None, k=8, coreset_size=None, coreset_params=None, working_directory=None, cache_dir=None, node_train_function=None, node_train_function_params=None, node_metadata_func=None, chunk_sample_ratio=None, model_cls=None)
Bases: CoresetTreeServiceUnsupervisedMixin
, CoresetTreeService
Subclass of CoresetTreeService for KMeans. A service class for creating a coreset tree and working with it. optimized_for is a required parameter defining the main usage of the service: 'training', 'cleaning' or both, optimized_for=['training', 'cleaning']. The service will decide whether to build an actual Coreset Tree or to build a single Coreset over the entire dataset, based on the triplet: n_instances, max_memory_gb and the 'number of features' (deduced from the dataset). The chunk_size and coreset_size will be deduced based on the above triplet too. In case chunk_size and coreset_size are provided, they will override all above mentioned parameters (less recommended).
When fitting KMeans on the Coreset, it is highly recommended to use the built-in fit function of the CoresetTreeServiceKMeans class. Sklearn uses by default k-means++ as its initialization method. While sklearn's KMeans implementation supports the receipt of sample_weight, the kmeans_plusplus implementation does not. When building the Coreset, samples are selected and weights are assigned to them, therefore, not using these weights will significantly degrade the quality of the results. The fit implementation of the CoresetTreeServiceKMeans solves this problem, by extending kmeans_plusplus to receive sample_weight.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_manager |
DataManagerT
|
DataManagerBase subclass, optional. The class used to interact with the provided data and store it locally. By default, only the sampled data is stored in HDF5 files format. |
None
|
data_params |
Union[DataParams, dict]
|
DataParams, optional. Preprocessing information. |
None
|
n_instances |
int
|
int. The total number of instances that are going to be processed (can be an estimation). This parameter is required and the only one from the above mentioned quadruplet, which isn't deduced from the data. |
None
|
max_memory_gb |
int
|
int, optional. The maximum memory in GB that should be used. When not provided, the server's total memory is used. In any case only 80% of the provided memory or the server's total memory is considered. |
None
|
optimized_for |
Union[list, str]
|
str or list Either 'training', 'cleaning' or or both ['training', 'cleaning']. The main usage of the service. |
required |
k |
int, default=8. Only relevant when tree is optimized_for cleaning. The number of clusters to form as well as the number of centroids to generate. |
8
|
|
chunk_size |
Union[dict, int]
|
int, optional. The number of instances to be used when creating a coreset node in the tree. When defined, it will override the parameters of optimized_for, n_instances, n_classes and max_memory_gb. chunk_size=0: Nodes are created based on input chunks. chunk_size=-1: Force the service to create a single coreset from the entire dataset (if it fits into memory). |
None
|
chunk_by |
Union[Callable, str, list]
|
function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored. |
None
|
coreset_size |
Union[int, dict]
|
int, optional. Represents the coreset size of each node in the coreset tree. The coreset is constructed by sampling data instances from the dataset based on their calculated importance. Since each instance may be sampled more than once, in practice, the actual size of the coreset is mostly smaller than coreset_size. |
None
|
coreset_params |
Union[CoresetParams, dict]
|
CoresetParams or dict, optional. Coreset algorithm specific parameters. |
None
|
node_train_function |
Callable[[ndarray, ndarray, ndarray], Any]
|
Callable, optional. method for training model at tree node level. |
None
|
node_train_function_params |
dict
|
dict, optional. kwargs to be used when calling node_train_function. |
None
|
node_metadata_func |
Callable[[Tuple[ndarray], ndarray, Union[list, None]], Union[list, dict, None]]
|
callable, optional. A method for storing user meta data on each node. |
None
|
working_directory |
Union[str, PathLike]
|
str, path, optional. Local directory where intermediate data is stored. |
None
|
cache_dir |
Union[str, PathLike]
|
str, path, optional. For internal use when loading a saved service. |
None
|
chunk_sample_ratio |
float
|
float, optional. Indicates the size of the sample that will be taken and saved from each chunk on top of the Coreset for the validation methods. The values are from the range [0,1]. For example, chunk_sample_ratio=0.5, means that 50% of the data instances from each chunk will be saved. |
None
|
model_cls |
Any
|
A Scikit-learn compatible model class, optional. The model class used to train the model on the coreset, in case a specific model instance wasn't passed to fit or the validation methods. The default model class is sklearn's KMeans, with our extension to kmeans_plusplus to support sample_weight. |
None
|
auto_preprocessing ¤
Apply auto-preprocessing on the provided (test) data, similarly to the way it is done by the fit or get_coreset methods. Preprocessing includes ohe-hot encoding and handling missing values depends.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
level |
int, optional, default 0. Defines the depth level of the tree according to which the preprocessing would be done. |
required | |
X |
Union[Iterable, Iterable[Iterable]]
|
array like or iterator of arrays like. An array or an iterator of features. |
None
|
sparse_threshold |
float
|
int, optional, default 0.01. Returns a sparse matrix if the data density after preprocessing is below sparse_threshold, otherwise, will return the data as an array. |
0.01
|
copy |
bool
|
boolean, default False. False (default) - Input data might be updated as result of this function. True - Data is copied before processing (impacts memory). |
False
|
Returns:
Name | Type | Description |
---|---|---|
Dict |
data: A numpy array of the preprocessed data. features: A list of feature names corresponding to the data. |
build ¤
build(X, y=None, indices=None, props=None, *, chunk_size=None, chunk_by=None, copy=False, n_jobs=None)
Create a coreset tree from the parameters X, y, indices and props (properties). build functions may be called only once. To add more data to the coreset tree use one of the partial_build functions. Categorical features are automatically one-hot encoded and missing values are automatically handled. The target will be ignored when the Coreset is built.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
Union[Iterable, Iterable[Iterable]]
|
array like or iterator of arrays like. An array or an iterator of features. Categorical features are automatically one-hot encoded and missing values are automatically handled. |
required |
y |
Union[Iterable[Any], Iterable[Iterable[Any]]]
|
array like or iterator of arrays like, optional. An array or an iterator of targets. The target will be ignored when the Coreset is built. |
None
|
indices |
Union[Iterable[Any], Iterable[Iterable[Any]]]
|
array like or iterator of arrays like, optional. An array or an iterator with indices of X. |
None
|
props |
Union[Iterable[Any], Iterable[Iterable[Any]]]
|
array like or iterator of arrays like, optional. An array or an iterator of properties. Properties, won’t be used to compute the Coreset or train the model, but it is possible to filter_out_samples on them or to pass them in the select_from_function of get_cleaning_samples. |
None
|
chunk_size |
int
|
int, optional. The number of instances used when creating a coreset node in the tree. chunk_size=0: Nodes are created based on input chunks. chunk_size=-1: Force the service to create a single coreset from the entire dataset (if it fits into memory). |
None
|
chunk_by |
Union[Callable, str, list]
|
function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored. |
None
|
copy |
bool
|
boolean, default False. False (default) - Input data might be updated as result of this function and functions such as update_targets or update_features. True - Data is copied before processing (impacts memory). |
False
|
n_jobs |
int
|
Default: number of CPUs. Number of jobs to run in parallel during build. |
None
|
Returns:
Type | Description |
---|---|
CoresetTreeService
|
self |
build_from_df ¤
build_from_df(datasets, target_datasets=None, *, chunk_size=None, chunk_by=None, copy=False, n_jobs=None)
Create a coreset tree from pandas DataFrame(s). build functions may be called only once. To add more data to the coreset tree use one of the partial_build functions. Categorical features are automatically one-hot encoded and missing values are automatically handled. The target will be ignored when the Coreset is built.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterator[DataFrame], DataFrame]
|
pandas DataFrame or a DataFrame iterator. Data includes features, may include labels and may include indices. |
required |
target_datasets |
Union[Iterator[Union[DataFrame, Series]], DataFrame, Series]
|
pandas DataFrame or a DataFrame iterator, optional. Use when data is split to features and target. Should include only one column. |
None
|
chunk_size |
int
|
int, optional. The number of instances used when creating a coreset node in the tree. chunk_size=0: Nodes are created based on input chunks. chunk_size=-1: Force the service to create a single coreset from the entire dataset (if it fits into memory). |
None
|
chunk_by |
Union[Callable, str, list]
|
function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored. |
None
|
copy |
bool
|
boolean, default False. False (default) - Input data might be updated as result of this function and functions such as update_targets or update_features. True - Data is copied before processing (impacts memory). |
False
|
n_jobs |
int
|
Default: number of CPUs. Number of jobs to run in parallel during build. |
None
|
Returns:
Type | Description |
---|---|
CoresetTreeService
|
self |
build_from_file ¤
build_from_file(file_path, target_file_path=None, *, reader_f=pd.read_csv, reader_kwargs=None, reader_chunk_size_param_name=None, chunk_size=None, chunk_by=None, n_jobs=None)
Create a coreset tree based on data taken from local storage. build functions may be called only once. To add more data to the coreset tree use one of the partial_build functions. Categorical features are automatically one-hot encoded and missing values are automatically handled. The target will be ignored when the Coreset is built.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
Union[Union[str, PathLike], Iterable[Union[str, PathLike]]]
|
file, list of files, directory, list of directories. Path(s) to the place where data is stored. Data includes features, may include targets and may include indices. |
required |
target_file_path |
Union[Union[str, PathLike], Iterable[Union[str, PathLike]]]
|
file, list of files, directory, list of directories, optional. Use when the dataset files are split to features and target. Each file should include only one column. |
None
|
reader_f |
Callable
|
pandas like read method, optional, default pandas read_csv. For example, to read excel files use pandas read_excel. |
read_csv
|
reader_kwargs |
dict
|
dict, optional. Keyword arguments used when calling reader_f method. |
None
|
reader_chunk_size_param_name |
str
|
str, optional. reader_f input parameter name for reading file in chunks. When not provided we'll try to figure it out our self. Based on the data, we decide on the optimal chunk size to read and use this parameter as input when calling reader_f. Use "ignore" to skip the automatic chunk reading logic. |
None
|
chunk_size |
int
|
int, optional. The number of instances used when creating a coreset node in the tree. chunk_size=0: Nodes are created based on input chunks. chunk_size=-1: Force the service to create a single coreset from the entire dataset (if it fits into memory). |
None
|
chunk_by |
Union[Callable, str, list]
|
function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored. |
None
|
n_jobs |
int
|
Default: number of CPUs. Number of jobs to run in parallel during build. |
None
|
Returns:
Type | Description |
---|---|
CoresetTreeService
|
self |
cross_validate ¤
cross_validate(level=None, model=None, scoring=None, return_model=False, verbose=0, preprocessing_stage='auto', **model_params)
Method for cross-validation on the coreset tree. This function is only applicable in case the coreset tree was optimized_for 'training'.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
level |
int
|
int, optional. The level of the tree on which the training and validation will be performed. If None, the best level will be selected. |
None
|
model |
Any
|
A Scikit-learn compatible model instance, optional. When provided, model_params are not relevant. The model class needs to implement the usual scikit-learn interface. Default: instantiate the service model class using input model_params. |
None
|
scoring |
Union[str, Callable[[BaseEstimator, ndarray, ndarray], float]]
|
callable or string, optional. If it is a callable object, it must return a scalar score. The signature of the call is (model, X, y), where model is the ML model to be evaluated, X is the data and y is the ground truth labeling. For example, it can be produced using sklearn.metrics.make_scorer. If it is a string, it must be a valid name of a Scikit-learn scoring method If None, the default scorer of the current model is used. |
None
|
return_model |
bool
|
bool, optional. If True, the trained model is also returned. |
False
|
verbose |
int
|
int, optional. Controls the verbosity: the higher, the more messages. >=1 : The number of folds and hyperparameter combinations to process at the start and the time it took, best hyperparameters found and their score at the end. >=2 : the score is also displayed; |
0
|
preprocessing_stage |
Union[str, None]
|
string, optional, default |
'auto'
|
model_params |
kwargs, optional. The hyper-parameters of the model. If not provided, the default values are used. |
{}
|
Returns:
Type | Description |
---|---|
Union[List[float], Tuple[List[float], List[BaseEstimator]]]
|
A list of scores, one for each fold. If return_model=True, a list of trained models is also returned (one model for each fold). |
filter_out_samples ¤
filter_out_samples(filter_function, force_resample_all=None, force_sensitivity_recalc=None, force_do_nothing=False)
Remove samples from the coreset tree, based on the provided filter function. The coreset tree is automatically updated to accommodate to the changes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filter_function |
Callable[[Iterable, Iterable, Union[Iterable, None], Union[Iterable, None]], Iterable[Any]]
|
function, optional. A function that returns a list of indices to be removed from the tree. The function should accept 4 parameters as input: indices, X, y, props and return a list(iterator) of indices to be removed from the coreset tree. For example, in order to remove all instances with a target equal to 6, use the following function: filter_function = lambda indices, X, y, props : indices[y = 6]. |
required |
force_resample_all |
Optional[int]
|
int, optional. Force full resampling of the affected nodes in the coreset tree, starting from level=force_resample_all. None - Do not force_resample_all (default), 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level. |
None
|
force_sensitivity_recalc |
Optional[int]
|
int, optional. Force the recalculation of the sensitivity and partial resampling of the affected nodes, based on the coreset's quality, starting from level=force_sensitivity_recalc. None - If self.chunk_sample_ratio<1 - one level above leaf node level. If self.chunk_sample_ratio=1 - leaf level 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level. |
None
|
force_do_nothing |
Optional[bool]
|
bool, optional, default False. When set to True, suppresses any update to the coreset tree until update_dirty is called. |
False
|
fit ¤
Fit a model on the coreset tree. This model will be used when predict and predict_proba are called. This function is only applicable in case the coreset tree was optimized_for 'training'.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
level |
int
|
Defines the depth level of the tree from which the coreset is extracted. Level 0 returns the coreset from the head of the tree with around coreset_size samples. Level 1 returns the coreset from the level below the head of the tree with around twice of the samples compared to level 0, etc. If the passed level is greater than the maximal level of the tree, the maximal available level is used. |
0
|
seq_from |
Any
|
string/datetime, optional The starting sequence of the training set. |
None
|
seq_to |
Any
|
string/datetime, optional The ending sequence of the training set. |
None
|
model |
Any
|
A Scikit-learn compatible model instance, optional. When provided, model_params are not relevant. Default: instantiate the service model class using input model_params. |
None
|
preprocessing_stage |
Union[str, None]
|
string, optional, default |
'auto'
|
model_params |
Model hyperparameters kwargs. Input when instantiating default model class. |
{}
|
Returns:
Type | Description |
---|---|
Fitted estimator. |
get_cleaning_samples ¤
get_cleaning_samples(size=None, ignore_indices=None, select_from_indices=None, select_from_function=None, ignore_seen_samples=True)
Returns indices of samples in descending order of importance. Useful for identifying mislabeled instances and other anomalies in the data. size must be provided. Function must be called after build. This function is only applicable in case the coreset tree was optimized_for 'cleaning'. This function is not for retrieving the coreset (use get_coreset in this case).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
size |
int
|
required, optional. Number of samples to return. |
None
|
ignore_indices |
Iterable
|
array-like, optional. An array of indices to ignore when selecting cleaning samples. |
None
|
select_from_indices |
Iterable
|
array-like, optional. An array of indices to consider when selecting cleaning samples. |
None
|
select_from_function |
Callable[[Iterable, Iterable, Union[Iterable, None], Union[Iterable, None]], Iterable[Any]]
|
function, optional. Pass a function in order to limit the selection of the cleaning samples accordingly. The function should accept 4 parameters as input: indices, X, y, props. and return a list(iterator) of the desired indices. |
None
|
ignore_seen_samples |
bool
|
bool, optional, default True. Exclude already seen samples and set the seen flag on any indices returned by the function. |
True
|
Returns:
Name | Type | Description |
---|---|---|
Dict |
Union[ValueError, dict]
|
idx: array-like[int]. Cleaning samples indices. X: array-like[int]. X array. y: array-like[int]. y array. importance: array-like[float]. The importance property. Instances that receive a high Importance in the Coreset computation, require attention as they usually indicate a labeling error, anomaly, out-of-distribution problem or other data-related issue. |
get_coreset ¤
get_coreset(level=0, preprocessing_stage='user', sparse_threshold=0.01, as_df=False, with_index=False, seq_from=None, seq_to=None)
Get tree's coreset data in one of the preprocessing_stage(s) in the data preprocessing workflow. Use the level parameter to control the level of the tree from which samples will be returned. This function is only applicable in case the coreset tree was optimized_for 'training'.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
level |
int
|
int, optional, default 0. Defines the depth level of the tree from which the coreset is extracted. Level 0 returns the coreset from the head of the tree with around coreset_size samples. Level 1 returns the coreset from the level below the head of the tree with around twice of the samples compared to level 0, etc. If the passed level is greater than the maximal level of the tree, the maximal available level is used. |
0
|
preprocessing_stage |
Union[str, None]
|
string, optional, default |
'user'
|
sparse_threshold |
float
|
int, optional, default 0.01.
Returns the features (X) as a sparse matrix if the data density after preprocessing is below sparse_threshold,
otherwise, will return the data as an array (Applicable only for preprocessing_stage= |
0.01
|
as_df |
bool
|
boolean, optional, default False. When True, returns the data as a pandas DataFrame. Besides the features, the DataFrame will include also the index and target columns. |
False
|
with_index |
bool
|
boolean, optional, default False.
Relevant only when preprocessing_stage= |
False
|
seq_from |
Any
|
string or datetime, optional, default None. The start sequence to filter samples by. |
None
|
seq_to |
Any
|
string or datetime, optional, default None. The end sequence to filter samples by. |
None
|
Returns:
Name | Type | Description |
---|---|---|
Dict |
dict
|
data: A numpy arrays tuple (indices, X, optional y) or a pandas DataFrame when, depending on the as_df parameter. w: A numpy array of sample weights. n_represents: number of instances represented by the coreset. props: A numpy array of sample properties. |
get_coreset_size ¤
Returns the size of the tree's coreset data. Use the level parameter to control the level of the tree from which samples will be returned. This function is only applicable in case the coreset tree was optimized_for 'training'.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
level |
int
|
int, optional, default 0. Defines the depth level of the tree from which the coreset is extracted. Level 0 returns the coreset from the head of the tree with around coreset_size samples. Level 1 returns the coreset from the level below the head of the tree with around twice of the samples compared to level 0, etc. If the passed level is greater than the maximal level of the tree, the maximal available level is used. |
0
|
seq_from |
Any
|
string or datetime, optional, default None. The start sequence to filter samples by. |
None
|
seq_to |
Union[str, datetime]
|
string or datetime, optional, default None. The end sequence to filter samples by. |
None
|
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
coreset size |
get_max_level ¤
Return the maximal level of the coreset tree. Level 0 is the head of the tree. Level 1 is the level below the head of the tree, etc.
grid_search ¤
grid_search(param_grid, level=None, validation_method='cross validation', model=None, scoring=None, refit=True, verbose=0, preprocessing_stage='auto', error_score=np.nan, validation_size=0.2, seq_train_from=None, seq_train_to=None, seq_validate_from=None, seq_validate_to=None)
A method for performing hyperparameter selection by grid search, using the coreset tree. This function is only applicable in case the coreset tree was optimized_for 'training'.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
param_grid |
Union[Dict[str, List], List[Dict[str, List]]]
|
dict or list of dicts. Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings. |
required |
level |
int
|
int, optional. The level of the tree on which the training and validation will be performed. If None, the best level will be selected. |
None
|
validation_method |
str
|
str, optional. Indicates which validation method will be used. The possible values are 'cross validation', 'hold-out validation' and 'seq-dependent validation'. If 'cross validation' is selected, the process involves progressing through folds. We first train and validate all hyperparameter combinations for each fold, before moving on to the subsequent folds. |
'cross validation'
|
model |
Any
|
A Scikit-learn compatible model instance, optional. The model class needs to implement the usual scikit-learn interface. |
None
|
scoring |
Union[str, Callable[[BaseEstimator, ndarray, ndarray], float]]
|
callable or string, optional. If it is a callable object, it must return a scalar score. The signature of the call is (model, X, y), where model is the ML model to be evaluated, X is the data and y is the ground truth labeling. For example, it can be produced using sklearn.metrics.make_scorer. If it is a string, it must be a valid name of a Scikit-learn scoring method If None, the default scorer of the current model is used. |
None
|
refit |
bool
|
bool, optional. If True, retrain the model on the whole coreset using the best found hyperparameters, and return the model. This model will be used when predict and predict_proba are called. |
True
|
verbose |
int
|
int, optional Controls the verbosity: the higher, the more messages. >=1 : The number of folds and hyperparameter combinations to process at the start and the time it took, best hyperparameters found and their score at the end. >=2 : The score and time for each fold and hyperparameter combination. |
0
|
preprocessing_stage |
Union[str, None]
|
string, optional, default |
'auto'
|
error_score |
Union[str, float, int]
|
"raise" or numeric, optional. Value to assign to the score if an error occurs in model training. If set to "raise", the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error. |
nan
|
validation_size |
float
|
float, optional, default 0.2. The size of the validation set, as a percentage of the training set size for hold-out validation. |
0.2
|
seq_train_from |
Any
|
Any, optional. The starting sequence of the training set for seq-dependent validation. |
None
|
seq_train_to |
Any
|
Any, optional. The ending sequence of the training set for seq-dependent validation. |
None
|
seq_validate_from |
Any
|
Any, optional. The starting sequence number of the validation set for seq-dependent validation. |
None
|
seq_validate_to |
Any
|
Any, optional. The ending sequence number of the validation set for seq-dependent validation. |
None
|
Returns:
Type | Description |
---|---|
Union[Tuple[Dict, DataFrame, BaseEstimator], Tuple[Dict, DataFrame]]
|
A dict with the best hyperparameters setting, among those provided by the user. The keys are the hyperparameters names, while the dicts' values are the hyperparameters values. A Pandas DataFrame holding the score for each hyperparameter combination and fold. For the 'cross validation' method the average across all folds for each hyperparameter combination is included too. If refit=True, the retrained model is also returned. |
holdout_validate ¤
holdout_validate(level=None, validation_size=0.2, model=None, scoring=None, return_model=False, verbose=0, preprocessing_stage='auto', **model_params)
A method for hold-out validation on the coreset tree.
The validation set is always the last part of the dataset.
This function is only applicable in case the coreset tree was optimized_for training
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
level |
int
|
int, optional. The level of the tree on which the training and validation will be performed. If None, the best level will be selected. |
None
|
validation_size |
float
|
float, optional. The percentage of the dataset that will be used for validating the model. |
0.2
|
model |
Any
|
A Scikit-learn compatible model instance, optional. When provided, model_params are not relevant. The model class needs to implement the usual scikit-learn interface. Default: instantiate the service model class using input model_params. |
None
|
scoring |
Union[str, Callable[[BaseEstimator, ndarray, ndarray], float]]
|
callable or string, optional. If it is a callable object, it must return a scalar score. The signature of the call is (model, X, y), where model is the ML model to be evaluated, X is the data and y is the ground truth labeling. For example, it can be produced using sklearn.metrics.make_scorer. If it is a string, it must be a valid name of a Scikit-learn scoring method If None, the default scorer of the current model is used. |
None
|
return_model |
bool
|
bool, optional. If True, the trained model is also returned. |
False
|
verbose |
int
|
int, optional. Controls the verbosity: the higher, the more messages. >=1 : The number of hyperparameter combinations to process at the start and the time it took, best hyperparameters found and their score at the end. >=2 : The score and time for each hyperparameter combination. |
0
|
preprocessing_stage |
Union[str, None]
|
string, optional, default |
'auto'
|
model_params |
kwargs, optional. The hyper-parameters of the model. If not provided, the default values are used. |
{}
|
Returns:
Type | Description |
---|---|
Union[List[float], Tuple[List[float], List[BaseEstimator]]]
|
The validation score. If return_model=True, the trained model is also returned. |
is_dirty ¤
Returns:
Type | Description |
---|---|
bool
|
Indicates whether the coreset tree has nodes marked as dirty, meaning they were affected by any of the methods: remove_samples, update_targets, update_features or filter_out_samples, when they were called with force_do_nothing. |
load
classmethod
¤
Restore a service object from a local directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dir_path |
Union[str, PathLike]
|
str, path. Local directory where service data is stored. |
required |
name |
str
|
string, optional, default service class name (lower case). The name prefix of the subdirectory to load. When several subdirectories having the same name prefix are found, the last one, ordered by name, is selected. For example when saving with override=False, the chosen subdirectory is the last saved. |
None
|
data_manager |
DataManagerT
|
DataManagerBase subclass, optional. When specified, input data manger will be used instead of restoring it from the saved configuration. |
None
|
load_buffer |
bool
|
boolean, optional, default True. If set, load saved buffer (a partial node of the tree) from disk and add it to the tree. |
True
|
working_directory |
Union[str, PathLike]
|
str, path, optional, default use working_directory from saved configuration. Local directory where intermediate data is stored. |
None
|
Returns:
Type | Description |
---|---|
CoresetTreeService
|
CoresetTreeService object |
partial_build ¤
partial_build(X, y=None, indices=None, props=None, *, chunk_size=None, chunk_by=None, copy=False, n_jobs=None)
Add new samples to a coreset tree from parameters X, y, indices and props (properties). Categorical features are automatically one-hot encoded and missing values are automatically handled. The target will be ignored when the Coreset is built.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
Union[Iterable, Iterable[Iterable]]
|
array like or iterator of arrays like. An array or an iterator of features. Categorical features are automatically one-hot encoded and missing values are automatically handled. |
required |
y |
Union[Iterable[Any], Iterable[Iterable[Any]]]
|
array like or iterator of arrays like, optional. An array or an iterator of targets. The target will be ignored when the Coreset is built. |
None
|
indices |
Union[Iterable[Any], Iterable[Iterable[Any]]]
|
array like or iterator of arrays like, optional. An array or an iterator with indices of X. |
None
|
props |
Union[Iterable[Any], Iterable[Iterable[Any]]]
|
array like or iterator of arrays like, optional. An array or an iterator of properties. Properties, won’t be used to compute the Coreset or train the model, but it is possible to filter_out_samples on them or to pass them in the select_from_function of get_cleaning_samples. |
None
|
chunk_size |
int
|
int, optional. The number of instances used when creating a coreset node in the tree. chunk_size=0: Nodes are created based on input chunks. |
None
|
chunk_by |
Union[Callable, str, list]
|
function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored. |
None
|
copy |
bool
|
boolean, default False False (default) - Input data might be updated as result of this function and functions such as update_targets or update_features. True - Data is copied before processing (impacts memory). |
False
|
n_jobs |
int
|
Default: number of CPUs. Number of jobs to run in parallel during build. |
None
|
Returns:
Type | Description |
---|---|
CoresetTreeService
|
self |
partial_build_from_df ¤
partial_build_from_df(datasets, target_datasets=None, *, chunk_size=None, chunk_by=None, copy=False, n_jobs=None)
Add new samples to a coreset tree based on the pandas DataFrame iterator. Categorical features are automatically one-hot encoded and missing values are automatically handled. The target will be ignored when the Coreset is built.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterator[DataFrame], DataFrame]
|
pandas DataFrame or a DataFrame iterator. Data includes features, may include targets and may include indices. |
required |
target_datasets |
Union[Iterator[DataFrame], DataFrame]
|
pandas DataFrame or a DataFrame iterator, optional. Use when data is split to features and target. Should include only one column. |
None
|
chunk_size |
int
|
int, optional, default previous used chunk_size. The number of instances used when creating a coreset node in the tree. chunk_size=0: Nodes are created based on input chunks. |
None
|
chunk_by |
Union[Callable, str, list]
|
function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored. |
None
|
copy |
bool
|
boolean, default False. False (default) - Input data might be updated as result of this function and functions such as update_targets or update_features. True - Data is copied before processing (impacts memory). |
False
|
n_jobs |
int
|
Default: number of CPUs. Number of jobs to run in parallel during build. |
None
|
Returns: self
partial_build_from_file ¤
partial_build_from_file(file_path, target_file_path=None, *, reader_f=pd.read_csv, reader_kwargs=None, reader_chunk_size_param_name=None, chunk_size=None, chunk_by=None, n_jobs=None)
Add new samples to a coreset tree based on data taken from local storage. Categorical features are automatically one-hot encoded and missing values are automatically handled. The target will be ignored when the Coreset is built.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
Union[Union[str, PathLike], Iterable[Union[str, PathLike]]]
|
file, list of files, directory, list of directories. Path(s) to the place where data is stored. Data includes features, may include targets and may include indices. |
required |
target_file_path |
Union[Union[str, PathLike], Iterable[Union[str, PathLike]]]
|
file, list of files, directory, list of directories, optional. Use when files are split to features and target. Each file should include only one column. |
None
|
reader_f |
Callable
|
pandas like read method, optional, default pandas read_csv. For example, to read excel files use pandas read_excel. |
read_csv
|
reader_kwargs |
dict
|
dict, optional. Keyword arguments used when calling reader_f method. |
None
|
reader_chunk_size_param_name |
str
|
str, optional. reader_f input parameter name for reading file in chunks. When not provided we'll try to figure it out our self. Based on the data, we decide on the optimal chunk size to read and use this parameter as input when calling reader_f. Use "ignore" to skip the automatic chunk reading logic. |
None
|
chunk_size |
int
|
int, optional, default previous used chunk_size. The number of instances used when creating a coreset node in the tree. chunk_size=0: Nodes are created based on input chunks. |
None
|
chunk_by |
Union[Callable, str, list]
|
function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored. |
None
|
n_jobs |
int
|
Default: number of CPUs. Number of jobs to run in parallel during build. |
None
|
Returns:
Type | Description |
---|---|
CoresetTreeService
|
self |
plot ¤
Produce a tree graph plot and save figure as a local png file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dir_path |
Union[str, PathLike]
|
string or PathLike. Path to save the plot figure in; if not provided, or if isn't valid/doesn't exist, the figure will be saved in the current directory (from which this method is called). |
None
|
selected_trees |
dict
|
dict, optional. A dictionary containing the names of the image file(s) to be generated. |
None
|
Returns:
Type | Description |
---|---|
Path
|
Image file path |
predict ¤
Run prediction on the trained model. This function is only applicable in case the coreset tree was optimized_for 'training' and in case fit() or grid_search(refit=True) where called before. The function automatically preprocesses the data according to the preprocessing_stage used to train the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
Union[Iterable, Iterable[Iterable]]
|
An array of features. |
required |
copy |
bool
|
boolean, default False. False (default) - Input data might be updated as result of this function. True - Data is copied before processing (impacts memory). |
False
|
Returns:
Type | Description |
---|---|
Model prediction results. |
predict_proba ¤
Run prediction on the trained model. This function is only applicable in case the coreset tree was optimized_for 'training' and in case fit() or grid_search(refit=True) where called before. The function automatically preprocesses the data according to the preprocessing_stage used to train the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
Union[Iterable, Iterable[Iterable]]
|
An array of features. |
required |
copy |
bool
|
boolean, default False. False (default) - Input data might be updated as result of this function. True - Data is copied before processing (impacts memory). |
False
|
Returns:
Type | Description |
---|---|
Returns the probability of the sample for each class in the model. |
print ¤
Print the tree's string representation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
selected_tree |
str
|
string, optional. Which tree to print. Defaults to printing all. |
None
|
remove_samples ¤
remove_samples(indices, force_resample_all=None, force_sensitivity_recalc=None, force_do_nothing=False)
Remove samples from the coreset tree. The coreset tree is automatically updated to accommodate to the changes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
indices |
Iterable
|
array-like. An array of indices to be removed from the coreset tree. |
required |
force_resample_all |
Optional[int]
|
int, optional. Force full resampling of the affected nodes in the coreset tree, starting from level=force_resample_all. None - Do not force_resample_all (default), 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level. |
None
|
force_sensitivity_recalc |
Optional[int]
|
int, optional. Force the recalculation of the sensitivity and partial resampling of the affected nodes, based on the coreset's quality, starting from level=force_sensitivity_recalc. None - If self.chunk_sample_ratio<1 - one level above leaf node level. If self.chunk_sample_ratio=1 - leaf level 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level. |
None
|
force_do_nothing |
Optional[bool]
|
bool, optional, default False. When set to True, suppresses any update to the coreset tree until update_dirty is called. |
False
|
save ¤
Save service configuration and relevant data to a local directory. Use this method when the service needs to be restored.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dir_path |
Union[str, PathLike]
|
string or PathLike, optional, default self.working_directory. A local directory for saving service's files. |
None
|
name |
str
|
string, optional, default service class name (lower case). Name of the subdirectory where the data will be stored. |
None
|
save_buffer |
bool
|
boolean, default True. Save also the data in the buffer (a partial node of the tree) along with the rest of the saved data. |
True
|
override |
bool
|
bool, optional, default False. False: add a timestamp suffix so each save won’t override the previous ones. True: The existing subdirectory with the provided name is overridden. |
False
|
allow_pickle |
bool
|
bool, optional, default True. True: Saves the Coreset tree in pickle format (much faster). False: Saves the Coreset tree in JSON format. |
True
|
Returns:
Type | Description |
---|---|
Path
|
Save directory path. |
save_coreset ¤
Get the coreset from the tree and save it to a file. Use the level parameter to control the level of the tree from which samples will be returned. This function is only applicable in case the coreset tree was optimized_for 'training'.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
Union[str, PathLike]
|
string or PathLike. Local file path to store the coreset. |
required |
level |
int
|
int, optional, default 0. Defines the depth level of the tree from which the coreset is extracted. Level 0 returns the coreset from the head of the tree with around coreset_size samples. Level 1 returns the coreset from the level below the head of the tree with around twice of the samples compared to level 0, etc. If the passed level is greater than the maximal level of the tree, the maximal available level is used. |
0
|
preprocessing_stage |
Union[str, None]
|
string, optional, default |
'user'
|
with_index |
bool
|
boolean, optional, default False.
Relevant only when preprocessing_stage= |
False
|
seq_dependent_validate ¤
seq_dependent_validate(level=None, seq_train_from=None, seq_train_to=None, seq_validate_from=None, seq_validate_to=None, model=None, scoring=None, return_model=False, verbose=0, preprocessing_stage='auto', **model_params)
The method allows to train and validate on a subset of the Coreset tree, according to the seq_column
defined
in the DataParams
structure passed to the init.
This function is only applicable in case the coreset tree was optimized_for training
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
level |
int
|
int, optional. The level of the tree from which the search for the best matching nodes starts. Nodes closer to the leaf level than the specified level, may be selected to better match the provided seq parameters.If None, the search starts from level 0, the head of the tree. If None, the best level will be selected. |
None
|
seq_train_from |
Any
|
Any, optional. The starting sequence of the training set. |
None
|
seq_train_to |
Any
|
Any, optional. The ending sequence of the training set. |
None
|
seq_validate_from |
Any
|
Any, optional. The starting sequence number of the validation set. |
None
|
seq_validate_to |
Any
|
Any, optional. The ending sequence number of the validation set. |
None
|
model |
Any
|
A Scikit-learn compatible model instance, optional. When provided, model_params are not relevant. The model class needs to implement the usual scikit-learn interface. Default: instantiate the service model class using input model_params. |
None
|
scoring |
Union[str, Callable[[BaseEstimator, ndarray, ndarray], float]]
|
callable or string, optional. If it is a callable object, it must return a scalar score. The signature of the call is (model, X, y), where model is the ML model to be evaluated, X is the data and y is the ground truth labeling. For example, it can be produced using sklearn.metrics.make_scorer. If it is a string, it must be a valid name of a Scikit-learn scoring method If None, the default scorer of the current model is used. |
None
|
return_model |
bool
|
bool, optional. If True, the trained model is also returned. |
False
|
verbose |
int
|
int, optional. Controls the verbosity: the higher, the more messages. >=1 : The number of hyperparameter combinations to process at the start and the time it took, best hyperparameters found and their score at the end. >=2 : The score and time for each hyperparameter combination. |
0
|
preprocessing_stage |
Union[str, None]
|
string, optional, default |
'auto'
|
model_params |
kwargs, optional. The hyper-parameters of the model. If not provided, the default values are used. |
{}
|
Returns:
Type | Description |
---|---|
Union[List[float], Tuple[List[float], List[BaseEstimator]]]
|
The validation score. If return_model=True, the trained model is also returned. |
set_model_cls ¤
Set the model class used to train the model on the coreset, in case a specific model instance wasn't passed to fit or the validation methods.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_cls |
Any
|
A Scikit-learn compatible model class. |
required |
set_seen_indication ¤
Set samples as 'seen' or 'unseen'. Not providing an indices list defaults to setting the flag on all samples. This function is only applicable in case the coreset tree was optimized_for 'cleaning'.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
seen_flag |
bool
|
bool, optional, default True. Set 'seen' or 'unseen' flag |
True
|
indices |
Iterable
|
array like, optional. Set flag only for the provided list of indices. Defaults to all indices. |
None
|
update_dirty ¤
Calculate the sensitivity and resample the nodes that were marked as dirty, meaning they were affected by any of the methods: remove_samples, update_targets, update_features or filter_out_samples, when they were called with force_do_nothing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
force_resample_all |
Optional[int]
|
int, optional. Force full resampling of the affected nodes in the coreset tree, starting from level=force_resample_all. None - Do not force_resample_all (default), 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level. |
None
|
force_sensitivity_recalc |
Optional[int]
|
int, optional. Force the recalculation of the sensitivity and partial resampling of the affected nodes, based on the coreset's quality, starting from level=force_sensitivity_recalc. None - If self.chunk_sample_ratio<1 - one level above leaf node level. If self.chunk_sample_ratio=1 - leaf level 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level. |
None
|
update_features ¤
update_features(indices, X, feature_names=None, force_resample_all=None, force_sensitivity_recalc=None, force_do_nothing=False)
Update the features for selected samples on the coreset tree. The coreset tree is automatically updated to accommodate to the changes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
indices |
Iterable
|
array-like. An array of indices to be updated. |
required |
X |
Iterable
|
array-like. An array of features. Should have the same length as indices. |
required |
feature_names |
Iterable[str]
|
If the quantity of features in X is not equal to the quantity of features in the original coreset, this param should contain list of names of passed features. |
None
|
force_resample_all |
Optional[int]
|
int, optional. Force full resampling of the affected nodes in the coreset tree, starting from level=force_resample_all. None - Do not force_resample_all (default), 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level. |
None
|
force_sensitivity_recalc |
Optional[int]
|
int, optional. Force the recalculation of the sensitivity and partial resampling of the affected nodes, based on the coreset's quality, starting from level=force_sensitivity_recalc. None - If self.chunk_sample_ratio<1 - one level above leaf node level. If self.chunk_sample_ratio=1 - leaf level 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level. |
None
|
force_do_nothing |
Optional[bool]
|
bool, optional, default False. When set to True, suppresses any update to the coreset tree until update_dirty is called. |
False
|
update_targets ¤
update_targets(indices, y, force_resample_all=None, force_sensitivity_recalc=None, force_do_nothing=False)
Update the targets for selected samples on the coreset tree. The coreset tree is automatically updated to accommodate to the changes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
indices |
Iterable
|
array-like. An array of indices to be updated. |
required |
y |
Iterable
|
array-like. An array of classes/labels. Should have the same length as indices. |
required |
force_resample_all |
Optional[int]
|
int, optional. Force full resampling of the affected nodes in the coreset tree, starting from level=force_resample_all. None - Do not force_resample_all (default), 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level. |
None
|
force_sensitivity_recalc |
Optional[int]
|
int, optional. Force the recalculation of the sensitivity and partial resampling of the affected nodes, based on the coreset's quality, starting from level=force_sensitivity_recalc. None - If self.chunk_sample_ratio<1 - one level above leaf node level. If self.chunk_sample_ratio=1 - leaf level 0 - The head of the tree, 1 - The level below the head of the tree, len(tree)-1 = leaf level, -1 - same as leaf level. |
None
|
force_do_nothing |
Optional[bool]
|
bool, optional, default False. When set to True, suppresses any update to the coreset tree until update_dirty is called. |
False
|