CoresetTreeServiceKMeans
CoresetTreeServiceKMeans ยค
CoresetTreeServiceKMeans(
*,
data_manager=None,
data_params=None,
n_instances=None,
max_memory_gb=None,
optimized_for,
chunk_size=None,
chunk_by=None,
k=8,
coreset_size=None,
coreset_params=None,
working_directory=None,
cache_dir=None,
node_train_function=None,
node_train_function_params=None,
node_metadata_func=None,
chunk_sample_ratio=None,
model_cls=None
)
Bases: CoresetTreeServiceUnsupervisedMixin
, CoresetTreeService
Subclass of CoresetTreeService for KMeans. A service class for creating a coreset tree and working with it. optimized_for is a required parameter defining the main usage of the service: 'training', 'cleaning' or both, optimized_for=['training', 'cleaning']. The service will decide whether to build an actual Coreset Tree or to build a single Coreset over the entire dataset, based on the triplet: n_instances, max_memory_gb and the 'number of features' (deduced from the dataset). The chunk_size and coreset_size will be deduced based on the above triplet too. In case chunk_size and coreset_size are provided, they will override all above mentioned parameters (less recommended).
When fitting KMeans on the Coreset, it is highly recommended to use the built-in fit function of the CoresetTreeServiceKMeans class. Sklearn uses by default k-means++ as its initialization method. While sklearn's KMeans implementation supports the receipt of sample_weight, the kmeans_plusplus implementation does not. When building the Coreset, samples are selected and weights are assigned to them, therefore, not using these weights will significantly degrade the quality of the results. The fit implementation of the CoresetTreeServiceKMeans solves this problem, by extending kmeans_plusplus to receive sample_weight.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_manager |
DataManagerT
|
DataManagerBase subclass, optional. The class used to interact with the provided data and store it locally. By default, only the sampled data is stored in HDF5 files format. |
None
|
data_params |
Union[DataParams, dict]
|
DataParams, optional. Data preprocessing information. |
None
|
n_instances |
int
|
int. The total number of instances that are going to be processed (can be an estimation). This parameter is required and the only one from the above mentioned quadruplet, which isn't deduced from the data. |
None
|
max_memory_gb |
int
|
int, optional. The maximum memory in GB that should be used. When not provided, the server's total memory is used. In any case only 80% of the provided memory or the server's total memory is considered. |
None
|
optimized_for |
Union[list, str]
|
str or list Either 'training', 'cleaning' or or both ['training', 'cleaning']. The main usage of the service. |
required |
k |
int, default=8. Only relevant when tree is optimized_for cleaning. The number of clusters to form as well as the number of centroids to generate. |
8
|
|
chunk_size |
Union[dict, int]
|
int, optional. The number of instances to be used when creating a coreset node in the tree. When defined, it will override the parameters of optimized_for, n_instances, n_classes and max_memory_gb. chunk_size=0: Nodes are created based on input chunks. chunk_size=-1: Force the service to create a single coreset from the entire dataset (if it fits into memory). |
None
|
chunk_by |
Union[Callable, str, list]
|
function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored. |
None
|
coreset_size |
Union[int, float, dict]
|
int or float, optional. Represents the coreset size of each node in the coreset tree. If provided as a float, it represents the ratio between each chunk and the resulting coreset.In any case the coreset_size is limited to 60% of the chunk_size. The coreset is constructed by sampling data instances from the dataset based on their calculated importance. Since each instance may be sampled more than once, in practice, the actual size of the coreset is mostly smaller than coreset_size. |
None
|
coreset_params |
Union[CoresetParams, dict]
|
CoresetParams or dict, optional. Coreset algorithm specific parameters. |
None
|
node_train_function |
Callable[[ndarray, ndarray, ndarray], Any]
|
Callable, optional. method for training model at tree node level. |
None
|
node_train_function_params |
dict
|
dict, optional. kwargs to be used when calling node_train_function. |
None
|
node_metadata_func |
Callable[[Tuple[ndarray], ndarray, Union[list, None]], Union[list, dict, None]]
|
callable, optional. A method for storing user meta data on each node. |
None
|
working_directory |
Union[str, PathLike]
|
str, path, optional. Local directory where intermediate data is stored. |
None
|
cache_dir |
Union[str, PathLike]
|
str, path, optional. For internal use when loading a saved service. |
None
|
chunk_sample_ratio |
float
|
float, optional. Indicates the size of the sample that will be taken and saved from each chunk on top of the Coreset for the validation methods. The values are from the range [0,1]. For example, chunk_sample_ratio=0.5, means that 50% of the data instances from each chunk will be saved. |
None
|
model_cls |
Any
|
A Scikit-learn compatible model class, optional. The model class used to train the model on the coreset, in case a specific model instance wasn't passed to fit or the validation methods. The default model class is sklearn's KMeans, with our extension to kmeans_plusplus to support sample_weight. |
None
|