Skip to content

CoresetTreeServicePCA

CoresetTreeServicePCA ยค

CoresetTreeServicePCA(
    *,
    data_manager=None,
    data_params=None,
    n_instances=None,
    max_memory_gb=None,
    optimized_for,
    chunk_size=None,
    chunk_by=None,
    coreset_size=None,
    coreset_params=None,
    working_directory=None,
    cache_dir=None,
    node_train_function=None,
    node_train_function_params=None,
    node_metadata_func=None,
    chunk_sample_ratio=None,
    model_cls=None
)

Bases: CoresetTreeServiceUnsupervisedMixin, CoresetTreeService

Subclass of CoresetTreeService for PCA. A service class for creating a coreset tree and working with it. optimized_for is a required parameter defining the main usage of the service: 'training', 'cleaning' or both, optimized_for=['training', 'cleaning']. The service will decide whether to build an actual Coreset Tree or to build a single Coreset over the entire dataset, based on the triplet: n_instances, n_classes, max_memory_gb and the 'number of features' (deduced from the dataset). The chunk_size and coreset_size will be deduced based on the above triplet too. In case chunk_size and coreset_size are provided, they will override all above mentioned parameters (less recommended).

When building the Coreset, samples are selected and weights are assigned to them, therefore it is important to use functions that support the receipt of sample_weight. Sklearn's PCA implementation does not support the receipt of sample_weight, therefore, it is highly recommended to use the built-in fit or fit_transform functions of the CoresetTreeServicePCA class as they were extended to receive sample_weight.

Parameters:

Name Type Description Default
data_manager DataManagerT

DataManagerBase subclass, optional. The class used to interact with the provided data and store it locally. By default, only the sampled data is stored in HDF5 files format.

None
data_params Union[DataParams, dict]

DataParams, optional. Data preprocessing information.

None
n_instances int

int. The total number of instances that are going to be processed (can be an estimation). This parameter is required and the only one from the above mentioned quadruplet, which isn't deduced from the data.

None
max_memory_gb int

int, optional. The maximum memory in GB that should be used. When not provided, the server's total memory is used. In any case only 80% of the provided memory or the server's total memory is considered.

None
optimized_for Union[list, str]

str or list Either 'training', 'cleaning' or or both ['training', 'cleaning']. The main usage of the service.

required
chunk_size Union[dict, int]

int, optional. The number of instances to be used when creating a coreset node in the tree. When defined, it will override the parameters of optimized_for, n_instances, n_classes and max_memory_gb. chunk_size=0: Nodes are created based on input chunks. chunk_size=-1: Force the service to create a single coreset from the entire dataset (if it fits into memory).

None
chunk_by Union[Callable, str, list]

function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored.

None
coreset_size Union[int, float, dict]

int or float, optional. Represents the coreset size of each node in the coreset tree. If provided as a float, it represents the ratio between each chunk and the resulting coreset.In any case the coreset_size is limited to 60% of the chunk_size. The coreset is constructed by sampling data instances from the dataset based on their calculated importance. Since each instance may be sampled more than once, in practice, the actual size of the coreset is mostly smaller than coreset_size.

None
coreset_params Union[CoresetParams, dict]

CoresetParams or dict, optional. Coreset algorithm specific parameters.

None
node_train_function Callable[[ndarray, ndarray, ndarray], Any]

Callable, optional. method for training model at tree node level.

None
node_train_function_params dict

dict, optional. kwargs to be used when calling node_train_function.

None
node_metadata_func Callable[[Tuple[ndarray], ndarray, Union[list, None]], Union[list, dict, None]]

callable, optional. A method for storing user meta data on each node.

None
working_directory Union[str, PathLike]

str, path, optional. Local directory where intermediate data is stored.

None
cache_dir Union[str, PathLike]

str, path, optional. For internal use when loading a saved service.

None
chunk_sample_ratio float

float, optional. Indicates the size of the sample that will be taken and saved from each chunk on top of the Coreset for the validation methods. The values are from the range [0,1]. For example, chunk_sample_ratio=0.5, means that 50% of the data instances from each chunk will be saved.

None
model_cls Any

A Scikit-learn compatible model class, optional. The model class used to train the model on the coreset, in case a specific model instance wasn't passed to fit or the validation methods. The default model class is our WPCA class extending sklearn's PCA to support weights.

None