Skip to content

DataTuningParams

A class including all required information to tune the data parameters for unsupervised and regression Coreset trees: CoresetTreeServiceDTR, CoresetTreeServiceKMeans, CoresetTreeServiceLR, CoresetTreeServicePCA, CoresetTreeServiceSVD. The parameters of the class are treated as a param_grid and a Coreset tree will be built for each combination of parameters.

Parameter nameTypeDescription
General Parameters
coreset_size List[Optional[Union[int, float]]] Represents the coreset size of each node in the coreset tree. If None, the coreset size is not specified. If provided as a float, it represents the ratio between each chunk and the resulting coreset. In any case the coreset_size is limited to 60% of the chunk_size. If provided as int, it is the number of samples. The coreset is constructed by sampling data instances from the dataset based on their calculated importance. Since each instance may be sampled more than once, in practice, the actual size of the coreset is mostly smaller than coreset_size.
Example: 'coreset_size': [1000, 5000, 10000]
deterministic_size List[Optional[Union[int, float]]] The ratio of the coreset_size, which is selected deterministically, based on the calculated importance. If None, the deterministic size is not specified and the Coreset would sample all its samples probabilistically.
Example: 'deterministic_size': [0.1, 0.2, None]
det_weights_behaviour List[Optional[str]] Determines how the weights of the Coreset samples will be calculated. The default is auto, which defaults to keep
  • 'keep': The weights of all samples that were selected deterministically are kept as given in the input and the probabilistic samples’ weights sum up proportionally to the dataset_sum_of_weights - sum_of_deterministic_samples.
  • 'inv': The weights of all samples that were selected (deterministically or probabilistically) are the inverse of their sampling probabilities. This means that there is no difference in weight calculation between the deterministic and probabilistic samples.
  • 'prop': The weights of all samples that were selected deterministically sums up proportionally to the deterministic_size * dataset_sum_of_weights and the probabilistic samples sum up to (1 - deterministic_size) * dataset_sum_of_weights.

Example: 'det_weights_behaviour': ['keep', 'inv']

Code example:

data_tuning_params = {
    'coreset_size': [500, 2000, 5000],
    'deterministic_size': [0.1, 0.3, None],
    'det_weights_behaviour': ['keep', 'inv']
}

service = CoresetTreeServiceDTR(data_tuning_params=data_tuning_params, ...)