Skip to content

CoresetTreeServiceDTR

CoresetTreeServiceDTR ยค

CoresetTreeServiceDTR(
    *,
    data_manager=None,
    data_params=None,
    n_instances=None,
    max_memory_gb=None,
    n_classes=None,
    optimized_for,
    chunk_size=None,
    chunk_by=None,
    coreset_size=None,
    coreset_params=None,
    working_directory=None,
    cache_dir=None,
    node_train_function=None,
    node_train_function_params=None,
    node_metadata_func=None,
    chunk_sample_ratio=None,
    model_cls=None
)

Bases: DTMixin, CoresetTreeServiceSupervisedMixin, CoresetTreeService

Subclass of CoresetTreeService for Decision Tree Regression-based problems. A service class for creating a coreset tree and working with it. optimized_for is a required parameter defining the main usage of the service: 'training', 'cleaning' or both, optimized_for=['training', 'cleaning']. The service will decide whether to build an actual Coreset Tree or to build a single Coreset over the entire dataset, based on the quadruplet: n_instances, n_classes, max_memory_gb and the 'number of features' (deduced from the dataset). The chunk_size and coreset_size will be deduced based on the above quadruplet too. In case chunk_size and coreset_size are provided, they will override all above mentioned parameters (less recommended).

Parameters:

Name Type Description Default
data_manager DataManagerT

DataManagerBase subclass, optional. The class used to interact with the provided data and store it locally. By default, only the sampled data is stored in HDF5 files format.

None
data_params Union[DataParams, dict]

DataParams, optional. Data preprocessing information.

None
n_instances int

int. The total number of instances that are going to be processed (can be an estimation). This parameter is required and the only one from the above mentioned quadruplet, which isn't deduced from the data.

None
max_memory_gb int

int, optional. The maximum memory in GB that should be used. When not provided, the server's total memory is used. In any case only 80% of the provided memory or the server's total memory is considered.

None
optimized_for Union[list, str]

str or list Either 'training', 'cleaning' or or both ['training', 'cleaning']. The main usage of the service.

required
chunk_size int

int, optional. The number of instances to be used when creating a coreset node in the tree. When defined, it will override the parameters of optimized_for, n_instances, n_classes and max_memory_gb. chunk_size=0: Nodes are created based on input chunks. chunk_size=-1: Force the service to create a single coreset from the entire dataset (if it fits into memory).

None
chunk_by Union[Callable, str, list]

function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored.

None
coreset_size Union[int, float, dict]

int or float, optional. Represents the coreset size of each node in the coreset tree. If provided as a float, it represents the ratio between each chunk and the resulting coreset.In any case the coreset_size is limited to 60% of the chunk_size. The coreset is constructed by sampling data instances from the dataset based on their calculated importance. Since each instance may be sampled more than once, in practice, the actual size of the coreset is mostly smaller than coreset_size.

None
coreset_params Union[CoresetParams, dict]

CoresetParams or dict, optional. Coreset algorithm specific parameters.

None
node_train_function Callable[[ndarray, ndarray, ndarray], Any]

Callable, optional. method for training model at tree node level.

None
node_train_function_params dict

dict, optional. kwargs to be used when calling node_train_function.

None
node_metadata_func Callable[[Tuple[ndarray], ndarray, Union[list, None]], Union[list, dict, None]]

callable, optional. A method for storing user meta data on each node.

None
working_directory Union[str, PathLike]

str, path, optional. Local directory where intermediate data is stored.

None
cache_dir Union[str, PathLike]

str, path, optional. For internal use when loading a saved service.

None
chunk_sample_ratio float

float, optional. Indicates the size of the sample that will be taken and saved from each chunk on top of the Coreset for the validation methods. The values are from the range [0,1]. For example, chunk_sample_ratio=0.5, means that 50% of the data instances from each chunk will be saved.

None
model_cls Any

A Scikit-learn compatible model class, optional. The model class used to train the model on the coreset, in case a specific model instance wasn't passed to fit or the validation methods. The default model class which will be selected for this class instance will be XGBRegressor, on condition the xgboost library is installed. Otherwise, LGBMRegressor will be chosen if the lightgbm library is installed. Else, in the presence of the Catboost library, the selected class will be the CatBoostRegressor. Lastly, if none of the mentioned three libraries are installed, sklearn's GradientBoostingRegressor will be chosen as the final fallback.

None