CoresetTreeServiceDTR

CoresetTreeServiceDTR ¤

CoresetTreeServiceDTR(
    *,
    data_manager=None,
    data_params=None,
    n_instances=None,
    max_memory_gb=None,
    n_classes=None,
    optimized_for,
    chunk_size=None,
    chunk_by=None,
    coreset_size=None,
    coreset_params=None,
    working_directory=None,
    cache_dir=None,
    node_train_function=None,
    node_train_function_params=None,
    node_metadata_func=None,
    chunk_sample_ratio=None,
    model_cls=None
)

Bases: DTMixin, CoresetTreeServiceSupervisedMixin, CoresetTreeService

Subclass of CoresetTreeService for Decision Tree Regression-based problems. A service class for creating a coreset tree and working with it. optimized_for is a required parameter defining the main usage of the service: 'training', 'cleaning' or both, optimized_for=['training', 'cleaning']. The service will decide whether to build an actual Coreset Tree or to build a single Coreset over the entire dataset, based on the quadruplet: n_instances, n_classes, max_memory_gb and the 'number of features' (deduced from the dataset). The chunk_size and coreset_size will be deduced based on the above quadruplet too. In case chunk_size and coreset_size are provided, they will override all above mentioned parameters (less recommended).

Parameters:

Name	Type	Description	Default
`data_manager`	`DataManagerT`	DataManagerBase subclass, optional. The class used to interact with the provided data and store it locally. By default, only the sampled data is stored in HDF5 files format.	`None`
`data_params`	`Union[DataParams, dict]`	DataParams, optional. Data preprocessing information.	`None`
`n_instances`	`int`	int. The total number of instances that are going to be processed (can be an estimation). This parameter is required and the only one from the above mentioned quadruplet, which isn't deduced from the data.	`None`
`max_memory_gb`	`int`	int, optional. The maximum memory in GB that should be used. When not provided, the server's total memory is used. In any case only 80% of the provided memory or the server's total memory is considered.	`None`
`optimized_for`	`Union[list, str]`	str or list Either 'training', 'cleaning' or or both ['training', 'cleaning']. The main usage of the service.	required
`chunk_size`	`int`	int, optional. The number of instances to be used when creating a coreset node in the tree. When defined, it will override the parameters of optimized_for, n_instances, n_classes and max_memory_gb. chunk_size=0: Nodes are created based on input chunks. chunk_size=-1: Force the service to create a single coreset from the entire dataset (if it fits into memory).	`None`
`chunk_by`	`Union[Callable, str, list]`	function, label, or list of labels, optional. Split the data according to the provided key. When provided, chunk_size input is ignored.	`None`
`coreset_size`	`Union[int, float, dict]`	int or float, optional. Represents the coreset size of each node in the coreset tree. If provided as a float, it represents the ratio between each chunk and the resulting coreset.In any case the coreset_size is limited to 60% of the chunk_size. The coreset is constructed by sampling data instances from the dataset based on their calculated importance. Since each instance may be sampled more than once, in practice, the actual size of the coreset is mostly smaller than coreset_size.	`None`
`coreset_params`	`Union[CoresetParams, dict]`	CoresetParams or dict, optional. Coreset algorithm specific parameters.	`None`
`node_train_function`	`Callable[[ndarray, ndarray, ndarray], Any]`	Callable, optional. method for training model at tree node level.	`None`
`node_train_function_params`	`dict`	dict, optional. kwargs to be used when calling node_train_function.	`None`
`node_metadata_func`	`Callable[[Tuple[ndarray], ndarray, Union[list, None]], Union[list, dict, None]]`	callable, optional. A method for storing user meta data on each node.	`None`
`working_directory`	`Union[str, PathLike]`	str, path, optional. Local directory where intermediate data is stored.	`None`
`cache_dir`	`Union[str, PathLike]`	str, path, optional. For internal use when loading a saved service.	`None`
`chunk_sample_ratio`	`float`	float, optional. Indicates the size of the sample that will be taken and saved from each chunk on top of the Coreset for the validation methods. The values are from the range [0,1]. For example, chunk_sample_ratio=0.5, means that 50% of the data instances from each chunk will be saved.	`None`
`model_cls`	`Any`	A Scikit-learn compatible model class, optional. The model class used to train the model on the coreset, in case a specific model instance wasn't passed to fit or the validation methods. The default model class which will be selected for this class instance will be XGBRegressor, on condition the xgboost library is installed. Otherwise, LGBMRegressor will be chosen if the lightgbm library is installed. Else, in the presence of the Catboost library, the selected class will be the CatBoostRegressor. Lastly, if none of the mentioned three libraries are installed, sklearn's GradientBoostingRegressor will be chosen as the final fallback.	`None`