Release Notes

Last updated on October 1, 2024

New Features, Improvements and Bug Fixes in version 0.12.0¤

Adding the support for Target Encoding, which is the default way to handle categorical features in binary classification problems.
Significantly improving the build functions’ runtime and their memory consumption, especially for datasets with categorical features and missing values.
Significantly improving the runtime of the auto_preprocessing, predict and predict_proba functions.
Fixing various problems concerning the handling of missing values.
Fixing the problem where incorrect data was saved for validation purposes when calling the save function.

Improvements and Bug Fixes in version 0.11.2¤

Solving a memory leak that happened sometimes during the various build and partial_build functions.
Improving the build function's runtime and their memory consumption.

Improvements and Bug Fixes in version 0.11.1¤

Improving the grid_search runtime, when XGBoost, LightGBM and CatBoost are used.
The get_coreset_size function is no longer limited by license.
Fixing the problem preventing the build of the CoresetTreeServiceDTR in some cases.

New Features, Improvements and Bug Fixes in version 0.11.0¤

Adding logging capabilities to the library, to improve its debugging capabilities.
Adding a verbose parameter to the various build and partial_build functions, to provide a better indication to the length of the operation.
Adding the get_hyperparameter_tuning_data function, which allows retrieving the data from the Coreset tree in such a format that allows running GridSearchCV, BayesSearchCV and other hyperparameter tuning methods.
Improving the structure returned by get_coreset.
Improving the build functions’ runtime and their memory consumption.
Fixing various problems when saving the Coreset tree.

New Features, Improvements and Bug Fixes in version 0.10.3¤

Improving grid_search runtime, by improving its parallelism. The number of jobs run in parallel during grid_search can be controlled by defining the n_jobs parameter passed to the function.
Fixing the problem when a Coreset tree was built with a seq_column, grid_search with refit=True would ignore the sequence-related parameters passed to the function.
Fixing the problem when a Coreset tree was built with a seq_column, fit and grid_search would not always select the optimal nodes from the Coreset tree.

New Features, Improvements and Bug Fixes in version 0.10.2¤

Allowing to configure the coreset_size also as a float, representing the ratio between each chunk and the resulting coreset.
Fixing the problem when the returned column types were incorrect in some cases when calling get_coreset, fit or grid_search with preprocessing_stage=user.

New Features, Improvements and Bug Fixes in version 0.10.1¤

Improving the grid_search time for datasets with categorical features and missing values.

New Features, Improvements and Bug Fixes in version 0.10.0¤

Adding an enhancement parameter to the fit, grid_search and the validation functions of the CoresetTreeServiceDTC and CoresetTreeServiceDTR. Setting a value of 1 to 3 for this parameter will enhance the default decision tree based training, which can improve the strength of the model, but will increase the training run time.

Bug Fixes in version 0.9.1¤

Fixing the problem when a Coreset tree was built with a seq_column on a dataset including categorical features, predict and predict_proba would sometimes fail.

New Features, Improvements and Bug Fixes in version 0.9.0¤

Allowing to execute get_coreset, fit, grid_search and the validation functions on a subset of the data of the Coreset tree (such as certain date ranges), by defining a sequence column (seq_column), in the DataParams structure passed during the initialization of the class, which can then be used to filter the data.
Improving the build time, by improving the parallelism of the Coreset Tree construction. The number of jobs run in parallel during build can be controlled by defining the n_jobs parameter passed to the function.

New Features, Improvements and Bug Fixes in version 0.8.1¤

Fixing a licensing issue.

New Features, Improvements and Bug Fixes in version 0.8.0¤

An additional CoresetTreeService for all decision tree regression-based problems has been added to the library. This service can be used to create regression-based Coresets for all libraries including: XGBoost, LightGBM, CatBoost, Scikit-learn and others.
Improving the default Coreset tree created when chunk_size and coreset_size are not provided during the initialization of the class.
grid_search was extended to return a Pandas DataFrame with the score of each hyperparameter combination and fold.
Improving the grid_search time.

New Features, Improvements and Bug Fixes in version 0.7.0¤

Coresets can now be built when the dataset has missing values. The library will automatically handle them during the build process (as Coresets can only be built without missing values).
Significantly improving the predict and predict_proba time for datasets with categorical features.
predict and predict_proba will now automatically preprocesses the data according to the preprocessing_stage used to train the model.
Improving the automatic detection of categorical features.
It is now possible to define the model class, used to train the model on the coreset, when initializing the CoresetTreeService class, using the model_cls parameter.
Enhancing the grid_search function to run on unsupervised datasets.
Fixing the problem when grid_search would fail after remove_samples or filter_out_samples were called.

New Features, Improvements and Bug Fixes in version 0.6.0¤

Replacing the save_all parameter passed when initializing all classes, with the chunk_sample_ratio parameter, which indicates the size of the sample that will be taken and saved from each chunk on top of the Coreset for the validation methods.
Significantly improving the build time for datasets with categorical features.
grid_search, cross_validate and holdout_validate all receive now the preprocessing_stage parameter, same as the fit function.
fit returns now the data in preprocessing_stage=auto by default when Scikit-learn or XGBoost are used to train the model and in preprocessing_stage=user by default when LightGBM or CatBoost are used to train the model.
Fixing the problem when both model and model_params were passed to fit, grid_search, cross_validate and holdout_validate, model_params were ignored.
Fixing the problem when a single Coreset was created for the entire dataset and get_cleaning_samples was called with class_size={"class XXX": "all"}, the returned result was faulty.

New Features and Improvements in version 0.5.0¤

Coresets can now be built using categorical features. The library will automatically one-hot encode them during the build process (as Coresets can only be built with numeric features).
get_coreset can now return the data according to three data preprocessing_stage=original – The dataset as it is handed to the Coreset’s build function. preprocessing_stage=user – The dataset after any user defined data preprocessing (default). preprocessing_stage=auto – The dataset after any automatic data preprocessing done by the library, such as one-hot encoding and converting Boolean fields to numeric. The features (X), can also be returned as a sparse matrix or an array, controlled by the sparse_output parameter (applicable only for preprocessing_stage=auto).
fit can now return the data according to two preprocessing_stage=auto is the default when Scikit-learn is used to train the model. preprocessing_stage=user is the default when XGBoost, LightGBM or CatBoost are used to train the model.
Adding a new auto_preprocessing function, allowing the user to preprocess the (test) data, as it was automatically done by the library during the Coreset’s build function.
Fixing the problem where the library’s Docstrings did not show up in some IDEs.

New Features and Improvements in version 0.4.0¤

Adding support for Python 3.11.
Allowing to create a CoresetTreeService, which can be used for both training and cleaning (optimized_for=['cleaning', 'training']).
The CoresetTreeService can now handle datasets that do not fit into the device’s memory also for cleaning purposes (for training purposes this was supported from the initial release).
The get_important_samples function was renamed to get_cleaning_samples to improve the clarity of its purpose.
Adding hyperparameter tuning capabilities to the library with the introduction of the grid_search() function, which works in a similar manner to Scikit-learn’s GridSearchCV class, only dramatically faster, as it utilizes the Coreset tree. Introducing also the cross_validate() and holdout_validate() functions, which can be used directly or as the validation method as part of the grid_search() function.
Further improving the error messages for some of the data processing problems users encountered.

Bug Fixes in version 0.3.1¤

build and build_partial now read the data in chunks in the size of chunk_size when the file format allows it (CSV, TSV), to reduce the memory footprint when building the Coreset tree.

New Features and Improvements in version 0.3.0¤

An additional CoresetTreeService for all decision tree classification-based problems has been added to the library. This service can be used to create classification-based Coresets for all libraries including: XGBoost, LightGBM, CatBoost, Scikit-learn and others.
Improving the results get_coreset returns in case the Coreset tree is not perfectly balanced.
Improving the data handling capabilities, when processing the input data provided to the different build functions, such as supporting pandas.BooleanDtype and pandas.Series and returning clearer error messages for some of the data processing problems encountered.

New Features and Improvements in version 0.2.0¤

Additional CoresetTreeServices for linear regression, K-Means, PCA and SVD have been added to the library.
Significantly reducing the memory footprint by up to 50% especially during the various build and partial_build functions.
Data is read in chunks in the size of chunk_size when the file format allows it (CSV, TSV), to reduce the memory footprint when building the Coreset tree.
Significantly improving the get_coreset time on large datasets.
Significantly improving the save time and changing the default save format to pickle.
Significantly improving the importance calculation when the number of data instances per class is lower than the number of features.
Allowing to save the entire dataset and not just the selected samples, by setting the save_all parameter during the initialization of the class. When optimized_for='cleaning' save_all is True by default and when optimized_for='training' it is False by default.
Allowing to define certain columns in the dataset as properties (props). Properties, won’t be used to compute the Coreset or train the model, but it is possible to filter_out_samples on them or to pass them in the select_from_function of get_important_samples.