Release Notes
Last updated on June 16, 2023
New Features and Improvements in version 0.4.0¤
- Adding support for Python 3.11.
- Allowing to create a CoresetTreeService, which can be used for both training and cleaning (
optimized_for=['cleaning', 'training']
). - The CoresetTreeService can now handle datasets that do not fit into the device’s memory also for cleaning purposes (for training purposes this was supported from the initial release).
- The
get_important_samples
function was renamed toget_cleaning_samples
to improve the clarity of its purpose. - Adding hyperparameter tuning capabilities to the library with the introduction of the
grid_search()
function, which works in a similar manner to Scikit-learn’s GridSearchCV class, only dramatically faster, as it utilizes the Coreset tree. Introducing also thecross_validate()
andholdout_validate()
functions, which can be used directly or as the validation method as part of thegrid_search()
function. - Further improving the error messages for some of the data processing problems users encountered.
Bug Fixes in version 0.3.1¤
- build and build_partial now read the data in chunks in the size of chunk_size when the file format allows it (CSV, TSV), to reduce the memory footprint when building the Coreset tree.
New Features and Improvements in version 0.3.0¤
- An additional CoresetTreeService for all decision tree classification-based problems has been added to the library. This service can be used to create classification-based Coresets for all libraries including: XGBoost, LightGBM, CatBoost, Scikit-learn and others.
- Improving the results get_coreset returns in case the Coreset tree is not perfectly balanced.
- Improving the data handling capabilities, when processing the input data provided to the different build functions, such as supporting pandas.BooleanDtype and pandas.Series and returning clearer error messages for some of the data processing problems encountered.
New Features and Improvements in version 0.2.0¤
- Additional CoresetTreeServices for linear regression, K-Means, PCA and SVD have been added to the library.
- Significantly reducing the memory footprint by up to 50% especially during the various build and partial_build functions.
- Data is read in chunks in the size of
chunk_size
when the file format allows it (CSV, TSV), to reduce the memory footprint when building the Coreset tree. - Significantly improving the get_coreset time on large datasets.
- Significantly improving the save time and changing the default save format to pickle.
- Significantly improving the importance calculation when the number of data instances per class is lower than the number of features.
- Allowing to save the entire dataset and not just the selected samples, by setting the
save_all
parameter during the initialization of the class. Whenoptimized_for
='cleaning'
save_all
is True by default and whenoptimized_for
='training'
it is False by default. - Allowing to define certain columns in the dataset as properties (
props
). Properties, won’t be used to compute the Coreset or train the model, but it is possible to filter_out_samples on them or to pass them in theselect_from_function
of get_important_samples.