Release Notes
Last updated on May 22, 2025
New Features in version 0.17.0¤
- Adding the active_sample function, which refines the Coreset by adding new samples to it based on active sampling.
 
New Features, Improvements and Bug Fixes in version 0.16.0¤
- Adding the 
build_from_databricksandpartial_build_from_databricksfunctions, which allow building a coreset tree directly from a Databricks SQL query. - Adding native support for AWS, GCP, and Azure cloud storages. Any function receiving a path to a file or a directory, namely 
build_from_file,partial_build_from_file,get_coreset,save,load, andplotcan now receive a path to cloud storage in addition to a local path.. - Adding a CLI allowing to configure different parameters of the library.
 - Adding 
model_fit_paramstogrid_search,fit, and the validation functions, allowing to pass parameters to the model'sfitfunction (such as early stopping and others). - Fixing an issue where 
partial_buildwithsample_weightdid not work when the originalbuildwas done withoutsample_weight, and vice versa. - Fixing an issue where 
fitignored thetree_idxparameter it received. - Fixing an issue where 
coreset_sizewas configured andchunk_sizewas not but thecoreset_sizewas ignored. - Fixing issues in 
grid_searchrelated to handling unbalanced Coreset trees and improving the logging message for the best hyperparameter combination. get_coreset_sizeis no longer limited by license.save_coresetwas removed from the library’s API.get_coresetnow receives asave_pathargument if the Coreset needs to be saved.
New Features, Improvements and Bug Fixes in version 0.15.0¤
- It is now possible to build multiple Coreset trees within one 
buildcall for data tuning and model training purposes, using theDataTuningParamsstructure, which can be passed when initializing the Coreset tree. TheDataTuningParamsstructure is treated as aparam_grid, and a Coreset tree will be built for each combination of parameters. - As a result of the 
DataTuningParamschange,coreset_sizeandsample_allparameters can only be defined within theDataTuningParamsstructure and not directly within the list of parameters passed when initializing the Coreset tree. - As a result of the 
DataTuningParamschange, thegrid_searchfunction can receive a list oftree_indices, specifying for which Coreset trees the grid search should be executed. If not specified, by default,grid_searchwill be executed on all Coreset trees. - As a result of the 
DataTuningParamschange, atree_idxparameter can be passed to the following functions, defining the index of the tree from which the coreset is extracted. The impacted functions areget_coreset,get_coreset_size,fit,get_hyperparameter_tuning_data,cross_validate,holdout_validateandseq_dependent_validateThe default index is0, which corresponds to the tree built according to the firstDataTuningParamscombination passed. - When defining a Coreset tree, the default value for the 
optimized_forparameter is nowtraining. - Coresets can now be built when the features include a new data type, an array of integers (e.g., 
[3, 7, 19, 203, 579]). The library will automatically encode these array features into binary columns during the build process. - The default 
coreset_sizeandchunk_sizecalculation was improved in cases where the dataset includes categorical and array features. - It is now possible to pass 
n_instancesandcoreset_sizeas a float number when initializing the Coreset tree, without specifying thechunk_size. The library will calculate the optimalchunk_size, and the providedcoreset_sizewill be treated as a ratio of thechunk_size. - Adding the 
build_from_file_insights,build_from_df_insightsandbuild_insightsfunctions, which provide insights into the Coreset tree(s) that would be built for the provided dataset. - Fixing an issue where 
build_from_fileandpartial_build_from_filereceived a directory or a list of directories, but the files were not processed in alphabetical order. - Adding the 
remove_by_seqfunction, which removes nodes from the Coreset tree within a given sequence. - When building the Coreset tree with a sequence column, it is now possible to define 
'chunk_by': "every_build". In this case, every call tobuildorpartial buildwill receive its own sequence, which can be assigned by the user or will be automatically assigned by the library. - When building the Coreset tree with a sequence column, it is now possible to define a 
sliding_window(e.g.:'sliding_window': 10). In this case, the tree will only keep samples that are in the lastnseq_columngranularity units, wherenis the value of thesliding_window. 
Bug Fixes in version 0.14.1¤
- Fixing compatibility issues with numpy version >= 2.0.0.
 
New Features, Improvements and Bug Fixes in version 0.14.0¤
- Adding the capability to pass 
sample_weightto all thebuildfunctions. preprocessing_stage=useris now the default when using CatBoost infit,grid_searchand the validation functions. In all other casespreprocessing_stage=autois the default.- Fixing a rare problem that caused the 
buildof theCoresetTreeServiceDTCandCoresetTreeServiceDTRto fail. 
New Features, Improvements and Bug Fixes in version 0.13.0¤
- Adding the support for a mixed categorical encoding strategy in binary classification problems, combining both Target Encoding (TE) and One Hot Encoding (OHE), which is now the default strategy.
 
Bug Fixes in version 0.12.1¤
- Fixing the problem where 
predictandpredict_probasometimes worked incorrectly when the data included categorical features. 
New Features, Improvements and Bug Fixes in version 0.12.0¤
- Adding the support for Target Encoding, which is the default way to handle categorical features in binary classification problems.
 - Significantly improving the 
buildfunctions’ runtime and their memory consumption, especially for datasets with categorical features and missing values. - Significantly improving the runtime of the 
auto_preprocessing,predictandpredict_probafunctions. - Fixing various problems concerning the handling of missing values.
 - Fixing the problem where incorrect data was saved for validation purposes when calling the save function.
 
Improvements and Bug Fixes in version 0.11.2¤
- Solving a memory leak that happened sometimes during the various 
buildandpartial_buildfunctions. - Improving the 
buildfunction's runtime and their memory consumption. 
Improvements and Bug Fixes in version 0.11.1¤
- Improving the 
grid_searchruntime, when XGBoost, LightGBM and CatBoost are used. - The 
get_coreset_sizefunction is no longer limited by license. - Fixing the problem preventing the build of the 
CoresetTreeServiceDTRin some cases. 
New Features, Improvements and Bug Fixes in version 0.11.0¤
- Adding logging capabilities to the library, to improve its debugging capabilities.
 - Adding a verbose parameter to the various build and partial_build functions, to provide a better indication to the length of the operation.
 - Adding the 
get_hyperparameter_tuning_datafunction, which allows retrieving the data from the Coreset tree in such a format that allows runningGridSearchCV,BayesSearchCVand other hyperparameter tuning methods. - Improving the structure returned by 
get_coreset. - Improving the 
buildfunctions’ runtime and their memory consumption. - Fixing various problems when saving the Coreset tree.
 
New Features, Improvements and Bug Fixes in version 0.10.3¤
- Improving 
grid_searchruntime, by improving its parallelism. The number of jobs run in parallel duringgrid_searchcan be controlled by defining then_jobsparameter passed to the function. - Fixing the problem when a Coreset tree was built with a 
seq_column,grid_searchwithrefit=Truewould ignore the sequence-related parameters passed to the function. - Fixing the problem when a Coreset tree was built with a 
seq_column,fitandgrid_searchwould not always select the optimal nodes from the Coreset tree. 
New Features, Improvements and Bug Fixes in version 0.10.2¤
- Allowing to configure the 
coreset_sizealso as a float, representing the ratio between each chunk and the resulting coreset. - Fixing the problem when the returned column types were incorrect in some cases when calling 
get_coreset,fitorgrid_searchwithpreprocessing_stage=user. 
New Features, Improvements and Bug Fixes in version 0.10.1¤
- Improving the 
grid_searchtime for datasets with categorical features and missing values. 
New Features, Improvements and Bug Fixes in version 0.10.0¤
- Adding an 
enhancementparameter to thefit,grid_searchand the validation functions of theCoresetTreeServiceDTCandCoresetTreeServiceDTR. Setting a value of 1 to 3 for this parameter will enhance the default decision tree based training, which can improve the strength of the model, but will increase the training run time. 
Bug Fixes in version 0.9.1¤
- Fixing the problem when a Coreset tree was built with a 
seq_columnon a dataset including categorical features,predictandpredict_probawould sometimes fail. 
New Features, Improvements and Bug Fixes in version 0.9.0¤
- Allowing to execute 
get_coreset,fit,grid_searchand the validation functions on a subset of the data of the Coreset tree (such as certain date ranges), by defining a sequence column (seq_column), in theDataParamsstructure passed during the initialization of the class, which can then be used to filter the data. - Improving the 
buildtime, by improving the parallelism of the Coreset Tree construction. The number of jobs run in parallel duringbuildcan be controlled by defining then_jobsparameter passed to the function. 
New Features, Improvements and Bug Fixes in version 0.8.1¤
- Fixing a licensing issue.
 
New Features, Improvements and Bug Fixes in version 0.8.0¤
- An additional CoresetTreeService for all decision tree regression-based problems has been added to the library. This service can be used to create regression-based Coresets for all libraries including: XGBoost, LightGBM, CatBoost, Scikit-learn and others.
 - Improving the default Coreset tree created when 
chunk_sizeandcoreset_sizeare not provided during the initialization of the class. - 
grid_searchwas extended to return a Pandas DataFrame with the score of each hyperparameter combination and fold. - Improving the 
grid_searchtime. 
New Features, Improvements and Bug Fixes in version 0.7.0¤
- Coresets can now be built when the dataset has missing values. The library will automatically handle them during the build process (as Coresets can only be built without missing values).
 - Significantly improving the 
predictandpredict_probatime for datasets with categorical features. predictandpredict_probawill now automatically preprocesses the data according to thepreprocessing_stageused to train the model.- Improving the automatic detection of categorical features.
 - It is now possible to define the model class, used to train the model on the coreset, when initializing the CoresetTreeService class, using the 
model_clsparameter. - Enhancing the 
grid_searchfunction to run on unsupervised datasets. - Fixing the problem when 
grid_searchwould fail afterremove_samplesorfilter_out_sampleswere called. 
New Features, Improvements and Bug Fixes in version 0.6.0¤
- Replacing the 
save_allparameter passed when initializing all classes, with thechunk_sample_ratioparameter, which indicates the size of the sample that will be taken and saved from each chunk on top of the Coreset for the validation methods. - Significantly improving the build time for datasets with categorical features.
 grid_search,cross_validateandholdout_validateall receive now thepreprocessing_stageparameter, same as thefitfunction.fitreturns now the data inpreprocessing_stage=autoby default when Scikit-learn or XGBoost are used to train the model and inpreprocessing_stage=userby default when LightGBM or CatBoost are used to train the model.- Fixing the problem when both 
modelandmodel_paramswere passed tofit,grid_search,cross_validateandholdout_validate,model_paramswere ignored. - Fixing the problem when a single Coreset was created for the entire dataset and 
get_cleaning_sampleswas called withclass_size={"class XXX": "all"}, the returned result was faulty. 
New Features and Improvements in version 0.5.0¤
- Coresets can now be built using categorical features. The library will automatically one-hot encode them during the build process (as Coresets can only be built with numeric features).
 get_coresetcan now return the data according to three datapreprocessing_stage=original– The dataset as it is handed to the Coreset’s build function.preprocessing_stage=user– The dataset after any user defined data preprocessing (default).preprocessing_stage=auto– The dataset after any automatic data preprocessing done by the library, such as one-hot encoding and converting Boolean fields to numeric. The features (X), can also be returned as a sparse matrix or an array, controlled by thesparse_outputparameter (applicable only forpreprocessing_stage=auto).fitcan now return the data according to twopreprocessing_stage=autois the default when Scikit-learn is used to train the model.preprocessing_stage=useris the default when XGBoost, LightGBM or CatBoost are used to train the model.- Adding a new 
auto_preprocessingfunction, allowing the user to preprocess the (test) data, as it was automatically done by the library during the Coreset’s build function. - Fixing the problem where the library’s Docstrings did not show up in some IDEs.
 
New Features and Improvements in version 0.4.0¤
- Adding support for Python 3.11.
 - Allowing to create a CoresetTreeService, which can be used for both training and cleaning (
optimized_for=['cleaning', 'training']). - The CoresetTreeService can now handle datasets that do not fit into the device’s memory also for cleaning purposes (for training purposes this was supported from the initial release).
 - The 
get_important_samplesfunction was renamed toget_cleaning_samplesto improve the clarity of its purpose. - Adding hyperparameter tuning capabilities to the library with the introduction of the 
grid_search()function, which works in a similar manner to Scikit-learn’s GridSearchCV class, only dramatically faster, as it utilizes the Coreset tree. Introducing also thecross_validate()andholdout_validate()functions, which can be used directly or as the validation method as part of thegrid_search()function. - Further improving the error messages for some of the data processing problems users encountered.
 
Bug Fixes in version 0.3.1¤
- build and build_partial now read the data in chunks in the size of chunk_size when the file format allows it (CSV, TSV), to reduce the memory footprint when building the Coreset tree.
 
New Features and Improvements in version 0.3.0¤
- An additional CoresetTreeService for all decision tree classification-based problems has been added to the library. This service can be used to create classification-based Coresets for all libraries including: XGBoost, LightGBM, CatBoost, Scikit-learn and others.
 - Improving the results get_coreset returns in case the Coreset tree is not perfectly balanced.
 - Improving the data handling capabilities, when processing the input data provided to the different build functions, such as supporting pandas.BooleanDtype and pandas.Series and returning clearer error messages for some of the data processing problems encountered.
 
New Features and Improvements in version 0.2.0¤
- Additional CoresetTreeServices for linear regression, K-Means, PCA and SVD have been added to the library.
 - Significantly reducing the memory footprint by up to 50% especially during the various build and partial_build functions.
 - Data is read in chunks in the size of 
chunk_sizewhen the file format allows it (CSV, TSV), to reduce the memory footprint when building the Coreset tree. - Significantly improving the get_coreset time on large datasets.
 - Significantly improving the save time and changing the default save format to pickle.
 - Significantly improving the importance calculation when the number of data instances per class is lower than the number of features.
 - Allowing to save the entire dataset and not just the selected samples, by setting the 
save_allparameter during the initialization of the class. Whenoptimized_for='cleaning'save_allis True by default and whenoptimized_for='training'it is False by default. - Allowing to define certain columns in the dataset as properties (
props). Properties, won’t be used to compute the Coreset or train the model, but it is possible to filter_out_samples on them or to pass them in theselect_from_functionof get_important_samples.