DataParams

DataParams `dataclass` ¤

Bases: BaseDC

A class including all required information to preprocess the data. When not defined all fields/columns in the data are treated as features.

The example below shows how the class is used by the CoresetTreeService class. See a more extensive example at the end of the page.

data_params = {
    'target': {'name': 'Cover_Type'},
    'index': {'name': 'index_column'}
}

service_obj = CoresetTreeServiceLG(
    optimized_for='training',
    data_params=data_params
)

Parameter name	Type	Description
General Parameters
features	List	The list of the fields/columns used as features to build the Coreset and train the model. If not defined, the `columns_to_features` parameter should be defined. Each feature is defined as a dictionary with the following attributes (only name is mandatory): `name`: The feature name. `dtype`: The feature data type. `categorical`: Set to true if the feature is categorical. For more information refer to the Categorical Features Parameters section. `fill_value`: In case the feature has missing values, how should they be filled. For more information refer to the Missing Values Parameters section. `transform`: A function defining the required transformation for the feature. For more information refer to the Data Transformation Parameters section. See the example at the end of the page.
target	dict	The target column. Example: `'target': {'name': 'Cover_Type'}`
index	dict	The index column. Example: `'index': {'name': 'index_column'}`
properties	List	A list of fields/columns which won’t be used to build the Coreset or train the model, but it is possible to `filter_out_samples` on them or to pass them in the `select_from_function` of `get_cleaning_samples`. Example: `'properties': [{'name': 'date'},]`
columns_to_features	Union[bool, dict], default False	Either `bool` or `dict` with two possible fields, `include` or `exclude`. When set to true, all fields/columns in the dataset are treated as features. When `include` is used, only fields/columns defined or those fitting the defined masks are treated as features. When `exclude` is used, only fields/columns defined or those fitting the defined masks are not treated as features. Example: `{'exclude': ['internal_.*', 'bad']}`
datetime_to_properties	boolean, default True	By default, all datetime fields/columns are turned into properties. Properties, won’t be used to build the Coreset or train the model, but it is possible to `filter_out_samples` on them or to pass them in the `select_from_function` of `get_cleaning_samples`. To disable this functionality set `'datetime_to_properties': False`.
save_orig	bool, default False	When data transformations are defined (such as data_transform_before or feature level transform), the default behavior is to save the data only after the transformations. To save the data also in it original format, as it was handed to the build function and before any user defined data transformations, set `'save_orig': True`. To retrieve the Coreset in its original format user `preprocessing_stage='original'` when calling the `get_coreset` function.
seq_column	dict, default None	Defining a sequence column (such as a date), allows to specify `seq_from` and `seq_to` parameters to `get_coreset`, `fit`, `grid_search` and the validation functions, so these functions would be executed on a subset of the data (such as certain date ranges). The `seq_column` is a dictionary containing the following parameters: `name/id/prop_id`: Required. The name, id or prop_id of the sequence column. name: The name of the column. id: The index of the feature starting from 0. prop_id: The index of the property starting from 0. `granularity`: Required. The granularity in which the sequence column would be queried. Can be either a pandas offset or a callable function. `datetime_format`: Required in case the sequence column is a datetime formatted as string. The datetime format of the sequence column. `chunk_by`: Optional. When set, the Coreset tree will be built using the `chunk_by` functionality according to the sequence column instead of using a fixed `chunk_size`. Example: `'seq_column': { 'name': 'Transaction Date', 'granularity': 'D', 'datetime_format': '%yyyy-%mm-%dd', 'chunk_by': True }`
Categorical Features Parameters
detect_categorical	boolean, default True	By default, all non-numeric and non-boolean fields/columns are automatically regarded as categorical features and one-hot- and/or target-encoded by the library. To disable this functionality set `'detect_categorical': False`. Note - coresets can only be built with numeric features.
cat_encoding_method	str, default None	Use this parameter to override the default categorical encoding strategy (valid non-default values are `‘OHE’`, `‘TE’`, `‘MIXED’`). If this parameter is left on default, the strategy for encoding categorical features is determined as follows: a mixed categorical encoding strategy, combining both Target Encoding (TE) and One Hot Encoding (OHE), will be used in binary classification tasks; One Hot Encoding (OHE) will be used in all other types of learning tasks (multiclass classification, regression, and unsupervised learning). Valid overriding is effective only for binary classification tasks (e.g., change of the default `‘MIXED’` to `‘OHE’` or to `‘TE’`). For more details on the mixed categorical encoding strategy, please see the `favor_ohe_num_cats_thresh` documentation below.
categorical_features	List	Forcing specific features, which include only numeric values, to be categorical, can be done in two possible ways. On a feature-by-feature base (setting the `categorical` attribute to True in the `features` list) or using the `categorical_features` list, passing the feature names or the feature index in the dataset starting from 0. See the example at the end of the page.
ohe_min_frequency	float between 0 and 1, default 0.01	Similarly to Skicit-learn's OneHotEncoder `min_frequency` parameter, specifies the minimum frequency below which a category will be considered infrequent. Example: `'ohe_min_frequency': 0`
ohe_max_categories	int, default 100	Similarly to Skicit-learn's OneHotEncoder `max_categories` parameter, specifies an upper limit to the number of output features for each input feature when considering infrequent categories. Example: `'ohe_max_categories': 500`
te_cv	int, default 5	If Target Encoding is employed, this parameter determines the number of folds in the 'cross fitting' strategy used in TargetEncoder’s `fit_transform`. In practice, a lower number may be applied, based on the distribution of classes in the data.
te_random_state	int, default None	If Target Encoding is employed, this parameter affects the ordering of the indices which controls the randomness of each fold in its 'cross fitting' strategy. Pass an int for reproducible output across multiple function calls.
favor_ohe_num_cats_thresh	int, default 50	Works in conjunction with `favor_ohe_vol_pct_thresh`. In a mixed categorical encoding strategy, we employ both One Hot Encoding (OHE) and Target Encoding (TE) strategies at the same time, and divide the categorical attributes into two distinct groups for each encoding type. For the division purposes, if the number of categories for a categorical feature is either lower than `favor_ohe_num_cats_thresh`, or higher than `favor_ohe_num_cats_thresh` but its `favor_ohe_num_cats_thresh` categories or less cover at least `favor_ohe_vol_pct_thresh` percent of the data instances, the OHE strategy will be favored over the TE. Using the default values as an example, values of `favor_ohe_num_cats_thresh=50` and `favor_ohe_vol_pct_thresh=0.8` mean that if a categorical feature's top `50` (or less) categories capture `80%` (or more) of the volume, the feature will we be encoded using the OHE; otherwise, it will be encoded using the TE.
favor_ohe_vol_pct_thresh	float between 0 and 1, default 0.8	Works in conjunction with `favor_ohe_num_cats_thresh`, please see its description.
Missing Values Parameters
detect_missing	bool, default True	By default, missing values are automatically detected and handled by the library. To disable this functionality set `'detect_missing': False`. Note - coresets can only be built when there are no missing values.
drop_rows_below	float between 0 and 1, default 0 (nothing is dropped)	If the ratio of instances with missing values on any feature is lower than this ratio, those instances would be ignored during the coreset build. Example: `'drop_rows_below': 0.05`
drop_cols_above	float between 0 and 1, default 1 (nothing is dropped).	If the ratio of instances with missing values for a certain feature is higher than this ratio, this feature would be ignored during the coreset build. Example: `'drop_cols_above': 0.3`
fill_value_num	float	By default, missing values for numeric features would be replaced with the calculated mean. It is possible to change the default behavior for numeric features by defining a specific replacement number for all features using the `fill_value_num` or to use the `fill_value` attribute in the `features` list, to define a replacement on a feature-by-feature base. Example: `'fill_value_num':-1`
fill_value_cat	Any	By default, missing values for categorical features would be treated just as another category/value when the feature is one-hot encoded by the library. It is possible to change the default behavior for categorical features by defining a specific replacement value or by specifying `take_most_common`, (which will fill the missing values with the most commonly used value of the feature) for all categorical features using the `fill_value_cat` or to use the `fill_value` attribute in the `features` list, to define a replacement on a feature-by-feature base. Example: `'fill_value_cat': 'take_most_common'`
Data Transformation Parameters
data_transform_before	Transform	A preprocessing function applied to the entire dataset. The function's signature is `func(dataset) -> dataset`. See the example at the end of the page.
feature_transform_default	Transform	A default feature transformation function applied on feature-by-feature base. Executed after the `data_transform_before`. The function can be overridden at the feature level with by defining the `transform` attribute in the `features` list. The function's signature is `func(dataset, transform_context:dict) -> data`. The function returns the data of a single feature. Example: `'transform_context': {'feature_name': 'salary'}`
data_transform_after	Transform	A preprocessing function similar to data_transform_before , executed after the feature-by-feature transformation. See the example at the end of the page.

Code example:

name="__codelineno-2-1" href="#__codelineno-2-1">def data_before_processing(dataset): df = pd.DataFrame(dataset) # remove dataset rows by condition df = df[df['department'] != 'Head Office'] return df class="k">def education_level_transform(dataset): # replace categorical values with numeric df = pd.DataFrame(dataset) conditions = [ df['education_level'] = 'elementary_school', df['education_level'] = 'high_school', df['education_level'] = 'diploma', df['education_level'] = 'associates', df['education_level'] = 'bachelors', df['education_level'] = 'masters', df['education_level'] = 'doctorate', ] choices = [1, 2, 3, 4, 5, 6, 7] df['education_level'] = np.select(conditions, choices) return df['education_level'] class="k">def yearly_bonus_transform(dataset): df = pd.DataFrame(dataset) # for creating new feature return df['h1_bonus'] + df['h2_bonus'] class="k">def transform_scaling(dataset): from sklearn.preprocessing import MinMaxScaler df = pd.DataFrame(dataset) scaler = MinMaxScaler() columns_to_scale = ['age', 'work_experience_in_months'] df[columns_to_scale] = scaler.fit_transform(df[columns_to_scale]) return df class="n">data_params = { 'features': [ { 'name': 'family_status', 'categorical': True, 'fill_value': 'single' }, {'name': 'department', 'categorical': True}, {'name': 'gender'}, {'name': 'job_title'}, {'name': 'age', 'fill_value': 18}, {'name': 'work_experience_in_months'}, { 'name': 'education_level', 'transform': {'func': education_level_transform} }, { 'name': 'yearly_bonus', 'transform': {'func': yearly_bonus_transform} }, ], 'properties': [{'name': 'full_name'}, {'name': 'Hire Date'}], 'target': {'name': 'salary'}, 'categorical_features': ['gender'], 'fill_value_cat': 'take_most_common', 'fill_value_num': 0, 'data_transform_before': {'func': data_before_processing}, 'data_transform_after': {'func': transform_scaling}, 'seq_column': { 'name': 'Hire Date', 'granularity': 'Y', 'datetime_format': '%d/%m/%Y', 'chunk_by': True, }, class="p">}

DataParams

DataParams dataclass ¤

DataParams `dataclass` ¤