Skip to content

DataParams

DataParams dataclass ¤

Bases: BaseDC

A class including all required information to preprocess the data. When not defined all fields/columns in the data are treated as features.

The example below shows how the class is used by the CoresetTreeService class. See a more extensive example at the end of the page.

data_params = {
    'target': {'name': 'Cover_Type'},
    'index': {'name': 'index_column'}
}

service_obj = CoresetTreeServiceLG(
    optimized_for='training',
    data_params=data_params
)

Parameter nameTypeDescription
General Parameters
features ListThe list of the fields/columns used as features to build the Coreset and train the model. If not defined, the columns_to_features parameter should be defined. Each feature is defined as a dictionary with the following attributes (only name is mandatory):
  • name: The feature name.
  • dtype: The feature data type.
  • categorical: Set to true if the feature is categorical. For more information refer to the Categorical Features Parameters section.
  • fill_value: In case the feature has missing values, how should they be filled. For more information refer to the Missing Values Parameters section.
  • transform: A function defining the required transformation for the feature. For more information refer to the Data Transformation Parameters section.
    • See the example at the end of the page.
target dictThe target column.
Example: 'target': {'name': 'Cover_Type'}
index dictThe index column.
Example: 'index': {'name': 'index_column'}
properties ListA list of fields/columns which won’t be used to build the Coreset or train the model, but it is possible to filter_out_samples on them or to pass them in the select_from_function of get_cleaning_samples.

Example: 'properties': [{'name': 'date'},]
columns_to_features Union[bool, dict], default FalseEither bool or dict with two possible fields, include or exclude. When set to true, all fields/columns in the dataset are treated as features. When include is used, only fields/columns defined or those fitting the defined masks are treated as features. When exclude is used, only fields/columns defined or those fitting the defined masks are not treated as features.

Example: {'exclude': ['internal_.*', 'bad']}
datetime_to_properties boolean, default TrueBy default, all datetime fields/columns are turned into properties. Properties, won’t be used to build the Coreset or train the model, but it is possible to filter_out_samples on them or to pass them in the select_from_function of get_cleaning_samples. To disable this functionality set 'datetime_to_properties': False.
save_origbool, default False When data transformations are defined (such as data_transform_before or feature level transform), the default behavior is to save the data only after the transformations. To save the data also in it original format, as it was handed to the build function and before any user defined data transformations, set `'save_orig': True`. To retrieve the Coreset in its original format user `preprocessing_stage='original'` when calling the `get_coreset` function.
seq_columndict, default None Defining a sequence column (such as a date), allows to specify seq_from and seq_to parameters to get_coreset, fit, grid_search and the validation functions, so these functions would be executed on a subset of the data (such as certain date ranges). The seq_column is a dictionary containing the following parameters:
  • name/id/prop_id: Required. The name, id or prop_id of the sequence column. name: The name of the column. id: The index of the feature starting from 0. prop_id: The index of the property starting from 0.
  • granularity: Required. The granularity in which the sequence column would be queried. Can be either a pandas offset or a callable function.
  • datetime_format: Required in case the sequence column is a datetime formatted as string. The datetime format of the sequence column.
  • chunk_by: Optional. When set, the Coreset tree will be built using the chunk_by functionality according to the sequence column instead of using a fixed chunk_size.

Example:
    'seq_column':
        {
            'name': 'Transaction Date',
            'granularity': 'D',
            'datetime_format': '%yyyy-%mm-%dd',
            'chunk_by': True
        }
Categorical Features Parameters
detect_categorical boolean, default True By default, all non-numeric and non-boolean fields/columns are automatically regarded as categorical features and one-hot encoded by the library. To disable this functionality set 'detect_categorical': False.

Note - coresets can only be built with numeric features.
cat_encoding_method str, default None Use this parameter to override the default categorical encoding strategy (valid non-default values are ‘OHE’, ‘TE’). If this parameter is left on default, the strategy for encoding categorical features is determined as follows: Target Encoding will be used in binary classification tasks; One Hot Encoding will be used in all other types of learning tasks (multiclass classification, regression, and unsupervised learning). Valid overriding is effective only for binary classification tasks (e.g., change of default ‘TE’ to ‘OHE’).
categorical_features List Forcing specific features, which include only numeric values, to be categorical, can be done in two possible ways. On a feature-by-feature base (setting the categorical attribute to True in the features list) or using the categorical_features list, passing the feature names or the feature index in the dataset starting from 0.

See the example at the end of the page.
ohe_min_frequency float between 0 and 1, default 0.01 Similarly to Skicit-learn's OneHotEncoder min_frequency parameter, specifies the minimum frequency below which a category will be considered infrequent.

Example: 'ohe_min_frequency': 0
ohe_max_categories int, default 100 Similarly to Skicit-learn's OneHotEncoder max_categories parameter, specifies an upper limit to the number of output features for each input feature when considering infrequent categories.

Example: 'ohe_max_categories': 500
te_cv int, default 5 If Target Encoding is employed, this parameter determines the number of folds in the 'cross fitting' strategy used in TargetEncoder’s fit_transform. In practice, a lower number may be applied, based on the distribution of classes in the data.
te_random_state int, default None If Target Encoding is employed, this parameter affects the ordering of the indices which controls the randomness of each fold in its 'cross fitting' strategy. Pass an int for reproducible output across multiple function calls.
Missing Values Parameters
detect_missing bool, default True By default, missing values are automatically detected and handled by the library. To disable this functionality set 'detect_missing': False.

Note - coresets can only be built when there are no missing values.
drop_rows_belowfloat between 0 and 1, default 0 (nothing is dropped) If the ratio of instances with missing values on any feature is lower than this ratio, those instances would be ignored during the coreset build.

Example: 'drop_rows_below': 0.05
drop_cols_above float between 0 and 1, default 1 (nothing is dropped). If the ratio of instances with missing values for a certain feature is higher than this ratio, this feature would be ignored during the coreset build.

Example: 'drop_cols_above': 0.3
fill_value_num float By default, missing values for numeric features would be replaced with the calculated mean. It is possible to change the default behavior for numeric features by defining a specific replacement number for all features using the fill_value_num or to use the fill_value attribute in the features list, to define a replacement on a feature-by-feature base.

Example: 'fill_value_num':-1
fill_value_cat Any By default, missing values for categorical features would be treated just as another category/value when the feature is one-hot encoded by the library.

It is possible to change the default behavior for categorical features by defining a specific replacement value or by specifying take_most_common, (which will fill the missing values with the most commonly used value of the feature) for all categorical features using the fill_value_cat or to use the fill_value attribute in the features list, to define a replacement on a feature-by-feature base.

Example: 'fill_value_cat': 'take_most_common'
Data Transformation Parameters
data_transform_before Transform A preprocessing function applied to the entire dataset. The function's signature is func(dataset) -> dataset.

See the example at the end of the page.
feature_transform_default Transform A default feature transformation function applied on feature-by-feature base. Executed after the data_transform_before. The function can be overridden at the feature level with by defining the transform attribute in the features list. The function's signature is func(dataset, transform_context:dict) -> data. The function returns the data of a single feature.

Example: 'transform_context': {'feature_name': 'salary'}
data_transform_after Transform A preprocessing function similar to
data_transform_before
, executed after the feature-by-feature transformation.

See the example at the end of the page.

Code example:

def data_before_processing(dataset):
    df = pd.DataFrame(dataset)
    # remove dataset rows by condition
    df = df[df['department'] != 'Head Office']
    return df

def education_level_transform(dataset):
    # replace categorical values with numeric
    df = pd.DataFrame(dataset)
    conditions = [
        df['education_level'] = 'elementary_school',
        df['education_level'] = 'high_school',
        df['education_level'] = 'diploma',
        df['education_level'] = 'associates',
        df['education_level'] = 'bachelors',
        df['education_level'] = 'masters',
        df['education_level'] = 'doctorate',
        ]
    choices = [1, 2, 3, 4, 5, 6, 7]
    df['education_level'] = np.select(conditions, choices)
    return df['education_level']

def yearly_bonus_transform(dataset):
    df = pd.DataFrame(dataset)
    # for creating new feature
    return df['h1_bonus'] + df['h2_bonus']

def transform_scaling(dataset):
    from sklearn.preprocessing import MinMaxScaler
    df = pd.DataFrame(dataset)
    scaler = MinMaxScaler()
    columns_to_scale = ['age', 'work_experience_in_months']
    df[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])
    return df

data_params = {
    'features': [
        {
            'name': 'family_status',
            'categorical': True, 'fill_value': 'single'
        },
        {'name': 'department', 'categorical': True},
        {'name': 'gender'},
        {'name': 'job_title'},
        {'name': 'age', 'fill_value': 18},
        {'name': 'work_experience_in_months'},
        {
            'name': 'education_level',
            'transform': {'func': education_level_transform}
        },
        {
            'name': 'yearly_bonus',
            'transform': {'func': yearly_bonus_transform}
        },
    ],
    'properties': [{'name': 'full_name'}, {'name': 'Hire Date'}],
    'target': {'name': 'salary'},
    'categorical_features': ['gender'],
    'fill_value_cat': 'take_most_common',
    'fill_value_num': 0,
    'data_transform_before': {'func': data_before_processing},
    'data_transform_after': {'func': transform_scaling},
    'seq_column': {
        'name': 'Hire Date',
        'granularity': 'Y',
        'datetime_format': '%d/%m/%Y',
        'chunk_by': True,
    },
}