DataParams
DataParams
dataclass
¤
DataParams(features=None, properties=None, target=None, sample_weight=None, index=None, feature_transform_default=None, transform_context=None, data_transform_before=None, data_transform_after=None, columns_to_features=False, n_instances=None, n_classes=None, is_classification=False, cat_encoding_method=None, ohe_max_categories=100, ohe_min_frequency=0.01, te_cv=5, te_random_state=None, favor_ohe_num_cats_thresh=50, favor_ohe_vol_pct_thresh=0.8, detect_categorical=True, detect_missing=True, categorical_features=None, drop_rows_below=0, drop_cols_above=1, seq_column=None, fill_value_cat=None, fill_value_num=None, datetime_to_properties=True, save_orig=False)
Bases: BaseDC
A class including all required information to preprocess the data. When not defined all fields/columns in the data are treated as features.
The example below shows how the class is used by the CoresetTreeService class. See a more extensive example at the end of the page.
data_params = {
'target': {'name': 'Cover_Type'},
'index': {'name': 'index_column'}
}
service_obj = CoresetTreeServiceLG(
optimized_for='training',
data_params=data_params
)
Parameter name | Type | Description |
---|---|---|
General Parameters | ||
features | List | The list of the fields/columns used as features to build
the Coreset and train the model. If not defined, the columns_to_features parameter should be defined.
Each feature is defined as a dictionary with the following attributes (only name is mandatory):
|
target | dict | The target column.
Example: 'target': {'name': 'Cover_Type'} |
sample_weight | dict | The sample weight column.
Example: 'sample_weight': {'name': 'Weights'} |
index | dict | The index column. Example: 'index': {'name': 'index_column'}
|
properties | List | A list of fields/columns which won’t be used to build the Coreset or train
the model, but it is possible to filter_out_samples on them or to pass them in the
select_from_function of get_cleaning_samples .
Example: 'properties': [{'name': 'date'},] |
columns_to_features | Union[bool, dict], default False | Either bool or
dict with two possible fields, include or exclude .
When set to true, all fields/columns in the dataset are treated as features.
When include is used, only fields/columns defined or those fitting the defined masks are
treated as features. When exclude is used, only fields/columns defined or those
fitting the defined masks are not treated as features. Example: {'exclude': ['internal_.*', 'bad']} |
datetime_to_properties | boolean, default True | By default, all datetime fields/columns
are turned into properties. Properties, won’t be used to build the Coreset or train
the model, but it is possible to filter_out_samples on them or to pass them
in the select_from_function of get_cleaning_samples . To disable this functionality
set 'datetime_to_properties': False . |
save_orig | bool, default False | When data transformations are defined (such as data_transform_before or feature level transform), the default behavior is to save the data only after the transformations. To save the data also in it original format, as it was handed to the build function and before any user defined data transformations, set `'save_orig': True`. To retrieve the Coreset in its original format user `preprocessing_stage='original'` when calling the `get_coreset` function. |
seq_column | dict, default None |
Defining a sequence column (such as a date), allows to specify seq_from and seq_to
parameters to get_coreset , fit ,
grid_search and the validation functions, so these functions would be executed on a subset of the
data (such as certain date ranges). The seq_column is a dictionary containing the following parameters:
Example: |
Categorical Features Parameters | ||
detect_categorical | boolean, default True | By default, all non-numeric and non-boolean fields/columns are automatically regarded
as categorical features and one-hot- and/or target-encoded by the library. To disable this
functionality set 'detect_categorical': False .
Note - coresets can only be built with numeric features. |
cat_encoding_method | str, default None | Use this parameter to override the default categorical encoding strategy (valid non-default values are
‘OHE’ , ‘TE’ , ‘MIXED’ ). If this parameter is left on default, the strategy
for encoding categorical features is determined as follows: a mixed categorical encoding strategy, combining both
Target Encoding (TE) and One Hot Encoding (OHE), will be used in binary classification tasks; One Hot Encoding (OHE)
will be used in all other types of learning tasks (multiclass classification, regression, and unsupervised
learning). Valid overriding is effective only for binary classification tasks (e.g., change of the default
‘MIXED’ to ‘OHE’ or to ‘TE’ ).For more details on the mixed categorical encoding strategy, please see the favor_ohe_num_cats_thresh
documentation below. |
categorical_features | List | Forcing specific features,
which include only numeric values, to be categorical, can be done in two possible ways.
On a feature-by-feature base (setting the categorical attribute to True in the features
list) or using the categorical_features list, passing the feature names or the feature index in
the dataset starting from 0. See the example at the end of the page. |
ohe_min_frequency | float between 0 and 1, default 0.01 | Similarly to Skicit-learn's OneHotEncoder min_frequency parameter, specifies the minimum frequency
below which a category will be considered infrequent. Example: 'ohe_min_frequency': 0 |
ohe_max_categories | int, default 100 | Similarly to Skicit-learn's OneHotEncoder max_categories parameter, specifies an upper limit to the number
of output features for each input feature when considering infrequent categories.
Example: 'ohe_max_categories': 500 |
te_cv | int, default 5 | If Target Encoding is employed, this parameter determines the number of folds in the 'cross fitting' strategy
used in TargetEncoder’s fit_transform . In practice, a lower number may be applied, based on the
distribution of classes in the data. |
te_random_state | int, default None | If Target Encoding is employed, this parameter affects the ordering of the indices which controls the randomness of each fold in its 'cross fitting' strategy. Pass an int for reproducible output across multiple function calls. |
favor_ohe_num_cats_thresh | int, default 50 | Works in conjunction with favor_ohe_vol_pct_thresh .In a mixed categorical encoding strategy, we employ both One Hot Encoding (OHE) and Target Encoding (TE) strategies at the same time, and divide the categorical attributes into two distinct groups for each encoding type. For the division purposes, if the number of categories for a categorical feature is either lower than favor_ohe_num_cats_thresh , or higher than favor_ohe_num_cats_thresh but its
favor_ohe_num_cats_thresh categories or less cover at least favor_ohe_vol_pct_thresh
percent of the data instances, the OHE strategy will be favored over the TE.Using the default values as an example, values of favor_ohe_num_cats_thresh=50 and
favor_ohe_vol_pct_thresh=0.8 mean that if a categorical feature's top 50 (or less)
categories capture 80% (or more) of the volume, the feature will we be encoded using the OHE;
otherwise, it will be encoded using the TE. |
favor_ohe_vol_pct_thresh | float between 0 and 1, default 0.8 | Works in conjunction with favor_ohe_num_cats_thresh , please see its description. |
Missing Values Parameters | ||
detect_missing | bool, default True | By default, missing values are automatically detected and handled by the library.
To disable this functionality set 'detect_missing': False .
Note - coresets can only be built when there are no missing values. |
drop_rows_below | float between 0 and 1, default 0 (nothing is dropped) | If the ratio of instances with missing values on any feature is lower than this ratio,
those instances would be ignored during the coreset build. Example: 'drop_rows_below': 0.05
|
drop_cols_above | float between 0 and 1, default 1 (nothing is dropped). | If the ratio of instances with missing values for a certain feature is higher than this ratio,
this feature would be ignored during the coreset build. Example: 'drop_cols_above': 0.3
|
fill_value_num | float | By default, missing values for numeric features would be
replaced with the calculated mean.
It is possible to change the default behavior for numeric features by defining a specific replacement
number for all features using the fill_value_num or to use the fill_value
attribute in the features list, to define a replacement on a
feature-by-feature base.Example: 'fill_value_num':-1 |
fill_value_cat | Any | By default, missing values for categorical features would be treated just as another category/value when the
feature is one-hot encoded by the library. It is possible to change the default behavior for categorical features by defining a specific replacement value or by specifying take_most_common , (which will fill the missing values with the most commonly
used value of the feature) for all categorical features using the
fill_value_cat or to use the fill_value attribute in the features list,
to define a replacement on a feature-by-feature base.
Example: 'fill_value_cat': 'take_most_common' |
Data Transformation Parameters | ||
data_transform_before | Transform | A preprocessing function applied to the entire dataset. The function's signature is
func(dataset) -> dataset .
See the example at the end of the page. |
feature_transform_default | Transform | A default feature transformation function applied on feature-by-feature base.
Executed after the data_transform_before . The function can be overridden at the feature level
with by defining the transform attribute in the features list.
The function's signature is func(dataset, transform_context:dict) -> data .
The function returns the data of a single feature.
Example: 'transform_context': {'feature_name': 'salary'} |
data_transform_after | Transform | A preprocessing function similar to data_transform_before , executed after the feature-by-feature transformation. See the example at the end of the page. |
Code example:
def data_before_processing(dataset):
df = pd.DataFrame(dataset)
# remove dataset rows by condition
df = df[df['department'] != 'Head Office']
return df
def education_level_transform(dataset):
# replace categorical values with numeric
df = pd.DataFrame(dataset)
conditions = [
df['education_level'] = 'elementary_school',
df['education_level'] = 'high_school',
df['education_level'] = 'diploma',
df['education_level'] = 'associates',
df['education_level'] = 'bachelors',
df['education_level'] = 'masters',
df['education_level'] = 'doctorate',
]
choices = [1, 2, 3, 4, 5, 6, 7]
df['education_level'] = np.select(conditions, choices)
return df['education_level']
def yearly_bonus_transform(dataset):
df = pd.DataFrame(dataset)
# for creating new feature
return df['h1_bonus'] + df['h2_bonus']
def transform_scaling(dataset):
from sklearn.preprocessing import MinMaxScaler
df = pd.DataFrame(dataset)
scaler = MinMaxScaler()
columns_to_scale = ['age', 'work_experience_in_months']
df[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])
return df
data_params = {
'features': [
{
'name': 'family_status',
'categorical': True, 'fill_value': 'single'
},
{'name': 'department', 'categorical': True},
{'name': 'gender'},
{'name': 'job_title'},
{'name': 'age', 'fill_value': 18},
{'name': 'work_experience_in_months'},
{
'name': 'education_level',
'transform': {'func': education_level_transform}
},
{
'name': 'yearly_bonus',
'transform': {'func': yearly_bonus_transform}
},
],
'properties': [{'name': 'full_name'}, {'name': 'Hire Date'}],
'target': {'name': 'salary'},
'categorical_features': ['gender'],
'fill_value_cat': 'take_most_common',
'fill_value_num': 0,
'data_transform_before': {'func': data_before_processing},
'data_transform_after': {'func': transform_scaling},
'seq_column': {
'name': 'Hire Date',
'granularity': 'Y',
'datetime_format': '%d/%m/%Y',
'chunk_by': True,
},
}