DataParams
DataParams
dataclass
¤
Bases: BaseDC
A class including all required information to preprocess the data. When not defined all fields/columns in the data are treated as features.
The example below shows how the class is used by the CoresetTreeService class. See a more extensive example at the end of the page.
data_params = {
'target': {'name': 'Cover_Type'},
'index': {'name': 'index_column'}
}
service_obj = CoresetTreeServiceLG(
optimized_for='training',
data_params=data_params
)
Parameter name | Type | Description |
---|---|---|
General Parameters | ||
features | List | The list of the fields/columns used as features to build
the Coreset and train the model. If not defined, the columns_to_features parameter should be defined.
Each feature is defined as a dictionary with the following attributes (only name is mandatory):
|
target | dict | The target column.
Example: 'target': {'name': 'Cover_Type'} |
index | dict | The index column. Example: 'index': {'name': 'index_column'}
|
properties | List | A list of fields/columns which won’t be used to build the Coreset or train
the model, but it is possible to filter_out_samples on them or to pass them in the
select_from_function of get_cleaning_samples .
Example: 'properties': [{'name': 'date'},] |
columns_to_features | Union[bool, dict], default False | Either bool or
dict with two possible fields, include or exclude .
When set to true, all fields/columns in the dataset are treated as features.
When include is used, only fields/columns defined or those fitting the defined masks are
treated as features. When exclude is used, only fields/columns defined or those
fitting the defined masks are not treated as features. Example: {'exclude': ['internal_.*', 'bad']} |
datetime_to_properties | boolean, default True | By default, all datetime fields/columns
are turned into properties. Properties, won’t be used to build the Coreset or train
the model, but it is possible to filter_out_samples on them or to pass them
in the select_from_function of get_cleaning_samples . To disable this functionality
set 'datetime_to_properties': False . |
save_orig | bool, default False | When data transformations are defined (such as data_transform_before or feature level transform), the default behavior is to save the data only after the transformations. To save the data also in it original format, as it was handed to the build function and before any user defined data transformations, set `'save_orig': True`. To retrieve the Coreset in its original format user `preprocessing_stage='original'` when calling the `get_coreset` function. |
seq_column | dict, default None |
Defining a sequence column (such as a date), allows to specify seq_from and seq_to
parameters to get_coreset , fit ,
grid_search and the validation functions, so these functions would be executed on a subset of the
data (such as certain date ranges). The seq_column is a dictionary containing the following parameters:
Example: |
Categorical Features Parameters | ||
detect_categorical | boolean, default True | By default, all non-numeric and non-boolean fields/columns are automatically regarded
as categorical features and one-hot encoded by the library. To disable this
functionality set 'detect_categorical': False .
Note - coresets can only be built with numeric features. |
categorical_features | List | Forcing specific features,
which include only numeric values, to be categorical, can be done in two possible ways.
On a feature-by-feature base (setting the categorical attribute to True in the features
list) or using the categorical_features list, passing the feature names or the feature index in
the dataset starting from 0. See the example at the end of the page. |
ohe_min_frequency | float between 0 and 1, default 0.01 | Similarly to Skicit-learn's OneHotEncoder min_frequency parameter, specifies the minimum frequency
below which a category will be considered infrequent. Example: 'ohe_min_frequency': 0 |
ohe_max_categories | int, default 100. | Similarly to Skicit-learn's OneHotEncoder max_categories parameter, specifies an upper limit to the number
of output features for each input feature when considering infrequent categories.
Example: 'ohe_max_categories': 500 |
Missing Values Parameters | ||
detect_missing | bool, default True | By default, missing values are automatically detected and handled by the library.
To disable this functionality set 'detect_missing': False .
Note - coresets can only be built when there are no missing values. |
drop_rows_below | float between 0 and 1, default 0 (nothing is dropped) | If the ratio of instances with missing values on any feature is lower than this ratio,
those instances would be ignored during the coreset build. Example: 'drop_rows_below': 0.05
|
drop_cols_above | float between 0 and 1, default 1 (nothing is dropped). | If the ratio of instances with missing values for a certain feature is higher than this ratio,
this feature would be ignored during the coreset build. Example: 'drop_cols_above': 0.3
|
fill_value_num | float | By default, missing values for numeric features would be
replaced with the calculated mean.
It is possible to change the default behavior for numeric features by defining a specific replacement
number for all features using the fill_value_num or to use the fill_value
attribute in the features list, to define a replacement on a
feature-by-feature base.Example: 'fill_value_num':-1 |
fill_value_cat | Any | By default, missing values for categorical features would be treated just as another category/value when the
feature is one-hot encoded by the library. It is possible to change the default behavior for categorical features by defining a specific replacement value or by specifying take_most_common , (which will fill the missing values with the most commonly
used value of the feature) for all categorical features using the
fill_value_cat or to use the fill_value attribute in the features list,
to define a replacement on a feature-by-feature base.
Example: 'fill_value_cat': 'take_most_common' |
Data Transformation Parameters | ||
data_transform_before | Transform | A preprocessing function applied to the entire dataset. The function's signature is
func(dataset) -> dataset .
See the example at the end of the page. |
feature_transform_default | Transform | A default feature transformation function applied on feature-by-feature base.
Executed after the data_transform_before . The function can be overridden at the feature level
with by defining the transform attribute in the features list.
The function's signature is func(dataset, transform_context:dict) -> data .
The function returns the data of a single feature.
Example: 'transform_context': {'feature_name': 'salary'} |
data_transform_after | Transform | A preprocessing function similar to data_transform_before , executed after the feature-by-feature transformation. See the example at the end of the page. |
Code example:
def data_before_processing(dataset):
df = pd.DataFrame(dataset)
# remove dataset rows by condition
df = df[df['department'] != 'Head Office']
return df
def education_level_transform(dataset):
# replace categorical values with numeric
df = pd.DataFrame(dataset)
conditions = [
df['education_level'] = 'elementary_school',
df['education_level'] = 'high_school',
df['education_level'] = 'diploma',
df['education_level'] = 'associates',
df['education_level'] = 'bachelors',
df['education_level'] = 'masters',
df['education_level'] = 'doctorate',
]
choices = [1, 2, 3, 4, 5, 6, 7]
df['education_level'] = np.select(conditions, choices)
return df['education_level']
def yearly_bonus_transform(dataset):
df = pd.DataFrame(dataset)
# for creating new feature
return df['h1_bonus'] + df['h2_bonus']
def transform_scaling(dataset):
from sklearn.preprocessing import MinMaxScaler
df = pd.DataFrame(dataset)
scaler = MinMaxScaler()
columns_to_scale = ['age', 'work_experience_in_months']
df[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])
return df
data_params = {
'features': [
{
'name': 'family_status',
'categorical': True, 'fill_value': 'single'
},
{'name': 'department', 'categorical': True},
{'name': 'gender'},
{'name': 'job_title'},
{'name': 'age', 'fill_value': 18},
{'name': 'work_experience_in_months'},
{
'name': 'education_level',
'transform': {'func': education_level_transform}
},
{
'name': 'yearly_bonus',
'transform': {'func': yearly_bonus_transform}
},
],
'properties': [{'name': 'full_name'}, {'name': 'Hire Date'}],
'target': {'name': 'salary'},
'categorical_features': ['gender'],
'fill_value_cat': 'take_most_common',
'fill_value_num': 0,
'data_transform_before': {'func': data_before_processing},
'data_transform_after': {'func': transform_scaling},
'seq_column': {
'name': 'Hire Date',
'granularity': 'Y',
'datetime_format': '%d/%m/%Y',
'chunk_by': True,
},
}