gators.binning.TreeBinning¶

class gators.binning.TreeBinning(tree: Union[sklearn.tree._classes.DecisionTreeClassifier, sklearn.tree._classes.DecisionTreeRegressor], inplace=False)[source]¶

Bin the columns using decision tree based splits.

The binning can be done inplace or by adding the binned columns to the existing data.

Parameters

tree‘DecisionTree’: Decision tree model used to create the bin intervals.
inplacebool, default False: If False, return the dataframe with the new binned columns with the names “column_name__bin”). Otherwise, return the dataframe with the existing binned columns.

See also

gators.binning.CustomBinning: Bin using user input splits.
gators.binning.Binning: Bin using equal splits.
gators.binning.CustomBinning: Bin using the variable quantiles.

Examples

>>> from gators.binning import TreeBinning
>>> from sklearn.tree import DecisionTreeClassifier

The binning can be done inplace by modifying the existing columns

>>> obj = TreeBinning(tree=DecisionTreeClassifier(max_depth=2, random_state=0), inplace=True)

or by adding new binned columns

>>> obj = TreeBinning(tree=DecisionTreeClassifier(max_depth=2, random_state=0), inplace=True)

The fit, transform, and fit_transform methods accept:

dask dataframes:

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> X = dd.from_pandas(pd.DataFrame({
... 'A': [1.07, -2.59, -1.54, 1.72],
... 'B': [-1.19, -0.22, -0.28, 1.28],
... 'C': [-1.15, 1.92, 1.09, -0.95]}), npartitions=1)
>>> y = dd.from_pandas(pd.Series([0, 1, 0, 1], name="TARGET"), npartitions=1)

koalas dataframes:

>>> import databricks.koalas as ks
>>> X = ks.DataFrame({
... 'A': [1.07, -2.59, -1.54, 1.72],
... 'B': [-1.19, -0.22, -0.28, 1.28],
... 'C': [-1.15, 1.92, 1.09, -0.95]})
>>> y = ks.Series([0, 1, 0, 1], name="TARGET")

and pandas dataframes:

>>> import pandas as pd
>>> X = pd.DataFrame({
... 'A': [1.07, -2.59, -1.54, 1.72],
... 'B': [-1.19, -0.22, -0.28, 1.28],
... 'C': [-1.15, 1.92, 1.09, -0.95]})
>>> y = pd.Series([0, 1, 0, 1], name="TARGET")

The result is a transformed dataframe belonging to the same dataframe library.

with inplace=True

>>> obj = TreeBinning(tree=DecisionTreeClassifier(max_depth=2, random_state=0), inplace=True)
>>> obj.fit_transform(X, y)
               A              B              C
0   [-2.06, 1.4)  (-inf, -0.25)  (-inf, -1.05)
1  (-inf, -2.06)   [-0.25, inf)    [0.07, inf)
2   [-2.06, 1.4)  (-inf, -0.25)    [0.07, inf)
3     [1.4, inf)   [-0.25, inf)  [-1.05, 0.07)

with inplace=False

>>> X = pd.DataFrame({
... 'A': [1.07, -2.59, -1.54, 1.72],
... 'B': [-1.19, -0.22, -0.28, 1.28],
... 'C': [-1.15, 1.92, 1.09, -0.95]})
>>> obj = TreeBinning(tree=DecisionTreeClassifier(max_depth=2, random_state=0), inplace=False)
>>> obj.fit_transform(X, y)
      A     B     C         A__bin         B__bin         C__bin
0  1.07 -1.19 -1.15   [-2.06, 1.4)  (-inf, -0.25)  (-inf, -1.05)
1 -2.59 -0.22  1.92  (-inf, -2.06)   [-0.25, inf)    [0.07, inf)
2 -1.54 -0.28  1.09   [-2.06, 1.4)  (-inf, -0.25)    [0.07, inf)
3  1.72  1.28 -0.95     [1.4, inf)   [-0.25, inf)  [-1.05, 0.07)

Independly of the dataframe library used to fit the transformer, the tranform_numpy method only accepts NumPy arrays and returns a transformed NumPy array. Note that this transformer should only be used when the number of rows is small e.g. in real-time environment.

>>> X = pd.DataFrame({
... 'A': [1.07, -2.59, -1.54, 1.72],
... 'B': [-1.19, -0.22, -0.28, 1.28],
... 'C': [-1.15, 1.92, 1.09, -0.95]})
>>> obj.transform_numpy(X.to_numpy())
array([[1.07, -1.19, -1.15, '[-2.06, 1.4)', '(-inf, -0.25)',
        '(-inf, -1.05)'],
       [-2.59, -0.22, 1.92, '(-inf, -2.06)', '[-0.25, inf)',
        '[0.07, inf)'],
       [-1.54, -0.28, 1.09, '[-2.06, 1.4)', '(-inf, -0.25)',
        '[0.07, inf)'],
       [1.72, 1.28, -0.95, '[1.4, inf)', '[-0.25, inf)', '[-1.05, 0.07)']],
      dtype=object)

fit(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y) → gators.binning.tree_binning.TreeBinning[source]¶

Fit the transformer on the dataframe X.

Parameters

XDataFrame: Input dataframe.
ySeries, default None.: Target values.

Returns

“TreeBinning”: Instance of itself.

static check_array(X: numpy.ndarray)¶

Validate array.

Parameters

Xnp.ndarray: Array.

check_array_is_numerics(X: numpy.ndarray)¶

Check if array is only numerics.

Parameters

Xnp.ndarray: Array.

static check_binary_target(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series])¶

Raise an error if the target is not binary.

Parameters

ySeries: Target values.

static check_dataframe(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶

Validate dataframe.

Parameters

XDataFrame: Dataframe.

static check_dataframe_contains_numerics(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶

Check if dataframe is only numerics.

Parameters

XDataFrame: Dataframe.

static check_dataframe_is_numerics(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶

Check if dataframe is only numerics.

Parameters

XDataFrame: Dataframe.

check_dataframe_with_objects(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶

Check if dataframe contains object columns.

Parameters

XDataFrame: Dataframe.

check_datatype(dtype, accepted_dtypes)¶

Check if dataframe is only numerics.

Parameters

XDataFrame: Dataframe.

static check_multiclass_target(y: Union[pd.Series, ks.Series, dd.Series])¶

Raise an error if the target is not discrete.

Parameters

ySeries: Target values.

check_nans(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], columns: List[str])¶

Raise an error if X contains NaN values.

Parameters

XDataFrame: Dataframe.
theta_vecList[float]: List of columns.

static check_regression_target(y: Union[pd.Series, ks.Series, dd.Series])¶

Raise an error if the target is not discrete.

Parameters

ySeries: Target values.

static check_target(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series])¶

Validate target.

Parameters

XDataFrame: Dataframe.
ySeries: Target values.

fit_transform(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series] = None) → Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]¶

Fit and Transform the dataframe X.

Parameters

XDataFrame.: Input dataframe.
ySeries, default None.: Input target.

Returns

XDataFrame: Transformed dataframe.

static get_column_names(inplace: bool, columns: List[str], suffix: str)¶

Return the names of the modified columns.

Parameters

inplacebool: If True return columns. If False return columns__suffix.
columnsList[str]: List of columns.
suffixstr: Suffix used if inplace is False.

Returns

List[str]: List of column names.

static get_labels(pretty_bins_dict: Dict[str, numpy.array])¶

Get the labels of the bins.

Parameters

pretty_bins_dictDict[str, np.array]): pretified bins used to generate the labels.

Returns

Dict[str, np.array]: Labels.
np.array: Labels.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

transform(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]) → Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]¶

Transform the dataframe X.

Parameters

XDataFrame: Input dataframe.

Returns

XDataFrame: Transformed dataframe.

transform_numpy(X: numpy.ndarray) → numpy.ndarray¶

Transform the array X.

Parameters

Xnp.ndarray: Array.

Returns

Xnp.ndarray: Transformed array.

gators.binning.QuantileBinning

gators.binning.CustomBinning