gators.binning.TreeBinning

class gators.binning.TreeBinning(tree: Union[sklearn.tree._classes.DecisionTreeClassifier, sklearn.tree._classes.DecisionTreeRegressor], inplace=False)[source]

Bin the columns using decision tree based splits.

The binning can be done inplace or by adding the binned columns to the existing data.

Parameters
tree‘DecisionTree’

Decision tree model used to create the bin intervals.

inplacebool, default False

If False, return the dataframe with the new binned columns with the names “column_name__bin”). Otherwise, return the dataframe with the existing binned columns.

See also

gators.binning.CustomBinning

Bin using user input splits.

gators.binning.Binning

Bin using equal splits.

gators.binning.CustomBinning

Bin using the variable quantiles.

Examples

>>> from gators.binning import TreeBinning
>>> from sklearn.tree import DecisionTreeClassifier

The binning can be done inplace by modifying the existing columns

>>> obj = TreeBinning(tree=DecisionTreeClassifier(max_depth=2, random_state=0), inplace=True)

or by adding new binned columns

>>> obj = TreeBinning(tree=DecisionTreeClassifier(max_depth=2, random_state=0), inplace=True)

The fit, transform, and fit_transform methods accept:

  • dask dataframes:

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> X = dd.from_pandas(pd.DataFrame({
... 'A': [1.07, -2.59, -1.54, 1.72],
... 'B': [-1.19, -0.22, -0.28, 1.28],
... 'C': [-1.15, 1.92, 1.09, -0.95]}), npartitions=1)
>>> y = dd.from_pandas(pd.Series([0, 1, 0, 1], name="TARGET"), npartitions=1)
  • koalas dataframes:

>>> import databricks.koalas as ks
>>> X = ks.DataFrame({
... 'A': [1.07, -2.59, -1.54, 1.72],
... 'B': [-1.19, -0.22, -0.28, 1.28],
... 'C': [-1.15, 1.92, 1.09, -0.95]})
>>> y = ks.Series([0, 1, 0, 1], name="TARGET")
  • and pandas dataframes:

>>> import pandas as pd
>>> X = pd.DataFrame({
... 'A': [1.07, -2.59, -1.54, 1.72],
... 'B': [-1.19, -0.22, -0.28, 1.28],
... 'C': [-1.15, 1.92, 1.09, -0.95]})
>>> y = pd.Series([0, 1, 0, 1], name="TARGET")

The result is a transformed dataframe belonging to the same dataframe library.

  • with inplace=True

>>> obj = TreeBinning(tree=DecisionTreeClassifier(max_depth=2, random_state=0), inplace=True)
>>> obj.fit_transform(X, y)
               A              B              C
0   [-2.06, 1.4)  (-inf, -0.25)  (-inf, -1.05)
1  (-inf, -2.06)   [-0.25, inf)    [0.07, inf)
2   [-2.06, 1.4)  (-inf, -0.25)    [0.07, inf)
3     [1.4, inf)   [-0.25, inf)  [-1.05, 0.07)
  • with inplace=False

>>> X = pd.DataFrame({
... 'A': [1.07, -2.59, -1.54, 1.72],
... 'B': [-1.19, -0.22, -0.28, 1.28],
... 'C': [-1.15, 1.92, 1.09, -0.95]})
>>> obj = TreeBinning(tree=DecisionTreeClassifier(max_depth=2, random_state=0), inplace=False)
>>> obj.fit_transform(X, y)
      A     B     C         A__bin         B__bin         C__bin
0  1.07 -1.19 -1.15   [-2.06, 1.4)  (-inf, -0.25)  (-inf, -1.05)
1 -2.59 -0.22  1.92  (-inf, -2.06)   [-0.25, inf)    [0.07, inf)
2 -1.54 -0.28  1.09   [-2.06, 1.4)  (-inf, -0.25)    [0.07, inf)
3  1.72  1.28 -0.95     [1.4, inf)   [-0.25, inf)  [-1.05, 0.07)

Independly of the dataframe library used to fit the transformer, the tranform_numpy method only accepts NumPy arrays and returns a transformed NumPy array. Note that this transformer should only be used when the number of rows is small e.g. in real-time environment.

>>> X = pd.DataFrame({
... 'A': [1.07, -2.59, -1.54, 1.72],
... 'B': [-1.19, -0.22, -0.28, 1.28],
... 'C': [-1.15, 1.92, 1.09, -0.95]})
>>> obj.transform_numpy(X.to_numpy())
array([[1.07, -1.19, -1.15, '[-2.06, 1.4)', '(-inf, -0.25)',
        '(-inf, -1.05)'],
       [-2.59, -0.22, 1.92, '(-inf, -2.06)', '[-0.25, inf)',
        '[0.07, inf)'],
       [-1.54, -0.28, 1.09, '[-2.06, 1.4)', '(-inf, -0.25)',
        '[0.07, inf)'],
       [1.72, 1.28, -0.95, '[1.4, inf)', '[-0.25, inf)', '[-1.05, 0.07)']],
      dtype=object)
fit(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y) → gators.binning.tree_binning.TreeBinning[source]

Fit the transformer on the dataframe X.

Parameters
XDataFrame

Input dataframe.

ySeries, default None.

Target values.

Returns
“TreeBinning”

Instance of itself.

static check_array(X: numpy.ndarray)

Validate array.

Parameters
Xnp.ndarray

Array.

check_array_is_numerics(X: numpy.ndarray)

Check if array is only numerics.

Parameters
Xnp.ndarray

Array.

static check_binary_target(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series])

Raise an error if the target is not binary.

Parameters
ySeries

Target values.

static check_dataframe(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])

Validate dataframe.

Parameters
XDataFrame

Dataframe.

static check_dataframe_contains_numerics(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])

Check if dataframe is only numerics.

Parameters
XDataFrame

Dataframe.

static check_dataframe_is_numerics(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])

Check if dataframe is only numerics.

Parameters
XDataFrame

Dataframe.

check_dataframe_with_objects(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])

Check if dataframe contains object columns.

Parameters
XDataFrame

Dataframe.

check_datatype(dtype, accepted_dtypes)

Check if dataframe is only numerics.

Parameters
XDataFrame

Dataframe.

static check_multiclass_target(y: Union[pd.Series, ks.Series, dd.Series])

Raise an error if the target is not discrete.

Parameters
ySeries

Target values.

check_nans(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], columns: List[str])

Raise an error if X contains NaN values.

Parameters
XDataFrame

Dataframe.

theta_vecList[float]

List of columns.

static check_regression_target(y: Union[pd.Series, ks.Series, dd.Series])

Raise an error if the target is not discrete.

Parameters
ySeries

Target values.

static check_target(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series])

Validate target.

Parameters
XDataFrame

Dataframe.

ySeries

Target values.

fit_transform(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series] = None) → Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]

Fit and Transform the dataframe X.

Parameters
XDataFrame.

Input dataframe.

ySeries, default None.

Input target.

Returns
XDataFrame

Transformed dataframe.

static get_column_names(inplace: bool, columns: List[str], suffix: str)

Return the names of the modified columns.

Parameters
inplacebool

If True return columns. If False return columns__suffix.

columnsList[str]

List of columns.

suffixstr

Suffix used if inplace is False.

Returns
List[str]

List of column names.

static get_labels(pretty_bins_dict: Dict[str, numpy.array])

Get the labels of the bins.

Parameters
pretty_bins_dictDict[str, np.array])

pretified bins used to generate the labels.

Returns
Dict[str, np.array]

Labels.

np.array

Labels.

get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfestimator instance

Estimator instance.

transform(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]) → Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]

Transform the dataframe X.

Parameters
XDataFrame

Input dataframe.

Returns
XDataFrame

Transformed dataframe.

transform_numpy(X: numpy.ndarray) → numpy.ndarray

Transform the array X.

Parameters
Xnp.ndarray

Array.

Returns
Xnp.ndarray

Transformed array.