gators.binning.TreeBinning¶
-
class
gators.binning.
TreeBinning
(tree: Union[sklearn.tree._classes.DecisionTreeClassifier, sklearn.tree._classes.DecisionTreeRegressor], inplace=False)[source]¶ Bin the columns using decision tree based splits.
The binning can be done inplace or by adding the binned columns to the existing data.
- Parameters
- tree‘DecisionTree’
Decision tree model used to create the bin intervals.
- inplacebool, default False
If False, return the dataframe with the new binned columns with the names “column_name__bin”). Otherwise, return the dataframe with the existing binned columns.
See also
gators.binning.CustomBinning
Bin using user input splits.
gators.binning.Binning
Bin using equal splits.
gators.binning.CustomBinning
Bin using the variable quantiles.
Examples
>>> from gators.binning import TreeBinning >>> from sklearn.tree import DecisionTreeClassifier
The binning can be done inplace by modifying the existing columns
>>> obj = TreeBinning(tree=DecisionTreeClassifier(max_depth=2, random_state=0), inplace=True)
or by adding new binned columns
>>> obj = TreeBinning(tree=DecisionTreeClassifier(max_depth=2, random_state=0), inplace=True)
The fit, transform, and fit_transform methods accept:
dask dataframes:
>>> import dask.dataframe as dd >>> import pandas as pd >>> X = dd.from_pandas(pd.DataFrame({ ... 'A': [1.07, -2.59, -1.54, 1.72], ... 'B': [-1.19, -0.22, -0.28, 1.28], ... 'C': [-1.15, 1.92, 1.09, -0.95]}), npartitions=1) >>> y = dd.from_pandas(pd.Series([0, 1, 0, 1], name="TARGET"), npartitions=1)
koalas dataframes:
>>> import databricks.koalas as ks >>> X = ks.DataFrame({ ... 'A': [1.07, -2.59, -1.54, 1.72], ... 'B': [-1.19, -0.22, -0.28, 1.28], ... 'C': [-1.15, 1.92, 1.09, -0.95]}) >>> y = ks.Series([0, 1, 0, 1], name="TARGET")
and pandas dataframes:
>>> import pandas as pd >>> X = pd.DataFrame({ ... 'A': [1.07, -2.59, -1.54, 1.72], ... 'B': [-1.19, -0.22, -0.28, 1.28], ... 'C': [-1.15, 1.92, 1.09, -0.95]}) >>> y = pd.Series([0, 1, 0, 1], name="TARGET")
The result is a transformed dataframe belonging to the same dataframe library.
with inplace=True
>>> obj = TreeBinning(tree=DecisionTreeClassifier(max_depth=2, random_state=0), inplace=True) >>> obj.fit_transform(X, y) A B C 0 [-2.06, 1.4) (-inf, -0.25) (-inf, -1.05) 1 (-inf, -2.06) [-0.25, inf) [0.07, inf) 2 [-2.06, 1.4) (-inf, -0.25) [0.07, inf) 3 [1.4, inf) [-0.25, inf) [-1.05, 0.07)
with inplace=False
>>> X = pd.DataFrame({ ... 'A': [1.07, -2.59, -1.54, 1.72], ... 'B': [-1.19, -0.22, -0.28, 1.28], ... 'C': [-1.15, 1.92, 1.09, -0.95]}) >>> obj = TreeBinning(tree=DecisionTreeClassifier(max_depth=2, random_state=0), inplace=False) >>> obj.fit_transform(X, y) A B C A__bin B__bin C__bin 0 1.07 -1.19 -1.15 [-2.06, 1.4) (-inf, -0.25) (-inf, -1.05) 1 -2.59 -0.22 1.92 (-inf, -2.06) [-0.25, inf) [0.07, inf) 2 -1.54 -0.28 1.09 [-2.06, 1.4) (-inf, -0.25) [0.07, inf) 3 1.72 1.28 -0.95 [1.4, inf) [-0.25, inf) [-1.05, 0.07)
Independly of the dataframe library used to fit the transformer, the tranform_numpy method only accepts NumPy arrays and returns a transformed NumPy array. Note that this transformer should only be used when the number of rows is small e.g. in real-time environment.
>>> X = pd.DataFrame({ ... 'A': [1.07, -2.59, -1.54, 1.72], ... 'B': [-1.19, -0.22, -0.28, 1.28], ... 'C': [-1.15, 1.92, 1.09, -0.95]}) >>> obj.transform_numpy(X.to_numpy()) array([[1.07, -1.19, -1.15, '[-2.06, 1.4)', '(-inf, -0.25)', '(-inf, -1.05)'], [-2.59, -0.22, 1.92, '(-inf, -2.06)', '[-0.25, inf)', '[0.07, inf)'], [-1.54, -0.28, 1.09, '[-2.06, 1.4)', '(-inf, -0.25)', '[0.07, inf)'], [1.72, 1.28, -0.95, '[1.4, inf)', '[-0.25, inf)', '[-1.05, 0.07)']], dtype=object)
-
fit
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y) → gators.binning.tree_binning.TreeBinning[source]¶ Fit the transformer on the dataframe X.
- Parameters
- XDataFrame
Input dataframe.
- ySeries, default None.
Target values.
- Returns
- “TreeBinning”
Instance of itself.
-
static
check_array
(X: numpy.ndarray)¶ Validate array.
- Parameters
- Xnp.ndarray
Array.
-
check_array_is_numerics
(X: numpy.ndarray)¶ Check if array is only numerics.
- Parameters
- Xnp.ndarray
Array.
-
static
check_binary_target
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series])¶ Raise an error if the target is not binary.
- Parameters
- ySeries
Target values.
-
static
check_dataframe
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶ Validate dataframe.
- Parameters
- XDataFrame
Dataframe.
-
static
check_dataframe_contains_numerics
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶ Check if dataframe is only numerics.
- Parameters
- XDataFrame
Dataframe.
-
static
check_dataframe_is_numerics
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶ Check if dataframe is only numerics.
- Parameters
- XDataFrame
Dataframe.
-
check_dataframe_with_objects
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶ Check if dataframe contains object columns.
- Parameters
- XDataFrame
Dataframe.
-
check_datatype
(dtype, accepted_dtypes)¶ Check if dataframe is only numerics.
- Parameters
- XDataFrame
Dataframe.
-
static
check_multiclass_target
(y: Union[pd.Series, ks.Series, dd.Series])¶ Raise an error if the target is not discrete.
- Parameters
- ySeries
Target values.
-
check_nans
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], columns: List[str])¶ Raise an error if X contains NaN values.
- Parameters
- XDataFrame
Dataframe.
- theta_vecList[float]
List of columns.
-
static
check_regression_target
(y: Union[pd.Series, ks.Series, dd.Series])¶ Raise an error if the target is not discrete.
- Parameters
- ySeries
Target values.
-
static
check_target
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series])¶ Validate target.
- Parameters
- XDataFrame
Dataframe.
- ySeries
Target values.
-
fit_transform
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series] = None) → Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]¶ Fit and Transform the dataframe X.
- Parameters
- XDataFrame.
Input dataframe.
- ySeries, default None.
Input target.
- Returns
- XDataFrame
Transformed dataframe.
-
static
get_column_names
(inplace: bool, columns: List[str], suffix: str)¶ Return the names of the modified columns.
- Parameters
- inplacebool
If True return columns. If False return columns__suffix.
- columnsList[str]
List of columns.
- suffixstr
Suffix used if inplace is False.
- Returns
- List[str]
List of column names.
-
static
get_labels
(pretty_bins_dict: Dict[str, numpy.array])¶ Get the labels of the bins.
- Parameters
- pretty_bins_dictDict[str, np.array])
pretified bins used to generate the labels.
- Returns
- Dict[str, np.array]
Labels.
- np.array
Labels.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
transform
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]) → Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]¶ Transform the dataframe X.
- Parameters
- XDataFrame
Input dataframe.
- Returns
- XDataFrame
Transformed dataframe.
-
transform_numpy
(X: numpy.ndarray) → numpy.ndarray¶ Transform the array X.
- Parameters
- Xnp.ndarray
Array.
- Returns
- Xnp.ndarray
Transformed array.