gators.binning.Binning¶
-
class
gators.binning.
Binning
(n_bins: int, inplace=False)[source]¶ Bin the columns using equal distance splits.
The binning can be done inplace or by adding the binned columns to the existing data.
- Parameters
- n_binsint
Number of bins to use.
- inplacebool, default False
If False, return the dataframe with the new binned columns with the names “column_name__bin”). Otherwise, return the dataframe with the existing binned columns.
See also
gators.binning.CustomBinning
Bin using the splits given by the user.
gators.binning.QuantileBinning
Bin using splits based on quantiles.
gators.binning.TreeBinning
Bin using splits based on decision trees.
Examples
Imports and initialization:
>>> from gators.binning import Binning
The binning can be done inplace by modifying the existing columns
>>> obj = Binning(n_bins=3, inplace=True)
or by adding new binned columns
>>> obj = Binning(n_bins=3, inplace=False)
The fit, transform, and fit_transform methods accept:
dask dataframes:
>>> import dask.dataframe as dd >>> import pandas as pd >>> X = dd.from_pandas(pd.DataFrame({'A': [-1, 0, 1], 'B': [3, 1, 0]}), npartitions=1)
koalas dataframes:
>>> import databricks.koalas as ks >>> X = ks.DataFrame({'A': [-1, 0, 1], 'B': [3, 1, 0]})
and pandas dataframes:
>>> import pandas as pd >>> X = pd.DataFrame({'A': [-1, 0, 1], 'B': [3, 1, 0]})
The result is a transformed dataframe belonging to the same dataframe library.
with inplace=True
>>> obj = Binning(n_bins=3, inplace=True) >>> obj.fit_transform(X) A B 0 (-inf, -0.33) [2.0, inf) 1 [-0.33, 0.33) [1.0, 2.0) 2 [0.33, inf) (-inf, 1.0)
with inplace=False
>>> X = pd.DataFrame({'A': [-1, 0, 1], 'B': [3, 1, 0]}) >>> obj = Binning(n_bins=3, inplace=False) >>> obj.fit_transform(X) A B A__bin B__bin 0 -1 3 (-inf, -0.33) [2.0, inf) 1 0 1 [-0.33, 0.33) [1.0, 2.0) 2 1 0 [0.33, inf) (-inf, 1.0)
Independly of the dataframe library used to fit the transformer, the tranform_numpy method only accepts NumPy arrays and returns a transformed NumPy array. Note that this transformer should only be used when the number of rows is small e.g. in real-time environment.
>>> X = pd.DataFrame({'A': [-1, 0, 1], 'B': [3, 1, 0]}) >>> obj.transform_numpy(X.to_numpy()) array([[-1, 3, '(-inf, -0.33)', '[2.0, inf)'], [0, 1, '[-0.33, 0.33)', '[1.0, 2.0)'], [1, 0, '[0.33, inf)', '(-inf, 1.0)']], dtype=object)
-
compute_bins
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series] = None) → Tuple[List[List[float]], numpy.ndarray][source]¶ Compute the bins list and the bins array. The bin list is used for dataframes and the bins array is used for arrays.
- Parameters
- XDataFrame
Input dataframe.
- n_binsint
Number of bins to use.
- Returns
- binsList[List[float]]
Bin splits definition. The dictionary keys are the column names to bin, its values are the split arrays.
- bins_npnp.ndarray
Bin splits definition for NumPy.
-
static
check_array
(X: numpy.ndarray)¶ Validate array.
- Parameters
- Xnp.ndarray
Array.
-
check_array_is_numerics
(X: numpy.ndarray)¶ Check if array is only numerics.
- Parameters
- Xnp.ndarray
Array.
-
static
check_binary_target
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series])¶ Raise an error if the target is not binary.
- Parameters
- ySeries
Target values.
-
static
check_dataframe
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶ Validate dataframe.
- Parameters
- XDataFrame
Dataframe.
-
static
check_dataframe_contains_numerics
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶ Check if dataframe is only numerics.
- Parameters
- XDataFrame
Dataframe.
-
static
check_dataframe_is_numerics
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶ Check if dataframe is only numerics.
- Parameters
- XDataFrame
Dataframe.
-
check_dataframe_with_objects
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶ Check if dataframe contains object columns.
- Parameters
- XDataFrame
Dataframe.
-
check_datatype
(dtype, accepted_dtypes)¶ Check if dataframe is only numerics.
- Parameters
- XDataFrame
Dataframe.
-
static
check_multiclass_target
(y: Union[pd.Series, ks.Series, dd.Series])¶ Raise an error if the target is not discrete.
- Parameters
- ySeries
Target values.
-
check_nans
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], columns: List[str])¶ Raise an error if X contains NaN values.
- Parameters
- XDataFrame
Dataframe.
- theta_vecList[float]
List of columns.
-
static
check_regression_target
(y: Union[pd.Series, ks.Series, dd.Series])¶ Raise an error if the target is not discrete.
- Parameters
- ySeries
Target values.
-
static
check_target
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series])¶ Validate target.
- Parameters
- XDataFrame
Dataframe.
- ySeries
Target values.
-
fit
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series] = None) → gators.transformers.transformer.Transformer¶ Fit the transformer on the dataframe X.
- Parameters
- XDataFrame
Input dataframe.
- ySeries, default None.
Target values.
- Returns
- self‘Transformer’
Instance of itself.
-
fit_transform
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series] = None) → Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]¶ Fit and Transform the dataframe X.
- Parameters
- XDataFrame.
Input dataframe.
- ySeries, default None.
Input target.
- Returns
- XDataFrame
Transformed dataframe.
-
static
get_column_names
(inplace: bool, columns: List[str], suffix: str)¶ Return the names of the modified columns.
- Parameters
- inplacebool
If True return columns. If False return columns__suffix.
- columnsList[str]
List of columns.
- suffixstr
Suffix used if inplace is False.
- Returns
- List[str]
List of column names.
-
static
get_labels
(pretty_bins_dict: Dict[str, numpy.array])¶ Get the labels of the bins.
- Parameters
- pretty_bins_dictDict[str, np.array])
pretified bins used to generate the labels.
- Returns
- Dict[str, np.array]
Labels.
- np.array
Labels.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
transform
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]) → Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]¶ Transform the dataframe X.
- Parameters
- XDataFrame
Input dataframe.
- Returns
- XDataFrame
Transformed dataframe.
-
transform_numpy
(X: numpy.ndarray) → numpy.ndarray¶ Transform the array X.
- Parameters
- Xnp.ndarray
Array.
- Returns
- Xnp.ndarray
Transformed array.