gators.feature_generation.ClusterStatistics

class gators.feature_generation.ClusterStatistics(clusters_dict: Dict[str, List[str]], column_names: List[str] = None)[source]

Create new columns based on statistics done at the row level.

The data should be composed of numerical columns only. Use gators.encoders to replace the categorical columns by numerical ones before using ClusterStatistics.

Parameters
clusters_dictDict[str, List[str]]

Dictionary of clusters of features.

column_namesList[str], default None.

List of new column names

Examples

Imports and initialization:

>>> from gators.feature_generation import ClusterStatistics
>>> clusters_dict = {'cluster_1': ['A', 'B'], 'cluster_2': ['A', 'C']}
>>> obj = ClusterStatistics(clusters_dict=clusters_dict)

The fit, transform, and fit_transform methods accept:

  • dask dataframes:

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> X = dd.from_pandas(pd.DataFrame(
... {'A': [9., 9., 7.], 'B': [3., 4., 5.], 'C': [6., 7., 8.]}), npartitions=1)
  • koalas dataframes:

>>> import databricks.koalas as ks
>>> X = ks.DataFrame(
... {'A': [9., 9., 7.], 'B': [3., 4., 5.], 'C': [6., 7., 8.]})
  • and pandas dataframes:

>>> import pandas as pd
>>> X = pd.DataFrame(
... {'A': [9., 9., 7.], 'B': [3., 4., 5.], 'C': [6., 7., 8.]})

The result is a transformed dataframe belonging to the same dataframe library.

>>> obj.fit_transform(X)
     A    B    C  cluster_1__mean  cluster_1__std  cluster_2__mean  cluster_2__std
0  9.0  3.0  6.0              6.0        4.242641              7.5        2.121320
1  9.0  4.0  7.0              6.5        3.535534              8.0        1.414214
2  7.0  5.0  8.0              6.0        1.414214              7.5        0.707107
>>> X = pd.DataFrame({'A': [9., 9., 7.], 'B': [3., 4., 5.], 'C': [6., 7., 8.]})
>>> _ = obj.fit(X)
>>> obj.transform_numpy(X.to_numpy())
array([[9.        , 3.        , 6.        , 6.        , 4.24264069,
        7.5       , 2.12132034],
       [9.        , 4.        , 7.        , 6.5       , 3.53553391,
        8.        , 1.41421356],
       [7.        , 5.        , 8.        , 6.        , 1.41421356,
        7.5       , 0.70710678]])
fit(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series] = None) → gators.feature_generation.cluster_statistics.ClusterStatistics[source]

Fit the transformer on the dataframe X.

Parameters
XDataFrame.

Input dataframe.

ySeries, default None.

Target values.

Returns
ClusterStatistics

Instance of itself.

transform(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]) → Union[pd.DataFrame, ks.DataFrame, dd.DataFrame][source]

Transform the dataframe X.

Parameters
XDataFrame.

Input dataframe.

Returns
XDataFrame

Dataframe with statistics cluster features.

transform_numpy(X: numpy.ndarray) → numpy.ndarray[source]

Transform the array X.

Parameters
Xnp.ndarray

Input array.

Returns
Xnp.ndarray

Transformed array.

static get_idx_columns(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], clusters_dict: Dict[str, List[str]]) → numpy.ndarray[source]

Get the column indices of the clusters.

Parameters
XDataFrame

Input data.

clusters_dictDict[str, List[str]]

Clusters.

Returns
Dict[str, List[str]]

Column indices of the clusters.

static get_column_names(clusters_dict: Dict[str, List[str]]) → List[str][source]

Get statistics cluster column names.

Parameters
XDataFrame.

Input dataframe.

Returns
List[str]
List of columns.
static check_array(X: numpy.ndarray)

Validate array.

Parameters
Xnp.ndarray

Array.

check_array_is_numerics(X: numpy.ndarray)

Check if array is only numerics.

Parameters
Xnp.ndarray

Array.

static check_binary_target(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series])

Raise an error if the target is not binary.

Parameters
ySeries

Target values.

static check_dataframe(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])

Validate dataframe.

Parameters
XDataFrame

Dataframe.

static check_dataframe_contains_numerics(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])

Check if dataframe is only numerics.

Parameters
XDataFrame

Dataframe.

static check_dataframe_is_numerics(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])

Check if dataframe is only numerics.

Parameters
XDataFrame

Dataframe.

check_dataframe_with_objects(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])

Check if dataframe contains object columns.

Parameters
XDataFrame

Dataframe.

check_datatype(dtype, accepted_dtypes)

Check if dataframe is only numerics.

Parameters
XDataFrame

Dataframe.

static check_multiclass_target(y: Union[pd.Series, ks.Series, dd.Series])

Raise an error if the target is not discrete.

Parameters
ySeries

Target values.

check_nans(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], columns: List[str])

Raise an error if X contains NaN values.

Parameters
XDataFrame

Dataframe.

theta_vecList[float]

List of columns.

static check_regression_target(y: Union[pd.Series, ks.Series, dd.Series])

Raise an error if the target is not discrete.

Parameters
ySeries

Target values.

static check_target(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series])

Validate target.

Parameters
XDataFrame

Dataframe.

ySeries

Target values.

fit_transform(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series] = None) → Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]

Fit and Transform the dataframe X.

Parameters
XDataFrame.

Input dataframe.

ySeries, default None.

Input target.

Returns
XDataFrame

Transformed dataframe.

get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfestimator instance

Estimator instance.