gators.feature_generation.ClusterStatistics¶
-
class
gators.feature_generation.
ClusterStatistics
(clusters_dict: Dict[str, List[str]], column_names: List[str] = None)[source]¶ Create new columns based on statistics done at the row level.
The data should be composed of numerical columns only. Use gators.encoders to replace the categorical columns by numerical ones before using ClusterStatistics.
- Parameters
- clusters_dictDict[str, List[str]]
Dictionary of clusters of features.
- column_namesList[str], default None.
List of new column names
Examples
Imports and initialization:
>>> from gators.feature_generation import ClusterStatistics >>> clusters_dict = {'cluster_1': ['A', 'B'], 'cluster_2': ['A', 'C']} >>> obj = ClusterStatistics(clusters_dict=clusters_dict)
The fit, transform, and fit_transform methods accept:
dask dataframes:
>>> import dask.dataframe as dd >>> import pandas as pd >>> X = dd.from_pandas(pd.DataFrame( ... {'A': [9., 9., 7.], 'B': [3., 4., 5.], 'C': [6., 7., 8.]}), npartitions=1)
koalas dataframes:
>>> import databricks.koalas as ks >>> X = ks.DataFrame( ... {'A': [9., 9., 7.], 'B': [3., 4., 5.], 'C': [6., 7., 8.]})
and pandas dataframes:
>>> import pandas as pd >>> X = pd.DataFrame( ... {'A': [9., 9., 7.], 'B': [3., 4., 5.], 'C': [6., 7., 8.]})
The result is a transformed dataframe belonging to the same dataframe library.
>>> obj.fit_transform(X) A B C cluster_1__mean cluster_1__std cluster_2__mean cluster_2__std 0 9.0 3.0 6.0 6.0 4.242641 7.5 2.121320 1 9.0 4.0 7.0 6.5 3.535534 8.0 1.414214 2 7.0 5.0 8.0 6.0 1.414214 7.5 0.707107
>>> X = pd.DataFrame({'A': [9., 9., 7.], 'B': [3., 4., 5.], 'C': [6., 7., 8.]}) >>> _ = obj.fit(X) >>> obj.transform_numpy(X.to_numpy()) array([[9. , 3. , 6. , 6. , 4.24264069, 7.5 , 2.12132034], [9. , 4. , 7. , 6.5 , 3.53553391, 8. , 1.41421356], [7. , 5. , 8. , 6. , 1.41421356, 7.5 , 0.70710678]])
-
fit
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series] = None) → gators.feature_generation.cluster_statistics.ClusterStatistics[source]¶ Fit the transformer on the dataframe X.
- Parameters
- XDataFrame.
Input dataframe.
- ySeries, default None.
Target values.
- Returns
- ClusterStatistics
Instance of itself.
-
transform
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]) → Union[pd.DataFrame, ks.DataFrame, dd.DataFrame][source]¶ Transform the dataframe X.
- Parameters
- XDataFrame.
Input dataframe.
- Returns
- XDataFrame
Dataframe with statistics cluster features.
-
transform_numpy
(X: numpy.ndarray) → numpy.ndarray[source]¶ Transform the array X.
- Parameters
- Xnp.ndarray
Input array.
- Returns
- Xnp.ndarray
Transformed array.
-
static
get_idx_columns
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], clusters_dict: Dict[str, List[str]]) → numpy.ndarray[source]¶ Get the column indices of the clusters.
- Parameters
- XDataFrame
Input data.
- clusters_dictDict[str, List[str]]
Clusters.
- Returns
- Dict[str, List[str]]
Column indices of the clusters.
-
static
get_column_names
(clusters_dict: Dict[str, List[str]]) → List[str][source]¶ Get statistics cluster column names.
- Parameters
- XDataFrame.
Input dataframe.
- Returns
- List[str]
- List of columns.
-
static
check_array
(X: numpy.ndarray)¶ Validate array.
- Parameters
- Xnp.ndarray
Array.
-
check_array_is_numerics
(X: numpy.ndarray)¶ Check if array is only numerics.
- Parameters
- Xnp.ndarray
Array.
-
static
check_binary_target
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series])¶ Raise an error if the target is not binary.
- Parameters
- ySeries
Target values.
-
static
check_dataframe
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶ Validate dataframe.
- Parameters
- XDataFrame
Dataframe.
-
static
check_dataframe_contains_numerics
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶ Check if dataframe is only numerics.
- Parameters
- XDataFrame
Dataframe.
-
static
check_dataframe_is_numerics
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶ Check if dataframe is only numerics.
- Parameters
- XDataFrame
Dataframe.
-
check_dataframe_with_objects
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶ Check if dataframe contains object columns.
- Parameters
- XDataFrame
Dataframe.
-
check_datatype
(dtype, accepted_dtypes)¶ Check if dataframe is only numerics.
- Parameters
- XDataFrame
Dataframe.
-
static
check_multiclass_target
(y: Union[pd.Series, ks.Series, dd.Series])¶ Raise an error if the target is not discrete.
- Parameters
- ySeries
Target values.
-
check_nans
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], columns: List[str])¶ Raise an error if X contains NaN values.
- Parameters
- XDataFrame
Dataframe.
- theta_vecList[float]
List of columns.
-
static
check_regression_target
(y: Union[pd.Series, ks.Series, dd.Series])¶ Raise an error if the target is not discrete.
- Parameters
- ySeries
Target values.
-
static
check_target
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series])¶ Validate target.
- Parameters
- XDataFrame
Dataframe.
- ySeries
Target values.
-
fit_transform
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series] = None) → Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]¶ Fit and Transform the dataframe X.
- Parameters
- XDataFrame.
Input dataframe.
- ySeries, default None.
Input target.
- Returns
- XDataFrame
Transformed dataframe.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.