gators.sampling.UnsupervisedSampling

class gators.sampling.UnsupervisedSampling(n_samples: int)[source]

Randomly sample the data and target.

Parameters
n_samplesint

Number of samples to keep.

Examples

>>> from gators.sampling import UnsupervisedSampling
>>> obj = UnsupervisedSampling(n_samples=3)

The fit, transform, and fit_transform methods accept:

  • dask dataframes:

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> X = dd.from_pandas(pd.DataFrame({
... 'A': [0, 3, 6, 9, 12, 15],
... 'B': [1, 4, 7, 10, 13, 16],
... 'C': [2, 5, 8, 11, 14, 17]}), npartitions=1)
>>> y = dd.from_pandas(pd.Series([0, 0, 1, 1, 2, 3], name='TARGET'), npartitions=1)
  • koalas dataframes:

>>> import databricks.koalas as ks
>>> X = ks.DataFrame({
... 'A': [0, 3, 6, 9, 12, 15],
... 'B': [1, 4, 7, 10, 13, 16],
... 'C': [2, 5, 8, 11, 14, 17]})
>>> y = ks.Series([0, 0, 1, 1, 2, 3], name='TARGET')
  • and pandas dataframes:

>>> import pandas as pd
>>> X = pd.DataFrame({
... 'A': [0, 3, 6, 9, 12, 15],
... 'B': [1, 4, 7, 10, 13, 16],
... 'C': [2, 5, 8, 11, 14, 17]})
>>> y = pd.Series([0, 0, 1, 1, 2, 3], name='TARGET')

The result is a transformed dataframe and series belonging to the same dataframe library.

>>> X, y = obj.transform(X, y)
>>> X
    A   B   C
5  15  16  17
2   6   7   8
1   3   4   5
>>> y
5    3
2    1
1    0
Name: TARGET, dtype: int64
transform(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series]) → Tuple[Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], Union[pd.Series, ks.Series, dd.Series]][source]

Fit and transform the dataframe X and the series y.

Parameters
XDataFrame

Input dataframe.

ySeries

Input target.

Returns
XDataFrame

Sampled dataframe.

ySeries

Sampled series.

static check_dataframe(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])

Validate dataframe.

Parameters
XDataFrame

Input dataframe.

static check_target(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series])

Validate target.

Parameters
XDataFrame

Dataframe.

ySeries

Target values.