gators.model_building.TrainTestSplit

class gators.model_building.TrainTestSplit(test_ratio: float, strategy: str, random_state: int = 0)[source]

TrainTestSplit class.

Parameters
test_ratiofloat

Proportion of the dataset to include in the test split.

strategystr

Train/Test split strategy. The possible values are:

  • ordered

  • random

  • stratified

random_stateint

Random state.

Notes

Note that the random and stratified strategies will be give different results for pandas and koalas.

Examples

Imports and initialization:

>>> from gators.sampling import UnsupervisedSampling
>>> obj = UnsupervisedSampling(n_samples=3)

The transform method accept:

  • dask dataframes:

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> X = dd.from_pandas(pd.DataFrame({
... 'A': [0, 3, 6, 9, 12, 15, 18, 21],
... 'B': [1, 4, 7, 10, 13, 16, 19, 22],
... 'C': [2, 5, 8, 11, 14, 17, 20, 23]}), npartitions=1)
>>> y = dd.from_pandas(pd.Series([0, 1, 2, 0, 1, 2, 0, 1], name='TARGET'), npartitions=1)
  • koalas dataframes:

>>> import databricks.koalas as ks
>>> X = ks.DataFrame({
... 'A': [0, 3, 6, 9, 12, 15, 18, 21],
... 'B': [1, 4, 7, 10, 13, 16, 19, 22],
... 'C': [2, 5, 8, 11, 14, 17, 20, 23]})
>>> y = ks.Series([0, 1, 2, 0, 1, 2, 0, 1], name='TARGET')
  • and pandas dataframes:

>>> import pandas as pd
>>> X = pd.DataFrame({
... 'A': [0, 3, 6, 9, 12, 15, 18, 21],
... 'B': [1, 4, 7, 10, 13, 16, 19, 22],
... 'C': [2, 5, 8, 11, 14, 17, 20, 23]})
>>> y = pd.Series([0, 1, 2, 0, 1, 2, 0, 1], name='TARGET')

The result is a transformed dataframe and series belonging to the same dataframe library.

  • ordered strategy:

>>> obj = TrainTestSplit(test_ratio=0.5, strategy='ordered')
>>> X_train, X_test, y_train, y_test = obj.transform(X, y)
>>> X_train
   A   B   C
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11
>>> X_test
    A   B   C
4  12  13  14
5  15  16  17
6  18  19  20
7  21  22  23
>>> y_train
0    0
1    1
2    2
3    0
Name: TARGET, dtype: int64
>>> y_test
4    1
5    2
6    0
7    1
Name: TARGET, dtype: int64
  • random strategy:

>>> obj = TrainTestSplit(test_ratio=0.5, strategy='random')
>>> X_train, X_test, y_train, y_test = obj.transform(X, y)
>>> X_train
    A   B   C
6  18  19  20
2   6   7   8
1   3   4   5
7  21  22  23
>>> X_test
    A   B   C
0   0   1   2
3   9  10  11
4  12  13  14
5  15  16  17
>>> y_train
6    0
2    2
1    1
7    1
Name: TARGET, dtype: int64
>>> y_test
0    0
3    0
4    1
5    2
Name: TARGET, dtype: int64
  • stratified strategy:

>>> obj = TrainTestSplit(test_ratio=0.5, strategy='stratified')
>>> X_train, X_test, y_train, y_test = obj.transform(X, y)
>>> X_train
    A   B   C
7  21  22  23
6  18  19  20
3   9  10  11
4  12  13  14
5  15  16  17
>>> X_test
   A  B  C
2  6  7  8
1  3  4  5
0  0  1  2
>>> y_train
7    1
6    0
3    0
4    1
5    2
Name: TARGET, dtype: int64
>>> y_test
2    2
1    1
0    0
Name: TARGET, dtype: int64
transform(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series]) → Tuple[Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], Union[pd.Series, ks.Series, dd.Series], Union[pd.Series, ks.Series, dd.Series]][source]

Transform dataframe and series.

Parameters
XDataFrame

Dataframe.

ySeries

Target values.

test_ratiofloat

Ratio of data points used for the test set.

Returns
X_trainDataframe

Train set X data.

X_testDataframe

Test set X data.

y_trainSeries

Train set y data.

y_testSeries

Test set y data.

ordered_split(Xy: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]) → Tuple[Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]][source]

Perform random split.

Parameters
XyDataFrame

Dataframe.

Returns
Xy_trainDataFrame

Train set.

Xy_testDataFrame:

Test set.

stratified_split(Xy: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y_name: str) → Tuple[Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]][source]

Perform stratified split.

Parameters
XyDataFrame

Dataframe.

y_namestr

Target name.

Returns
Xy_trainDataFrame

Train set.

Xy_testDataFrame:

Test set.

static check_dataframe(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])

Validate dataframe.

Parameters
XDataFrame

Input dataframe.

static check_target(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series])

Validate target.

Parameters
XDataFrame

Dataframe.

ySeries

Target values.