gators.model_building.TrainTestSplit¶
-
class
gators.model_building.
TrainTestSplit
(test_ratio: float, strategy: str, random_state: int = 0)[source]¶ TrainTestSplit class.
- Parameters
- test_ratiofloat
Proportion of the dataset to include in the test split.
- strategystr
Train/Test split strategy. The possible values are:
ordered
random
stratified
- random_stateint
Random state.
Notes
Note that the random and stratified strategies will be give different results for pandas and koalas.
Examples
Imports and initialization:
>>> from gators.sampling import UnsupervisedSampling >>> obj = UnsupervisedSampling(n_samples=3)
The transform method accept:
dask dataframes:
>>> import dask.dataframe as dd >>> import pandas as pd >>> X = dd.from_pandas(pd.DataFrame({ ... 'A': [0, 3, 6, 9, 12, 15, 18, 21], ... 'B': [1, 4, 7, 10, 13, 16, 19, 22], ... 'C': [2, 5, 8, 11, 14, 17, 20, 23]}), npartitions=1) >>> y = dd.from_pandas(pd.Series([0, 1, 2, 0, 1, 2, 0, 1], name='TARGET'), npartitions=1)
koalas dataframes:
>>> import databricks.koalas as ks >>> X = ks.DataFrame({ ... 'A': [0, 3, 6, 9, 12, 15, 18, 21], ... 'B': [1, 4, 7, 10, 13, 16, 19, 22], ... 'C': [2, 5, 8, 11, 14, 17, 20, 23]}) >>> y = ks.Series([0, 1, 2, 0, 1, 2, 0, 1], name='TARGET')
and pandas dataframes:
>>> import pandas as pd >>> X = pd.DataFrame({ ... 'A': [0, 3, 6, 9, 12, 15, 18, 21], ... 'B': [1, 4, 7, 10, 13, 16, 19, 22], ... 'C': [2, 5, 8, 11, 14, 17, 20, 23]}) >>> y = pd.Series([0, 1, 2, 0, 1, 2, 0, 1], name='TARGET')
The result is a transformed dataframe and series belonging to the same dataframe library.
ordered strategy:
>>> obj = TrainTestSplit(test_ratio=0.5, strategy='ordered') >>> X_train, X_test, y_train, y_test = obj.transform(X, y) >>> X_train A B C 0 0 1 2 1 3 4 5 2 6 7 8 3 9 10 11 >>> X_test A B C 4 12 13 14 5 15 16 17 6 18 19 20 7 21 22 23 >>> y_train 0 0 1 1 2 2 3 0 Name: TARGET, dtype: int64 >>> y_test 4 1 5 2 6 0 7 1 Name: TARGET, dtype: int64
random strategy:
>>> obj = TrainTestSplit(test_ratio=0.5, strategy='random') >>> X_train, X_test, y_train, y_test = obj.transform(X, y) >>> X_train A B C 6 18 19 20 2 6 7 8 1 3 4 5 7 21 22 23 >>> X_test A B C 0 0 1 2 3 9 10 11 4 12 13 14 5 15 16 17 >>> y_train 6 0 2 2 1 1 7 1 Name: TARGET, dtype: int64 >>> y_test 0 0 3 0 4 1 5 2 Name: TARGET, dtype: int64
stratified strategy:
>>> obj = TrainTestSplit(test_ratio=0.5, strategy='stratified') >>> X_train, X_test, y_train, y_test = obj.transform(X, y) >>> X_train A B C 7 21 22 23 6 18 19 20 3 9 10 11 4 12 13 14 5 15 16 17 >>> X_test A B C 2 6 7 8 1 3 4 5 0 0 1 2 >>> y_train 7 1 6 0 3 0 4 1 5 2 Name: TARGET, dtype: int64 >>> y_test 2 2 1 1 0 0 Name: TARGET, dtype: int64
-
transform
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series]) → Tuple[Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], Union[pd.Series, ks.Series, dd.Series], Union[pd.Series, ks.Series, dd.Series]][source]¶ Transform dataframe and series.
- Parameters
- XDataFrame
Dataframe.
- ySeries
Target values.
- test_ratiofloat
Ratio of data points used for the test set.
- Returns
- X_trainDataframe
Train set X data.
- X_testDataframe
Test set X data.
- y_trainSeries
Train set y data.
- y_testSeries
Test set y data.
-
ordered_split
(Xy: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]) → Tuple[Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]][source]¶ Perform random split.
- Parameters
- XyDataFrame
Dataframe.
- Returns
- Xy_trainDataFrame
Train set.
- Xy_testDataFrame:
Test set.
-
stratified_split
(Xy: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y_name: str) → Tuple[Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], Union[pd.DataFrame, ks.DataFrame, dd.DataFrame]][source]¶ Perform stratified split.
- Parameters
- XyDataFrame
Dataframe.
- y_namestr
Target name.
- Returns
- Xy_trainDataFrame
Train set.
- Xy_testDataFrame:
Test set.
-
static
check_dataframe
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame])¶ Validate dataframe.
- Parameters
- XDataFrame
Input dataframe.
-
static
check_target
(X: Union[pd.DataFrame, ks.DataFrame, dd.DataFrame], y: Union[pd.Series, ks.Series, dd.Series])¶ Validate target.
- Parameters
- XDataFrame
Dataframe.
- ySeries
Target values.