Best Practices

Pandas, Dask or Koalas?

The choice of using Pandas, Dask, or Koalas will be dictated by your dataset size. For in-memory datasets it is recommended to use Pandas, Dask or Koalas otherwise.

Does the transformation order matter?

The datetime feature generation steps should be done before the encoding.

Note

After an encoding transformation, the data will be only composed of numerical columns, any datetime columns should then be removed before this step.

What are the models currently supported by gators?

Gators mainly focuses on data pre-processing in both offline and in real-time, and model deployment with the package treelite which compiles in C tree-based models. Only this type of models is currently supported. Note that for deep learning models, the tvm package could be interesting to consider.

When using the method transform_numpy()?

The method transform_numpy(), have been designed to speed-up the pre-processing in a production-like environment, where the response time of the data pre-processing is critical. It is recommended to use transform_numpy() off-line to validate the prouction pipeline.

Why the method fit_numpy() is not defined?

The offline model building steps are only done with Pandas, Dask, or Koalas dataframes. First, the excellent Sklearn package already handle NumPy arrays, second, NumPy is not suitable for large-scale data.