About Gators¶
Gators was created to help data scientists to:
Perform in-memory and out-of-core memory data pre-processing for model building.
Fast real-time pre-processing and model scoring.
History of development¶
In 2018, Gators development began at Simility and had been open sourced in 2021,
Timeline¶
2018: Development of Gators started.
2020: Dask, Koalas and Cython packages are added to tackle out-of-core memory datasets and fast real-time pre-processing.
2021: Gators becomes open-sourced.
Library Highlights¶
Data pre-processing can be done for both in-memory and out-of-memory datasets using the same interface.
Using Cython, the real-time data pre-processing is carried out on NumPy arrays with compiled C-code leading to fast response times, similar to compiled-languages.
Our Vision¶
A world where data scientists can develop and push their models in production using only Python, even when there are a large number of queries per second (QPS).
Python packages leveraged in gators¶
“If I have seen further it is by standing on the shoulders of giants.”
Sir Isaac Newton
gators uses a variety of libraries internally, at each step of the model building process.
Below is the list of libraries used.
Data pre-processing¶
The well-known package for data analysis is used for data pre-processing during the model building phase. This package should be used as long as the data can fit in memory.
Koalas is one of the two libraries chosen to handle the preprocessing when the data does not fit in memory.
Dask can also be used to handle the preprocessing when the data does not fit in memory.
NumPy is used in the production environment when the pre-processing needs to be as fast as possible.
In the production environment, the pre-processing with be done by pre-compiled Cython code on NumPy arrays.
Model building¶
The most well known package for model building is used for cross-validation and model evaluation.
Decision tree-based package used for model building. XGBoost algorithm applies level-wise tree growth.
Decision tree-based package used for model building. LightGBM algorithm applies leaf-wise tree growth.
Treelite is used to compile the trained models in C before being deployed in production, and treelite-runtime is used for real-time model scoring.