Gimel Data API
Contents
| What is gimel | Gimel Overview | APIs & Version Compatibility | | ————- | ———– | ———— |
| Getting Started | Gimel Catalog Providers | Edit on GitHub | | ————- | ———– | ———— |
| Questions | Contribution Guidelines | Adding a new connector | | ————- | ———– | ———— |
What is gimel?
- Gimel is a Big Data Abstraction framework built on Apache Spark & other open source connectors in the industry.
- Gimel provides unified Data API to read & write data to various stores.
- Alongside, a unified SQL access pattern for all stores alike.
- The APIs are available in both scala & python (pyspark).
Scala
/* Simple Data API example of read from kafka, transform & write to elastic */
// Initiate API
val dataset = com.paypal.gimel.DataSet(spark)
// Read Data | kafka semantics abstracted for user.
// Refer "Gimel Catalog Providers" that abstracts dataset details
val df: DataFrame = dataset.read("kafka_dataset")
// Apply transformations (business logic | abstracted for Gimel)
val transformed_df: DataFrame = df(...transformations...)
// Write Data | Elastic semantics abstracted for user
dataset.write("elastic_dataset",df)
/* GSQL Reference */
// Create Gimel SQL reference
val gsql: (String) => DataFrame = com.paypal.gimel.sql.GimelQueryProcessor.executeBatch(_: String, spark)
// your SQL
val sql = """
insert into elastic_dataset
select * from kafka_dataset
"""
gsql(sql)
Python | pyspark
# import DataFrame and SparkSession
from pyspark.sql import DataFrame, SparkSession, SQLContext
# fetch reference to the class in JVM
ScalaDataSet = sc._jvm.com.paypal.gimel.DataSet
# fetch reference to java SparkSession
jspark = spark._jsparkSession
# initiate dataset
dataset = ScalaDataSet.apply(jspark)
# Read Data | kafka semantics abstracted for user
df = dataset.read("kafka_dataset")
# Apply transformations (business logic | abstracted for Gimel)
transformed_df = df(...transformations...)
# Write Data | Elastic semantics abstracted for user
dataset.write("elastic_dataset",df)
# fetch reference to GimelQueryProcessor Class in JVM
gsql = sc._jvm.com.paypal.gimel.scaas.GimelQueryProcessor
# your SQL
sql = """
insert into elastic_dataset
select * from kafka_dataset
"""
# Set some props
gsql.executeBatch("set es.nodes.wan.only=true", jspark)
# execute GSQL, this can be any sql of type "insert into ... select .. join ... where .."
gsql.executeBatch(sql, jspark)
Gimel overview
2020 - Gimel @ Scale By The Bay, Online
2020 - Gimel @ Data Orchestration Summit By Alluxio, Online
2018 - Gimel @ QCon.ai, SF
Stack & Version Compatibility
Compute/Storage/Language | Version | Grade | Documentation | Notes |
---|---|---|---|---|
2.12.10 | PRODUCTION | Data API is built on scala 2.12.10 regardless the library should be compatible as long as the spark major version of library and the environment match |
||
3x | PRODUCTION | PySpark Support | Data API works fully well with PySpark as long as spark version in environment & Gimel library matches. | |
2.4.7 | PRODUCTION | This is the recommended version | ||
2.10.0 | PRODUCTION | This is the recommended version | ||
1.10.6 | PRODUCTION | S3 Doc | ||
0.17.3 | PRODUCTION | Big Query Doc | ||
14 | PRODUCTION | Teradata Doc | Uses JDBC Connector internally |
|
2.3.7 | PRODUCTION | Hive Doc | ||
2.1.1 | PRODUCTION | Kafka 2.2 Doc | V2.1.1 is the PayPal’s Supported Version of Kafka | |
0.82 | PRODUCTION | SFTP Doc | Read/Write files from/To SFTP server | |
6.2.1 | PRODUCTION | ElasticSearch Doc | ||
NA | PRODUCTION WITH LIMITATIONS | Restful/Web-API Doc | Allows Accessing Data - to any source supporting - Rest API |
|
3.1.5 | EXPERIMENTAL | Aerospike Doc | Experimental API for Aerospike reads / writes | |
2.0 | EXPERIMENTAL | Cassandra Doc | Experimental API for Cassandra reads / writes Leverages DataStax Connector |
|
Gimel Serde |
1.0 | PRODUCTION | Gimel Serde Doc | Pluggable gimel serializers and deserializers |