Skip to the content.

Gimel Data API

Contents

| What is gimel | Gimel Overview | APIs & Version Compatibility | | ————- | ———– | ———— |

| Getting Started | Gimel Catalog Providers | Edit on GitHub | | ————- | ———– | ———— |

| Questions | Contribution Guidelines | Adding a new connector | | ————- | ———– | ———— |


What is gimel?

Scala

/* Simple Data API example of read from kafka, transform & write to elastic */

// Initiate API
val dataset = com.paypal.gimel.DataSet(spark)

// Read Data | kafka semantics abstracted for user. 
// Refer "Gimel Catalog Providers" that abstracts dataset details
val df: DataFrame = dataset.read("kafka_dataset")

// Apply transformations (business logic | abstracted for Gimel)
val transformed_df: DataFrame = df(...transformations...)

// Write Data | Elastic semantics abstracted for user
dataset.write("elastic_dataset",df)

/* GSQL Reference */

// Create Gimel SQL reference
val gsql: (String) => DataFrame = com.paypal.gimel.sql.GimelQueryProcessor.executeBatch(_: String, spark)

// your SQL
val sql = """
insert into elastic_dataset
select * from kafka_dataset
"""

gsql(sql)

Python | pyspark


# import DataFrame and SparkSession
from pyspark.sql import DataFrame, SparkSession, SQLContext

# fetch reference to the class in JVM
ScalaDataSet = sc._jvm.com.paypal.gimel.DataSet

# fetch reference to java SparkSession
jspark = spark._jsparkSession

# initiate dataset
dataset = ScalaDataSet.apply(jspark)

# Read Data | kafka semantics abstracted for user
df = dataset.read("kafka_dataset")

# Apply transformations (business logic | abstracted for Gimel)
transformed_df = df(...transformations...)

# Write Data | Elastic semantics abstracted for user
dataset.write("elastic_dataset",df)

# fetch reference to GimelQueryProcessor Class in JVM
gsql = sc._jvm.com.paypal.gimel.scaas.GimelQueryProcessor

# your SQL
sql = """
insert into elastic_dataset
select * from kafka_dataset
"""

# Set some props
gsql.executeBatch("set es.nodes.wan.only=true", jspark)

# execute GSQL, this can be any sql of type "insert into ... select .. join ... where .."
gsql.executeBatch(sql, jspark)

Gimel overview

2020 - Gimel @ Scale By The Bay, Online

2020 - Gimel @ Data Orchestration Summit By Alluxio, Online

2018 - Gimel @ QCon.ai, SF


Stack & Version Compatibility

Compute/Storage/Language Version Grade Documentation Notes
2.12.10 PRODUCTION  
Data API is built on scala 2.12.10
regardless the library should be compatible as long as the spark major version of library and the environment match
3x PRODUCTION PySpark Support Data API works fully well with PySpark as long as spark version in environment & Gimel library matches.
2.4.7 PRODUCTION   This is the recommended version
2.10.0 PRODUCTION   This is the recommended version
1.10.6 PRODUCTION S3 Doc  
0.17.3 PRODUCTION Big Query Doc  
14 PRODUCTION Teradata Doc Uses JDBC Connector internally
2.3.7 PRODUCTION Hive Doc  
2.1.1 PRODUCTION Kafka 2.2 Doc V2.1.1 is the PayPal’s Supported Version of Kafka
0.82 PRODUCTION SFTP Doc Read/Write files from/To SFTP server
6.2.1 PRODUCTION ElasticSearch Doc  
NA PRODUCTION WITH LIMITATIONS Restful/Web-API Doc
Allows Accessing Data
- to any source supporting
- Rest API
3.1.5 EXPERIMENTAL Aerospike Doc Experimental API for Aerospike reads / writes
2.0 EXPERIMENTAL Cassandra Doc
Experimental API for Cassandra reads / writes
Leverages DataStax Connector

Gimel Serde
1.0 PRODUCTION Gimel Serde Doc Pluggable gimel serializers and deserializers

Questions