Skip to the content.

PySpark / Python Support

Launch PySpark

While you launch pyspark - add the Gimel Jar.

pyspark --jars hdfs://gimel_jar
# or
pyspark --jars file://gimel_jar

Using Gimel Data API in PySpark

# import DataFrame and SparkSession
from pyspark.sql import DataFrame, SparkSession

# fetch reference to the class in JVM
ScalaDataSet =

# fetch reference to java SparkSession
jspark = spark._jsparkSession

# initiate dataset
dataset = ScalaDataSet.apply(jspark)

# invoking the read API gives reference to the scala DataFrame result-set
scala_df ="pcatalog.your_dataset","")

# passing options - an example
scala_df_with_options ="pcatalog.your_dataset","gimel.kafka.throttle.batch.fetchRowsOnFirstRun=1")

# convert to pyspark dataframe
python_df = DataFrame(scala_df,jspark)

# from now - you may use regular pyspark lingua to play with the data

Using Gimel SQL in PySpark

# fetch reference to GimelQueryProcessor Class in JVM
gsql =

# fetch reference to java SparkSession
jspark = spark._jsparkSession

# your SQL
sql = "select * from pcatalog.your_dataset limit 5"

# execute GSQL, this can be any sql of type "insert into ... select .. join ... where .."
gsql.executeBatch(sql, jspark)

# execute GSQL, and get reference to resulting dataset of the SQL
scala_df = gsql.executeBatch(sql, jspark)

# convert to pyspark dataframe
python_df = DataFrame(scala_df, jspark)

# from now - you may use regular pyspark lingua to play with the data

Sample Read and Write illustration

# DataSet
dataset = ScalaDataSet.apply(jspark)

# Read from HDFS
hdfs_data = DataFrame("pcatalog.flights_hdfs",""),jspark)

# Illustration Count
# 1398164

# Read from Kafka
kafka_data = DataFrame("pcatalog.flights_kafka",""),jspark)

# Illustration Count
# 0

# Write to Kafka

# Read from Kafka post-write
kafka_data = DataFrame("pcatalog.flights_kafka",""),jspark)

# Illustration Count
# 1398164

# Sample Data

#|               68.0|     20304|    39.0|             null|      0.0|     OO|         null|            62.0| ORD|   Chicago, IL|   157.0|             1|     0.0|    1.0|2017-10-01T00:00:...|  2936|               null|     null|   FWA|  Fort Wayne, IN|          null|  N464SW|            OO|         null|   10|2017|
#|               63.0|     20304|    35.0|             null|      0.0|     OO|         null|            60.0| ORD|   Chicago, IL|   137.0|             1|     0.0|    1.0|2017-10-01T00:00:...|  2940|               null|     null|   GRR|Grand Rapids, MI|          null|  N727SK|            OO|         null|   10|2017|
#|               65.0|     20304|    42.0|             null|      0.0|     OO|         null|            72.0| ALO|  Waterloo, IA|   234.0|             1|     0.0|    1.0|2017-10-01T00:00:...|  2942|               null|     null|   ORD|     Chicago, IL|          null|  N423SW|            OO|         null|   10|2017|