PySpark / Python Support
- Gimel Data API is fully compatible with pyspark, although the library itself is built in Scala.
- Pyspark provides an extremely powerful feature to tap into the JVM, and thus get a reference to all Java/Scala classes/objects in the JVM.
- This mean, by adding the JAR (Gimel Library) at runtime - you may leverage all the features of Gimel.
- Please see below illustration of how you may develop on PySpark, but still leverage Gimel Data API or Gimel SQL in their entirety.
Launch PySpark
While you launch pyspark - add the Gimel Jar.
export SPARK_MAJOR_VERSION=2
pyspark --jars hdfs://gimel_jar
# or
pyspark --jars file://gimel_jar
Using Gimel Data API in PySpark
# import DataFrame and SparkSession
from pyspark.sql import DataFrame, SparkSession
# fetch reference to the class in JVM
ScalaDataSet = sc._jvm.com.paypal.gimel.DataSet
# fetch reference to java SparkSession
jspark = spark._jsparkSession
# initiate dataset
dataset = ScalaDataSet.apply(jspark)
# invoking the read API gives reference to the scala DataFrame result-set
scala_df = dataset.read("pcatalog.your_dataset","")
# passing options - an example
scala_df_with_options = dataset.read("pcatalog.your_dataset","gimel.kafka.throttle.batch.fetchRowsOnFirstRun=1")
# convert to pyspark dataframe
python_df = DataFrame(scala_df,jspark)
# from now - you may use regular pyspark lingua to play with the data
python_df.show(10)
Using Gimel SQL in PySpark
# fetch reference to GimelQueryProcessor Class in JVM
gsql = sc._jvm.com.paypal.gimel.sql.GimelQueryProcessor
# fetch reference to java SparkSession
jspark = spark._jsparkSession
# your SQL
sql = "select * from pcatalog.your_dataset limit 5"
# execute GSQL, this can be any sql of type "insert into ... select .. join ... where .."
gsql.executeBatch(sql, jspark)
# execute GSQL, and get reference to resulting dataset of the SQL
scala_df = gsql.executeBatch(sql, jspark)
# convert to pyspark dataframe
python_df = DataFrame(scala_df, jspark)
# from now - you may use regular pyspark lingua to play with the data
python_df.show(10)
Sample Read and Write illustration
# DataSet
dataset = ScalaDataSet.apply(jspark)
# Read from HDFS
hdfs_data = DataFrame(dataset.read("pcatalog.flights_hdfs",""),jspark)
# Illustration Count
hdfs_data.count()
# 1398164
# Read from Kafka
kafka_data = DataFrame(dataset.read("pcatalog.flights_kafka",""),jspark)
# Illustration Count
kafka_data.count()
# 0
# Write to Kafka
dataset.write("pcatalog.flights_kafka",hdfs_data._jdf,"")
# Read from Kafka post-write
kafka_data = DataFrame(dataset.read("pcatalog.flights_kafka",""),jspark)
# Illustration Count
kafka_data.count()
# 1398164
# Sample Data
kafka_data.show(3)
#+-------------------+----------+--------+-----------------+---------+-------+-------------+----------------+----+--------------+--------+--------------+--------+-------+--------------------+------+-------------------+---------+------+----------------+--------------+--------+--------------+-------------+-----+----+
#|ACTUAL_ELAPSED_TIME|AIRLINE_ID|AIR_TIME|CANCELLATION_CODE|CANCELLED|CARRIER|CARRIER_DELAY|CRS_ELAPSED_TIME|DEST|DEST_CITY_NAME|DISTANCE|DISTANCE_GROUP|DIVERTED|FLIGHTS| FL_DATE|FL_NUM|LATE_AIRCRAFT_DELAY|NAS_DELAY|ORIGIN|ORIGIN_CITY_NAME|SECURITY_DELAY|TAIL_NUM|UNIQUE_CARRIER|WEATHER_DELAY|month|year|
#+-------------------+----------+--------+-----------------+---------+-------+-------------+----------------+----+--------------+--------+--------------+--------+-------+--------------------+------+-------------------+---------+------+----------------+--------------+--------+--------------+-------------+-----+----+
#| 68.0| 20304| 39.0| null| 0.0| OO| null| 62.0| ORD| Chicago, IL| 157.0| 1| 0.0| 1.0|2017-10-01T00:00:...| 2936| null| null| FWA| Fort Wayne, IN| null| N464SW| OO| null| 10|2017|
#| 63.0| 20304| 35.0| null| 0.0| OO| null| 60.0| ORD| Chicago, IL| 137.0| 1| 0.0| 1.0|2017-10-01T00:00:...| 2940| null| null| GRR|Grand Rapids, MI| null| N727SK| OO| null| 10|2017|
#| 65.0| 20304| 42.0| null| 0.0| OO| null| 72.0| ALO| Waterloo, IA| 234.0| 1| 0.0| 1.0|2017-10-01T00:00:...| 2942| null| null| ORD| Chicago, IL| null| N423SW| OO| null| 10|2017|
#+-------------------+----------+--------+-----------------+---------+-------+-------------+----------------+----+--------------+--------+--------------+--------+-------+--------------------+------+-------------------+---------+------+----------------+--------------+--------+--------------+-------------+-----+----+