Skip to the content.

Catalog Providers

Note

Catalog Provider as USER

Note

GSQL

%%sql 
set gimel.catalog.provider=USER

Note: The property name should be {datasetName}.dataSetProperties. For Example if your dataset name is pcatalog.mydataset, then the property name would be pcatalog.mydataset.dataSetProperties.

Example for HBase Dataset

%%sql
set pcatalog.hbase_test.dataSetProperties={
    "datasetType": "HBASE",
    "fields": [
        {
            "fieldName": "id",
            "fieldType": "string",
            "isFieldNullable": false
        },
        {
            "fieldName": "name",
            "fieldType": "string",
            "isFieldNullable": false
        },
        {
            "fieldName": "address",
            "fieldType": "string",
            "isFieldNullable": false
        },
        {
            "fieldName": "age",
            "fieldType": "string",
            "isFieldNullable": false
        },
        {
            "fieldName": "company",
            "fieldType": "string",
            "isFieldNullable": false
        },
        {
            "fieldName": "designation",
            "fieldType": "string",
            "isFieldNullable": false
        },
        {
            "fieldName": "salary",
            "fieldType": "string",
            "isFieldNullable": true
        }
    ],
    "partitionFields": [],
    "props": {
        "gimel.hbase.rowkey":"id",
        "gimel.hbase.table.name":"adp_bdpe:test_annai",
        "gimel.hbase.namespace.name":"namespace",
        "gimel.hbase.columns.mapping":"personal:name,personal:address,personal:age,professional:company,professional:designation,professional:salary"
    }
}
%%pcatalog
select * from pcatalog.hbase_test

DATA API Usage

import org.apache.spark.sql.{DataFrame, SQLContext};
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.{SparkConf, SparkContext};
import org.apache.spark.rdd.RDD;
import com.paypal.gimel.logger.Logger;
import com.paypal.gimel.{DataSet, DataSetType};
import spray.json.DefaultJsonProtocol._;
import spray.json._;
val logger = Logger(this.getClass.getName)
import org.apache.spark.sql.{DataFrame, SaveMode, SQLContext}
import com.paypal.gimel.common.catalog._

//Initiate HiveContext
val hiveContext = new HiveContext(sc);
val dataSet = com.paypal.gimel.DataSet(spark);
val sqlContext = hiveContext.asInstanceOf[SQLContext]

hiveContext.sql("set gimel.catalog.provider=USER");

val datasetPropsJson="""
{
    "datasetType": "ELASTIC_SEARCH",
    "fields": [],
    "partitionFields": [],
    "props": {
        "gimel.es.index.partition.isEnabled": "true",
        "gimel.es.index.partition.suffix": "20180120",
        "datasetName": "data_set_name",
        "nameSpace": "pcatalog",
        "es.port": "8080",
        "es.resource": "index/type",
        "gimel.es.index.partition.delimiter": "-",
        "es.nodes": "http://es_host"
    }
}
"""

val data = dataSet.read("pcatalog.data_set_name",
                         Map("pcatalog.data_set_name.dataSetProperties" -> datasetPropsJson))

Catalog Provider as HIVE

Note

%%sql 
set gimel.catalog.provider=HIVE

Create Hive Table pointing to physical storage

You can find hive table templates for each storage in their docs Stack & Version Compatibility

Catalog Provider as UDC (Unified Data Catalog)

Note

%%sql 
set gimel.catalog.provider=UDC

UDC parameters

Property Mandatory? Description Example Default
rest.service.method or spark.rest.service.method Y UDC Rest service protocol http/https https
rest.service.host or spark.rest.service.host Y UDC Rest service host name my-udc.example.com None
rest.service.port or spark.rest.service.port Y UDC Rest service port 80 443
gimel.udc.auth.enabled or spark.gimel.udc.auth.enabled N UDC auth enabled flag true/false true
gimel.udc.auth.provider.class or spark.gimel.udc.auth.provider.class N This is a Must if gimel.udc.auth.enabled=true com.example.MyAuthProvider None

UDC Auth

  <dependency>
    <groupId>com.paypal.gimel</groupId>
    <artifactId>gimel-common</artifactId>
    <version>${gimel.version}</version>
   </dependency>
package com.example

import com.paypal.gimel.common.security._

class MyAuthProvider extends AuthProvider {

  override def getCredentials(props: Map[String, Any]): String = {
    /*
     * Implement your own logic here
     */
  }
}

Here props are the gimel properties passed at runtime.

import org.apache.spark.sql.{Column, Row, SparkSession,DataFrame}
import org.apache.spark.sql.functions._

// Create Gimel SQL reference
val gsql: (String) => DataFrame = com.paypal.gimel.sql.GimelQueryProcessor.executeBatch(_: String, spark)
gsql("set gimel.logging.level=CONSOLE")

val options = Map("rest.service.method" -> "https",
"rest.service.host" -> "my-udc.example.com",
"rest.service.port" -> "443",
"gimel.deserializer.class" -> "com.paypal.gimel.deserializers.generic.JsonDynamicDeserializer",
"gimel.kafka.throttle.batch.fetchRowsOnFirstRun" -> "1",
"gimel.udc.auth.provider.class" -> "com.example.MyAuthProvider")

val df = dataset.read("udc.Kafka.Gimel_Dev.default.flights", options)
df.show