Connecting Glue Catalog to Domino for use with On Demand Spark

tayler.saletayler.sale Member, Moderator, Domino Posts: 27 mod

This article is meant for customers who use AWS Glue Catalog as their Metastore for distributed computing and would like to be able to continue using it with Domino. AWS Glue Catalog is a fully managed ETL service that makes it easy to prepare and load data for analytics. This is AWSs version of Apache Hive Metastore, but unfortunately it is not compatible with Spark out of the box so you have to install a special client, AWS Glue Catalog Client for Apache Hive Metastore, to make them compatible. In order to get this client working in Domino, you need to build out two new compute environments: one for your workspace (set in Project Settings) and one for the Spark cluster executors (set in Workspace cluster configs).

The diagram above describes how Domino's custom Glue setup compares to Apache Spark’s official releases. The most recent version of Apache Spark is v2.4.6. This release typically ships with JARs for the latest Apache Hadoop v2.7.x, as well as an Apache Spark branch of Apache Hive based off of v1.2.1. In addition, hadoop-aws relies on aws-java-sdk, typically v1.7.x.

The Glue Catalog client provided by AWS, however, introduces two constraints:

  1. It relies on apply a patch to Apache Hive to support custom metastore clients.
  2. It requires a much more recent version of aws-java-sdk (v1.11.267+).

The first constraint requires a compilation of Apache Hive with the patch applied and therefore a custom Spark distribution for use on Domino. The second constraint requires the use of Apache Hadoop v2.9.x+, which depends on AWS Java SDK 1.11.x+.

Below is a Dockerfile (adapted from here) to build this custom distribution of Apache Spark. You will need to copy the resulting artifact out of the image when done (/opt/spark/spark.tgz). The image is currently hosted as an artifact in a public S3 bucket.

FROM python:3.6-slim-buster

# ADD REPO FOR JDK
RUN echo "deb http://ftp.us.debian.org/debian sid main" >> /etc/apt/sources.list \
&&  apt-get update \
&&  mkdir -p /usr/share/man/man1

# INSTALL PACKAGES
RUN apt-get install -y git wget openjdk-8-jdk

# INSTALL MAVEN
ENV MAVEN_VERSION=3.6.3
RUN cd /opt \
&&  wget https://downloads.apache.org/maven/maven-3/$MAVEN_VERSION/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz \
&&  tar zxvf /opt/apache-maven-${MAVEN_VERSION}-bin.tar.gz \
&&  rm apache-maven-${MAVEN_VERSION}-bin.tar.gz
ENV PATH=/opt/apache-maven-$MAVEN_VERSION/bin:$PATH


ENV SPARK_VERSION=2.4.6 
ENV HADOOP_VERSION=2.9.2
ENV HIVE_VERSION=1.2.1
ENV AWS_SDK_VERSION=1.11.267

# BUILD HIVE FOR HIVE v1
RUN git clone https://github.com/apache/hive.git /opt/hive
WORKDIR /opt/hive
RUN git checkout tags/release-$HIVE_VERSION -b rel-$HIVE_VERSION
# Apply patch
RUN wget https://issues.apache.org/jira/secure/attachment/12958417/HIVE-12679.branch-1.2.patch
RUN patch -p0 <HIVE-12679.branch-1.2.patch
# Install fails to fetch this JAR.
RUN wget https://repository.jboss.org/maven2/javax/jms/jms/1.1/jms-1.1.jar
RUN mvn install:install-file -DgroupId=javax.jms -DartifactId=jms -Dversion=1.1 -Dpackaging=jar -Dfile=jms-1.1.jar
# Build Hive
RUN mvn clean install -DskipTests -Phadoop-2
# Related to this issue https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/pull/14
RUN mkdir -p ~/.m2/repository/org/spark-project
RUN cp -r ~/.m2/repository/org/apache/hive ~/.m2/repository/org/spark-project

# BUILD AWS GLUE DATA CATALOG CLIENT
RUN git clone https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore.git /opt/glue
WORKDIR /opt/glue
RUN sed -i '/<packaging>pom<\/packaging>/a <dependencies><dependency><groupId>org.apache.hadoop<\/groupId><artifactId>hadoop-common<\/artifactId><version>${hadoop.version}<\/version><scope>provided<\/scope><\/dependency><\/dependencies>' shims/pom.xml
RUN mvn clean package -DskipTests -pl -aws-glue-datacatalog-hive2-client

# BUILD SPARK
RUN git clone https://github.com/apache/spark.git /opt/spark
WORKDIR /opt/spark
RUN git checkout tags/v$SPARK_VERSION -b v$SPARK_VERSION
RUN ./dev/make-distribution.sh --name my-custom-spark --pip -Phadoop-${HADOOP_VERSION%.*} -Phive -Dhadoop.version=$HADOOP_VERSION -Dhive.version=$HIVE_VERSION

# ADD MISSING & BUILT JARS TO SPARK CLASSPATHS + CONFIG
WORKDIR /opt/spark/dist
# Copy missing deps
RUN mvn dependency:get -Dartifact=asm:asm:3.2
RUN mvn dependency:get -Dartifact=commons-codec:commons-codec:1.9
# Copy Glue Client JARs
RUN find /opt/glue -name "*.jar" -exec cp {} jars \;
# Copy AWS JARs
RUN echo :quit | ./bin/spark-shell --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:$HADOOP_VERSION,com.amazonaws:aws-java-sdk:$AWS_SDK_VERSION 
RUN cp /root/.ivy2/jars/*.jar jars

# CREATE ARTIFACT
RUN mv /opt/spark/dist /opt/spark/spark
WORKDIR /opt/spark
RUN tar -cvzf spark.tgz spark

You don't need to pull down this image unless you want to store it somewhere custom, otherwise you can pull it from S3 using the code below. To create the environments use these instructions to choose the appropriate base image and add the code to the Dockerfile and/or Pre-Run Script. Note: these environments use the Jupyter Spylon kernel rather than PySpark. If you want to use PySpark you will need to install it in the dockerfile for both environments.

Workspace Environment

Base Image: Domino Analytics Distribution (or base image of your choosing)

Dockerfile:

ENV SPARK_HOME=/opt/domino/spark
ENV PATH "$PATH:$SPARK_HOME/bin"
RUN mkdir /opt/domino
RUN wget -q https://domino-spark.s3.us-east-2.amazonaws.com/spark.tar && \
    tar -xf spark.tar && \
    rm spark.tar && \
    mv spark /opt/domino/spark && \
    chmod -R 777 /opt/domino/spark/conf && \
    rm /opt/domino/spark/jars/com.amazonaws_aws-java-sdk-bundle-1.11.199.jar
RUN pip install spylon-kernel
RUN python -m spylon_kernel install

Pre-Run Script: Be sure to edit the $AWS_REGION to match the region you operate in.

export PATH="$PATH:$SPARK_HOME/bin"
cat >> $SPARK_HOME/conf/spark-defaults.conf << EOF
spark.hadoop.aws.region $AWS_REGION
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.experimental.fadvise random
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version        2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.parquet.mergeSchema                false
spark.sql.parquet.filterPushdown             true
spark.sql.hive.metastorePartitionPruning     true
spark.sql.catalogImplementation              hive
spark.sql.hive.convertMetastoreParquet       false
spark.sql.hive.caseSensitiveInferenceMode    NEVER_INFER
EOF
cat >> /opt/domino/spark/conf/hive-site.xml <<EOF
<configuration>
    <property>
        <name>hive.metastore.client.factory.class</name>
        <value>com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory</value>
    </property>
</configuration>
EOF

Executor Environment

Base Image: Bitnami Spark Image

Dockerfile:

USER root
RUN apt-get update && apt-get install -y wget && rm -r /var/lib/apt/lists /var/cache/apt/archives
WORKDIR /opt/bitnami

# Replace Spark
RUN rm -rf spark
RUN wget -q https://domino-spark.s3.us-east-2.amazonaws.com/spark.tar && \
    tar -xf spark.tar && \
    rm spark.tar && \
    chmod -R 777 spark/conf && \
    rm /opt/bitnami/spark/jars/com.amazonaws_aws-java-sdk-bundle-1.11.199.jar

# Rerun Bitnami post-install script
WORKDIR /
RUN /opt/bitnami/scripts/spark/postunpack.sh
WORKDIR /opt/bitnami/spark
ENV PATH="$PATH:$SPARK_HOME/bin"
USER 1001

If you run into issues getting this set up let us know here or by emailing [email protected]

Comments

  • tayler.saletayler.sale Member, Moderator, Domino Posts: 27 mod

    If you run into an issue accessing data tables in Glue and encounter this error message...

     java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
     at org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:119)
    

    Then replace this line in the Pre-Run Script of your Workspace environment

    spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
    

    with this line

    spark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem
    
Sign In or Register to comment.