Create a Pyspark environment with any version of Hadoop
When installing Pyspark from Pip, Anaconda or downloading from https://pypi.org/project/pyspark/ you will find that it will only work with certain versions of Hadoop (Hadoop 2.7.2 for Pyspark 2.4.3 for example.) If you are using a different version of Hadoop and try to access something like S3 from Spark you will receive errors such as
Py4JJavaError: An error occurred while calling o69.parquet: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.fs.s3a.S3AFileSystem at java.lang.Class.forName0(Native Method)
This despite that fact you see to have the correct JARs in place. The reason for this is that Pyspark is built with a specific version of Hadoop by default. In order to work around this you will need to install the "no hadoop" version of Spark, build the Pyspark installation bundle from that, install it, then install the Hadoop core libraries needed and point Pyspark at those libraries.
Below is a dockerfile to do just this using Spark 2.4.3 and Hadoop 2.8.5:
# # Download Spark 2.4.3 WITHOUT Hadoop. # Unpack and move to /usr/local and make symlink from /usr/local/spark to specific Spark version # Build and install Pyspark from this download # RUN wget http://archive.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-without-hadoop.tgz && \ tar xzvf spark-2.4.3-bin-without-hadoop.tgz && \ mv spark-2.4.3-bin-without-hadoop /usr/local/spark-2.4.3 && \ ln -s /usr/local/spark-2.4.3 /usr/local/spark && \ cd /usr/local/spark/python && \ python setup.py sdist && \ pip install dist/pyspark-2.4.3.tar.gz # # Download core Hadoop 2.8.5 # Unpack and move to /usr/local and make symlink from /usr/local/hadoop to specific Hadoop version # RUN wget https://www.apache.org/dist/hadoop/core/hadoop-2.8.5/hadoop-2.8.5.tar.gz && \ tar xzvf hadoop-2.8.5.tar.gz && \ mv hadoop-2.8.5 /usr/local && \ ln -s /usr/local/hadoop-2.8.5 /usr/local/hadoop # # Download correct AWS Java SDK for S3 for Hadoop version being used # and put in Hadoop install location # RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar && \ mv aws-java-sdk-s3-1.10.6.jar /usr/local/hadoop/share/hadoop/tools/lib/ # # Set SPARK_DIST_CLASSPATH in $SPARK_HOME/conf/spark-env.sh to point to Hadoop install # RUN echo 'export SPARK_DIST_CLASSPATH="/usr/local/hadoop/etc/hadoop:'\ '/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:'\ '/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:'\ '/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:'\ '/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:'\ '/usr/local/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:'\ '/usr/local/hadoop/share/hadoop/tools/lib/*"' >> /usr/local/spark/conf/spark-env.sh
The key parts of this dockerfile are where the Pyspark package is built and installed:
cd /usr/local/spark/python && \ python setup.py sdist && \ pip install dist/pyspark-2.4.3.tar.gz
and ensuring you have the correct AWS S3 SDK JAR to match your Hadoop version:
RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar && \ mv aws-java-sdk-s3-1.10.6.jar /usr/local/hadoop/share/hadoop/tools/lib/
To find the correct AWS S3 SDK JAR version you should go to the Maven Repository page for the hadoop-aws.jar file for the Hadoop version you're installing (in the example above this is hadoop-aws-2.8.5jar so the page is https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.8.5) and look at the AWS S3 SDK JAR version in the Compile Dependencies section In this case it is aws-java-sdk-s3-1.10.6.jar.
SPARK_DIST_CLASSPATH is set to your
HADOOP_CLASSPATH as output by the
hadoop classpath command.