I have been trying to get a Python Streamlit app to publish as an App, but just get an empty 'Please wait...' screen rather than the app content.
I have a
config.toml file and a
credentials.toml preconfigured to host on
0.0.0.0 and port
8888 which are moved to
~/.streamlit/ as part of a custom Post Setup script. Those were necessary to get the app sort of working and the logs to stop throwing errors, but there still appears to be issues.
Anyone else been able to get Streamlit (or any other Tornado app) to work and host in Domino?
When installing Pyspark from Pip, Anaconda or downloading from https://pypi.org/project/pyspark/ you will find that it will only work with certain versions of Hadoop (Hadoop 2.7.2 for Pyspark 2.4.3 for example.) If you are using a different version of Hadoop and try to access something like S3 from Spark you will receive errors such as
Py4JJavaError: An error occurred while calling o69.parquet: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.fs.s3a.S3AFileSystem at java.lang.Class.forName0(Native Method)
This despite that fact you see to have the correct JARs in place. The reason for this is that Pyspark is built with a specific version of Hadoop by default. In order to work around this you will need to install the "no hadoop" verison of Spark, build the Pyspark installation bundle from that, install it, then install the Hadoop core libraries needed and point Pyspark at those libraries.
Below is a dockerfile to do just this using Spark 2.4.3 and Hadoop 2.8.5:
# # Download Spark 2.4.3 WITHOUT Hadoop. # Unpack and move to /usr/local and make symlink from /usr/local/spark to specific Spark version # Build and install Pyspark from this download # RUN wget http://archive.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-without-hadoop.tgz && \ tar xzvf spark-2.4.3-bin-without-hadoop.tgz && \ mv spark-2.4.3-bin-without-hadoop /usr/local/spark-2.4.3 && \ ln -s /usr/local/spark-2.4.3 /usr/local/spark && \ cd /usr/local/spark/python && \ python setup.py sdist && \ pip install dist/pyspark-2.4.3.tar.gz # # Download core Hadoop 2.8.5 # Unpack and move to /usr/local and make symlink from /usr/local/hadoop to specific Hadoop version # RUN wget https://www.apache.org/dist/hadoop/core/hadoop-2.8.5/hadoop-2.8.5.tar.gz && \ tar xzvf hadoop-2.8.5.tar.gz && \ mv hadoop-2.8.5 /usr/local && \ ln -s /usr/local/hadoop-2.8.5 /usr/local/hadoop # # Download correct AWS Java SDK for S3 for Hadoop version being used # and put in Hadoop install location # RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar && \ mv aws-java-sdk-s3-1.10.6.jar /usr/local/hadoop/share/hadoop/tools/lib/ # # Set SPARK_DIST_CLASSPATH in $SPARK_HOME/conf/spark-env.sh to point to Hadoop install # RUN echo 'export SPARK_DIST_CLASSPATH="/usr/local/hadoop/etc/hadoop:'\ '/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:'\ '/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:'\ '/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:'\ '/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:'\ '/usr/local/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:'\ '/usr/local/hadoop/share/hadoop/tools/lib/*"' >> /usr/local/spark/conf/spark-env.sh
The key parts of this dockerfile are where the Pyspark package is built and installed:
cd /usr/local/spark/python && \ python setup.py sdist && \ pip install dist/pyspark-2.4.3.tar.gz
and ensuring you have the correct AWS S3 SDK JAR to match your Hadoop version:
RUN wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.10.6/aws-java-sdk-s3-1.10.6.jar && \ mv aws-java-sdk-s3-1.10.6.jar /usr/local/hadoop/share/hadoop/tools/lib/
To find the correct AWS S3 SDK JAR version you should go to the Maven Repository page for the hadoop-aws.jar file for the Hadoop version you're installing (in the example above this is hadoop-aws-2.8.5jar so the page is https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.8.5) and look at the AWS S3 SDK JAR version in the Compile Dependencies section In this case it is aws-java-sdk-s3-1.10.6.jar.
SPARK_DIST_CLASSPATH is set to your
HADOOP_CLASSPATH as output by the
hadoop classpath command.
Please see this guide for adding git repositories to Domino.
If you want to add a repo from a git server with ssh configured to another port (in this example, it's 12345) you will need to use the below format:
If you have any issues setting this up, please let us know!
Note: Don't forget to add your git credentials for the domain in question.
At times while executing runs you might observe CPU utilization in excess of 100%.
Domino measures CPU usage as a percentage of single core utilization. Therefore a machine with multiple cores can regularly utilize greater than 100% of a single processing core's maximum frequency.
For instance a m4.xlarge EC2 machine in AWS has 4 CPU cores and would show a maximum utilization of 400%.