Conversion from Spark to Pandas dataframe using pyArrow significantly slower after update
I have noticed that the conversion of Spark tables to Pandas dataframes using pyArrow and the .toPandas command in PySpark takes significantly longer after the latest update to Domino. I was getting a 10 to 20 fold decrease in the time taken to convert the data using pyArrow but am now only seeing a ~2 fold decrease.
I am using the original base image in the docker container that I had been using in the past so I don't think it is a package version issue.
However, I have noticed that once the file is converted to a Pandas dataframe, rerunning the .toPandas on the same file is very fast (possible due to caching?).
Appreciate any help you can provide.