Conversion from Spark to Pandas dataframe using pyArrow significantly slower after update

cgalea11 Member Posts: 14


I have noticed that the conversion of Spark tables to Pandas dataframes using pyArrow and the .toPandas command in PySpark takes significantly longer after the latest update to Domino. I was getting a 10 to 20 fold decrease in the time taken to convert the data using pyArrow but am now only seeing a ~2 fold decrease.

I am using the original base image in the docker container that I had been using in the past so I don't think it is a package version issue.

However, I have noticed that once the file is converted to a Pandas dataframe, rerunning the .toPandas on the same file is very fast (possible due to caching?).

Appreciate any help you can provide.


