Changes coming soon. Watch this space for updates on our new community.

Conversion from Spark to Pandas dataframe using pyArrow significantly slower after update

cgalea11
cgalea11 Member Posts: 14

Hi,

I have noticed that the conversion of Spark tables to Pandas dataframes using pyArrow and the .toPandas command in PySpark takes significantly longer after the latest update to Domino. I was getting a 10 to 20 fold decrease in the time taken to convert the data using pyArrow but am now only seeing a ~2 fold decrease.

I am using the original base image in the docker container that I had been using in the past so I don't think it is a package version issue.

However, I have noticed that once the file is converted to a Pandas dataframe, rerunning the .toPandas on the same file is very fast (possible due to caching?).

Appreciate any help you can provide.

Charles

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!