FAQ: How can I read files from a dataset in my project and write some files back to it from my spark-on-demand cluster in Domino?
See below for how to do so!
Josh Mineroff | Field Engineer
By default, datasets are read-only and if you want to write data back to the dataset, you'll want to mount the dataset for reading and writing. Mounting a dataset is an advanced operation and is covered in more detail in our support docs at this link. The below article borrows the learnings from the link and expands on reading and writing to a dataset from a spark cluster.
There are two ways to mount a dataset in Domino:
It is important to note that the same Dataset can be mounted for input and output simultaneously at different mount points.
To mount a dataset, we need to create a domino.yaml file that describes the paths at which the dataset will be read from (inputs) and written to (outputs). The schema of the domino.yaml file is shown here.
The below example is shown for a project in Domino named spark-read-data-from-local and the dataset in the project is called sales-recs.
Now, we first want to mount the sales-recs dataset in our spark cluster so that we can read and write to it. In the files section of our project, let us create a new file called domino.yaml file and add the below contents:
- name: WriteToDS
- path: "/josh-demo/input/"
- path: "/josh-demo/output/"
The inputs section specifies the path at which the latest snapshot (by default) of the dataset will be mounted. If you wish to mount multiple snapshots of the same dataset for reading data, you can follow the steps mentioned here. The output section specifies the path at which an empty directory will be created that you can write to. Note that Domino automatically appends the /domino/datasets/ path to each input and output path.
So, the files in your latest snapshot will be available for reading at /domino/datasets/josh-demo/input/
Next, let’s launch a new workspace, and attach a spark cluster to it. Before you hit the launch button, expand the datasets accordion, navigate to the Advanced tab and select “WriteToDS” (dataset configuration name from the yaml file) from the drop down menu as follows:
This will mount the dataset and make the inputs and outputs available to the workspace/cluster.
Once the workspace launches, you can expand the Workspace Settings tab and look at the input and output paths that have been mounted to read/write data.
As specified in the domino.yaml file, the input directory is mounted at /domino/datasets/josh-demo/input. To read the csv file, we can enter into a spark-shell and read the .csv file using:
val df = spark.csv.read("/domino/datasets/josh-demo/input/1000_sales_records.csv")
The output directory is mounted at /domino/datasets/josh-demo/output. One of the ways in which we can write to that path is as follows:
Finally, the last step is to commit the data back to the dataset as a new snapshot. To do so, you can stop and commit the workspace and your files will be available as a new snapshot in the sales-recs dataset:
In addition to reading and writing to a dataset, I also wanted to highlight Dataset scratch spaces. Scratch spaces are exactly what they sound like- temporary spaces that you can read and write intermediate or experimental files to. I will encourage you to read more about this feature here: https://docs.dominodatalab.com/en/4.2/reference/data/datasets/Datasets_Scratch_Spaces.html#datasets-scratch-spaces.
It looks like you're new here. If you want to get involved, click one of these buttons!