FAQ: How do I read and write to a dataset from my Spark-on-demand cluster in Domino?

josh.mineroffjosh.mineroff Member, Moderator, Domino Posts: 9 mod
edited September 14 in Q&A

FAQ: I wish to be able to read files from a dataset in my project and write some files back to it from my spark-on-demand cluster in Domino. Can you please help on how I can do that?

Josh Mineroff | Field Engineer

Tagged:

Answers

  • akshay.ambekarakshay.ambekar Administrator, Moderator, Domino Posts: 21 admin
    edited September 9

    By default, datasets are read-only and if you want to write data back to the dataset, you'll want to mount the dataset for reading and writing. Mounting a dataset is an advanced operation and is covered in more detail in our support docs at this link. The below article borrows the learnings from the link and expands on reading and writing to a dataset from a spark cluster.

    Refresher on mounting a dataset

    There are two ways to mount a dataset in Domino:

    1. Mounting a Dataset as an input Dataset makes the contents of a specific snapshot (the most recent, by default) available in a directory at the specified mount point in your Workspace or Run. A Dataset mounted only for input cannot be modified.
    2. Mounting a Dataset as an output Dataset creates an empty directory at the specified mount point, and when your Run or Workspace is stopped a new snapshot is written to the Dataset with the contents of that directory. Note that the new snapshot will only contain exactly those files that are in the mounted output directory. Snapshots do not append by default.

    It is important to note that the same Dataset can be mounted for input and output simultaneously at different mount points.

    To mount a dataset, we need to create a domino.yaml file that describes the paths at which the dataset will be read from (inputs) and written to (outputs). The schema of the domino.yaml file is shown here.

    The below example is shown for a project in Domino named spark-read-data-from-local and the dataset in the project is called sales-recs.


    Now, we first want to mount the sales-recs dataset in our spark cluster so that we can read and write to it. In the files section of our project, let us create a new file called domino.yaml file and add the below contents:

    datasetConfigurations:
      - name: WriteToDS
        inputs: 
          - path: "/josh-demo/input/"
            dataset: "sales-recs"
        outputs: 
          - path: "/josh-demo/output/"
            dataset: "sales-recs"
    

    The inputs section specifies the path at which the latest snapshot (by default) of the dataset will be mounted. If you wish to mount multiple snapshots of the same dataset for reading data, you can follow the steps mentioned here. The output section specifies the path at which an empty directory will be created that you can write to. Note that Domino automatically appends the /domino/datasets/ path to each input and output path.

    So, the files in your latest snapshot will be available for reading at /domino/datasets/josh-demo/input/

    Next, let’s launch a new workspace, and attach a spark cluster to it. Before you hit the launch button, expand the datasets accordion, navigate to the Advanced tab and select “WriteToDS” (dataset configuration name from the yaml file) from the drop down menu as follows:

    This will mount the dataset and make the inputs and outputs available to the workspace/cluster. 

    Once the workspace launches, you can expand the Workspace Settings tab and look at the input and output paths that have been mounted to read/write data.

    Reading from the dataset:

    As specified in the domino.yaml file, the input directory is mounted at /domino/datasets/josh-demo/input. To read the csv file, we can enter into a spark-shell and read the .csv file using: 

    val df = spark.csv.read("/domino/datasets/josh-demo/input/1000_sales_records.csv")
    df.head(10)
    

    Writing to a new snapshot of the dataset:

    The output directory is mounted at /domino/datasets/josh-demo/output. One of the ways in which we can write to that path is as follows:

    df.write.format("csv").save("/domino/datasets/josh-demo/output/analyzed_sales-recs.csv")

    Finally, the last step is to commit the data back to the dataset as a new snapshot. To do so, you can stop and commit the workspace and your files will be available as a new snapshot in the sales-recs dataset:

    In addition to reading and writing to a dataset, I also wanted to highlight Dataset scratch spaces. Scratch spaces are exactly what they sound like- temporary spaces that you can read and write intermediate or experimental files to. I will encourage you to read more about this feature here: https://docs.dominodatalab.com/en/4.2/reference/data/datasets/Datasets_Scratch_Spaces.html#datasets-scratch-spaces.

Sign In or Register to comment.