Distributed Graph Analytics

Walkthrough With Data

Prerequisites

Let's Get Started

In this example, we will be running Leaf Compression with DGA.

First, let's get some sample data from here. Already have data you want to use? That's great! Make sure it follows this format and it will work with DGA.

All set? Now we need to deploy this out to our cluster.


    $ scp -r dga-graphx/build/dist/ hostname:/path/on/disks

Now, let's scp our data out to the cluster. Navigate to the directory you downloaded the file to and run the command below.


    $ scp example.csv hostname:/path/on/disks

Next, we need to ssh into our cluster.


    $ ssh hostname

Now let's see if our files made it to the cluster. Run the command below and you should see a dist folder and example.tsv.


    $ ls -al

If everything checks out! We can now copy our data set to a directory in hdfs. For this example we will create a directory in tmp for the input.


    $ hdfs dfs -mkdir -p /tmp/dga/pr/input/

No need to create the output directory. That will be done for us when our job is complete.

Now let's copy our data onto hdfs.


    $ hdfs dfs -copyFromLocal example.csv /tmp/dga/pr/input/

Finally, we can now run our analytic! The command below uses the built in DGARunner to run Leaf Compression.


    $ cd /opt/dga/
    $ ./dga-graphx pr -i hdfs://scc.silverdale.dev:8020/tmp/dga/pr/input/edges.csv -o hdfs://scc.silverdale.dev:8020/tmp/dga/pr/output/ -s /opt/spark -n NameOfJob -m
    spark://spark.master.url:7077 --S spark.executor.memory=30g --ca parallelism=378 --S spark.worker.timeout=400 --S spark.cores.max=126

The command above, runs the dga-graphx-0.0.1.jar and executes the DGARunner class. It passes in 5 command line arguments.

pr - Tells the DGARunner which analytic it needs to run. This is required.
-i hdfs://scc.silverdale.dev:8020/tmp/dga/pr/input/edges.csv - Tells the DGARunner where the input data is located. This is required.
-o hdfs://scc.silverdale.dev:8020/tmp/dga/pr/output/ - Tells the DGARunner where to output the data. This is required.
-s /opt/spark - Tells the DGARunner where the $SPARK_HOME is located
-n NameOfJob - Tells the DGARunner what to name the job.
-m spark://spark.master.url:7077 - Tells the DGARunner what the spark master url is.
--S spark.executor.memory=30g - Sets a system property that tells spark how much memory it should use.
--ca parallelism=378 - Tells the DGARunner how many tasks to run in parallel.
--S spark.worker.timeout=400 - Sets a system property that tells spark when to consider a worker node dead.
--S spark.cores.max=126 - Sets a system property that tells spark the number of cores to use.

Is it done yet? If so, lets see the results!


    $ mkdir results/
    $ cd results
    $ hdfs dfs -copyToLocal /tmp/dga/pr/output/* .

What are all these parts? Don't worry, let's make them one! Note: You might need to open up a subdirectory to see the parts. Use the cd command to navigate.


    $ cat part-* >> bigfile.txt
    $ vi bigfile.txt

And there you have it! You ran your first analytic with DGA!