Distributed Graph Analytics

Walkthrough With Data

Prerequisites

Let's Get Started

In this example, we will be running Weakly Connected Components with DGA.

First, let's get some sample data. Already have data you want to use? That's great! Make sure it follows this format and it will work with DGA.


    $ wget http://sotera.github.io/distributed-graph-analytics/data/example.csv

If everything checks out! We can now copy our data set to a directory in hdfs. For this example we will create a directory in tmp for the input. You don't need to use this directory all the time.


    $ hadoop fs -mkdir -p /tmp/dga/wcc/input/

No need to create the output directory. That will be done for us when our job is complete.

Now let's copy our data onto hdfs.


    $ hadoop fs -copyFromLocal example.csv /tmp/dga/wcc/input/

Finally, we can now run our analytic! The command below uses the built in DGARunner to run Weakly Connected Components.


    $ cd /opt/dga/
    $ ./bin/dga-giraph wcc /tmp/dga/wcc/input/ /tmp/dga/wcc/output/ -w 1 -ca io.edge.reverse.duplicator=true

The command above, runs the dga-giraph-0.0.1.jar and executes the DGARunner class. It passes in 5 command line arguments.

wcc - Tells the DGARunner which analytic it needs to run. This is required.
/tmp/dga/wcc/input/ - Tells the DGARunner where the input data is located. This is required.
/tmp/dga/wcc/output/ - Tells the DGARunner where to output the data. This is required.
-w 1 - This gets passed to giraph. It tells giraph how many workers to use. This is optional.
-ca io.edge.reverse.duplicator=true - This tells our input format to duplicate the edges so our graph becomes weakly connected. This is optional.

Is it done yet? If so, lets see the results!


    $ cd
    $ mkdir results/
    $ cd results
    $ hadoop fs -copyToLocal /tmp/dga/wcc/output/* .

What are all these parts? Don't worry, let's make them one! Note: You might need to open up a subdirectory to see the parts. Use the cd command to navigate.


    $ cat part-* >> bigfile.txt
    $ vi bigfile.txt

And there you have it! You ran your first analytic with DGA!