In this example, we will be running PageRank with DGA.
All set? Now we need to deploy this out to our cluster.
$ scp -r dga-giraph/build/dist/ hostname:/path/on/disks
Now, let's scp our data out to the cluster. Navigate to the directory you downloaded the file to and run the command below.
$ scp example.csv hostname:/path/on/disks
Next, we need to ssh into our cluster.
$ ssh hostname
Now let's see if our files made it to the cluster. Run the command below and you should see a dist folder and example.tsv.
$ ls -al
If everything checks out! We can now copy our data set to a directory in hdfs. For this example we will create a directory in tmp for the input.
$ hadoop fs -mkdir -p /tmp/dga/pr/input/
No need to create the output directory. That will be done for us when our job is complete.
Now let's copy our data onto hdfs.
$ hadoop fs -copyFromLocal example.csv /tmp/dga/pr/input/
Finally, we can now run our analytic! The command below uses the built in DGARunner to run PageRank.
$ cd /opt/dga/ $ ./bin/dga-giraph pr /tmp/dga/pr/input/ /tmp/dga/pr/output/ -w 1 -ca io.edge.reverse.duplicator=true
The command above, runs the dga-giraph-0.0.1.jar and executes the DGARunner class. It passes in 5 command line arguments.
Is it done yet? If so, lets see the results!
$ mkdir results/ $ cd results $ hadoop fs -copyToLocal /tmp/dga/pr/output/* .
What are all these parts? Don't worry, let's make them one! Note: You might need to open up a subdirectory to see the parts. Use the cd command to navigate.
$ cat part-* >> bigfile.txt $ vi bigfile.txt
And there you have it! You ran your first analytic with DGA!