In this example, we will be running Leaf Compression with DGA.
All set? Now we need to deploy this out to our cluster.
$ scp -r dga-graphx/build/dist/ hostname:/path/on/disks
Now, let's scp our data out to the cluster. Navigate to the directory you downloaded the file to and run the command below.
$ scp example.csv hostname:/path/on/disks
Next, we need to ssh into our cluster.
$ ssh hostname
Now let's see if our files made it to the cluster. Run the command below and you should see a dist folder and example.tsv.
$ ls -al
If everything checks out! We can now copy our data set to a directory in hdfs. For this example we will create a directory in tmp for the input.
$ hdfs dfs -mkdir -p /tmp/dga/pr/input/
No need to create the output directory. That will be done for us when our job is complete.
Now let's copy our data onto hdfs.
$ hdfs dfs -copyFromLocal example.csv /tmp/dga/pr/input/
Finally, we can now run our analytic! The command below uses the built in DGARunner to run Leaf Compression.
$ cd /opt/dga/ $ ./dga-graphx pr -i hdfs://scc.silverdale.dev:8020/tmp/dga/pr/input/edges.csv -o hdfs://scc.silverdale.dev:8020/tmp/dga/pr/output/ -s /opt/spark -n NameOfJob -m spark://spark.master.url:7077 --S spark.executor.memory=30g --ca parallelism=378 --S spark.worker.timeout=400 --S spark.cores.max=126
The command above, runs the dga-graphx-0.0.1.jar and executes the DGARunner class. It passes in 5 command line arguments.
Is it done yet? If so, lets see the results!
$ mkdir results/ $ cd results $ hdfs dfs -copyToLocal /tmp/dga/pr/output/* .
What are all these parts? Don't worry, let's make them one! Note: You might need to open up a subdirectory to see the parts. Use the cd command to navigate.
$ cat part-* >> bigfile.txt $ vi bigfile.txt
And there you have it! You ran your first analytic with DGA!