PageRank is an algorithm that determines the importance of a particular node in a graph.


  1. Initialize the same value to each vertex.
  2. Calculate your rank based on the initial value divided by the number of edges you have.
  3. Send your rank to all your edges.
  4. Accumulate all incoming messages and calculate your new rank.
  5. Repeat step 2.
  6. Stop whenever the max change is less than an epsilon value of 0.001.

The picture above shows the PageRank of that specific graph. As you can see, the more times a node references another, the higher the PageRank. The nodes that are referenced the least, have the least PageRank.

What kind of data can be used?

Any kind of data can be used for PageRank. The PageRank algorithm results will be different if you make it undirected, but if that's what you want to do, then it will work.

Output Format

The output format will output each vertex and it's page rank.

PageRank Configuration

Property Required Description
damping.factor N (Default: 0.85f) The PageRank theory holds that an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor d. Various studies have tested different damping factors, but it is generally assumed that the damping factor will be set around 0.85.

How To Run

Note: First see How To Build

How To Run DGA-Giraph
        $ ./bin/dga-giraph pr /path/to/input /path/to/output
How To Run HBSE with DGA-GraphX
        $ ./dga-graphx pr -i hdfs://url.for.namenode:port/path/to/input -o hdfs://url.for.namenode:port/path/to/output