High Betweenness Set Extraction

Betweenness Centrality is a useful property when searching for critical nodes in a network. The exact calculation of betweenness is however computationally expensive, which leaves us with the need for approximate methods.

The picture above shows the approximate betweenness of the given graph. You can see some nodes are a lot more critical than others and their betweenness value reflects that.

Algorithm

  1. Shortest paths are calculated from a set of pivots (source nodes) to all other nodes in the graph.
  2. The dependency of each vertex for each pivot is calculated and added to a running total approximated betweenness.
  3. The highest betweenness set is extracted.
  4. Set stability conditions are checked. If met algorithm exits, otherwise a new pivot batch is chosen and step 1 starts again.

Benchmarks and Trade Offs

This method has 4 primary parameters that can greatly effect the run time and accuracy of the results.

To characterize the behavior of these parameters we've run a series of tests of 5 data sets.

All tests below were run on an 8-node cluster using 8 giraph workers using 8 threads per worker.

What kind of data can be used?

You can use any kind of data with HBSE. It can be directional or undirected. Keep in mind, that an undirected graph will make it even more computational heavy, thus your job will take much longer.

For background information on the approximation method see: "W. Chong, Efficient Extraction of High-Betweenness Vertices"

For background information on the method of accumulation of pair dependencies and shortest path data see: "U. Brandes, A Faster Algorithm for Betweenness Centrality"

Implementation

Our implementation is distributed using the BSP method on Apache giraph and built with Gradle. Giraph requires a Hadoop cluster (map reduce and hdfs)

I/O

The graph is read from an edge list file in HDFS, each line in the file represents a single edge and is coded as a "key,target,weight" (edge weight is optional and defaults to 1)

Output is a directory in HDFS with both the extracted high betweenness set and the approximate betweenness value for all nodes.

High Betweenness Set Extraction Configuration

Property Required Description
betweenness.output.dir N Sets the betweenness set output directory.
betweenness.shortest.path.phases N Sets the number of shortest path phases that the algorithm should run through. The default is 1.
betweenness.set.stability N Sets the stability cutoff point. Defaults to 0.
betweenness.set.stability.counter N Counter for the stability cutoff point.
betweenness.set.maxSize N Sets the maximum number of nodes in a betweenness set.
pivot.batch.size N The percentage of pivots to select out of the nodes. Must be a decimal between 0.0 and 1.0.
pivot.batch.size.initial N The percentage of pivots to select initially. Must be a decimal between 0.0 and 1.0.
pivot.batch.random.seed N Seed the random number generator for pivot selection.
vertex.count N Sets the total number of vertices to perform the algorithm on.

How To Run

Note: First see How To Build

How To Run DGA-Giraph
    
        $ ./bin/dga-giraph hbse /path/to/input /path/to/output
    
How To Run HBSE with DGA-GraphX
    
        $ ./dga-graphx hbse -i hdfs://url.for.namenode:port/path/to/input -o hdfs://url.for.namenode:port/path/to/output