Distributed Graph Analytics

Louvain Modularity

This project is an implementation of the Louvain Community Detection algorithm described in "Fast unfolding of communities in large networks" which:

Assigns communities to nodes in a graph based on graph structure and statistics.
Compresses the community-tagged graph into a smaller one.

This process can then be repeated to build several community-aggregated versions of the same original graph.

Identifying communities in large networks based on graph structure is difficult to eyeball and computationally hard, but is useful in understanding structure and strength of community metrics which may adhere to real-world relationships or constraints. Furthermore, big-data size graph compression allows for analysis of networks at various aggregation levels which is useful in guided network analysis and drill-down operations.

Community Assignment phases of Louvain Modularity when applied to the Enron Email Data Set.

In this image each node represents an email address and color represents community.

Algorithm

Each vertex receives community values from its community hub and sends its own community to its neighbors
Each vertex determines if it should move to a neighboring community or not and sends its information to its community hub
Each community hub re-calculates community totals and sends the updates to each community member
Compress each community such that they are represented by one node.

Algorithm Explanation

Our implmentation, based off of Giraph and Hadoop, is distributed and can execute against data sets with hundreds of millions of nodes and edges in a relatively short amount of time. This implementation hopes to scale the original algorithm and process to the largest of data sets by utilizing cloud-based technologies.

How is the code split up?

There are two main parts to the job.

A giraph job, that detects communities in a graph structure.
A map reduce job, that compresses a graph based on its community structure.

The map reduce and giraph job then run in a cycle, detecting communities and compressing the graph until no significant progress is being made, and then exits.

What kind of data can be used?

Any kind of data can be used with Louvain. It doesn't necessarily need to be undirected, but it is preferred.

Louvain Modularity Configuration

Property	Required	Description
minimum.progress	N	(Default: 0) The minimum delta X required to be considered progress, where X is the number of nodes that have changed their community on a particular pass. Delta X is then the difference in number of nodes that changed communities on the current pass compared to the previous pass. Using the default of 0 means that any delta is considered progress.
progress.tries	N	(Default: 1) Number of times the minimum.progress setting is not met before exiting form the current level and compressing the graph. Default of 1 means the first time minimum.progress is not met the algorithm exits.

How To Run

Note: First see How To Build

How To Run DGA-Giraph

    
        $ ./bin/dga-giraph louvain /path/to/input /path/to/output

How To Run HBSE with DGA-GraphX

    
        $ ./dga-graphx louvain -i hdfs://url.for.namenode:port/path/to/input -o hdfs://url.for.namenode:port/path/to/output