This project is an implementation of the Louvain Community Detection algorithm described in "Fast unfolding of communities in large networks" which:
This process can then be repeated to build several community-aggregated versions of the same original graph.
Identifying communities in large networks based on graph structure is difficult to eyeball and computationally hard, but is useful in understanding structure and strength of community metrics which may adhere to real-world relationships or constraints. Furthermore, big-data size graph compression allows for analysis of networks at various aggregation levels which is useful in guided network analysis and drill-down operations.
Community Assignment phases of Louvain Modularity when applied to the Enron Email Data Set.
In this image each node represents an email address and color represents community.
Our implmentation, based off of Giraph and Hadoop, is distributed and can execute against data sets with hundreds of millions of nodes and edges in a relatively short amount of time. This implementation hopes to scale the original algorithm and process to the largest of data sets by utilizing cloud-based technologies.
There are two main parts to the job.
The map reduce and giraph job then run in a cycle, detecting communities and compressing the graph until no significant progress is being made, and then exits.
Any kind of data can be used with Louvain. It doesn't necessarily need to be undirected, but it is preferred.
Property | Required | Description |
---|---|---|
minimum.progress | N | (Default: 0) The minimum delta X required to be considered progress, where X is the number of nodes that have changed their community on a particular pass. Delta X is then the difference in number of nodes that changed communities on the current pass compared to the previous pass. Using the default of 0 means that any delta is considered progress. |
progress.tries | N | (Default: 1) Number of times the minimum.progress setting is not met before exiting form the current level and compressing the graph. Default of 1 means the first time minimum.progress is not met the algorithm exits. |
Note: First see How To Build
$ ./bin/dga-giraph louvain /path/to/input /path/to/output
$ ./dga-graphx louvain -i hdfs://url.for.namenode:port/path/to/input -o hdfs://url.for.namenode:port/path/to/output