HPC Graph Analysis

Massive Graphs: Big Compute meets Big Data (SIAM AN12 Minisymposium)

Hyatt Regency
Minneapolis, MN

9-13 July 2012

Scope and Goals:

Large graph analytics have become an increasingly important in a wide variety of application areas such as internet search, bioinformatics, social media, and cybersecurity. Massive graphs push the state of the art in both big compute and big data. This mini-symposium will explore the algorithms, tools and technologies being developed to address these leading edge problems. The mini-symposium will include a mix of invited talks from established practitioners and early career professionals.

Location:

AN12 logoThis workshop is co-located with SIAM AN12, held 9-13 July 2012, at the Hyatt Regency in Minneapolis, MN. Registration information for AN12 can be found at here.



Program:

Tuesday, July 10

10:30 AM - 12:30 PM
Room: TBD

10:30-10:55 Streaming Graph Analytics for Massive Graphs
David A. Bader, David Ediger, and Jason Riedy, Georgia Institute of Technology, USA
Emerging real-world graph problems include detecting community structure in large social networks, improving the resilience of the electric power grid, and detecting and preventing disease in human populations. The volume and richness of data combined with its rate of change renders monitoring properties at scale by static recomputation infeasible. We approach these problems with massive, fine-grained parallelism across different shared memory architectures both to compute solutions and to explore the sensitivity of these solutions to natural bias and omissions within the data.
 
 
11:00-11:25 Fast Counting of Patterns in Graphs
Ali Pinar, C. Seshadhri, and Tamara G. Kolda, Sandia National Laboratories, USA
The occurrence of small subgraphs is a key ingredient in understanding real networks, since local interactions can reveal a lot of information about the full graph. For instance, the number of triangles indicates how well the graph is clustered. However, counting such small graphs can be computationally intensive, preventing adoption of such techniques on massive graphs. We are developing sampling-based algorithms for counting small sugbgraphs graphs. In this talk, we will present our latest results.
 
 
11:30-11:55 Statistical Models and Methods for Anomaly Detection in Large Graphs
Nicholas Arcolano, Massachusetts Institute of Technology, USA
Traditional statistical models and methods provide a powerful framework for analyzing observed data. To extend these approaches to large graph-valued data sets, however, we must address significant theoretical and computational challenges. In this talk, we discuss new techniques for statistical inference on large graphs, with application to the detection of anomalous behavior in a citation database.
 
 
12:00-12:25 Perfect Power Law Graphs: Generation, Sampling, Construction and Fitting
Jeremy Kepner, Massachusetts Institute of Technology, USA
Power law graphs are ubiquitous and arise in the Internet, the Web, citation graphs, and online social networks. Deviations from the power law model do not have a large impact on the qualitative characterization of graphs. However, the need for precise background models becomes essential in order to apply rigorous statistical signal processing techniques to detect graph anomalies. This work explores the power law degree distribution that is used to model power law graphs. A simple heuristic for generating perfect power law graphs is presented that reproduces many observed phenomena. Using this model the sampling effects of graph construction and edge ordering are examined. Graph construction appears to be a lossy, non-linear process that generates many phenomena that are observed in real data (e.g., data bin scatter, low-degree tails, and high-degree tails). Likewise, the order that edges are generated can have a strong effect on the evolution of the power law slope and the overall density of the graph (i.e., densification). Applying the perfect power law model to real data (e.g., entities extracted from the Reuters Corpus) provides a self-consistent set of degree bins for measuring deviations from the background. Using this scheme it is possible to identify specific edges that are typical, surplus, or deficit using standard signal processing techniques.
 
Tuesday, July 10

4:00 PM - 6:00 PM
Room: TBD

4:00-4:25 Networks, Communities and the Ground-Truth
Jaewon Yang and Jure Leskovec, Stanford University, USA
The Web, society, information, cells and brain can all be represented and studied as complex networks of interactions. Nodes in such networks tend to organize into clusters and communities, which represent the fundamental structures for understanding the organization of complex systems. Even though detection of network communities is of significant importance for computer science, sociology and biology, our understanding of the community structure of large networks remains limited. We study a set of more than 200 large networks with the goal to understand and identify communities in networks. We challenge the conventional view of network community structure and show that it is not exhibited by the large real-world networks. We then present a new conceptual model of network community structure, which reliably captures the overall structure of networks and accurately identifies the overlapping nature of network communities.
 
 
4:30-4:55 Influence Propagation on Large Graphs
Aditya Prakash, Carnegie Mellon University, USA
Given a network of who-contacts-whom, will a contagious virus/product/meme 'take-over' (cause an epidemic) or die-out quickly? What will change if nodes have partial, temporary or permanent immunity? What if the underlying network changes over time (e.g., people have different connections during the day at work, and during the night at home)? Propagation style processes can model well many real-world scenarios like epidemiology and public health, information diffusion etc. We present results on understanding the tipping-point behavior of epidemics (enabling among other things faster simulations), predicting who-wins among competing viruses, and developing effective algorithms for immunization and marketing for several large-scale real-world settings. Finally, we collected and analyzed Twitter data over multiple months, to understand the role of external effects in how different hashtags (topics) spread. We showed fundamental differences in the dynamics of hashtags e.g. we found that hashtags of a political nature are primarily driven by outside influences, even though many people may be tweeting about them.
 
 
5:00-5:25 High-Performance Metagenomic Data Clustering and Assembly
Kamesh Madduri, Pennsylvania State University, USA
We present a novel de Bruijn graph-based parallel assembly approach for assembling metagenomic reads. The assembler employs a maximum-likelihood framework that handles the challenges posed by increased polymorphism and over-representation of conserved regions. It also provides tolerance for errors due to the incomplete and fragmentary nature of the reads. We also propose a two-way, multi-species, multi-dimensional Poisson mixture model-based method for representing reads from a metagenome. We use this method to cluster metagenomic reads by their species of origin, and to characterize the abundance of each species.
 
 
5:30-5:55 Scaling Graph Computations at Facebook
John Ugander, Cornell University, USA
With more than 800 million active users and 68 billion edges, scalability is at the forefront of concerns when dealing with the Facebook social graph. This talk will discuss two recent advances. First, through the development of a novel graph sharding algorithm for load-balancing graph computations, we were able to reduce the query time of a realtime system by 50%. Second, when computing the average graph distance between users on Facebook, empirical advances in the compression of the Facebook graph was a key step in speeding up the computation.
 
 
Wednesday, July 11

10:30 AM - 12:30 PM
Room: TBD

10:30-10:55 Graph Analytics for Subject-Matter Experts: Balancing Standards, Simplicity, and Complexity
Steve Reinhardt, Cray, Inc., USA
Subject-matter experts needing to analyze large graphs ideally want the abilities to incorporate data from disparate sources, perform simple analyses to find data needing deeper analysis, and execute complex analyses on that data, all via clear, reliable, and widely-available interfaces. Knowledge Discovery Toolkit (KDT) enables complex analyses on very large data (beyond what will reside in the memory of a single cluster node) through procedural interfaces, though it does not (yet) ingest numerous data formats. Emerging semantic-web technologies are starting to deliver the promise of straightforward fusing of disparate data sources through standard interfaces and ontologies, yet the current SPARQL 1.1 language does not support the calculation of complex graph metrics, such as betweenness centrality. This talk will cover our work to merge the ease-of-use of the semantic-web technologies with the complex analytics of KDT.
 
 
11:00-11:25 Extended Sparse Matrices as Tools for Graph Computation
Adam Lugowski, University of California, Santa Barbara, USA
Sparse matrices frame a powerful graph computation architecture. Extensions such as complex elemental types, runtime filtering and arbitrary semiring operations greatly expand the utility of sparse matrices for graph analytics. Implementations must also match available hardware to be useful. Individual shared-memory machines, distributed-memory clusters and cloud HPC systems all raise unique challenges, but significant commonality remains among them. This talk will cover how we address these challenges in our Knowledge Discovery Toolbox.
 
 
11:30-11:55 Version 2.0 of the Parallel Boost Graph Library: Message-Driven Solutions to Data-Driven Problems
Nick Edmonds, Indiana University, USA
Data-driven applications, such as graph analytics, are unique in that the computational structure of the applications is entirely dependent on the input data. This fact makes static analysis difficult and necessitates dynamic, lightweight, execution-time solutions. Version 2.0 of the Parallel Boost Graph Library (PBGL) demonstrates one approach to building a modular, scalable, and perhaps most importantly, performance portable set of graph kernels. The PBGL 2.0 uses the AM++ active messaging library (an implementation of the Active Pebbles programming and execution model) to provide portable, generic, thread-safe messaging support on a variety of platforms. On top of AM++, PBGL 2.0 provides a variety of graph types, auxiliary data structures, and algorithmic kernels suitable for both shared- and distributed-memory parallelism (e.g. threads and processes) with the potential for straightforward extension to various types of accelerators.
 
 
12:00-12:25 Massive Graphs: The Way Forward
Aydin Buluç, Lawrence Berkeley National Laboratory, USA
Large graph analytics have become an increasingly important in a wide variety of application areas such as internet search, bioinformatics, social media, and cybersecurity. Massive graphs push the state of the art in both big compute and big data. This presentation will collect and present the outstanding questions in this field that have been raised over the course of the mini-symposium.
 
 

Workshop Organizers: