Hewlett Packard Enterprise Data Science Institute

2020-10-19

PandaSQL: Parallel Randomized Triangle Enumeration with SQL Queries

Triangles are an important pattern in large-scale graph analysis for their practical use in many real-life applications. However, with the expansion of networks, maintaining a balanced computational load is challenging especially for problems like triangle computations because of skewed vertices. On the other hand, there is a huge amount of data in database management systems (DBMSs) that can be modeled and analyzed as graphs. With these motivations in mind, we developed PandaSQL, a novel approach using SQL queries to enumerate all the triangles in a given graph based on Randomized Triangle Enumeration Algorithm. Our approach is elegant, abstract, and short compared to traditional languages like C++ or Python. Moreover, our partitioning queries ensures perfect load balancing. Thus, the triangle enumeration is independent, local, and parallel.

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management