On Sampling from Massive Graph Streams

We propose Graph Priority Sampling (GPS), a new paradigm for order-based reservoir sampling from massive streams of graph edges. GPS provides a general way to weight edge sampling according to auxiliary and/or size variables so as to accomplish various estimation goals of graph properties. In the context of subgraph counting, we show how edge sampling weights can be chosen so as to minimize the estimation variance of counts of specified sets of subgraphs...