Sampling for Big Data Outline ◊ Motivating application: sampling in large ISP networks ◊ Basics of sampling: concepts and estimation ◊ Stream sampling: uniform and weighted case – Variations: Concise sampling, sample and hold, sketch guided BREAK ◊ Advanced stream sampling: sampling as cost optimization – VarOpt, priority, structure aware, and stable sampling ◊ Hashing and coordination – Bottom-k, consistent sampling and sketch-based sampling ◊ Graph sampling – Node, edge and subgraph sampling ◊ Conclusion and future directions

MOTIFs: Cliques, k-plexes, k-cores and other communities are subgraphs defined in terms of the count of their edges (an internal count). Subgraph is a Motif if its isomorphic occurrences in the graph (an external count) is higher than “expected”. Wikipedia: motifs are defined as recurrent and statistically significant sub-graphs or patterns. Network motifs are sub-graphs that repeat themselves in a specific network or even among various networks. Each of these sub-graphs, defined by a particular edge pattern between vertices, may mean that a particular function is achieved efficiently. Indeed, motifs are of importance because they may reflect functional properties Motif detection is computationally challenging. Most algorithms for Motif discovery are used to find induced Motifs (induced sub-graphs). Graph G′ is a sub-graph of G (G′⊆G) if V′⊆V and E′⊆E∩(V′×V′). If G′⊆G and G′ contains all of the edges ‹u,v›∈E with u,v ∈V′, then G′ is an induced sub-graph of G. G′ and G are isomorphic (G′↔G), if there exists a bijection (one-to-one) f:V′→V with ‹u,v› ∈E′ ⇔‹f(u), f(v)› ∈E u,v∈V′. When G″⊂G and an isomorphism between sub-graph G″ and a graph G′, this mapping represents an appearance of G′ in G. The number of appearances of graph G′ in G is called the frequency FG of G′ in G. G is recurrent or frequent in G, when its frequency FG(G′) is above a predefined threshold or cut-off value. We used terms pattern and frequent sub-graph in this review interchangeably. motif discovery methods can be classified as exact counting, sampling, pattern growth methods and so on. Motif discovery has 2 steps: calc # of occurrences then, evaluating significance. Mfinder implements full enumeration and sampling. Until 2004, the only exact counting method for network motif detection was brute-force proposed by Milo et al.[3] It was successful at discovering small motifs, but for finding even size 5 or 6 motifs it was not computationally feasible. Hence, a new approach to this problem was needed. Kashtan et al. [9] sampling NM algorithm, based on edge sampling throughout the network, estimates concentrations of induced sub-graphs and can be utilized in directed or undirected networks. The sampling procedure starts from an arbitrary edge that leads to a sub-graph of size two, and then expands the sub-graph by choosing a random edge that is incident to the current sub-graph. Then, it continues choosing random neighboring edges until a sub-graph of size n is obtained. Finally, the sampled sub-graph is expanded to include all of the edges that exist in the network between these n nodes. In sampling, taking unbiased samples is important. Kashtan et al. proposed a weighting scheme that assigns different weights to the different sub-graphs.[9], which exploits the info of the sampling probability for each sub-graph, i.e. the probable sub-graphs will obtain comparatively less weights in comparison to the improbable sub-graphs; hence, the algorithm must calculate the sampling probability of each sub-graph that has been sampled. This weighting technique assists mfinder to determine sub-graph concentrations impartially. If expanded to include sharp contrast to exhaustive search, the computational time of the algorithm surprisingly is asymptotically independent of the network size. An analysis of the computational time has shown that it takes O(nn) for each sample of a subgraph of size n from the network. On the other hand, there is no analysis in [9] on the classification time of sampled sub-graphs that requires solving the graph isomorphism problem for each sub-graph sample. Additionally, an extra computational effort is imposed on the algorithm by the sub-graph weight calculation. But it is unavoidable to say that theconclusion, algorithm may sample the same sub-graph multiple timessearch, – spending gathering any information.[10] In by sampling, mfinder is faster than exhaustive but ittime onlywithout determines sub-graphs concentrations approximately. This algorithm can find motifs up to size 6 because of its main implementation, and as result it gives the most significant motif. Also, it is necessary to mention that this tool has no option of visual presentation. mfinder: Es=set of picked edges. Vs= set of all nodes that are touched by the edges in E. Init Vs and Es be empty sets. 1. Pick a random edge e1 = (vi, vj). Update Es = {e1}, Vs = {vi, vj} 2. Make a list L of all neighbor edges of Es. Omit from L all edges between members of Vs. 3. Pick a random edge e = {vk,vl} from L. Update Es = Es ⋃ {e}, Vs = Vs ⋃ {vk, vl}. 4. Repeat steps 2-3 until completing an n-node subgraph (until |Vs| = n). 5. Calculate the probability to sample the picked n-node subgraph. (M1 – M4) are different occurrences of sub-graph (b) in graph (a). For frequency concept F1, M1, M2, M3, M4 represent all matches, F1 = 4. For F2, one of the two set M1, M4 or M2, M3 are possible matches, F2 = 2. For F3, merely one of the matches (M1 to M4) is allowed, therefore F3 = 1. Frequency decreases as the usage of network elements are restricted.