Edge Count Clique Thms Graph C is a clique iff |EC||PUC|=COMB(|VC|,2)|VC|!/((|VC|-2)!2!) (VC,EC) is a k-clique iff induced k-1 subgraph, (VD,ED) is a (k-1)-clique. Apriori Clique Mining Alg Uses an ARM-Apriori-like downward closure property: CSkkCliqueSet, CCSk+1Candidatek+1CliqueSet. By SGE, CCSk+1= all s of CSk pairs w k-1 common vertices. Let CCCSk+1 be a union of 2 k-cliques w k-1 common vertices. Let v,w be the kth vertices (different) of the w k-cliques: CCSk+1 iff (PE)(v,w)=1. Breadth-1st Clique Alg: CLQK=all Kcliques. Find CLQ3 w CS0. A Kclique and 3clique sharing an edge form a (K+1)clique iff all K-2 edges from the non-shared Kclique vertices to the non-shared 3clique vertex exist. Next find CLQ4, then CLQ5, … Depth-1st Clique Alg: Find a Largest MaxClique v. If (x,y)E and Count(NewPtSet(v,w,x,y)CLQ3pTree(v,w)&CLQ3pTree(x,y)): 0, 4 v’s form a max4Clique (i.e., v,w,x,y). 1, 5 v’s form a max5Clique (i.e., v,w,x,y,NewPt) 2, 6 v’s form max6Clique if NewPairE, else form 2 max5Cliques. 3, 7 v’s form max7Clique if each NewPairE, elseif 1 or 2 NewPairsE each 6VertexSets (vwxy + 2 EdgeEndpts) form Max6Clique, elseif 0 NewPairsE, each 5VertexSet (vwxy + 1 NewVertex) forms maximal 5Clique…. Theorem: hCliqueNewPtSet, those h vertices together with v,w,x,y form a maximal h+4Clique, where NPS(v,w,x,y)=CLQ 3(v,w)&CLQ3(x,y). GRAPH (linear edges, 2 vertices) kPARTITE GRAPH (V=!Vi i=1..k (x,y)Ex,ysame Vi ) kHYPERGRAPH (edges=k vertices) 2graph=2hypergraph. kPARTITE HYPERGRAPH (V=!Vi i=1..k (x1..xk)Exj,xjsame Vi ) Bipartite Clique Mining finds MaxCliques at cost of pairwise &s. Each LETpTreeMCLQ unless pairwise & with same count.A&B, B w Ct(A&B)=Ct(A) is a MCLQ. potential for a k-plex [k-core] mining alg here. Instead of Ct(A&B)=Ct(A), consider. E.g., Ct(A&B)=Ct(A)-1. Each such pTree, C, would be missing just 1vertex (1 edge). Taking any MCLQ as above, ANDing in CpTree would produce a 1-plex. ANDing in k such C’s would produce a k-plex. In fact, suppose we have produced a k-plex in such a manner, then ANDing in any C with Ct(C)=Ct(A)-h would produce a (K+h)-plex. &i=1..nAi is a [i=1..nCt(Ai)]-Core Tripartite Clique Mining Algorithm? In a Tripartite Graph edges must start and end in different vertex parts. E.g., PART1=tweeters; PART2=hashtags; PART3=tweets. Tweeters-to-hashtags is many-to-many? Tweeters-to-tweets is many-to-many (incl. retweets)?; hashtags-to-tweets is many-to-many? Multipartite Graphs Bipartite, Tripartite (have 2,3 PARTs resp.) … The rule is that no edge can start and end in the same PART. HyperClique Mining: A 3hyperGraph has 3 vertex PARTS and each edge is a planar triangle (vertex triple), one from each PART. Stock recommender is 3PARThyperGraph (Investors, Stocks, Days) A triangular "edge" connects Investor #k, Stock X, and Day n if k recommended X on day n. A 3PARThyperClique is a community s.t. all the investors in the clique recommend all the stocks in the clique on each of the days in the clique (A strong signal?) Tweet example: PART1=tweeters; PART2=hashtags; PART3=tweets. Conjecture: KmultiCliques and KhyperCliques in 1-1 corresp. (K vertex set)? So, one of the mining processes only? Represent these common objects w cliqueTrees (cTrees). Cliques, k-plexes and k-cores are subgraphs or communities defined in terms of the count of their edges. Another type of subgraph of interest today is a Motif, which is defined by the count of its occurrences in the graph. E.g., a clique is a motif iff it occurs more times (isomorphically) in the graph than the “expected” number. Criticism An assumption behind the preservation of a topological sub-structure is that it is of a particular functional importance. This assumption has recently been questioned. Some authors have argued that motifs might show a variety depending on the network context, and therefore,[62] structure of the motif does not necessarily determine function. Network structure certainly does not always indicate function; this is an idea that has been around for some time, for an example see the Sin operon.[63] Most analyses of motif function are carried out looking at the motif operating in isolation. Recent research[64] provides good evidence that network context, i.e. the connections of the motif to the rest of the network, is too important to draw inferences on function from local structure only — the cited paper also reviews the criticisms and alternative explanations for the observed data. An analysis of the impact of a single motif module on the global dynamics of a network is studied in.[65] Yet another recent work suggests that certain topological features of biological networks naturally give rise to the common appearance of canonical motifs, thereby questioning whether frequencies of occurrences are reasonable evidence that the structures of motifs are selected for their functional contribution to operation of networks.[66] Would a motif in the Stock-Investor or Stock-Investor-Day graphs have useful meaning? Can we mine for Motifs?

MOTIFs: Cliques, k-plexes, k-cores and other communities are subgraphs defined in terms of the count of their edges (an internal count). Subgraph is a Motif if its isomorphic occurrences in the graph (an external count) is higher than “expected”. Wikipedia: motifs are defined as recurrent and statistically significant sub-graphs or patterns. Network motifs are sub-graphs that repeat themselves in a specific network or even among various networks. Each of these sub-graphs, defined by a particular edge pattern between vertices, may mean that a particular function is achieved efficiently. Indeed, motifs are of importance because they may reflect functional properties Motif detection is computationally challenging. Most algorithms for Motif discovery are used to find induced Motifs (induced sub-graphs). Graph G′ is a sub-graph of G (G′⊆G) if V′⊆V and E′⊆E∩(V′×V′). If G′⊆G and G′ contains all of the edges ‹u,v›∈E with u,v ∈V′, then G′ is an induced sub-graph of G. G′ and G are isomorphic (G′↔G), if there exists a bijection (one-to-one) f:V′→V with ‹u,v› ∈E′ ⇔‹f(u), f(v)› ∈E u,v∈V′. When G″⊂G and an isomorphism between sub-graph G″ and a graph G′, this mapping represents an appearance of G′ in G. The number of appearances of graph G′ in G is called the frequency FG of G′ in G. G is recurrent or frequent in G, when its frequency FG(G′) is above a predefined threshold or cut-off value. We used terms pattern and frequent sub-graph in this review interchangeably. motif discovery methods can be classified as exact counting, sampling, pattern growth methods and so on. Motif discovery has 2 steps: calc # of occurrences then, evaluating significance. Mfinder implements full enumeration and sampling. Until 2004, the only exact counting method for network motif detection was brute-force proposed by Milo et al.[3] It was successful at discovering small motifs, but for finding even size 5 or 6 motifs it was not computationally feasible. Hence, a new approach to this problem was needed. Kashtan et al. [9] sampling NM algorithm, based on edge sampling throughout the network, estimates concentrations of induced sub-graphs and can be utilized in directed or undirected networks. The sampling procedure starts from an arbitrary edge that leads to a sub-graph of size two, and then expands the sub-graph by choosing a random edge that is incident to the current sub-graph. Then, it continues choosing random neighboring edges until a sub-graph of size n is obtained. Finally, the sampled sub-graph is expanded to include all of the edges that exist in the network between these n nodes. In sampling, taking unbiased samples is important. Kashtan et al. proposed a weighting scheme that assigns different weights to the different sub-graphs.[9], which exploits the info of the sampling probability for each sub-graph, i.e. the probable sub-graphs will obtain comparatively less weights in comparison to the improbable sub-graphs; hence, the algorithm must calculate the sampling probability of each sub-graph that has been sampled. This weighting technique assists mfinder to determine sub-graph concentrations impartially. If expanded to include sharp contrast to exhaustive search, the computational time of the algorithm surprisingly is asymptotically independent of the network size. An analysis of the computational time has shown that it takes O(nn) for each sample of a subgraph of size n from the network. On the other hand, there is no analysis in [9] on the classification time of sampled sub-graphs that requires solving the graph isomorphism problem for each sub-graph sample. Additionally, an extra computational effort is imposed on the algorithm by the sub-graph weight calculation. But it is unavoidable to say that theconclusion, algorithm may sample the same sub-graph multiple timessearch, – spending gathering any information.[10] In by sampling, mfinder is faster than exhaustive but ittime onlywithout determines sub-graphs concentrations approximately. This algorithm can find motifs up to size 6 because of its main implementation, and as result it gives the most significant motif. Also, it is necessary to mention that this tool has no option of visual presentation. mfinder: Es=set of picked edges. Vs= set of all nodes that are touched by the edges in E. Init Vs and Es be empty sets. 1. Pick a random edge e1 = (vi, vj). Update Es = {e1}, Vs = {vi, vj} 2. Make a list L of all neighbor edges of Es. Omit from L all edges between members of Vs. 3. Pick a random edge e = {vk,vl} from L. Update Es = Es ⋃ {e}, Vs = Vs ⋃ {vk, vl}. 4. Repeat steps 2-3 until completing an n-node subgraph (until |Vs| = n). 5. Calculate the probability to sample the picked n-node subgraph. (M1 – M4) are different occurrences of sub-graph (b) in graph (a). For frequency concept F1, M1, M2, M3, M4 represent all matches, F1 = 4. For F2, one of the two set M1, M4 or M2, M3 are possible matches, F2 = 2. For F3, merely one of the matches (M1 to M4) is allowed, therefore F3 = 1. Frequency decreases as the usage of network elements are restricted.

Schreiber, Schwöbbermeyer [12] proposed flexible pattern finder (FPF) in a system Mavisto.[23] It exploits downward closure , applicable for frequency concepts F2 and F3. The downward closure property asserts that the frequency for sub-graphs decrease monotonically by increasing the size of sub-graphs; but it does not hold necessarily for frequency concept F1. FPF is based on a pattern tree (see figure) consisting of nodes that represents different graphs (or patterns), where the parent is a sub-graph of its children nodes; i.e., corresp. graph of each pattern tree’s node is expanded by adding a new edge of its parent node. At first, FPF enumerates and maintains info of all matches of a sub-graph at the root of the pattern tree. Then builds child nodes of previous node by adding 1 edge supported by a matching edge in target graph, tries to expand all of previous info about matches to the new sub-graph (child node).[In next step, it decides whether the frequency of the current pattern is lower than a predefined threshold or not. If it is lower and if downward closure holds, FPF can abandon that path and not traverse further in this part of the tree; as a result, unnecessary computation is avoided. This procedure is continued until there is no remaining path to traverse. It does not consider infrequent sub-graphs and tries to finish the enumeration process as soon as possible; therefore, it only spends time for promising nodes in the pattern tree and discards all other nodes. As an added bonus, the pattern tree notion permits FPF to be implemented and executed in a parallel manner since it is possible to traverse each path of the pattern tree independently. But, FPF is most useful for frequency concepts F2 and F3, because downward closure is not applicable to F1. Still the pattern tree is still practical for F1 if the algorithm runs in parallel. It has no limitation on motif size, which makes it more amenable to improvements. ESU (FANMOD) Sampling bias of Kashtan et al. [9] provided great impetus for designing better algs for NM discovery, Even after weighting scheme, this method imposed an undesired overhead on the running time as well a more complicated impl. It supports visual options and is time efficient. But it doesn’t allow searching for motifs of size 9. Wernicke [10] RAND-ESU is better than jfinder, based on the exact enumeration algorithm ESU, has been implemented as an app called FANMOD.[10] Rand-esu is a discovery alg applicable for both directed and undirected networks. It effectively exploits an unbiased node sampling, and prevents overcounting sub-graphs. RAND-ESU uses DIRECT for determining sub-graph significance instead of an ensemble of random networks as a Null-model. DIRECT estimates sub-graph # w/oexplicitly generating random networks.[10] Empirically, DIRECT is more efficient than random network ensemble for sub-graphs with a very low concentration. But classical Null-model is faster than DIRECT for highly concentrated sub-graphs.[3][10] ESU alg: We show how this exact algorithm can be modified efficiently to RAND-ESU that estimates sub-graphs concentrations. The algorithms ESU and RAND-ESU are fairly simple, and hence easy to implement. ESU first finds the set of all induced sub-graphs of size k, let Sk be this set. ESU can be implemented as a recursive function; the running of this function can be displayed as a tree-like structure of depth k, called the ESU-Tree (see figure). Each of the ESU-Tree nodes indicate the status of the recursive function that entails two consecutive sets SUB and EXT. SUB refers to nodes in the target network that are adjacent and establish a partial sub-graph of size |SUB|≤k. If |SUB|=k, alg has found induced complete sub-graph, Sk=SUB ∪Sk. If |SUB|v} graphs of size 3 in the target graph. call ExtendSubgraph({v}, VExtension, v) endfor Leaves: set S3 or all of size-3 induced sub-graphs of the target graph (a). ESUtree nodes incl 2 adjoining sets: adjacent ExtendSubgraph(VSubgraph, VExtension, v) nodes called SUB and EXT=all adjacent if |VSubG|=k output G[VSubG] return 1 SUB node and where their numerical While VExt≠∅ do Remove arbitrary vertex w from VExt labels > SUB nodes labels. EXT set is VExtension′←VExtension∪{u∈Nexcl(w,VSubgraph)|u>v} utilized by the alg to expand a SUB set call ExtendSubgraph(VSubgraph ∪ {w}, VExtension′, v) until it reaches a desired size placed at return lowest level of ESU-Tree (or its leaves).

NeMoFinder adapts SPIN [27] to extract frequent trees and expands them into non-isomorphic graphs.[8] NeMoFinder utilizes frequent size-n trees to partition the input network into a collection of size-n graphs, afterward finding frequent size-n sub-graphs by expansion of frequent trees edge-by-edge until getting a complete size-n graph Kn. The algorithm finds NMs in undirected networks and is not limited to extracting only induced sub-graphs. Furthermore, NeMoFinder is an exact enumeration algorithm and is not based on a sampling method. As Chen et al. claim, NeMoFinder is applicable for detecting relatively large NMs, for instance, finding NMs up to size-12 from the whole S. cerevisiae (yeast) PPI network as the authors claimed.[28] NeMoFinder consists of three main steps. First, finding frequent size-n trees, then utilizing repeated size-n trees to divide the entire network into a collection of size-n graphs, finally, performing sub-graph join operations to find frequent size-n sub-graphs.[26] In the first step, the algorithm detects all non-isomorphic size-n trees and mappings from a tree to the network. In the second step, the ranges of these mappings are employed to partition the network into size-n graphs. Up to this step, there is no distinction between NeMoFinder and an exact enumeration method. However, a large portion of non-isomorphic size-n graphs still remain. NeMoFinder exploits a heuristic to enumerate non-tree size-n graphs by the obtained information from preceding steps. The main advantage is in third step, which generates candidate sub-graphs from previously enumerated sub-graphs. This generation of new size-n sub-graphs is done by joining each previous sub-graph with derivative sub-graphs from itself called cousin sub-graphs. These new sub-graphs contain one additional edge in comparison to the previous sub-graphs. However, there exist some problems in generating new sub-graphs: There is no clear method to derive cousins from a graph, joining a sub-graph with its cousins leads to redundancy in generating particular sub-graph more than once, and cousin determination is done by a canonical representation of the adjacency matrix which is not closed under join operation. NeMoFinder is an efficient network motif finding algorithm for motifs up to size 12 only for protein-protein interaction networks, which are presented as undirected graphs. And it is not able to work on directed networks which are so important in the field of complex and biological networks. The pseudocode of NeMoFinder is shown here: NeMoFinder Input: G - PPI network; N - Number of randomized networks; K - Maximal network motif size; F - Frequency threshold; S - Uniqueness threshold; Output: U - Repeated and unique network motif set; D ← ∅; for motif-size k from 3 to K do T ← FindRepeatedTrees(k); GDk ← GraphPartition(G, T) D ← D ∪ T; D′ ← T; i ← k; while D″ = ∅ and i ≤ k × (k - 1) / 2 do D′ ← FindRepeatedGraphs(k, i, D′); D ← D ∪ D′; i ← i + 1; end while end for for counter i from 1 to N do Grand ← RandomizedNetworkGeneration(); for each g ∈ D do GetRandFrequency(g, Grand); end for end for U ← ∅; for each g ∈ D do s ← GetUniqunessValue(g); if s ≥ S then U ← U ∪ {g}; end if end for return U Grochow and Kellis [29] proposed an exact alg for enumerating sub-graph appearances, which is based on a motif-centric approach, which means that the frequency of a given sub-graph,called the query graph, is exhaustively determined by searching for all possible mappings from the query graph into the larger network. It is claimed [29] that a motif-centric method in comparison to network-centric methods has some beneficial features. First of all it avoids the increased complexity of sub-graph enumeration. Also, by using mapping instead of enumerating, it enables an improvement in the isomorphism test. To improve the performance of the alg, since it is an inefficient exact enumeration alg, the authors introduced a fast method which is called symmetry-breaking conditions. During straightforward sub-graph isomorphism tests, a sub-graph may be mapped to the same sub-graph of the query graph multiple times. In Grochow-Kellis alg symmetrybreaking is used to avoid such multiple mappings. GK alg and symmetry-breaking condition which eliminates redundant isomorphism tests. (a) graph G, (b) illustration of all automorphisms of G that is showed in (a). From set AutG we can obtain a set of symmetrybreaking conditions of G given by SymG in (c). Only the first mapping in AutG satisfies the SynG conditions; so, by applying SymG in Isomorphism Extension module alg only enumerate each match-able sub-graph to G once. Note that SynG is not a unique set for an arbitrary graph G. The GK alg discovers the whole set of mappings of a given query graph to the network in two major steps. It starts with the computation of symmetry-breaking conditions of the query graph. Next, by means of a branch-and-bound method, alg tries to find every possible mapping from the query graph to the network that meets the associated symmetry-breaking conditions. Computing symmetry-breaking conditions requires finding all automorphisms of a given query graph. Even though, there is no efficient (or polynomial time) algorithm for the graph automorphism problem, this problem can be tackled efficiently in practice by McKay’s tools.[24][25] As it is claimed, using symmetry-breaking conditions in NM detection lead to save a great deal of running time. Moreover, it can be inferred from the results in [29][30] that using (a) graph G, (b) illustration of all automorphisms of G that is showed in (a). From set AutG we can obtain a set the symmetry-breaking conditions results in high efficiency particularly for directed networks in comparison to undirected of symmetry-breaking conditions of G given by SymG networks. The symmetry-breaking conditions used in the GK algorithm are similar to the restriction which ESU algorithm in (c). Only the first mapping in AutG satisfies the applies to the labels in EXT and SUB sets. In conclusion, the GK algorithm computes the exact number of appearance of a SynG conditions; as a result, by applying SymG in the given query graph in a large complex network and exploiting symmetry-breaking conditions improves the algorithm Isomorphism Extension module the algorithm only performance. Also, GK alg is 1 of the known algorithms having no limitation for motif size in implementation and potentially enumerate each match-able sub-graph in network to G once. SynG is not a unique set for an arbitrary graph G. it can find motifs of any size.

Noga Alon et al. [31] found non-induced sub-graphs on undirected networks such as PPI ones. It counts non-induced trees and bounded treewidth sub-graphs of size <10. This alg counts the number of non-induced occurrences of a tree T with k = O(logn) vertices in a network G with n vertices as follows: 1. Color coding. Color each vertex of input network G independently and uniformly at random with one of the k colors. 2. Counting. Apply a dynamic prog routine to count the number of non-induced occurrences of T in which each vertex has a unique color. For more details on this step, see.[31] 3. Repeat the above two steps O(ek) times and add up the number of occurrences of T to get an estimate on the number of its occurrences in G. MODA Omidi et al. [32] is applicable for induced and non-induced NM discovery in undirected networks. Motif-centric algs (MODA, GK) have the ability to work as queryfinding algs (find a single motif query or a small number of motif queries (not all possible sub-graphs of a given size) with larger sizes). As the number of possible nonisomorphic sub-graphs increases exponentially with sub-graph size, for large size motifs (even larger than 10), the network-centric algorithms, those looking for all possible subgraphs, face a problem. Their ability to find small numbers of them is sometimes a significant property. Using a hierarchical structure called an expansion tree, the MODA algorithm is able to extract NMs of a given size systematically and similar to FPF that avoids enumerating unpromising sub-graphs; MODA takes into consideration potential queries (or candidate sub-graphs) that would result in frequent sub-graphs. Despite the fact that MODA resembles FPF in using a tree like structure, the expansion tree is applicable merely for computing frequency concept F1. Here is the main idea: by a simple criterion one can generalize a mapping of a k-size graph into the network to its same size supergraphs. For example, suppose there is mapping f(G) of graph G with k nodes into the network and we have a same size graph G′ with one more edge ‹u, v›; fG will map G′ into the network, if there is an edge ‹fG(u), fG(v)› in the network. As a result, we can exploit the mapping set of a graph to determine the frequencies of its same order supergraphs simply in O(1) time without carrying out sub-graph isomorphism testing. The algorithm starts ingeniously with minimally connected query graphs of size k and finds their mappings in the network via sub-graph isomorphism. After that, with conservation of the graph size, it expands previously considered query graphs edge-by-edge and computes the frequency of these expanded graphs as mentioned above. The expansion process continues until reaching a complete graph Kk (fully connected with k(k-1)⁄2 edge.) The alg starts by computing sub-tree frequencies in the network and then expands sub-trees edge by edge. One way to implement this idea is called the expansion tree Tk for each k. A query graph in a node is a sub-graph of query graph in a node’s child with one edge difference. The longest path in Tk consists of (k2-3k+4)/2 edges and is the path from the root to the leaf node holding the complete graph. Generating expansion trees can be done by a simple routine which is explained in.[32] MODA traverses Tk and when it extracts query trees from the first level of Tk it computes their mapping sets and saves these mappings for the next step. For non-tree queries from Tk, the algorithm extracts the mappings associated with the parent node in Tk and determines which of these mappings can support the current query graphs. The process will continue until the algorithm gets the complete query graph. The query tree mappings are extracted using the Grochow-Kellis algorithm. For computing the frequency of non-tree query graphs, the algorithm employs a simple routine that takes O(1) steps. In addition, MODA exploits a sampling method where the sampling of each node in the network is linearly proportional to the node degree, the probability distribution is exactly similar to the well-known Barabási-Albert preferential attachment model in the field of complex networks.[33] This approach generates approximations; however, the results are almost stable in different executions since sub-graphs aggregate around highly connected nodes. [34] The pseudocode of MODA is shown below: Illustration of the expansion tree T4 for 4-node query graphs. At the first level, there are non-isomorphic k-size trees and at each level, an edge is added to the parent graph to form a child graph. In the second level, there is a graph with two alternative edges that is shown by a dashed red edge. In fact, this node represents two expanded graphs that are isomorphic.[32] Kavosh [35] improves main memory usage; is similar to GK and MODA, which first find all k-size sub-graphs a particular node participated in, then remove the node, and repeat this process for the remaining nodes.[35] For counting the sub-graphs of size k that include a particular node, trees with maximum depth of k, rooted at this node and based on neighborhood relationship are implicitly built. Children of each node include both incoming and outgoing adjacent nodes. To descend the tree, a child is chosen at each level with the restriction that a particular child can be included only if it has not been included at any upper level. After having descended to the lowest level possible, the tree is again ascended and the process is repeated with the stipulation that nodes visited in earlier paths of a descendent are now considered unvisited nodes. A final restriction in building trees is that all children in a particular tree must have numerical labels larger than the label of the root of the tree. The restrictions are similar to GK and ESU . The protocol for extracting sub-graphs makes use of the compositions of an integer. For the extraction of subgraphs of size k, all possible compositions of the integer k-1 must be considered. The compositions of k-1 consist of all possible manners of expressing k-1 as a sum of positive integers. Summations in which the order of the summands differs are considered distinct. A composition can be expressed as k2,k3,…,km where k2 + k3 + … + km = k-1. To count sub-graphs based on the composition, ki nodes are selected from the i-th level of the tree to be nodes of the sub-graphs (i = 2,3,…,m). The k-1 selected nodes along with the node at the root define a sub-graph within the network. After discovering a sub-graph involved as a match in the target network, in order to be able to evaluate the size of each class according to the target network, Kavosh employs the nauty algorithm [24][25] in the same way as FANMOD. Input: G: Input graph, k: sub-graph size, Δ: threshold value Output: Frequent Subgraph List: List of all frequent k-size sub-graphs Note: FG: set of mappings from G in the input graph G fetch Tk do G′ = Get-Next-BFS(Tk) // G′ is a query graph if |E(G′)| = (k – 1) call Mapping-Module(G′, G) else call Enumerating-Module(G′,G,Tk) end if save F2 if |FG| > Δ then add G′ into Frequent Subgraph List end if Until |E(G')| = (k – 1)/2) return Frequent Subgraph List

In 2010, Ribeiro, Silva proposed a data structure for storing a collection of sub-graphs, a g-trie.[37] This data structure, which is conceptually akin to a prefix tree, stores sub-graphs according to their structures and finds occurrences of each of these sub-graphs in a larger graph. One of the noticeable aspects of this data structure is that coming to the network motif discovery, the sub-graphs in the main network are needed to be evaluated. So, there is no need to find the ones in random network which are not in the main network. This can be one of the time-consuming parts in the algorithms in which all sub-graphs in random networks are derived. A g-trie is a multiway tree that can store a collection of graphs. Each tree node contains information about a single graph vertex and its corresponding edges to ancestor nodes. A path from the root to a leaf corresponds to one single graph. Descendants of a g-trie node share a common sub-graph. Constructing a g-trie is well described in.[37] After constructing a g-trie, the counting part takes place. The main idea in counting process is to backtrack by all possible sub-graphs, but at the same time do the isomorphism tests. This backtracking technique is essentially the same technique employed by other motif-centric approaches like MODA and GK algorithms. Taking advantage of common substructures in the sense that at a given time there is a partial isomorphic match for several different candidate sub-graphs. Among the mentioned algorithms, G-Tries is the fastest. But, the excessive use of memory is the drawback of this algorithm, which might limit the size of discoverable motifs by a personal computer with average memory. Runtimes of FANMOD, G-Trie for subgraphs from 3-9 nodes on five different networks.[37] Network Size Dolphins Circuit Social Yeast Power Census Original Network Average Census on Random Networks FANMOD GK G-Trie FANMOD GK G-Trie 5 0.07 0.03 0.01 0.13 0.04 0.01 6 0.48 0.28 0.04 1.14 0.35 0.07 7 3.02 3.44 0.23 8.34 3.55 0.46 8 19.44 73.16 1.69 67.94 37.31 4.03 9 100.86 2984.22 6.98 493.98 366.79 24.84 6 0.49 0.41 0.03 0.55 0.24 0.03 7 3.28 3.73 0.22 3.53 1.34 8 17.78 48.00 1.52 21.42 7.91 3 0.31 0.11 0.02 0.35 0.11 0.02 4 7.78 1.37 0.56 13.27 1.86 0.57 5 208.30 31.85 14.88 531.65 62.66 22.11 3 0.47 0.33 0.02 0.57 0.35 0.02 Classification of Algs exact counting and those using statistical sampling and estimations. 2nd group does not count all subgraph occurences in main network, algs belonging to this group are faster, but they might yield in biased and unrealistic results. Exact counting algs can be classified to network-centric and subgraph-centric methods. The algorithms of the first class search the given network for all subgraphs of a given size, while the algorithms falling into the second class first generate different possible non-isomorphic graphs of the given size, and then explore the network for each generated subgraph separately. Classification of Motif Discovery Algorithms Name Directed/Un Induced/Non 0.17 mfinder Both Induced 1.06 (FANMOD) Both Induced Kavosh Both Induced G-Tries Both Induced FPF (Mavisto) Both Induced NeMoFinder Undirected Induced Grochow-Kellis Both Both MODA Both Both N. Alon Undirected Non-Induced mfinder Both Induced FANMOD) Both Induced 4 10.07 2.04 0.36 12.90 2.25 0.41 5 268.51 34.10 12.73 400.13 47.16 14.98 3 0.51 1.46 0.00 0.91 1.37 0.01 4 1.38 4.34 0.02 3.01 4.40 0.03 5 4.68 16.95 0.10 12.38 17.54 0.14 6 20.36 95.58 0.55 67.65 92.74 0.88 7 101.04 765.91 3.36 408.15 630.65 5.17 Count Method Basis Net-Centric Exact SubG-Centric Color-Coding Est/Sampling Other

Well-Established Motifs and Their Functions Much experimental work has been devoted to understanding network motifs in gene regulatory networks. These networks control which genes are expressed in the cell in response to biological signals. The network is defined such that genes are nodes, and directed edges represent the control of one gene by a transcription factor (regulatory protein that binds DNA) encoded by another gene. Thus, network motifs are patterns of genes regulating each other's transcription rate. When analyzing transcription networks, it is seen that the same network motifs appear again and again in diverse organisms from bacteria to human. The transcription network of E. coli and yeast, for example, is made of three main motif families, that make up almost the entire network. The leading hypothesis is that the network motif were independently selected by evolutionary processes in a converging manner,[38][39] since the creation or elimination of regulatory interactions is fast on evolutionary time scale, relative to the rate at which genes change,[38][39][40] experiments on dynamics generated by network motifs in living cells indicate that they have characteristic dynamical functions. This suggests that the network motif serve as building blocks in gene regulatory networks that are beneficial to the organism. The functions associated with common network motifs in transcription networks were explored by several both theoretically and experimentally. Below are some of the most common network motifs and their associated function. Negative auto-regulation (NAR) One of simplest and abundant motifs in E. coli is negative auto-regulation in which a transcription factor (TF) represses its own transcription. It was shown to perform two important functions. First: response acceleration. NAR was shown to speed-up the response to signals both theoretically [41] and experimentally. This was first shown in a synthetic transcription network[42] and later on in the natural context in the SOS DNA repair system of E .coli.[43] Second: increased stability of the auto-regulated gene product concentration against stochastic noise, thus reducing variations in protein levels between different cells.[44][45][46] Positive auto-regulation (PAR) Positive auto-regulation (PAR) occurs when a transcription factor enhances its own rate of production. Opposite to the NAR motif this motif slows the response time compared to simple regulation.[47] In the case of a strong PAR the motif may lead to a bimodal distribution of protein levels in cell populations.[48] Feed-forward loops (FFL) Schematic representation of a Feed-forward motif This motif is commonly found in many gene systems and organisms. The motif consists of three genes and three regulatory interactions. The target gene C is regulated by 2 TFs A and B and in addition TF B is also regulated by TF A . Since each of the regulatory interactions may either be positive or negative there are possibly eight types of FFL motifs. [49] Two of those eight types: the coherent type 1 FFL (C1-FFL) (where all interactions are positive) and the incoherent type 1 FFL (I1-FFL) (A activates C and also activates B which represses C) are found much more frequently in the transcription network of E. coli and yeast than the other six types.[49][50] In addition to the structure of the circuitry the way in which the signals from A and B are integrated by the C promoter should also be considered. In most of the cases the FFL is either an AND gate (A and B are required for C activation) or OR gate (either A or B are sufficient for C activation) but other input function are also possible. Coherent type 1 FFL (C1-FFL) The C1-FFL with an AND gate was shown to have a function of a ‘sign-sensitive delay’ element and a persistence detector both theoretically [49] and experimentally[51] with the arabinose system of E. coli. This means that this motif can provide pulse filtration in which short pulses of signal will not generate a response but persistent signals will generate a response after short delay. The shut off of the output when a persistent pulse is ended will be fast. The opposite behavior emerges in the case of a sum gate with fast response and delayed shut off as was demonstrated in the flagella system of E. coli.[52] Incoherent type 1 FFL (I1-FFL) The I1-FFL is a pulse generator and response accelerator. The two signal pathways of the I1-FFL act in opposite directions where one pathway activates Z and the other represses it. When the repression is complete this leads to a pulse-like dynamics. It was also demonstrated experimentally that the I1-FFL can serve as response accelerator in a way which is similar to the NAR motif. The difference is that the I1-FFL can speed-up the response of any gene and not necessarily a transcription factor gene.[53] An additional function was assigned to the I1-FFL network motif: it was shown both theoretically and experimentally that the I1-FFL can generate non-monotonic input function in both a synthetic [54] and native systems.[55] Finally, expression units that incorporate incoherent feedforward control of the gene product provide adaptation to the amount of DNA template and can be superior to simple combinations of constitutive promoters.[56] Feedforward regulation displayed better adaptation than negative feedback, and circuits based on RNA interference were the most robust to variation in DNA template amounts.[56] Multi-output FFLs In some cases the same regulators X and Y regulate several Z genes of the same system. By adjusting the strength of the interactions this motif was shown to determine the temporal order of gene activation. Demonstrated in the flagella system of E. coli.[57] Single-input modules (SIM) occurs when a single regulator regulates a set of genes with no additional regulation. This is useful when the genes are cooperatively carrying out a specific function and always need to be activated in a synchronized manner. By adjusting strength of interactions it can create temporal expression prog of genes it regulates.[58] In the literature, Multiple-input modules (MIM) arose as a generalization of SIM. However, the precise definitions of SIM and MIM have been a source of inconsistency. There are attempts to provide orthogonal definitions for canonical motifs in biological networks and algorithms to enumerate them, especially SIM, MIM and Bi-Fan (2x2 MIM).[59] Dense overlapping regulons (DOR) Occurs where several regulators combinatorially control a set of genes with diverse regulatory combinations. This motif was found in E. coli in various systems such as carbon utilization, anaerobic growth, stress response and others.[13][18] In order to better understand the function of this motif one has to obtain more info about the way multiple inputs are integrated by genes. Kaplan et al.[60] has mapped input functions of the sugar utilization genes in E. coli, showing diverse shapes. Activity motifs An interesting generalization of the network-motifs, activity motifs are over occurring patterns that can be found when nodes and edges in the network are annotated with quantitative features. For instance, when edges in a metabolic pathways are annotated with the magnitude or timing of the corresponding gene expression, some patterns are over occurring given the underlying network structure.[61]

I S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 D 3PART HyperGraphs R A A A A A A A A A A A A A A A A A A B B B B B B B B B B B B B B B B B B C C C C C C C C C C C C C C C C C C D D D D D D D D D D D D D D D D D D E E E E E E E E E E E E E E E E E E F F F F α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α α 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 1 Base clTrees {123..}=Investors recommending Stocks={ABC..} on Days={,,,,}. 3-Level pTrees with strides, 1 14 (Stocks) 5 (Days) 18 (Inv) A B C 1 2 3 4 5 1 1 0 1 0 1111 1111 1100 0111 0000 6 7 8 9 a 0 0 0 0 0 0000 0000 0000 0000 0000 b v d e f 0 0 0 0 0 0000 0000 0000 0000 0000 E… (Stocks) D (Days) … stock, we mine BMCLQs by &ing day-A pTrees then combining all other day-X pTrees with the same count (This is oaa on Stocks, Days, Investors). 0 0 0 0 0 F G H I J 0 0 0 0 0 K L M N 1 1 0 1 0 We might want communities of type: 1. BC12, which tells Stocks B,C have been recommended every Day by Investors 1,2. The operation here is oaa (or Stocks, and Days, and Investors). This is a clique. 2. BCEFH(DayCt2)12, which tells us Stocks B,C,E,F,H have been recommended 2 Days by Investors 1,2. The op here is oCt2 where Ct2 operator applies to the 5 Stock=A,Day=? pTrees and produces a Investor mask pTree showing those Investors who have recommended BCEFH at least 2 Days out of the 5. This is not a clique. Md? Of course we can implement this operator as 5 SPTS additions (where each SPTS is 1 bit wide), followed by an EINring type operator on that column of sums that masks to 1, those investors whose sum2. But might there be a single operation to produce that 5 way sum? Or even one operator that produces the Investor mask pTree directly from the 5 input pTrees? It would be an EINring type operator, but instead of treating the input SPTS as bitslice pTrees, it would treat them as individual mask pTrees 1 0 0 0 0 0 0 0 0 0 00 00 0 00 0 … 6 7 8 9 a (Investors) b v d e f g h i 33433 ctS ctD ctI 3-Level pTrees with strides, 18 (Investors) 5 (Days) 1 14 (Stocks) 1 2 1 1 1 1 1 1 1 1 1 1 11 11 01 01 11 0 0 0 1 1 1 0 1 1 0 1 0 1 0 0 11 00 11 11 00 1 0 1 1 0 0 0 00 K 0 0 00 L 0 0 00 M 0 0 00 N 8 7 68 0 0 0 0 F G H I J 1 0 0 0 0 11 2 13 3 4 5 g 00000 h 00000 i 00000 A B C D E A B C D E 5 3 The Base Cliques are the 1-1-many cliques like those above (1 Stock, 1 Day, many Investors. And it doesn’t matter whether Stocks or Days is on top). Other Base Cliques are (1 Investor, 1 Day, many Stocks) and (1 Stock, 1 Investor, many Days): 1 10 2 000 3 4 00 5 00 (Stocks) 3-Level pTrees with strides, 18 (Investors) 14 (Stocks) 1 5 (Days) 0 Base cTrees 4 5 … (Investors) (Days) 6 00 7 00 80 9 00 a 00 1 0 b 10 v 00 d0 e1 f 11 1 1 g1 h 01 i 10 0 0 0 0 … … A B C D E F G H I J K L M N CtI CtD CtS 1 1 8 ABCDE… 1 0 1 1 0 1 0 1 0 0 01 00 11 11 00 1 0 1 0 0 3 2 23 2 … 1 2 3 4 5 1 0 0 0 0 Base cTrees 6 7 8 9 2 3 4 5 … (Investors) a b v d ABCDE… ABCDE… ABCDE… ABCDE… (Stocks) e f (Days) … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 g h i 0 0 0 0 1 0 1 1 0 A B C D E 1 1 3 F G H I J K L M N CtI CtS CtD Is there a fast MaxClique mining algorithm for tripartite graphs?