Time Barbie Tootle Hayes Cape Cartoon Room I Cartoon Room II Suzanne M. Scharer Rosa M. Ailabouni Monday Aug 5th, 2013 10:10-11:50 A1L-A Analog Circuits I Chr: Ming Gu, Shantanu Chakrabartty Track: Analog and Mixed Signal Integrated Circuits A1L-B Low Power Digital Circuit Design Techniques A1L-C Chr: Joanne Degroat Student Contest I Track: Digital Integrated Chr: Mohammed Ismail Circuits, SoC and NoC Track: INVITED ONLY A1L-D Design and Analysis for Power Systems and Power Electronics Chr: Hoi Lee, Ayman Fayed Track: Power Systems and Power Electronics A1L-E Design and Analysis of Linear and Non-Linear Systems Chr: Samuel Palermo Track: Linear and Non-linear Circuits and Systems A1L-F Emerging Technologues Chr: Khaled Salama Track: Emerging Technologies Monday Aug 5th, 2013 13:10-14:50 A2L-A Analog Circuits II Chr: Ming Gu, Shantanu Chakrabartty Track: Analog and Mixed Signal Integrated Circuits A2L-B Low Power VLSI Design Methodology Chr: Genevieve Sapijaszko Track: Digital Integrated Circuits, SoC and NoC A2L-C Student Contest II Chr: Sleiman Bou-Sleiman Track: INVITED ONLY A2L-D Power Management and Energy Harvesting Chr: Ayman Fayed, Hoi Lee Track: Power Management and Energy Harvesting A2L-E Oscillators and Chaotic Systems Chr: Samuel Palermo, Warsame Ali Track: Linear and Non-linear Circuits and Systems A2L-F Bioengineering Systems Chr: Khaled Salama Track: Bioengineering Systems and Bio Chips A4L-A Analog Design Techniques I Chr: Dong Ha Track: Analog and Mixed Signal Integrated Circuits A4L-B Imaging and Wireless Sensors Chr: Igor Filanovsky Track: Analog and Mixed Signal Integrated Circuits A4L-C Special Session: Characterization of Nano Materials and Circuits Chr: Nayla El-Kork Track: SPECIAL SESSION A4L-D Special Session: Power Management and Energy Harvesting Chr: Paul Furth Track: SPECIAL SESSION A4L-E Communication and Signal Processing Circuits Chr: Samuel Palermo Track: Linear and Non-linear Circuits and Systems A4L-F Sensing and Measurement of Biological Signals Chr: Hoda Abdel-Aty-Zohdy Track: Bioengineering Systems and Bio Chips B2L-A Analog Design Techniques II Chr: Valencia Koomson Track: Analog and Mixed Signal Integrated Circuits B2L-B VLSI Design Reliability Chr: Shantanu Chakrabartty, Gursharan Reehal Track: Digital Integrated Circuits, SoC and NoC B2L-D B2L-C Special Session: University and Delta-Sigma Modulators Industry Training in the Art of Chr: Vishal Saxena Electronics Track: Analog and Mixed Signal Chr: Steven Bibyk Integrated Circuits Track: SPECIAL SESSION B2L-E Radio Frequency Integrated Circuits Chr: Nathan Neihart, Mona Hella Track: RFICs, Microwave, and Optical Systems B2L-F Bio-inspired Green Technologies Chr: Hoda Abdel-Aty-Zohdy Track: Bio-inspired Green Technologies B3L-A Analog Design Techniques III Chr: Valencia Koomson Track: Analog and Mixed Signal Integrated Circuits B3L-B VLSI Design, Routing, and Testing Chr: Nader Rafla Track: Programmable Logic, VLSI, CAD and Layout B3L-C Special Session: High-Precision and High-Speed Data Converters I Chr: Samuel Palermo Track: SPECIAL SESSION B3L-D B3L-E Special Session: Advancing the RF/Optical Devices and Circuits Frontiers of Solar Energy Chr: Mona Hella, Nathan Neihart Chr: Michael Soderstrand Track: RFICs, Microwave, and Track: SPECIAL SESSION Optical Systems B5L-A Nyquist-Rate Data Converters Chr: Vishal Saxena Track: Analog and Mixed Signal Integrated Circuits B5L-B Digital Circuits Chr: Nader Rafla Track: Programmable Logic, VLSI, CAD and Layout B5L-C Special Session: High-Precision and High-Speed Data Converters II Chr: Samuel Palermo Track: SPECIAL SESSION B5L-D Special Session: RF-FPGA Circuits and Systems for Enhancing Access to Radio Spectrum (CAS-EARS) Chr: Arjuna Madanayake, Vijay Devabhaktuni Track: SPECIAL SESSION B5L-E B5L-F Analog and RF Circuit Memristors, DG-MOSFETS and Techniques Graphine FETs Chr: Igor Filanovsky Chr: Reyad El-Khazali Track: Analog and Mixed Signal Track: Nanoelectronics and Integrated Circuits Nanotechnology C2L-A Phase Locked Loops Chr: Chung-Chih Hung Track: Analog and Mixed Signal Integrated Circuits C2L-B Computer Arithmetic and Cryptography Chr: George Purdy Track: Programmable Logic, VLSI, CAD and Layout C2L-C Special Session: Reversible Computing Chr: Himanshu Thapliyal Track: SPECIAL SESSION C2L-D Special Session: Self-healing and Self-Adaptive Circuits and Systems Chr: Abhilash Goyal, Abhijit Chatterjee Track: SPECIAL SESSION C2L-E Digital Signal Processing-Media and Control Chr: Wasfy Mikhael, Steven Bibyk Track: Digital Signal Processing C2L-F Advances in Communications and Wireless Systems Chr: Sami Muhaidat Track: Communication and Wireless Systems C3L-A SAR Analog-to-Digital Converters Chr: Vishal Saxena Track: Analog and Mixed Signal Integrated Circuits C3L-B Real Time Systems Chr: Brian Dupaix, Abhilash Goyal Track: System Architectures C3L-C Image Processing and Interpretation Chr: Annajirao Garimella Track: Image Processing and Multimedia Systems C3L-D Special Session: Verification and Trusted Mixed Signal Electronics Development Chr: Greg Creech, Steven Bibyk Track: SPECIAL SESSION C3L-E Digital Signal Processing I Chr: Ying Liu Track: Digital Signal Processing C3L-F Wireless Systems I Chr: Sami Muhaidat Track: Communication and Wireless Systems C5L-A Wireless Systems II Chr: Sami Muhaidat Track: Communication and Wireless Systems C5L-B System Architectures Chr: Swarup Bhunia, Abhilash Goyal Track: System Architectures C5L-C Image Embedding Compression and Analysis Chr: Annajirao Garimella Track: Image Processing and Multimedia Systems C5L-D Low Power Datapath Design Chr: Wasfy Mikhael Track: Digital Integrated Circuits, SoC and NoC C5L-E Digital Signal Processing II Chr: Moataz AbdelWahab Track: Digital Signal Processing C5L-F Advances in Control Systems, Mechatronics, and Robotics Chr: Charna Parkey, Genevieve Sapijaszko Track: Control Systems, Mechatronics, and Robotics Monday Aug 5th, 2013 16:00-17:40 Tuesday Aug 6th, 2013 10:10-11:50 Tuesday Aug 6th, 2013 13:10-14:50 Tuesday Aug 6th, 2013 16:00-17:40 Wednesday Aug 7th, 2013 10:10-11:50 Wednesday Aug 7th, 2013 13:10-14:50 Wednesday Aug 7th, 2013 16:00-17:40 B3L-F Carbon Nanotube-based Sensors and Beyond Chr: Nayla El-Kork Track: Nanoelectronics and Nanotechnology 5

MOTIFs: Cliques, k-plexes, k-cores and other communities are subgraphs defined in terms of the count of their edges (an internal count). Subgraph is a Motif if its isomorphic occurrences in the graph (an external count) is higher than “expected”. Wikipedia: motifs are defined as recurrent and statistically significant sub-graphs or patterns. Network motifs are sub-graphs that repeat themselves in a specific network or even among various networks. Each of these sub-graphs, defined by a particular edge pattern between vertices, may mean that a particular function is achieved efficiently. Indeed, motifs are of importance because they may reflect functional properties Motif detection is computationally challenging. Most algorithms for Motif discovery are used to find induced Motifs (induced sub-graphs). Graph G′ is a sub-graph of G (G′⊆G) if V′⊆V and E′⊆E∩(V′×V′). If G′⊆G and G′ contains all of the edges ‹u,v›∈E with u,v ∈V′, then G′ is an induced sub-graph of G. G′ and G are isomorphic (G′↔G), if there exists a bijection (one-to-one) f:V′→V with ‹u,v› ∈E′ ⇔‹f(u), f(v)› ∈E u,v∈V′. When G″⊂G and an isomorphism between sub-graph G″ and a graph G′, this mapping represents an appearance of G′ in G. The number of appearances of graph G′ in G is called the frequency FG of G′ in G. G is recurrent or frequent in G, when its frequency FG(G′) is above a predefined threshold or cut-off value. We used terms pattern and frequent sub-graph in this review interchangeably. motif discovery methods can be classified as exact counting, sampling, pattern growth methods and so on. Motif discovery has 2 steps: calc # of occurrences then, evaluating significance. Mfinder implements full enumeration and sampling. Until 2004, the only exact counting method for network motif detection was brute-force proposed by Milo et al.[3] It was successful at discovering small motifs, but for finding even size 5 or 6 motifs it was not computationally feasible. Hence, a new approach to this problem was needed. Kashtan et al. [9] sampling NM algorithm, based on edge sampling throughout the network, estimates concentrations of induced sub-graphs and can be utilized in directed or undirected networks. The sampling procedure starts from an arbitrary edge that leads to a sub-graph of size two, and then expands the sub-graph by choosing a random edge that is incident to the current sub-graph. Then, it continues choosing random neighboring edges until a sub-graph of size n is obtained. Finally, the sampled sub-graph is expanded to include all of the edges that exist in the network between these n nodes. In sampling, taking unbiased samples is important. Kashtan et al. proposed a weighting scheme that assigns different weights to the different sub-graphs.[9], which exploits the info of the sampling probability for each sub-graph, i.e. the probable sub-graphs will obtain comparatively less weights in comparison to the improbable sub-graphs; hence, the algorithm must calculate the sampling probability of each sub-graph that has been sampled. This weighting technique assists mfinder to determine sub-graph concentrations impartially. If expanded to include sharp contrast to exhaustive search, the computational time of the algorithm surprisingly is asymptotically independent of the network size. An analysis of the computational time has shown that it takes O(nn) for each sample of a subgraph of size n from the network. On the other hand, there is no analysis in [9] on the classification time of sampled sub-graphs that requires solving the graph isomorphism problem for each sub-graph sample. Additionally, an extra computational effort is imposed on the algorithm by the sub-graph weight calculation. But it is unavoidable to say that theconclusion, algorithm may sample the same sub-graph multiple timessearch, – spending gathering any information.[10] In by sampling, mfinder is faster than exhaustive but ittime onlywithout determines sub-graphs concentrations approximately. This algorithm can find motifs up to size 6 because of its main implementation, and as result it gives the most significant motif. Also, it is necessary to mention that this tool has no option of visual presentation. mfinder: Es=set of picked edges. Vs= set of all nodes that are touched by the edges in E. Init Vs and Es be empty sets. 1. Pick a random edge e1 = (vi, vj). Update Es = {e1}, Vs = {vi, vj} 2. Make a list L of all neighbor edges of Es. Omit from L all edges between members of Vs. 3. Pick a random edge e = {vk,vl} from L. Update Es = Es ⋃ {e}, Vs = Vs ⋃ {vk, vl}. 4. Repeat steps 2-3 until completing an n-node subgraph (until |Vs| = n). 5. Calculate the probability to sample the picked n-node subgraph. (M1 – M4) are different occurrences of sub-graph (b) in graph (a). For frequency concept F1, M1, M2, M3, M4 represent all matches, F1 = 4. For F2, one of the two set M1, M4 or M2, M3 are possible matches, F2 = 2. For F3, merely one of the matches (M1 to M4) is allowed, therefore F3 = 1. Frequency decreases as the usage of network elements are restricted.

Schreiber, Schwöbbermeyer [12] proposed flexible pattern finder (FPF) in a system Mavisto.[23] It exploits downward closure , applicable for frequency concepts F2 and F3. The downward closure property asserts that the frequency for sub-graphs decrease monotonically by increasing the size of sub-graphs; but it does not hold necessarily for frequency concept F1. FPF is based on a pattern tree (see figure) consisting of nodes that represents different graphs (or patterns), where the parent is a sub-graph of its children nodes; i.e., corresp. graph of each pattern tree’s node is expanded by adding a new edge of its parent node. At first, FPF enumerates and maintains info of all matches of a sub-graph at the root of the pattern tree. Then builds child nodes of previous node by adding 1 edge supported by a matching edge in target graph, tries to expand all of previous info about matches to the new sub-graph (child node).[In next step, it decides whether the frequency of the current pattern is lower than a predefined threshold or not. If it is lower and if downward closure holds, FPF can abandon that path and not traverse further in this part of the tree; as a result, unnecessary computation is avoided. This procedure is continued until there is no remaining path to traverse. It does not consider infrequent sub-graphs and tries to finish the enumeration process as soon as possible; therefore, it only spends time for promising nodes in the pattern tree and discards all other nodes. As an added bonus, the pattern tree notion permits FPF to be implemented and executed in a parallel manner since it is possible to traverse each path of the pattern tree independently. But, FPF is most useful for frequency concepts F2 and F3, because downward closure is not applicable to F1. Still the pattern tree is still practical for F1 if the algorithm runs in parallel. It has no limitation on motif size, which makes it more amenable to improvements. ESU (FANMOD) Sampling bias of Kashtan et al. [9] provided great impetus for designing better algs for NM discovery, Even after weighting scheme, this method imposed an undesired overhead on the running time as well a more complicated impl. It supports visual options and is time efficient. But it doesn’t allow searching for motifs of size 9. Wernicke [10] RAND-ESU is better than jfinder, based on the exact enumeration algorithm ESU, has been implemented as an app called FANMOD.[10] Rand-esu is a discovery alg applicable for both directed and undirected networks. It effectively exploits an unbiased node sampling, and prevents overcounting sub-graphs. RAND-ESU uses DIRECT for determining sub-graph significance instead of an ensemble of random networks as a Null-model. DIRECT estimates sub-graph # w/oexplicitly generating random networks.[10] Empirically, DIRECT is more efficient than random network ensemble for sub-graphs with a very low concentration. But classical Null-model is faster than DIRECT for highly concentrated sub-graphs.[3][10] ESU alg: We show how this exact algorithm can be modified efficiently to RAND-ESU that estimates sub-graphs concentrations. The algorithms ESU and RAND-ESU are fairly simple, and hence easy to implement. ESU first finds the set of all induced sub-graphs of size k, let Sk be this set. ESU can be implemented as a recursive function; the running of this function can be displayed as a tree-like structure of depth k, called the ESU-Tree (see figure). Each of the ESU-Tree nodes indicate the status of the recursive function that entails two consecutive sets SUB and EXT. SUB refers to nodes in the target network that are adjacent and establish a partial sub-graph of size |SUB|≤k. If |SUB|=k, alg has found induced complete sub-graph, Sk=SUB ∪Sk. If |SUB|v} graphs of size 3 in the target graph. call ExtendSubgraph({v}, VExtension, v) endfor Leaves: set S3 or all of size-3 induced sub-graphs of the target graph (a). ESUtree nodes incl 2 adjoining sets: adjacent ExtendSubgraph(VSubgraph, VExtension, v) nodes called SUB and EXT=all adjacent if |VSubG|=k output G[VSubG] return 1 SUB node and where their numerical While VExt≠∅ do Remove arbitrary vertex w from VExt labels > SUB nodes labels. EXT set is VExtension′←VExtension∪{u∈Nexcl(w,VSubgraph)|u>v} utilized by the alg to expand a SUB set call ExtendSubgraph(VSubgraph ∪ {w}, VExtension′, v) until it reaches a desired size placed at return lowest level of ESU-Tree (or its leaves).

NeMoFinder adapts SPIN [27] to extract frequent trees and expands them into non-isomorphic graphs.[8] NeMoFinder utilizes frequent size-n trees to partition the input network into a collection of size-n graphs, afterward finding frequent size-n sub-graphs by expansion of frequent trees edge-by-edge until getting a complete size-n graph Kn. The algorithm finds NMs in undirected networks and is not limited to extracting only induced sub-graphs. Furthermore, NeMoFinder is an exact enumeration algorithm and is not based on a sampling method. As Chen et al. claim, NeMoFinder is applicable for detecting relatively large NMs, for instance, finding NMs up to size-12 from the whole S. cerevisiae (yeast) PPI network as the authors claimed.[28] NeMoFinder consists of three main steps. First, finding frequent size-n trees, then utilizing repeated size-n trees to divide the entire network into a collection of size-n graphs, finally, performing sub-graph join operations to find frequent size-n sub-graphs.[26] In the first step, the algorithm detects all non-isomorphic size-n trees and mappings from a tree to the network. In the second step, the ranges of these mappings are employed to partition the network into size-n graphs. Up to this step, there is no distinction between NeMoFinder and an exact enumeration method. However, a large portion of non-isomorphic size-n graphs still remain. NeMoFinder exploits a heuristic to enumerate non-tree size-n graphs by the obtained information from preceding steps. The main advantage is in third step, which generates candidate sub-graphs from previously enumerated sub-graphs. This generation of new size-n sub-graphs is done by joining each previous sub-graph with derivative sub-graphs from itself called cousin sub-graphs. These new sub-graphs contain one additional edge in comparison to the previous sub-graphs. However, there exist some problems in generating new sub-graphs: There is no clear method to derive cousins from a graph, joining a sub-graph with its cousins leads to redundancy in generating particular sub-graph more than once, and cousin determination is done by a canonical representation of the adjacency matrix which is not closed under join operation. NeMoFinder is an efficient network motif finding algorithm for motifs up to size 12 only for protein-protein interaction networks, which are presented as undirected graphs. And it is not able to work on directed networks which are so important in the field of complex and biological networks. The pseudocode of NeMoFinder is shown here: NeMoFinder Input: G - PPI network; N - Number of randomized networks; K - Maximal network motif size; F - Frequency threshold; S - Uniqueness threshold; Output: U - Repeated and unique network motif set; D ← ∅; for motif-size k from 3 to K do T ← FindRepeatedTrees(k); GDk ← GraphPartition(G, T) D ← D ∪ T; D′ ← T; i ← k; while D″ = ∅ and i ≤ k × (k - 1) / 2 do D′ ← FindRepeatedGraphs(k, i, D′); D ← D ∪ D′; i ← i + 1; end while end for for counter i from 1 to N do Grand ← RandomizedNetworkGeneration(); for each g ∈ D do GetRandFrequency(g, Grand); end for end for U ← ∅; for each g ∈ D do s ← GetUniqunessValue(g); if s ≥ S then U ← U ∪ {g}; end if end for return U Grochow and Kellis [29] proposed an exact alg for enumerating sub-graph appearances, which is based on a motif-centric approach, which means that the frequency of a given sub-graph,called the query graph, is exhaustively determined by searching for all possible mappings from the query graph into the larger network. It is claimed [29] that a motif-centric method in comparison to network-centric methods has some beneficial features. First of all it avoids the increased complexity of sub-graph enumeration. Also, by using mapping instead of enumerating, it enables an improvement in the isomorphism test. To improve the performance of the alg, since it is an inefficient exact enumeration alg, the authors introduced a fast method which is called symmetry-breaking conditions. During straightforward sub-graph isomorphism tests, a sub-graph may be mapped to the same sub-graph of the query graph multiple times. In Grochow-Kellis alg symmetrybreaking is used to avoid such multiple mappings. GK alg and symmetry-breaking condition which eliminates redundant isomorphism tests. (a) graph G, (b) illustration of all automorphisms of G that is showed in (a). From set AutG we can obtain a set of symmetrybreaking conditions of G given by SymG in (c). Only the first mapping in AutG satisfies the SynG conditions; so, by applying SymG in Isomorphism Extension module alg only enumerate each match-able sub-graph to G once. Note that SynG is not a unique set for an arbitrary graph G. The GK alg discovers the whole set of mappings of a given query graph to the network in two major steps. It starts with the computation of symmetry-breaking conditions of the query graph. Next, by means of a branch-and-bound method, alg tries to find every possible mapping from the query graph to the network that meets the associated symmetry-breaking conditions. Computing symmetry-breaking conditions requires finding all automorphisms of a given query graph. Even though, there is no efficient (or polynomial time) algorithm for the graph automorphism problem, this problem can be tackled efficiently in practice by McKay’s tools.[24][25] As it is claimed, using symmetry-breaking conditions in NM detection lead to save a great deal of running time. Moreover, it can be inferred from the results in [29][30] that using (a) graph G, (b) illustration of all automorphisms of G that is showed in (a). From set AutG we can obtain a set the symmetry-breaking conditions results in high efficiency particularly for directed networks in comparison to undirected of symmetry-breaking conditions of G given by SymG networks. The symmetry-breaking conditions used in the GK algorithm are similar to the restriction which ESU algorithm in (c). Only the first mapping in AutG satisfies the applies to the labels in EXT and SUB sets. In conclusion, the GK algorithm computes the exact number of appearance of a SynG conditions; as a result, by applying SymG in the given query graph in a large complex network and exploiting symmetry-breaking conditions improves the algorithm Isomorphism Extension module the algorithm only performance. Also, GK alg is 1 of the known algorithms having no limitation for motif size in implementation and potentially enumerate each match-able sub-graph in network to G once. SynG is not a unique set for an arbitrary graph G. it can find motifs of any size.

Network Classification • Collective Inference using Relaxation Labeling • Definition of collective inference: • Similar but different to Gibbs sampling in that: • Keeps track of class probability estimates for XU • Instead of updating the graph one node at a time, updates class probabilities of all vertices, at iteration t+1, based on estimations from step t. 17 Final Presentation

In 2010, Ribeiro, Silva proposed a data structure for storing a collection of sub-graphs, a g-trie.[37] This data structure, which is conceptually akin to a prefix tree, stores sub-graphs according to their structures and finds occurrences of each of these sub-graphs in a larger graph. One of the noticeable aspects of this data structure is that coming to the network motif discovery, the sub-graphs in the main network are needed to be evaluated. So, there is no need to find the ones in random network which are not in the main network. This can be one of the time-consuming parts in the algorithms in which all sub-graphs in random networks are derived. A g-trie is a multiway tree that can store a collection of graphs. Each tree node contains information about a single graph vertex and its corresponding edges to ancestor nodes. A path from the root to a leaf corresponds to one single graph. Descendants of a g-trie node share a common sub-graph. Constructing a g-trie is well described in.[37] After constructing a g-trie, the counting part takes place. The main idea in counting process is to backtrack by all possible sub-graphs, but at the same time do the isomorphism tests. This backtracking technique is essentially the same technique employed by other motif-centric approaches like MODA and GK algorithms. Taking advantage of common substructures in the sense that at a given time there is a partial isomorphic match for several different candidate sub-graphs. Among the mentioned algorithms, G-Tries is the fastest. But, the excessive use of memory is the drawback of this algorithm, which might limit the size of discoverable motifs by a personal computer with average memory. Runtimes of FANMOD, G-Trie for subgraphs from 3-9 nodes on five different networks.[37] Network Size Dolphins Circuit Social Yeast Power Census Original Network Average Census on Random Networks FANMOD GK G-Trie FANMOD GK G-Trie 5 0.07 0.03 0.01 0.13 0.04 0.01 6 0.48 0.28 0.04 1.14 0.35 0.07 7 3.02 3.44 0.23 8.34 3.55 0.46 8 19.44 73.16 1.69 67.94 37.31 4.03 9 100.86 2984.22 6.98 493.98 366.79 24.84 6 0.49 0.41 0.03 0.55 0.24 0.03 7 3.28 3.73 0.22 3.53 1.34 8 17.78 48.00 1.52 21.42 7.91 3 0.31 0.11 0.02 0.35 0.11 0.02 4 7.78 1.37 0.56 13.27 1.86 0.57 5 208.30 31.85 14.88 531.65 62.66 22.11 3 0.47 0.33 0.02 0.57 0.35 0.02 Classification of Algs exact counting and those using statistical sampling and estimations. 2nd group does not count all subgraph occurences in main network, algs belonging to this group are faster, but they might yield in biased and unrealistic results. Exact counting algs can be classified to network-centric and subgraph-centric methods. The algorithms of the first class search the given network for all subgraphs of a given size, while the algorithms falling into the second class first generate different possible non-isomorphic graphs of the given size, and then explore the network for each generated subgraph separately. Classification of Motif Discovery Algorithms Name Directed/Un Induced/Non 0.17 mfinder Both Induced 1.06 (FANMOD) Both Induced Kavosh Both Induced G-Tries Both Induced FPF (Mavisto) Both Induced NeMoFinder Undirected Induced Grochow-Kellis Both Both MODA Both Both N. Alon Undirected Non-Induced mfinder Both Induced FANMOD) Both Induced 4 10.07 2.04 0.36 12.90 2.25 0.41 5 268.51 34.10 12.73 400.13 47.16 14.98 3 0.51 1.46 0.00 0.91 1.37 0.01 4 1.38 4.34 0.02 3.01 4.40 0.03 5 4.68 16.95 0.10 12.38 17.54 0.14 6 20.36 95.58 0.55 67.65 92.74 0.88 7 101.04 765.91 3.36 408.15 630.65 5.17 Count Method Basis Net-Centric Exact SubG-Centric Color-Coding Est/Sampling Other

Comparison with other methods Recently, Tjong and Zhou (2007) developed a neural network method for predicting DNA-binding sites. In their method, for each surface residue, the PSSM and solvent accessibilities of the residue and its 14 neighbors were used as input to a neural network in the form of vectors. In their publication, Tjong and Zhou showed that their method achieved better performance than other previously published methods. In the current study, the 13 test proteins were obtained from the study of Tjong and Zhou. Thus, we can compare the method proposed in the current study with Tjong and Zhou’s neural network method using the 13 proteins. Figure 1. Tradeoff between coverage and accuracy In their publication, Tjong and Zhou also used coverage and accuracy to evaluate the predictions. However, they defined accuracy using a loosened criterion of “true positive” such that if a predicted interface residue is within four nearest neighbors of an actual interface residue, then it is counted as a true positive. Here, in the comparison of the two methods, the strict definition of true positive is used, i.e., a predicted interface residue is counted as true positive only when it is a true interface residue. The original data were obtained from table 1 of Tjong and Zhou (2007), the accuracy for the neural network method was recalculated using this strict definition (Table 3). The coverage of the neural network was directly taken from Tjong and Zhou (2007). For each protein, Tjong and Zhou’s method reported one coverage and one accuracy. In contrast, the method proposed this study allows the users to tradeoff between coverage and accuracy based on their actual need. For the purpose of comparison, for each test protein, topranking patches are included into the set of predicted interface residues one by one in the decreasing order of ranks until coverage is the same as or higher than the coverage that the neural network method achieved on that protein. Then the coverage and accuracy of the two methods are compared. On a test protein, method A is better than B, if accuracy(A)>accuracy(B) and coverage (A)≥coverage(B). Table 3 shows that the graph kernel method proposed in this study achieves better results than the neural network method on 7 proteins (in bold font in table 3). On 4 proteins (shown in gray shading in table 3), the neural network method is better than the graph kernel method. On the remaining 2 proteins (in italic font in table 3), conclusions can be drawn because the two conditions, accuracy(A)>accuracy(B) and coverage (A)≥coverage(B), never become true at the same time, i.e., when coverage (graph kernel)>coverage(neural network), we have accuracy(graph kernel)accuracy(neural network). Note that the coverage of the graph kernel method increases in a discontinuous fashion as we use more patches to predict DNA-binding sites. One these two proteins, we were not able to reach at a point where the two methods have identical coverage. Given these situations, we consider that the two methods tie on these 2 proteins. Thus, these comparisons show that the graph kernel method can achieves better results than the neural network on 7 of the 13 proteins (shown in bold font in Table 3). Additionally, on another 4 proteins (shown in Italic font in Table 3), the graph kernel method ties with the neural network method. When averaged over the 13 proteins, the coverage and accuracy for the graph kernel method are 59% and 64%. It is worth to point out that, in the current study, the predictions are made using the protein structures that are unbound with DNA. In contrast, the data we obtained from Tjong and Zhou’s study were obtained using proteins structures bound with DNA. In their study, Tjong and Zhou showed that when unbound structures were used, the average coverage decreased by 6.3% and average accuracy by 4.7% for the 14 proteins (but the data for each protein was not shown).

Edge Count Clique Thms Graph C is a clique iff |EC||PUC|=COMB(|VC|,2)|VC|!/((|VC|-2)!2!) (VC,EC) is a k-clique iff induced k-1 subgraph, (VD,ED) is a (k-1)-clique. Apriori Clique Mining Alg Uses an ARM-Apriori-like downward closure property: CSkkCliqueSet, CCSk+1Candidatek+1CliqueSet. By SGE, CCSk+1= all s of CSk pairs w k-1 common vertices. Let CCCSk+1 be a union of 2 k-cliques w k-1 common vertices. Let v,w be the kth vertices (different) of the w k-cliques: CCSk+1 iff (PE)(v,w)=1. Breadth-1st Clique Alg: CLQK=all Kcliques. Find CLQ3 w CS0. A Kclique and 3clique sharing an edge form a (K+1)clique iff all K-2 edges from the non-shared Kclique vertices to the non-shared 3clique vertex exist. Next find CLQ4, then CLQ5, … Depth-1st Clique Alg: Find a Largest MaxClique v. If (x,y)E and Count(NewPtSet(v,w,x,y)CLQ3pTree(v,w)&CLQ3pTree(x,y)): 0, 4 v’s form a max4Clique (i.e., v,w,x,y). 1, 5 v’s form a max5Clique (i.e., v,w,x,y,NewPt) 2, 6 v’s form max6Clique if NewPairE, else form 2 max5Cliques. 3, 7 v’s form max7Clique if each NewPairE, elseif 1 or 2 NewPairsE each 6VertexSets (vwxy + 2 EdgeEndpts) form Max6Clique, elseif 0 NewPairsE, each 5VertexSet (vwxy + 1 NewVertex) forms maximal 5Clique…. Theorem: hCliqueNewPtSet, those h vertices together with v,w,x,y form a maximal h+4Clique, where NPS(v,w,x,y)=CLQ 3(v,w)&CLQ3(x,y). GRAPH (linear edges, 2 vertices) kPARTITE GRAPH (V=!Vi i=1..k (x,y)Ex,ysame Vi ) kHYPERGRAPH (edges=k vertices) 2graph=2hypergraph. kPARTITE HYPERGRAPH (V=!Vi i=1..k (x1..xk)Exj,xjsame Vi ) Bipartite Clique Mining finds MaxCliques at cost of pairwise &s. Each LETpTreeMCLQ unless pairwise & with same count.A&B, B w Ct(A&B)=Ct(A) is a MCLQ. potential for a k-plex [k-core] mining alg here. Instead of Ct(A&B)=Ct(A), consider. E.g., Ct(A&B)=Ct(A)-1. Each such pTree, C, would be missing just 1vertex (1 edge). Taking any MCLQ as above, ANDing in CpTree would produce a 1-plex. ANDing in k such C’s would produce a k-plex. In fact, suppose we have produced a k-plex in such a manner, then ANDing in any C with Ct(C)=Ct(A)-h would produce a (K+h)-plex. &i=1..nAi is a [i=1..nCt(Ai)]-Core Tripartite Clique Mining Algorithm? In a Tripartite Graph edges must start and end in different vertex parts. E.g., PART1=tweeters; PART2=hashtags; PART3=tweets. Tweeters-to-hashtags is many-to-many? Tweeters-to-tweets is many-to-many (incl. retweets)?; hashtags-to-tweets is many-to-many? Multipartite Graphs Bipartite, Tripartite (have 2,3 PARTs resp.) … The rule is that no edge can start and end in the same PART. HyperClique Mining: A 3hyperGraph has 3 vertex PARTS and each edge is a planar triangle (vertex triple), one from each PART. Stock recommender is 3PARThyperGraph (Investors, Stocks, Days) A triangular "edge" connects Investor #k, Stock X, and Day n if k recommended X on day n. A 3PARThyperClique is a community s.t. all the investors in the clique recommend all the stocks in the clique on each of the days in the clique (A strong signal?) Tweet example: PART1=tweeters; PART2=hashtags; PART3=tweets. Conjecture: KmultiCliques and KhyperCliques in 1-1 corresp. (K vertex set)? So, one of the mining processes only? Represent these common objects w cliqueTrees (cTrees). Cliques, k-plexes and k-cores are subgraphs or communities defined in terms of the count of their edges. Another type of subgraph of interest today is a Motif, which is defined by the count of its occurrences in the graph. E.g., a clique is a motif iff it occurs more times (isomorphically) in the graph than the “expected” number. Criticism An assumption behind the preservation of a topological sub-structure is that it is of a particular functional importance. This assumption has recently been questioned. Some authors have argued that motifs might show a variety depending on the network context, and therefore,[62] structure of the motif does not necessarily determine function. Network structure certainly does not always indicate function; this is an idea that has been around for some time, for an example see the Sin operon.[63] Most analyses of motif function are carried out looking at the motif operating in isolation. Recent research[64] provides good evidence that network context, i.e. the connections of the motif to the rest of the network, is too important to draw inferences on function from local structure only — the cited paper also reviews the criticisms and alternative explanations for the observed data. An analysis of the impact of a single motif module on the global dynamics of a network is studied in.[65] Yet another recent work suggests that certain topological features of biological networks naturally give rise to the common appearance of canonical motifs, thereby questioning whether frequencies of occurrences are reasonable evidence that the structures of motifs are selected for their functional contribution to operation of networks.[66] Would a motif in the Stock-Investor or Stock-Investor-Day graphs have useful meaning? Can we mine for Motifs?

Mining for Communities with more relaxed definitions than cliques (taken from Fortunato’s survey) There are many cohesiveness definitions other than a Clique. Another criterion for subgraph cohesion relies on adjacency of its vertices. The idea is that a vertex must be adjacent to some minimum number of other vertices in the subgraph. In the literature on social network analysis there are two complementary ways of expressing this. A k-plex is a maximal subgraph in which each vertex is adjacent to all other vertices of the subgraph except at most k of them. A k-core is a maximal subgraph in which each vertex is adjacent to at least k other vertices of the subgraph. In any graph there is a whole hierarchy of cores of different order. A k-core is essentially the same as a p-quasi complete subgraph, which is a subgraph such that the degree of each vertex is larger than p(k-1) , where p is a real number in [0; 1] and k the order of the subgraph. As cohesive as a subgraph can be, it would hardly be a community if there is strong cohesion also between the subgraph and the rest of the graph. Therefore, it is important to compare the internal and external cohesion of a subgraph. In fact, this is what is usually done in the most recent definitions of community. The first recipe, however, is not recent and stems from social network analysis. An LS-set is a subgraph such that the internal degree of each vertex is greater than its external degree. This condition is quite strict and can be relaxed into the so-called weak definition of community, for which it suffices that the internal degree of the subgraph exceeds its external degree. A community is strong if the internal degree of any vertex exceeds the number of edges that the vertex shares with any other community. A community is weak if its total internal degree exceeds the number of edges shared by the community with the other communities. Another definition focuses on the robustness of clusters to edge removal and uses the concept of edge connectivity. Edge connectivity of a pair of vertices is the minimal number of edges that need to be removed in order to disconnect them (no path between). A lambda set is a subgraph such that any pair of vertices of the subgraph has a larger edge connectivity than any pair formed by one vertex of the subgraph and one outside the subgraph. However, vertices of a lambda-set need not be adjacent and may be quite distant from each other. Communities can also be identified by a fitness measure, expressing to which extent a subgraph satisfies a given property related to its cohesion. The larger the fitness, the more definite is the community. This is the same principle behind quality functions, which give an estimate of the goodness of a graph partition. The simplest fitness measure for a cluster is its intra-cluster density int(C) (see slide 1). One could say subgraph C with k vertices is a cluster if int(C)>threshold. Finding such subgraphs is NP-complete, as it coincides with the NP-complete Clique Problem when the threshold =1. It is better to fix the size of the subgraph because, without this conditions, any clique would be one of the best possible communities, including trivial two-cliques (simple edges). Variants of this problem focus on the number of internal edges of the subgraph. Another measure is the relative density (C) of a subgraph C, defined as the ratio between the internal and the total degree of C (see slide1). Finding subgraphs of a given size with (C) larger than a threshold is NP-complete. Fitness measures can also be associated to the connectivity of the subgraph to the other vertices of the graph. A good community is expected to have a small cut size, i. e. small # of edges joining it to the rest of the graph.

E1 0&1= 0 01 1 1 1 E1 SubGraph Path pTrees E2 13 0 0 0 1 E3 14 0 1 1 0 13 0 0 0 1 1 0 1 1 134 1 1 1 01 0 1 0 C1 0 0 1 1 134 1 0 0 0 1 0 1 1 142 0 10 0 1 0 1 0 E2 0 0 0 1 14 0 0 1 0 142 0 0 0 0 24 1 0 1 0 143 1 10 0 1 0 1 0 241 0 0 1 0 143 1 0 0 0 243 1 0 0 0 E3 1 10 0 1 0 1 1 C3 1 0 0 1 31 0 0 0 1 31 0 0 0 1 34 1 1 0 0 314 0 0 1 0 3411 0 0 0 1 1 1 0 1 0 1 1 314 0 10 1 1 1 1 0 E4 1 10 1 1 1 1 0 1 0 1 1 41 0 0 1 0 34 1 0 0 0 341 0 0 1 0 342 0 10 0 1 0 1 0 342 0 0 0 0 C4 1 0 1 0 1 0 1 1 413 0 10 0 1 0 1 1 41 0 0 1 0 42 0 0 0 0 1 0 1 1 42 0 0 0 0 413 0 0 0 1 43 1 0 0 0 1 0 1 1 431 0 10 0 1 0 1 1 43 1 0 0 0 1 2 3 4 G1 C in orange 1 PC= 01 431 0 0 0 1 1 To get the C Path pTree, remove all C’ pTrees. & each G pTree with P C. Kill the 2nd bit (or keep vertex2 having no incident edges (then all pTrees are the same depth and can operate on each other. Diameter of C? Cdiamk is the max of the min path lengths from k to the other Cvertices. For each k, proceed down from C k a level at a time and record the first occurrence of kh , hk. DiamC = maxkV(Diamk) = 1 CDiam1=max{fo13 fo14}=max{11}=1 Diam3=max{fo31 fo34}=max{11}=1 Diam4=max{fo41 fo43}=max{11}=1 Always use pop-count for 1-counts as we AND, then C is a clique iff all C level-1 counts are |V C|-1. In fact one can mine out all cliques by just analyzing the G level=1 counts. Note: If one creates the G Path pTree, lots of tasks become easy! E.g., clique mining, shortest path mining, degree community mining, density community mining! What else? A k-plex is a maximal subgraph in which each vertex is adjacent to all other vertices of the subgraph except at most k of them. A k-core is a maximal subgraph in which each vertex is adjacent to at least k other vertices of the subgraph. In any graph there is a whole hierarchy of cores of different order. k-plex existence alg (using the GPpT): C is a k-plex iff vC|Cv| |VC|2–k2 k-plex inheritance thm: Every induced subgraph of a k-plex is a k-plex. Mine all max k-plexes: Use |Cv| vC . k-core inheritance thm: If a cover of G by induced k-cores, G is a k-core. Mine all max k-cores: Use |Cv| vC k-core existence alg (using the GPpT): C is a k-core iff vC, |V | k C Clique=community s.t. edge between each vertex pair. Recommenders: # edges = 1MB (1015) Community=subgraph w more edges inside than linked to its outside. Gene-Gene Ints: # edges = 1B (10 9) Person-Tweet Security: # edges = 7B*10K= 10 14 Friends Social: # edges = 4BB (1018) Stock-Price: # edges = 1013 An Induced SubGraph (ISG) C, is a subgraph that inherits all of G’s edges on its own vertices. A k-ISG (k vertices), C, is a k-clique iff all of its (k-1)-Sub-ISGs are (k-1)-cliques. 1 3:2 As a Rolodex card C Ekey 1,3 1,4 2,4 3,4 2:3 12 2 3 1 V2 V1 4:3 1:2 2:3 3:2 4:3 E=Adj matrix 1:2 2:3 3:2 4:3 V1 1 | 1 | 2 | 3 | PVL,1 1 1 1 1 Bit offset 1 2 V (vertex tbl) 3 Vkey VL 4 1 2 5 6 2 3 7 3 2 8 4 3 9 10 11 PVL,0 PC 12 13 0 1 0 14 1 1 15 0 1 16 1 Ekey 1,1 1,2 1,3 1,4_ 2,1 2,2 2,3 2,4_ 3,1 3,2 3,3 3,4_ 4,1 4,2 4,3 4,4 PE 0 0 1 1_ 0 0 0 1_ 1 0 0 1_ 1 1 1 0 PU 0 0 1 1_ 0 0 0 1_ 0 0 0 1_ 0 0 0 0 EL 0 0 1 2_ 0 0 0 3_ 0 0 0 1_ 2 3 1 0 PEL.,1 0 0 0 1_ 0 0 0 1_ 0 0 0 1_ 1 1 0 0 PEL.,0 0 0 1 0_ 0 0 0 1_ 1 0 0 0_ 0 1 1 0 P1 1 1 1 1_ 0 0 0 0_ 0 0 0 0_ 0 0 0 0 P2 0 0 0 0_ 1 1 1 1_ 0 0 0 0_ 0 0 0 0 P3 0 0 0 0_ 0 0 0 0_ 1 1 1 1_ 0 0 0 0 P4 0 0 0 0_ 0 0 0 0_ 0 0 0 0_ 1 1 1 1 PEC=PE&PC 0 0 1 1_ 0 0 0 0_ 1 0 0 1_ 1 0 1 0 PUC=PU&PC 0 0 1 1_ 0 0 0 0_ 0 0 0 1_ 0 0 0 0 (C=Induced SubGraph with VC={1,3,4}) Clique Existence Alg is induced SG a clique. Edge Count existence thm (EC): |EC||PUC|=COMB(|VC|,2)|VC|!/((|VC|-2)!2!) 1 2 V2 ELabel 3 1 4 2 4 3 4 1 3 Apply EC 3vertex ISGs (3-Clique iff |PU|= 3!/(2!1!)=3) 1 VC={1,3,4} VD={1,2,3} VF={1,2,4} PUC 0 0 1 1_ 0 0 0 0_ 0 0 0 1_ 0 0 0 0 Ct=3 VH={2,3,4} PUD 0 0 1 0_ 0 0 0 0_ 0 0 0 0_ 0 0 0 0 Ct=1 PUF 0 0 0 1_ 0 0 0 1_ 0 0 0 0_ 0 0 0 0 Ct=2 PUH 0 0 0 0_ 0 0 0 1_ 0 0 0 1_ 0 0 0 0 Ct=2 C only 3-Clique. SubGraph existence theorem (SG): (VC,EC) is a k-clique iff every induced k-1 subgraph, (V D,ED) is a (k-1)-clique. SG or EC better? Extend to quasi-cliques? Extend to mine out all cliques? A Clique Mining algorithm finds all cliques in a graph. For Clique-Mining we can use an ARM-Apriori-like downward closure property: CSkkCliqueSet, CCSk+1Candidatek+1CliqueSet By the SG clique thm, CCSk+1= all s of CSk pairs having k-1 common vertices. Let CCCSk+1 be a union of two k-cliques with k-1 common vertices. Let v and w be the kth vertices (different) of the two k-cliques, then CCSk+1 iff (PE)(v,w)=1. (We just need to check a single bit in P E.) Form CCSk+1: Union CSk pairs sharing k-1 vertices, check single PE bit. Below, k=2, so we check edge pairs sharing 1 vertex, then check the 1 new edge bit in P E. CS2=E={13 14 24 34} PE(3,4) = PE(4*[3-1]+4=12)=1 134CS3 Already have 134 PE(2,3)=PE(4*[2-1]+3=7)=0 PE(1,2) = PE(4*[1-1]+2=2)=0 The only expensive part of this is forming CCSk. And that is expensive only for CCS3 (as in Apriori ARM) Next? List out CS3 = {134} form CCS4 = . Done. Internal degree of C, kCint =vC kvint 2=|PC&PE&Pv1|=kv1int Int/Ext degree of v∈C, kv =# edges v to wC/C’ 0=|P’C&PE&Pv1|=kv1ext Intra-cluster density δint(C)=|edges(C,C)|/(nc(nc−1)/2)=|PE&PC&PLT|/(3*2/2)=3/3=1 int/wxt int External degree of C, kCext =vC kvext 2=|PC&PE&Pv3| =kv3int 2=|PC&PE&Pv4|=kv4int 6=kC ext kC=7 Total degree of C, kC= kCint +kCext 0=|P’C&PE&Pv3|=kv3ext 1=|P’C&PE&Pv4|=kv4ext 1=kC δ Cδ C=1–1/3=2/3 int ext Inter-cluster density δext(C)=|edges(C,C’)| / (nc(n-nc)) =|PE&P’C&PLT|=1/(3*1)=1/3 Tradeoff between large δint(C) and small δext(C) is goal of many community mining algorithms. A simple approach is to Maximize differences. Density Difference algorithm for Communities: δint(C)−δext(C) >Threshold? Degree Difference algorithm: kCint – kCext > Threshold? Easy to compute w pTrees, even for Big Graphs. Graphs are ubiquitous for complex data in all of science. Ignoring Subgraphs of 2 vertices, the four 3-vertex subgraphs are: C={1,3,4}, D={1,2,3}, F={1,2,4}, H={2,3,4} δint(D) =|PE&PD&PLT|/(3*2/2)=1/3 δext(D)=|PE&P’D&PLT|=1/(3*1)=3/3=1 δintD - δextD=1/3–1=-2/3 δext(H)=|PE&P’H&PLT|=1/(3*1)=2/3 D δint(H) =|PE&PH&PLT|/(3*2/2)=2/3 δintH - δextH=2/3-2/3=0 δint(F) =|PE&PF&PLT|/(3*2/2)=2/3 δext(F)=|PE&P’F&PLT|=1/(3*1)=2/3 δintF - δextF=2/3-2/3=0 F H

Introduction In this project, the Co-PI will develop graph models for protein representation. The graph models will maintain crucial structural information and encode the spatial distribution of multiple features on the proteins. Then, the Co-PI will develop new graph kernel methods that can fully exploit the rich information contained in the graphs for binding site prediction. Justification and Feasibility Stable interactions between macromolecules require the binding sites to possess favorable conditions in multiple aspects, including geometric complementarities, evolutionary conservation, hydrophobic force, electrostatic force and other physical and chemical forces. Thus, a method must assess features covering these aspects of the proteins in order to identify the binding sites. The graph models proposed in this work incorporate a wide range of features covering these aspects, and the proposed graph kernels combined with machine-learning methods are capable of discover complex patterns in the proposed graph models. Thus, we believe that the proposed methods can achieve success in binding site prediction and analysis. To test the feasibility of the proposed approach, we conducted a preliminary study on using graph kernel method to predict DNA-binding sites. We used a dataset of 171 DNA-binding proteins collected in our previous study19. We divided the protein surface into overlapping patches, such that each patch included a surface residue and its neighboring residues. Each patch was assigned to either positive class or negative class depending on whether the center residue was in the binding sites. Then, each patch was represented as a graph, such that each amino acid residue was represented using a vertex and an edge was added between two vertices if the corresponding residues were within a distance of 3.5 Å. Each vertex was then labeled with six features of the corresponding amino acid residue, including residue identity, sequence conservation score, structural conservation score, solvent accessibility, electrostatic potential and surface curvature. We used a shortest-path graph kernel to calculate the similarity between graphs. For more details about the shortest-path graph kernels, please see section C.3.ii.d. Briefly, a shortestpath graph kernel method compares all-pairs shortest paths between graphs. The comparison of two paths includes the comparison of path length and source and destination vertices. A Gaussian function was used to compare vertices based on the vertex labels. A Brownian kernel was used to compare the path length. The graph kernel was embedded into a support vector machine (SVM) to build a predictor for DNAbinding site prediction. When evaluated using leave-one-out cross-validation, the predictor achieved 89% accuracy, 90% specificity and 88% sensitivity. We also evaluated how each of the six features affected the prediction performance. When the feature number increased from one to six, the accuracy gradually increased from 86% to 89%. To further evaluate the method, we tested it using an independent set of 13 proteins used in a previous study21 whose apo-state structure (i.e. unbounded with DNA) and holo-state structure (i.e. bounded with DNA) were available. We used the predictor to predict DNA-binding sites on the apo-state structures and the predictions were compared against the actual binding sites gleaned from the holo-state structures. For each test protein, we ranked the surface patches by the prediction score from high to low. Remarkably, the top 1 patch in all of the 13 proteins belonged to the actual DNA-binding sites. This preliminary study shows that the graph kernel approach is able to discover predictive patterns on DNA-binding sites. The results are very encouraging. The top 1 prediction patch accurately indicates the location of the DNA-binding sites in all independent test proteins. This level of success would allow the method to make significant contribution in real applications.

APPENDIX Always use pop-count for 1-counts as we AND, then C is a clique iff all C level-1 counts are |V C|-1. In fact one can mine out all cliques by just analyzing the PT counts. Note: If one creates PT, lots of tasks become easy! E.g., clique mining, shortest path mining, degree community mining, density community mining! What else? 1 2 3 4 G1 A k-plex is a maximal subgraph in which each vertex is adjacent to all other vertices of the subgraph except at most k of them. A k-core is a maximal subgraph in which each vertex is adjacent to at least k other vertices of the subgraph. In any graph there is a whole hierarchy of cores of different order. k-plex existence alg (using the GPpT): C is a k-plex iff vC|Cv| |VC|2–k2 k-core existence alg (using the GPpT): C is a k-core iff vC, |VC| k . k-plex inheritance thm: Every induced subgraph of a k-plex is a k-plex. Mine all max k-plexes: Use |Cv| vC k-core inheritance thm: If a cover of G by induced k-cores, G is a k-core. Mine all max k-cores: Use |Cv| vC Clique=community s.t. edge between each vertex pair. Community=subgraph w more edges inside than linked to its outside. Gene-Gene Ints: # edges = 1B (10 9) Person-Tweet Security: # edges = 7B*10K= 10 14 Recommenders: # edges = 1MB (1015) Friends Social: # edges = 4BB (1018) Stock-Price: # edges = 1013 An Induced SubGraph (ISG) C, is a subgraph that inherits all of G’s edges on its own vertices. A k-ISG (k vertices), C, is a k-clique iff all of its (k-1)-Sub-ISGs are (k-1)-cliques. 1 3:2 As a Rolodex card C Ekey 1,3 1,4 2,4 3,4 2:3 12 2 3 1 V2 V1 4:3 1:2 V1 1 | 1 | 2 | 3 | 2:3 3:2 4:3 E=Adj matrix 1:2 V2 ELabel 3 1 4 2 4 3 4 1 PVL,1 1 1 1 1 Bit offset 1 2 V (vertex tbl) 3 Vkey VL 4 1 2 5 6 2 3 7 3 2 8 4 3 9 10 11 PVL,0 PC 12 13 0 1 14 0 1 1 15 0 1 16 1 Ekey 1,1 1,2 1,3 1,4_ 2,1 2,2 2,3 2,4_ 3,1 3,2 3,3 3,4_ 4,1 4,2 4,3 4,4 PE 0 0 1 1_ 0 0 0 1_ 1 0 0 1_ 1 1 1 0 PU 0 0 1 1_ 0 0 0 1_ 0 0 0 1_ 0 0 0 0 EL 0 0 1 2_ 0 0 0 3_ 0 0 0 1_ 2 3 1 0 PEL.,1 0 0 0 1_ 0 0 0 1_ 0 0 0 1_ 1 1 0 0 PEL.,0 0 0 1 0_ 0 0 0 1_ 1 0 0 0_ 0 1 1 0 P1 1 1 1 1_ 0 0 0 0_ 0 0 0 0_ 0 0 0 0 P2 0 0 0 0_ 1 1 1 1_ 0 0 0 0_ 0 0 0 0 P3 0 0 0 0_ 0 0 0 0_ 1 1 1 1_ 0 0 0 0 P4 0 0 0 0_ 0 0 0 0_ 0 0 0 0_ 1 1 1 1 PEC=PE&PC 0 0 1 1_ 0 0 0 0_ 1 0 0 1_ 1 0 1 0 PUC=PU&PC 0 0 1 1_ 0 0 0 0_ 0 0 0 1_ 0 0 0 0 (C=Induced SubGraph with VC={1,3,4}) 2:3 3:2 4:3 PUC 0 0 1 1_ 0 0 0 0_ 0 0 0 1_ 0 0 0 0 Ct=3 PUD 0 0 1 0_ 0 0 0 0_ 0 0 0 0_ 0 0 0 0 Ct=1 PUF 0 0 0 1_ 0 0 0 1_ 0 0 0 0_ 0 0 0 0 Ct=2 PUH 0 0 0 0_ 0 0 0 1_ 0 0 0 1_ 0 0 0 0 Ct=2 1 2 3 1 Clique Existence Alg is induced SG a clique. Edge Count existence thm (EC): |EC||PUC|=COMB(|VC|,2)|VC|!/((|VC|-2)!2!) Apply EC 3vertex ISGs (3-Clique iff |PU|= 3!/(2!1!)=3) VC={1,3,4} VD={1,2,3} VF={1,2,4} VH={2,3,4} C only 3-Clique. SubGraph existence theorem (SG): (VC,EC) is a k-clique iff every induced k-1 subgraph, (V D,ED) is a (k-1)-clique. SG or EC better? Extend to quasi-cliques? Extend to mine out all cliques? A Clique Mining algorithm finds all cliques in a graph. For Clique-Mining we can use an ARM-Apriori-like downward closure property: CSkkCliqueSet, CCSk+1Candidatek+1CliqueSet By the SG clique thm, CCSk+1= all s of CSk pairs having k-1 common vertices. Let CCCSk+1 be a union of two k-cliques with k-1 common vertices. Let v and w be the kth vertices (different) of the two k-cliques, then CCSk+1 iff (PE)(v,w)=1. (We just need to check a single bit in P E.) Form CCSk+1: Union CSk pairs sharing k-1 vertices, check single PE bit. Below, k=2, so we check edge pairs sharing 1 vertex, then check the 1 new edge bit in P E. CS2=E={13 14 24 34} PE(3,4) = PE(4*[3-1]+4=12)=1 134CS3 Already have 134 PE(2,3)=PE(4*[2-1]+3=7)=0 PE(1,2) = PE(4*[1-1]+2=2)=0 The only expensive part of this is forming CCSk. And that is expensive only for CCS3 (as in Apriori ARM) Next? List out CS3 = {134} form CCS4 = . Done. Internal degree of C, kCint =vC kvint 2=|PC&PE&Pv1|=kv1int Int/Ext degree of v∈C, kv =# edges v to wC/C’ 0=|P’C&PE&Pv1|=kv1ext Intra-cluster density δint(C)=|edges(C,C)|/(nc(nc−1)/2)=|PE&PC&PLT|/(3*2/2)=3/3=1 int/wxt int External degree of C, kCext =vC kvext 2=|PC&PE&Pv3| =kv3int 2=|PC&PE&Pv4|=kv4int 6=kC ext kC=7 Total degree of C, kC= kCint +kCext 0=|P’C&PE&Pv3|=kv3ext 1=|P’C&PE&Pv4|=kv4ext 1=kC δ Cδ C=1–1/3=2/3 int ext Inter-cluster density δext(C)=|edges(C,C’)| / (nc(n-nc)) =|PE&P’C&PLT|=1/(3*1)=1/3 Tradeoff between large δint(C) and small δext(C) is goal of many community mining algorithms. A simple approach is to Maximize differences. Density Difference algorithm for Communities: δint(C)−δext(C) >Threshold? Degree Difference algorithm: kCint – kCext > Threshold? Easy to compute w pTrees, even for Big Graphs. Graphs are ubiquitous for complex data in all of science. Ignoring Subgraphs of 2 vertices, the four 3-vertex subgraphs are: C={1,3,4}, D={1,2,3}, F={1,2,4}, H={2,3,4} δint(D) =|PE&PD&PLT|/(3*2/2)=1/3 δext(D)=|PE&P’D&PLT|=1/(3*1)=3/3=1 δintD - δextD=1/3–1=-2/3 δext(H)=|PE&P’H&PLT|=1/(3*1)=2/3 D δint(H) =|PE&PH&PLT|/(3*2/2)=2/3 δintH - δextH=2/3-2/3=0 δint(F) =|PE&PF&PLT|/(3*2/2)=2/3 δext(F)=|PE&P’F&PLT|=1/(3*1)=2/3 δintF - δextF=2/3-2/3=0 F H

Network Classification • Netkit-SRL Components Purpose Approaches Local (Non-relational) Classifier Returns a model which uses only attributes of a node to estimate its class label. 1) Uniform prior; 2) Class-prior Relational Classifier Returns a model which uses not only the local attributes of a node but also attributes of related nodes, including their (estimated) class membership. 1) Weighted-vote relational neighbor; 2) Classdistributional relational neighbor; 3) Network-only multinomial Bayes classifier with Markov Random Field estimation Collective Inference This module applies collective inference in order to (approximately) maximize the joint probability of the labels of all nodes in the graph whose labels were initially unknown. 1) Relaxation labeling; 2) Iterative classification; 3) Gibb’s sampling 14 Final Presentation