CS6604 Digital Libraries Social Communities Knowledge Management: Social Interactome Final Term Project Presentation Presenter Prashant Chandrasekar {peecee}@vt.edu Instructor Dr. Edward A. Fox Virgi

CS6604 Digital Libraries Social Communities Knowledge Management: Social Interactome Final Term Project Presentation Presenter Prashant Chandrasekar {peecee}@vt.edu Instructor Dr. Edward A. Fox Virginia Polytechnic Institute and State University Blacksburg, VA, 24061 May 2, 2017

Acknowledgements • Dr. Edward A. Fox • Global events team • Social Interactome team • The Social Interactome of Recovery: Social Media as Therapy Development (NIH Grant 1R01DA039456-01) • Xuan Zhang and Yufeng Ma • Mostafa Mohammed 2 Final Presentation

Background: Social Interactome (SI) • Social Interactome • NIH-funded project conducted by a team of researchers • Study the community of people, who are recovering from addiction • Study their interactions in an online social network, built to provide support and management of their recovery • The project is broken down into set of “test vs. control” experiments with variables defined: • • • • Duration of study Number of participants required Avenue of recruitment Null and alternative hypotheses Final Presentation 4

Background: SI Setup • The project is broken down into a set of clinical trials. For each clinical trial: • The team decides on a set of null and alternative hypotheses and the duration of the trial • Recruits participants for the trial • Organizes the participants into one of two (or more) 128node social network • Participants interact with the website and their assigned friends • Two 16-week clinical trials have been completed. Along with a set of smaller scaled trials executed via Amazon Mturk Final Presentation 5

Background: SI Participant Info Demographic ... Family's history with addiction Past Social network experience Family Info Recovery Participation Scale Primary Addiction Religious Commitment Inventory Secondary Addiction Participants Minute Discounting - Collected from 19,070 questions - ~10 psychologybased measures - 16 surveys Addiction Severity Index Assessment Recovery Capital Social Connected Scale Adult Social Network Index Relapse Recovery Capital Scale Big 5 Personalities DSM-V 6 Final Presentation

Overall Challenges • How do you organize the data? • How do you validate/clean the data? • What do you analyze first? And in what order do you go about it? • How do you make sense of the data? • How to interpret psychology-related measures? • Big goal: How to streamline the entire process from data collection to analyses to presentation such that it is reproducible and extensible? 8 Final Presentation

Goal • Goal: Investigate/explore ways to model the data and recommend an approach. • Approaches to understand the data • • • • Frequency Distributions / Histograms Time series Checking for correlations Comparing means and standard deviations • t-tests • Statistical modeling 9 Final Presentation

Approaches • Statistical Modeling • What do we model? • • • • • • Substance relapse Engagement/Change in engagement Change in psychology-related measures Change in behavior Homophily Friendship or Trust • Factors • Classification: What would be the predictor variables? Response variables? • PGMs: Directed or Undirected? What would be the factors? 10 Final Presentation

Approaches • Classification • Network-Classification using NetKit-SRL (Statistical Relational Learning)1 [Focus of the presentation] • Learning using Markov Logic Networks2 1 Sofus A. Macskassy , Foster Provost. "Classification in Networked Data: A toolkit and a univariate case study," Journal of Machine Learning, 8(May):935-983, 2007. 2 Domingos, Pedro and Richardson, Matthew (2007). Markov Logic: A Unifying Framework for Statistical Relational Learning. In L. Getoor and B. Taskar (eds.), Introduction to Statistical Relational Learning (pp. 339371), 2007. Cambridge, MA: MIT Press. Final Presentation 11

Network Classification • Idea: Taking advantage of relational information in addition to attribute information for entity classification. Example: Networked data. • Focuses on within-network classification • Networks of web pages, research papers, social networks, etc. • Netkit-SRL: Toolkit developed to employ statistical relational learning and inference 12 Final Presentation

Network Classification • Netkit-SRL • Network learning toolkit for classification and inference • Developed by Dr. Macskassy & Dr. Provost • Has 3 components • Non-relational model • Relational model • Collective inference • Specific Outcomes: • Maximize P(x|GK), where x are labels to be estimated and GK is everything known in the network • Estimating joint distribution over the labels • Input: • Graph with edges describing relationships and attributes of nodes Final Presentation 13

Network Classification • Netkit-SRL Components Purpose Approaches Local (Non-relational) Classifier Returns a model which uses only attributes of a node to estimate its class label. 1) Uniform prior; 2) Class-prior Relational Classifier Returns a model which uses not only the local attributes of a node but also attributes of related nodes, including their (estimated) class membership. 1) Weighted-vote relational neighbor; 2) Classdistributional relational neighbor; 3) Network-only multinomial Bayes classifier with Markov Random Field estimation Collective Inference This module applies collective inference in order to (approximately) maximize the joint probability of the labels of all nodes in the graph whose labels were initially unknown. 1) Relaxation labeling; 2) Iterative classification; 3) Gibb’s sampling 14 Final Presentation

Network Classification • Possible instantiations Author Chakrabarti et al. (1998)1 Lu & Getoor (2003)2 Macskassy & Provost (2003)3 Non-relational Classifier Naïve Bayes classifier Relational Classifier Collective Inference Naïve Bayes Markov Random Field Relaxation labeling Logistic regression Logistic regression Iterative classification Classes priors Majority vote of neighboring classes Relaxation labeling [1] Chakrabarti, S., Dom, B., & Indyk, P. (1998). Enhanced Hypertext Categorization Using Hyperlinks. Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 307– 318). [2] Lu, Q., & Getoor, L. (2003). Link-Based Classification. International Conference on Machine Learning, ICML-2003 (pp. 496–503). [3] Macskassy, S. A., & Provost, F. (2003). A Simple Relational Classifier. Proceedings of the Second Workshop on Multi-Relational Data Mining (MRDM-2003) at KDD-2003 (pp. 64–76). Final Presentation 15

Network Classification • Weighted-vote relational neighbor classifier (wv-RN) • Authors: Macskassy, S. A., & Provost, F. (2003) • Estimates class membership by assuming existence of homophily • Weighted mean of class-membership probabilities of entities in De (where De is the neighbors of entity/node e) 16 Final Presentation

Network Classification • Collective Inference using Relaxation Labeling • Definition of collective inference: • Similar but different to Gibbs sampling in that: • Keeps track of class probability estimates for XU • Instead of updating the graph one node at a time, updates class probabilities of all vertices, at iteration t+1, based on estimations from step t. 17 Final Presentation

Network Classification: Experiment • Experiment • Rationale: Participants who are “homopholous” (who have shared background in common), have common interests. • Hypothesis: Given a set of common interests, between pairs of participants, one can predict the homophilymeasures with good accuracy. • Input graph • Nodes: Participants • Attributes: Addiction, Education, Income • Edges: Edge weight is the number of news stories + success stories + educational modules that both nodes (connected via the edge) have viewed in common. • Predicted attribute: Addiction Final Presentation 18

Network Classification: Experiment • E1R2 data statistics • # of nodes: 256; # of edges: 436 Pr im ar y Substance Breakdown A m ong 256 participants 160 139 140 120 100 Frequency 80 60 40 41 30 18 20 17 1 0 sti m a ul s nt l ho co l a s id io p o s di ci a so e tiv 1 ca co e in es pr n tio ir p c s er li iz u nq rt a s es pr e /d ts an 7 is ab n n ca 1 1 r he Ot e tin ci o n Primary Substance 20 Final Presentation

Network Classification: Experiment Conclusion • Conclusion • The highest accuracy for all experiment configurations for predicting primary addiction as shown in slide 22, is 0.392 • The confusion matrix for predicting each of primary addiction, education and income shows more details on the accuracy of predicting each class. • The accuracy is low. • This is probably due to the fact that our experiment configuration does NOT include a non-relational component. • Furthermore, our graph edges, and attributes have only 1-3 fields. The graph needs to be more dense with a lot more information to be used for network-based inference. Final Presentation 26

Network Classification: Next Steps • Possible extensions of the work: • Build graph with different representation of edges • Construct more attributes of the node for non-relational (local) classifier step • Try experiments with priors learnt from various traditional classification models. • Problem/Challenge • Extension or further work is open-ended. • Part of doctoral work: Build a logical flowchart of inquiries/hypotheses. • The logical flowchart of inquiries can be used and called upon based on user’s line of inquiry. Final Presentation 27

Learning via Markov Logic Networks • A Markov Logic Network (MLN) is a set of pairs (F, w) where • F is a formula in first-order logic • w is a real number • Together with a set of constants, it defines a Markov network with • One node for each grounding of each predicate in the MLN • One feature for each grounding of each formula F in the MLN, with the corresponding weight w *Slide source: http://www.cs.washington.edu/homes/pedrod/psrai.ppt Final Presentation 28

Learning via Markov Logic Networks Two constants: Anna (A) and Bob (B) Smokes(A) Cancer(A) Smokes(B) Cancer(B) *Slide source: http://www.cs.washington.edu/homes/pedrod/psrai.ppt Final Presentation 29

Learning via Markov Logic Networks Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Cancer(A) Friends(B,B) Cancer(B) Friends(B,A) 1 Probability of a world x: P ( x ) exp wi ni ( x) Z i Weight of formula i No. of true groundings of formula i in x *Slide source: http://www.cs.washington.edu/homes/pedrod/psrai.ppt Final Presentation 32

Future work • Next steps • Extract more attributes for each participant • Compiledifferent ways to represent edge weight • Build local classifier and testing results for Netkit-SRL • Use Alchemy to represent data using Markov Logic networks. 34 Final Presentation