Slide #1.

Naïve Bayes ECE-5424G / CS-5824 Jia-Bin Huang Virginia Tech Spring 2019
More slides like this


Slide #2.

Administrative • HW 1 out today. Please start early! • Office hours • Chen: Wed 4pm-5pm • Shih-Yang: Fri 3pm-4pm • Location: Whittemore 266
More slides like this


Slide #3.

Linear Regression • Model representation • Cost function • Gradient descent for linear regression Repeat until convergence {} • Features and polynomial regression Can combine features; can use different functions to generate features (e.g., polynomial) • Normal equation
More slides like this


Slide #4.

() 1 1 1 1 1 1 1 1 Size in feet^2 Number of () bedrooms () 2104 5 1416 3 2104 5 1534 3 1416 3 852 2 1534 3 … 852 2 … Number of floors () 1 2 1 2 2 1 2 1 Age of home (years) () 45 40 45 30 40 36 30 36 Price ($) in Price ($)(y) in 1000’s 1000’s (y) 460 232 460 315 232 178 315 … 178 … Slide credit: Andrew Ng
More slides like this


Slide #5.

Least square solution •
More slides like this


Slide #6.

Justification/interpretation 1 •• Geometric interpretation column space of • : column space of or span() • Residual is orthogonal to the column space of
More slides like this


Slide #7.

Justification/interpretation 2 •• Probabilistic model • Assume linear model with Gaussian errors • Solving maximum likelihood Image credit: CS [email protected]
More slides like this


Slide #8.

Justification/interpretation 3 •• Loss minimization • Least squares loss • Empirical Risk Minimization (ERM)
More slides like this


Slide #9.

 training examples, features • Gradient Descent • Need to choose • Need many iterations • Works well even when is large Normal Equation • No need to choose • Don’t need to iterate • Need to compute • Slow if is very large Slide credit: Andrew Ng
More slides like this


Slide #10.

Things to remember • Model representation • Cost function • Gradient descent for linear regression Repeat until convergence {} • Features and polynomial regression Can combine features; can use different functions to generate features (e.g., polynomial) • Normal equation
More slides like this


Slide #11.

Today’s plan • Probability basics • Estimating parameters from data • Maximum likelihood (ML) • Maximum a posteriori estimation (MAP) • Naïve Bayes
More slides like this


Slide #12.

Today’s plan • Probability basics • Estimating parameters from data • Maximum likelihood (ML) • Maximum a posteriori estimation (MAP) • Naive Bayes
More slides like this


Slide #13.

Random variables • Outcome space S • Space of possible outcomes • Random variables • Functions that map outcomes to real numbers • Event E • Subset of S
More slides like this


Slide #14.

 Visualizing probability Sample space Area = 1 A is true A is false Area of the blue circle
More slides like this


Slide #15.

 Visualizing probability A is true A is false
More slides like this


Slide #16.

 Visualizing probability A^B B A^~B
More slides like this


Slide #17.

Visualizing conditional probability A^B B A Corollary: The chain rule
More slides like this


Slide #18.

Bayes rule A^B Thomas Bayes B A Corollary: The chain rule
More slides like this


Slide #19.

Other forms of Bayes rule •
More slides like this


Slide #20.

Applying Bayes rule • • A = you have the flu B = you just coughed • Assume: • What is P(flu | cough) = P(A|B)? Slide credit: Tom Mitchell
More slides like this


Slide #21.

• Why we are learning this? Hypothesis Learn
More slides like this


Slide #22.

Joint distribution • Making a joint distribution of M variables 1. Make a truth table listing all combinations A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10 2. For each combination of values, say how probable it is 3. Probability must sum to 1 Slide credit: Tom Mitchell
More slides like this


Slide #23.

Using joint distribution • Can ask for any logical expression involving these variables Slide credit: Tom Mitchell
More slides like this


Slide #24.

 The solution to learn ? • Main problem: learning may require more data than we have • Say, learning a joint distribution with 100 attributes • # of rows in this table? • # of people on earth? Slide credit: Tom Mitchell
More slides like this


Slide #25.

What should we do? 1. Be smart about how we estimate probabilities from sparse data • Maximum likelihood estimates (ML) • Maximum a posteriori estimates (MAP) 2. Be smart about how to represent joint distributions • Bayes network, graphical models (more on this later) Slide credit: Tom Mitchell
More slides like this


Slide #26.

Today’s plan • Probability basics • Estimating parameters from data • Maximum likelihood (ML) • Maximum a posteriori (MAP) • Naive Bayes
More slides like this


Slide #27.

Estimating the probability • Flip the coin repeatedly, observing 0 • It turns heads times • It turns tails times • Your estimate for is? • Case A: 100 flips: 51 Heads (), 49 Tails () • Case B: 3 flips: 2 Heads (), 1 Tails () Slide credit: Tom Mitchell
More slides like this


Slide #28.

Two principles for estimating parameters •• Maximum Likelihood Estimate (MLE) Choose that maximizes probability of observed data • Maximum a posteriori estimation (MAP) Choose that is most probable given prior probability and data Slide credit: Tom Mitchell
More slides like this


Slide #29.

Two principles for estimating parameters •• Maximum Likelihood Estimate (MLE) Choose that maximizes • Maximum a posteriori estimation (MAP) Choose that maximize Slide credit: Tom Mitchell
More slides like this


Slide #30.

Maximum likelihood estimate 0 • Each flip yields Boolean value for : • Data set of independent, identically distributed (iid) flips, produces ones, zeros Slide credit: Tom Mitchell
More slides like this


Slide #31.

 Beta prior distribution • Slide credit: Tom Mitchell
More slides like this


Slide #32.

Maximum likelihood estimate 0 • Data set of iid flips, produces ones, zeros • Assume prior (Conjugate prior: Closed form representation of posterior) Slide credit: Tom Mitchell
More slides like this


Slide #33.

Some terminology • Likelihood function • Prior • Posterior • Conjugate prior: Prior is the conjugate prior for a likelihood function if the prior and the posterior have the same form. • Example (coin flip problem) • Prior : Likelihood : Binomial • Posterior : Slide credit: Tom Mitchell
More slides like this


Slide #34.

How many parameters? • Suppose , where and are Boolean random variables To estimate When (Gender, Hours-worked)? When ? Slide credit: Tom Mitchell
More slides like this


Slide #35.

Can we reduce paras using Bayes rule? • • How many parameters for ? • How many parameters for Slide credit: Tom Mitchell
More slides like this


Slide #36.

Today’s plan • Probability basics • Estimating parameters from data • Maximum likelihood (ML) • Maximum a posteriori estimation (MAP) • Naive Bayes
More slides like this


Slide #37.

Naïve Bayes • Assumption: • i.e., and are conditionally independent given for Slide credit: Tom Mitchell
More slides like this


Slide #38.

Conditional independence • Definition: is conditionally independent of given , if the probability distribution governing is independent of the value of , given the value of Example: Slide credit: Tom Mitchell
More slides like this


Slide #39.

Applying conditional independence • Naïve Bayes assumes are conditionally independent given e.g., General form: How many parameters to describe ? ? • Without conditional indep assumption? • With conditional indep assumption? Slide credit: Tom Mitchell
More slides like this


Slide #40.

Naïve Bayes classifier • Bayes rule: • Assume conditional independence among ’s: • Pick the most probable Y Slide credit: Tom Mitchell
More slides like this


Slide #41.

Naïve Bayes algorithm – discrete   • For each value Estimate For each value of each attribute Estimate • Classify
More slides like this


Slide #42.

Estimating parameters:   discrete • Maximum likelihood estimates (MLE) Slide credit: Tom Mitchell
More slides like this


Slide #43.

• F = 1 iff you live in Fox Ridge • S = 1 iff you watched the superbowl last night • D = 1 iff you drive to VT • G = 1 iff you went to gym in the last month
More slides like this


Slide #44.

Naïve Bayes: Subtlety #1 • Often the are not really conditionally independent • Naïve Bayes often works pretty well anyway • Often the right classification, even when not the right probability [Domingos & Pazzani, 1996]) • What is the effect on estimated ? • What if we have two copies: Slide credit: Tom Mitchell
More slides like this


Slide #45.

Naïve Bayes: Subtlety #2 •MLE estimate for might be zero. (for example, = birthdate. = Feb_4_1995) • Why worry about just one parameter out of many? • What can we do to address this? • MAP estimates (adding “imaginary” examples) Slide credit: Tom Mitchell
More slides like this


Slide #46.

Estimating parameters:   discrete • Maximum likelihood estimates (MLE) • MAP estimates (Dirichlet priors): Slide credit: Tom Mitchell
More slides like this


Slide #47.

 What if we have continuous • Gaussian Naïve Bayes (GNB): assume • Additional assumption on : • Is independent of () • Is independent of () • Is independent of and () Slide credit: Tom Mitchell
More slides like this


Slide #48.

Naïve Bayes algorithm – continuous   • For each value Estimate For each attribute estimate Class conditional mean , variance • Classify Slide credit: Tom Mitchell
More slides like this


Slide #49.

Things to remember •• Probability basics • Estimating parameters from data • Maximum likelihood (ML) maximize • Maximum a posteriori estimation (MAP) maximize • Naive Bayes
More slides like this


Slide #50.

Next class • Logistic regression
More slides like this