Linear Regression • Model representation • Cost function • Gradient descent for linear regression Repeat until convergence {} • Features and polynomial regression Can combine features; can use different functions to generate features (e.g., polynomial) • Normal equation

Justification/interpretation 1 •• Geometric interpretation column space of • : column space of or span() • Residual is orthogonal to the column space of

Justification/interpretation 2 •• Probabilistic model • Assume linear model with Gaussian errors • Solving maximum likelihood Image credit: CS [email protected]

training examples, features • Gradient Descent • Need to choose • Need many iterations • Works well even when is large Normal Equation • No need to choose • Don’t need to iterate • Need to compute • Slow if is very large Slide credit: Andrew Ng

Things to remember • Model representation • Cost function • Gradient descent for linear regression Repeat until convergence {} • Features and polynomial regression Can combine features; can use different functions to generate features (e.g., polynomial) • Normal equation

Random variables • Outcome space S • Space of possible outcomes • Random variables • Functions that map outcomes to real numbers • Event E • Subset of S

Joint distribution • Making a joint distribution of M variables 1. Make a truth table listing all combinations A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10 2. For each combination of values, say how probable it is 3. Probability must sum to 1 Slide credit: Tom Mitchell

The solution to learn ? • Main problem: learning may require more data than we have • Say, learning a joint distribution with 100 attributes • # of rows in this table? • # of people on earth? Slide credit: Tom Mitchell

What should we do? 1. Be smart about how we estimate probabilities from sparse data • Maximum likelihood estimates (ML) • Maximum a posteriori estimates (MAP) 2. Be smart about how to represent joint distributions • Bayes network, graphical models (more on this later) Slide credit: Tom Mitchell

Estimating the probability • Flip the coin repeatedly, observing 0 • It turns heads times • It turns tails times • Your estimate for is? • Case A: 100 flips: 51 Heads (), 49 Tails () • Case B: 3 flips: 2 Heads (), 1 Tails () Slide credit: Tom Mitchell

Two principles for estimating parameters •• Maximum Likelihood Estimate (MLE) Choose that maximizes probability of observed data • Maximum a posteriori estimation (MAP) Choose that is most probable given prior probability and data Slide credit: Tom Mitchell

Two principles for estimating parameters •• Maximum Likelihood Estimate (MLE) Choose that maximizes • Maximum a posteriori estimation (MAP) Choose that maximize Slide credit: Tom Mitchell

Maximum likelihood estimate 0 • Each flip yields Boolean value for : • Data set of independent, identically distributed (iid) flips, produces ones, zeros Slide credit: Tom Mitchell

Maximum likelihood estimate 0 • Data set of iid flips, produces ones, zeros • Assume prior (Conjugate prior: Closed form representation of posterior) Slide credit: Tom Mitchell

Some terminology • Likelihood function • Prior • Posterior • Conjugate prior: Prior is the conjugate prior for a likelihood function if the prior and the posterior have the same form. • Example (coin flip problem) • Prior : Likelihood : Binomial • Posterior : Slide credit: Tom Mitchell

Conditional independence • Definition: is conditionally independent of given , if the probability distribution governing is independent of the value of , given the value of Example: Slide credit: Tom Mitchell

Applying conditional independence • Naïve Bayes assumes are conditionally independent given e.g., General form: How many parameters to describe ? ? • Without conditional indep assumption? • With conditional indep assumption? Slide credit: Tom Mitchell

• F = 1 iff you live in Fox Ridge • S = 1 iff you watched the superbowl last night • D = 1 iff you drive to VT • G = 1 iff you went to gym in the last month

Naïve Bayes: Subtlety #1 • Often the are not really conditionally independent • Naïve Bayes often works pretty well anyway • Often the right classification, even when not the right probability [Domingos & Pazzani, 1996]) • What is the effect on estimated ? • What if we have two copies: Slide credit: Tom Mitchell

Naïve Bayes: Subtlety #2 •MLE estimate for might be zero. (for example, = birthdate. = Feb_4_1995) • Why worry about just one parameter out of many? • What can we do to address this? • MAP estimates (adding “imaginary” examples) Slide credit: Tom Mitchell

What if we have continuous • Gaussian Naïve Bayes (GNB): assume • Additional assumption on : • Is independent of () • Is independent of () • Is independent of and () Slide credit: Tom Mitchell

Naïve Bayes algorithm – continuous • For each value Estimate For each attribute estimate Class conditional mean , variance • Classify Slide credit: Tom Mitchell

Things to remember •• Probability basics • Estimating parameters from data • Maximum likelihood (ML) maximize • Maximum a posteriori estimation (MAP) maximize • Naive Bayes