Notes on 🐻Introduction to Machine Learning

I was curious to learn what has been updated since I took a version of CS 3780, so I am reading through the lecture slides for the Spring 2025 semester

I use these notes to keep track of my thoughts while reading and to record anything that sticks out to me as interesting. The notes are incomplete as I have not yet finished reading the slides

Introduction

I’ve heard many definitions of machine learning and artificial intelligence, but the ones use by this course are machine learning: “programs that improve with experience, a subfield of artificial intelligence” and artificial intelligence: “programs that demonstrate ‘intelligence’ in some sense”
The slides contain a fun flowchart: (training data + training output → program / model) + test data → test output

ML basics

Definitions adapted from notes:
- Supervised learning: “make predictions from data”
- Classifier: “a program to predict the correct label of each annotated data instance”
- Binary classification: a label is one of two possibilities
- Multi-class classification: a label is one of K possibilities
- Regression: the label space is the real numbers
- Feature vector: the input vector of a sample
- Feature: a dimension of the feature vector
- Dense feature vector: the number of nonzero coordinates in the feature vector is large relative to the number of features
- Sparse feature vector: the feature vector consists of mostly zeros
- Hypothesis class: the set of possible hypothesis functions, the set of hypothesis functions we can possibly learn, “encodes your assumptions about the data set / distribution”
- No Free Lunch Theorem: “every successful ML algorithm must make assumptions. This also means that there is no single ML algorithm that works for every setting”, “you must make assumptions in order to learn. no Algorithm Works in all settings”
- Loss function / risk function: “evaluates a hypothesis on our training data and tells us how bad it is”, “tells us how good a model did on an instance”
- Loss: “the higher the loss, the worse a hypothesis is - a loss of zero means it makes perfect predictions”
- Zero-one loss: “counts how many mistakes a hypothesis function makes on the training set”
- Normalized zero-one loss / training error: “the fraction of misclassified training samples”
- Overfitting: get low error on the training data, but does horribly with samples not in the training data
- Weak law of large numbers: “the empirical average of data drawn from a distribution converges to its mean”
- Feature extraction: Selecting “part of instances we deem relevant for predicting output”
- Model / program / hypothesis: function from input to “label / output we would like to predict”

K-nearest neighbors and the curse of dimensionality

Definitions:
- k-NN algorithm: “or a test input, assign the most common label amongst its k most similar training inputs”
- Bayes optimal classifier: “_you knew P(y x), then you would simply predict the most likely label_”
According to Wikipedia: “[Minkowski distance](https://en.wikipedia.org/wiki/Minkowski_distance) is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. In the limiting case of p reaching infinity, we obtain the Chebyshev distance”