Lecture 11 - Introduction to Machine Learning

In this section, we will provide a brief introduction to machine learning using R. Let us first understand what machine learning entails by distinguishing it from other popular terms in the field, such as AI and data science.

Artificial Intelligence is the name of a whole knowledge field, similar to biology or chemistry.
Machine Learning is a part of artificial intelligence. An important part, but not the only one.
Neural Networks are one of machine learning types. A popular one, but there are other good guys in the class.
Deep Learning is a modern method of building, training, and using neural networks. Basically, it’s a new architecture.

In fact, the following map shows what are involved in machine learning.

We will introduce only a few classical machine learning techniques, as listed here. The resources mentioned above are cited from the blog post titled “Machine Learning for Everyone.” You are encouraged to read the full article for a more comprehensive understanding here.

The classic methods in machine learning originated from pure statistics in the 1950s. These methods were developed to solve formal mathematical tasks and establish theories behind model construction. In practical applications, these methods focus on fitting models to data, capturing patterns in numbers, and ultimately facilitating summaries or predictions.

When constructing models, this often involves partitioning the data for different purposes: training, validation, and testing. Unless specified otherwise, we will consider the following approach for splitting data in the modeling process.

In this course, we will introduce tidymodels, a package that is becoming the tidyverse toolkit for machine learning. Max Kuhn, formerly of Pfizer and now with RStudio, leads its development. He is notably also the developer of the caret package in R, which provides a uniform interface for the diverse range of machine learning models available in R.

The following diagram illustrates which step each package covers in a typical data science project.

source

Even though a model is a single step, the development of models can benefit from having a tidyverse-friendly interface. This is where tidymodels comes into play.

tidymodels is also an umbrella of packages. In this introductory section, we will showcase functions from four tidymodels packages.

rsample - provides functions to create different types of resamples and corresponding classes for their analysis.
recipes - dplyr-like pipeable sequences of feature engineering steps to get your data ready for modeling.
parsnip - provides a tidy, unified interface to models that can be used to try a range of models without getting bogged down in the syntactical minutiae of the underlying packages.
yardstick - measures the effectiveness of models using performance metrics.

The following diagram illustrates each modeling step, and lines up the tidymodels packages that we will use in this section: