Picking Algorithms: The Basic Models

Picking an algorithm for a machine learning project can be confusing business. Most data scientists will tell you that there is not always a perfect answer to the question “What algorithm or model should I use?” In this series, we look to break down the important details that go into making that decision.

After understanding the basics of supervised vs. unsupervised learning, and regression vs. classification algorithms, you might be ready to try a model out. In this article, we will explore the basic models of linear regression and decision trees in Infor Coleman ML to baseline your machine learning models. These models aren’t often the final production models for a machine learning application, but are very useful for baseline models, and establish a comparison for your future models to improve on.

Linear Regression: Linear regression is a model that creates a linear relationship between your predictor variables and your target. This is a very simple relationship, but it comes with some assumptions.

  • Linear relationship: Linear regression is assuming that the relationship between your features and your target is linear. Much like the relationship between height and wingspan, a change in one will have a proportional change in the other.
  • Collinearity: For a linear regression model, it is important to determine if there are multiple predictor variables that are highly correlated. For example, if you were using height and wingspan as predictor variables for weight, you would run into an issue that height and wingspan are highly correlated. A taller person has a larger wingspan so using both variables as predictors can disrupt the model.
  • Normal Distributions: Linear regression will work best when the input and output variables have normal distributions.

Coleman AI offers visualization techniques and a correlation heatmap visual to detect collinearity between variables so you can easily evaluate if you have variables that can be removed from the model.

Correlation Heatmap sample in Coleman

Decision Trees: A decision tree can best be thought of like a game of “20-questions” or “Guess Who?”. With enough if-then questions, a population can be reduced to smaller subsets until the value, person, or category can be guessed. Each question is considered a node, splitting the data into two branches, which will have their own nodes that continue to split producing a tree.

Advantages:

  • Decision trees require relatively little effort from users for data preparation.
  • Handles non-linear data.
  • Decision trees can be easily visualized. Below is a visual representation of a decision tree for a basic example that uses the titanic dataset that predicts if a passenger will survive or not. (SIBSP = number of siblings/spouses aboard).

Decision tree depiction where each node splits into multiple branches

You can see that each question partitions the data eventually leaving you in a bin that includes a prediction for survival. This brings us to our first of our disadvantages to decision trees.

Disadvantages:

  • Does not work well with unbalanced classes. Each split should roughly cut your data in half. For example, if you are dealing with a dataset that is 90% red and 10% blue, a node that splits based on color will cause your model to be biased towards red.
  • Overfitting. With enough nodes and branches, you could keep splitting the titanic population until each passenger into their own unique bin and the tree perfectly describes the dataset. This tree would not perform as well if we used it on the population of a different shipwreck. Some tuning of parameters such as number maximum tree depth can keep the tree general enough to work on new data. Infor Coleman ML has easy access to these hyperparameters for tuning models.

These basic algorithms are a great place to start when developing your model. You might find that these algorithms work well for your applications, but you might also find that their assumptions don’t meet your use case, or that your business need requires a different performance threshold. These models serve as great baselines as model development moves into testing different algorithms.

We will continue this blog series in the coming weeks to further understand the world of machine learning.

Infor Coleman ML is part of the Infor technology platform. If you would like to learn more about how Coleman can benefit your business and the industry specific machine learning models Infor can deploy don’t hesitate to contact us.