Introducing: Machine Learning in R
Machine learning is a branch in computer science that studies the design of algorithms that can learn. Typical machine learning tasks are concept learning, function learning or predictive modeling, clustering and finding predictive patterns. These tasks are learned through available data that were observed through experiences or instructions, for example. Machine learning hopes that including the experience into its tasks will eventually improve the learning. The ultimate goal is to improve the learning in such a way that it becomes automatic, so that humans like ourselves dont need to interfere any more.
This small tutorial is meant to introduce you to the basics of machine learning in R: more specifically, it will show you how to use R to work with the well-known machine learning algorithm called KNN or k-nearest neighbors.
If youre interested in following a course, consider checking out our Introduction to Machine Learning with R or DataCamps Unsupervised Learning in R course!
The KNN or k-nearest neighbors algorithm is one of the simplest machine learning algorithms and is an example of instance-based learning, where new data are classified based on stored, labeled instances.
More specifically, the distance between the stored data and the new instance is calculated by means of some kind of a similarity measure. This similarity measure is typically expressed by a distance measure such as the Euclidean distance, cosine similarity or the Manhattan distance.
In other words, the similarity to the data that was already in the system is calculated for any new data point that you input into the system.
Then, you use this similarity value to perform predictive modeling. Predictive modeling is either classification, assigning a label or a class to the new instance, or regression, assigning a value to the new instance. Whether you classify or assign a value to the new instance depends of course on your how you compose your model with KNN.
The k-nearest neighbor algorithm adds to this basic algorithm that after the distance of the new point to all stored data points has been calculated, the distance values are sorted and the k-nearest neighbors are determined. The labels of these neighbors are gathered and a majority vote or weighted vote is used for classification or regression purposes.
In other words, the higher the score for a certain data point that was already stored, the more likely that the new instance will receive the same classification as that of the neighbor. In the case of regression, the value that will be assigned to the new data point is the mean of its k nearest neighbors.
Machine learning usually starts from observed data. You can take your own data set or browse through other sources to find one.
This tutorial uses the Iris data set, which is very well-known in the area of machine learning. This dataset is built into R, so you can take a look at this dataset by typing the following into your console:
If you want to download the data set instead of using the one that is built into R, you can go to the UC Irvine Machine Learning Repository and look up the Iris data set.
Tip: dont only check out the data folder of the Iris data set, but also take a look at the data description page!
Then, use the following command to load in the data:
The command reads the .csv or Comma Separated Value file from the website. The header argument has been put to FALSE, which means that the Iris data set from this source does not give you the attribute names of the data.
Instead of the attribute names, you might see strange column names such as V1 or V2 when you inspect the iris attribute with a function such as head(). Those are set at random.
To simplify working with the data set, it is a good idea to make the column names yourself: you can do this through the function names(), which gets or sets the names of an object. Concatenate the names of the attributes as you would like them to appear. In the code chunk above, youll have listed Sepal.Length, Sepal.Width, Petal.Length, Petal.Width and Species.
Once again, these names dont come out of the blue: take a look at the description of the data set that is linked above; Youll normally see all these names listed.
Now that you have loaded the Iris data set into RStudio, you should try to get a thorough understanding of what your data is about. Just looking or reading about your data is certainly not enough to get started!
You need to get your hands dirty, explore and visualize your data set and even gather some more domain knowledge if you feel the data is way over your head.
Probably youll already have the domain knowledge that you need, but just as a reminder, all flowers contain a sepal and a petal. The sepal encloses the petals and is typically green and leaf-like, while the petals are typically colored leaves. For the iris flowers, this is just a little bit different, as you can see in the following picture:
First, you can already try to get an idea of your data by making some graphs, such as histograms or boxplots. In this case, however, scatter plots can give you a great idea of what youre dealing with: it can be interesting to see how much one variable is affected by another.
In other words, you want to see if there is any correlation between two variables.
You can make scatterplots with the ggvis package, for example.
Note that you first need to load the ggvis package:
You see that there is a high correlation between the sepal length and the sepal width of the Setosa iris flowers, while the correlation is somewhat less high for the Virginica and Versicolor flowers: the data points are more spread out over the graph and dont form a cluster like you can see in the case of the Setosa flowers.
The scatter plot that maps the petal length and the petal width tells a similar story:
You see that this graph indicates a positive correlation between the petal length and the petal width for all different species that are included into the Iris data set. Of course, you probably need to test this hypothesis a bit further if you want to be really sure of this:
You see that when you combined all three species, the correlation was a bit stronger than it is when you look at the different species separately: the overall correlation is 0.96, while for Versicolor this is 0.79. Setosa and Virginica, on the other hand, have correlations of petal length and width at 0.31 and 0.32 when you round up the numbers.
Tip: are you curious about ggvis, graphs or histograms in particular? Check out our histogram tutorial and/or ggvis course.
After a general visualized overview of the data, you can also view the data set by entering
However, as you will see from the result of this command, this really isnt the best way to inspect your data set thoroughly: the data set takes up a lot of space in the console, which will impede you from forming a clear idea about your data. It is therefore a better idea to inspect the data set by executing head(iris) or str(iris).
Note that the last command will help you to clearly distinguish the data type num and the three levels of the Species attribute, which is a factor. This is very convenient, since many R machine learning classifiers require that the target feature is coded as a factor.
Remember that factor variables represent categorical variables in R. They can thus take on a limited number of different values.
A quick look at the Species attribute through tells you that the division of the species of flowers is 50-50-50. On the other hand, if you want to check the percentual division of the Species attribute, you can ask for a table of proportions:
Note that the round argument rounds the values of the first argument, prop.table(table(iris$Species))*100 to the specified number of digits, which is one digit after the decimal point. You can easily adjust this by changing the value of the digits argument.
Lets not remain on this high-level overview of the data! R gives you the opportunity to go more in-depth with the summary() function. This will give you the minimum value, first quantile, median, mean, third quantile and maximum value of the data set Iris for numeric data types. For the class variable, the count of factors will be returned:
As you can see, the c() function is added to the original command: the columns petal width and sepal width are concatenated and a summary is then asked of just these two columns of the Iris data set.
After you have acquired a good understanding of your data, you have to decide on the use cases that would be relevant for your data set. In other words, you think about what your data set might teach you or what you think you can learn from your data. From there on, you can think about what kind of algorithms you would be able to apply to your data set in order to get the results that you think you can obtain.
Tip: keep in mind that the more familiar you are with your data, the easier it will be to assess the use cases for your specific data set. The same also holds for finding the appropriate machine algorithm.
For this tutorial, the Iris data set will be used for classification, which is an example of predictive modeling. The last attribute of the data set, Species, will be the target variable or the variable that you want to predict in this example.
Note that you can also take one of the numerical classes as the target variable if you want to use KNN to do regression.
Many of the algorithms used in machine learning are not incorporated by default into R. You will most probably need to download the packages that you want to use when you want to get started with machine learning.
Tip: got an idea of which learning algorithm you may use, but not of which package you want or need? You can find a pretty complete overview of all the packages that are used in R right here.
To illustrate the KNN algorithm, this tutorial works with the package class:
If you dont have this package yet, you can quickly and easily do so by typing the following line of code:
Remember the nerd tip: if youre not sure if you have this package, you can run the following command to find out!
After exploring your data and preparing your workspace, you can finally focus back on the task ahead: making a machine learning model. However, before you can do this, its important to also prepare your data. The following section will outline two ways in which you can do this: by normalizing your data (if necessary) and by splitting your data in training and testing sets.
As a part of your data preparation, you might need to normalize your data so that its consistent. For this introductory tutorial, just remember that normalization makes it easier for the KNN algorithm to learn. There are two types of normalization:
So when do you need to normalize your dataset?
In short: when you suspect that the data is not consistent.
You can easily see this when you go through the results of the summary() function. Look at the minimum and maximum values of all the (numerical) attributes. If you see that one attribute has a wide range of values, you will need to normalize your dataset, because this means that the distance will be dominated by this feature.
For example, if your dataset has just two attributes, X and Y, and X has values that range from 1 to 1000, while Y has values that only go from 1 to 100, then Ys influence on the distance function will usually be overpowered by Xs influence.
When you normalize, you actually adjust the range of all features, so that distances between variables with larger ranges will not be over-emphasised.
Tip: go back to the result of summary(iris) and try to figure out if normalization is necessary.
The Iris data set doesnt need to be normalized: the Sepal.Length attribute has values that go from 4.3 to 7.9 and Sepal.Width contains values from 2 to 4.4, while Petal.Lengths values range from 1 to 6.9 and Petal.Width goes from 0.1 to 2.5. All values of all attributes are contained within the range of 0.1 and 7.9, which you can consider acceptable.
Nevertheless, its still a good idea to study normalization and its effect, especially if youre new to machine learning. You can perform feature normalization, for example, by first making your own normalize() function.
You can then use this argument in another command, where you put the results of the normalization in a data frame through as.data.frame() after the function lapply() returns a list of the same length as the data set that you give in. Each element of that list is the result of the application of the normalize argument to the data set that served as input:
Test this in the DataCamp Light chunk below!
For the Iris dataset, you would have applied the normalize argument on the four numerical attributes of the Iris data set (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and put the results in a data frame.
Tip: to more thoroughly illustrate the effect of normalization on the data set, compare the following result to the summary of the Iris data set that was given in step two.
In order to assess your models performance later, you will need to divide the data set into two parts: a training set and a test set.
The first is used to train the system, while the second is used to evaluate the learned or trained system. In practice, the division of your data set into a test and a training sets is disjoint: the most common splitting choice is to take 2/3 of your original data set as the training set, while the 1/3 that remains will compose the test set.
One last look on the data set teaches you that if you performed the division of both sets on the data set as is, you would get a training class with all species of Setosa and Versicolor, but none of Virginica. The model would therefore classify all unknown instances as either Setosa or Versicolor, as it would not be aware of the presence of a third species of flowers in the data.
In short, you would get incorrect predictions for the test set.
You thus need to make sure that all three classes of species are present in the training model. Whats more, the amount of instances of all three species needs to be more or less equal so that you do not favour one or the other class in your predictions.
To make your training and test sets, you first set a seed. This is a number of Rs random number generator. The major advantage of setting a seed is that you can get the same sequence of random numbers whenever you supply the same seed in the random number generator.
Then, you want to make sure that your Iris data set is shuffled and that you have an equal amount of each species in your training and test sets.
You use the sample() function to take a sample with a size that is set as the number of rows of the Iris data set, or 150. You sample with replacement: you choose from a vector of 2 elements and assign either 1 or 2 to the 150 rows of the Iris data set. The assignment of the elements is subject to probability weights of 0.67 and 0.33.
Note that the replace argument is set to TRUE: this means that you assign a 1 or a 2 to a certain row and then reset the vector of 2 to its original state. This means that, for the next rows in your data set, you can either assign a 1 or a 2, each time again. The probability of choosing a 1 or a 2 should not be proportional to the weights amongst the remaining items, so you specify probability weights. Note also that, even though you dont see it in the DataCamp Light chunk, the seed has still been set to 1234.
Remember that you want your training set to be 2/3 of your original data set: that is why you assign 1 with a probability of 0.67 and the 2s with a probability of 0.33 to the 150 sample rows.
You can then use the sample that is stored in the variable ind to define your training and test sets:
Note that, in addition to the 2/3 and 1/3 proportions specified above, you dont take into account all attributes to form the training and test sets. Specifically, you only take Sepal.Length, Sepal.Width, Petal.Length and Petal.Width. This is because you actually want to predict the fifth attribute, Species: it is your target variable. However, you do want to include it into the KNN algorithm, otherwise there will never be any prediction for it.
You therefore need to store the class labels in factor vectors and divide them over the training and test sets:
After all these preparation steps, you have made sure that all your known (training) data is stored. No actual model or learning was performed up until this moment. Now, you want to find the k nearest neighbors of your training set.
An easy way to do these two steps is by using the knn() function, which uses the Euclidian distance measure in order to find the k-nearest neighbours to your new, unknown instance. Here, the k parameter is one that you set yourself.
As mentioned before, new instances are classified by looking at the majority vote or weighted vote. In case of classification, the data point with the highest score wins the battle and the unknown instance receives the label of that winning data point. If there is an equal amount of winners, the classification happens randomly.
Note: the k parameter is often an odd number to avoid ties in the voting scores.
To build your classifier, you need to take the knn() function and simply add some arguments to it, just like in this example:
You store into iris_pred the knn() function that takes as arguments the training set, the test set, the train labels and the amount of neighbours you want to find with this algorithm. The result of this function is a factor vector with the predicted classes for each row of the test data.
Note that you dont want to insert the test labels: these will be used to see if your model is good at predicting the actual classes of your instances!
You see that when you inspect the the result, iris_pred, youll get back the factor vector with the predicted classes for each row of the test data.
An essential next step in machine learning is the evaluation of your models performance. In other words, you want to analyze the degree of correctness of the models predictions.
For a more abstract view, you can just compare the results of iris_pred to the test labels that you had defined earlier:
Go here to read the rest:
- Machine Learning Answers: Facebook Stock Is Down 20% In A Month, What Are The Chances It'll Rebound? - Trefis - September 22nd, 2020
- Machine Learning in Education Market Incredible Possibilities, Growth Analysis and Forecast To 2025 - The Daily Chronicle - September 22nd, 2020
- Proximity matters: Using machine learning and geospatial analytics to reduce COVID-19 exposure risk - Healthcare IT News - September 22nd, 2020
- Global Machine Learning Market Tends To Show Steady Growth Post Pandemic With Regional Overview and Top Key Players - Verdant News - September 22nd, 2020
- PREDICTING THE OPTIMUM PATH - Port Strategy - September 22nd, 2020
- AI/ML Remains The Most In-Demand Tech Skill Post COVID - Analytics India Magazine - September 22nd, 2020
- Panalgo Brings the Power of Machine-Learning to the Healthcare Industry Via Its IHD Software - AiThority - September 15th, 2020
- Microchip Partners with Machine-Learning (ML) Software Leaders to Simplify AI-at-the-Edge Design Using its 32-Bit Microcontrollers (MCUs) - EE Journal - September 15th, 2020
- What is 'custom machine learning' and why is it important for programmatic optimisation? - The Drum - September 15th, 2020
- PODCAST: NVIDIA's Director of Data Science Talks Machine Learning for Airlines and Aerospace - Aviation Today - September 15th, 2020
- The Use of Machine Learning to Forecast Progression to Advanced AMD - DocWire News - September 15th, 2020
- How Can Machine Learning Help the Teaching Profession? - FE News - September 15th, 2020
- Global Machine Learning in Automobile Market: Development History, Current Analysis and Estimated Forecast to 2024 - The Market Correspondent - September 15th, 2020
- Using machine learning to organize the chemical diversity - Tech Explorist - September 15th, 2020
- Dashboard AI Announces Its Technology Vision for the Foodservice and Hospitality Industry - PRNewswire - September 15th, 2020
- Alfa Releases Second Paper on AI, Using Machine Learning in the Wild - Monitor Daily - September 10th, 2020
- Combatting COVID-19 misinformation with machine learning (VB Live) - VentureBeat - September 10th, 2020
- This artist used machine learning to create realistic portraits of Roman emperors - The World - September 10th, 2020
- Domino Data Lab Named a Leader in Notebook-Based Predictive Analytics and Machine Learning Evaluation by Global Research Firm - Business Wire - September 10th, 2020
- Demonstration Of What-If Tool For Machine Learning Model Investigation - Analytics India Magazine - September 10th, 2020
- RXA to Participate in 2nd Annual A2.AI Conference focused on Machine Learning & Applied AI - PR Web - September 10th, 2020
- 50 Data Science and Analysts Jobs That Opened Just Last Week - Analytics India Magazine - September 10th, 2020
- FSS Launches Next Gen Recon with Machine Learning and Cloud Support - TechGenyz - September 10th, 2020
- Getting to the heart of machine learning and complex humans - The Irish Times - August 28th, 2020
- Global Machine Learning Courses Market Trends, Key Driven Factors, Segmentation And Forecast To 2020-2026 - The Scarlet - August 28th, 2020
- AI and Machine Learning Network Fetch.ai Partners Open-Source Blockchain Protocol Waves to Conduct R&D on DLT - Crowdfund Insider - August 28th, 2020
- UT Austin Selected as Home of National AI Institute Focused on Machine Learning - UT News | The University of Texas at Austin - August 26th, 2020
- Participation-washing could be the next dangerous fad in machine learning - MIT Technology Review - August 26th, 2020
- The Role of Artificial Intelligence and Machine Learning in the... - Insurance CIO Outlook - August 26th, 2020
- Machine Learning Artificial intelligence Market Size and Growth By Leading Vendors, By Types and Application, By End Users and Forecast to 2020-2027 -... - August 26th, 2020
- Air Force Taps Machine Learning to Speed Up Flight Certifications - Nextgov - August 26th, 2020
- What is AutoML and Why Should Your Business Consider It - BizTech Magazine - August 26th, 2020
- Chatbots Are Machine Learning Their Way To Human Language - Forbes - August 26th, 2020
- Explainable AI: From the peak of inflated expectations to the pitfalls of interpreting machine learning models - ZDNet - August 26th, 2020
- Focusing on ethical AI in business and government - FierceElectronics - August 26th, 2020
- Amazon's Machine Learning University To Make Its Online Courses Available To The Public - Analytics India Magazine - August 14th, 2020
- Watch 3 Videos from Coursera's New "Machine Learning for Everyone" - Machine Learning Times - machine learning & data science news - The... - August 14th, 2020
- PhD Research Fellowship in Machine Learning for Cognitive Power Management job with NORWEGIAN UNIVERSITY OF SCIENCE & TECHNOLOGY - NTNU | 219138 -... - August 14th, 2020
- Machine learning is pivotal to every line of business, every organisation must have an ML strategy - BusinessLine - August 14th, 2020
- CORRECTING and REPLACING Anyscale Hosts Inaugural Ray Summit on Scalable Python and Scalable Machine Learning - Yahoo Finance - August 14th, 2020
- Why GPT-3 Heralds a Democratic Revolution in Tech - Built In - August 14th, 2020
- BMW, Red Hat, and Malong Share Insights on AI and Machine Learning During Transform 2020 - ENGINEERING.com - August 14th, 2020
- Algorithm created by deep learning finds potential therapeutic targets throughout the human genome - National Science Foundation - August 14th, 2020
- Ensighten Launches Client-Side Threat Intelligence Initiative and Invests in Machine Learning - WFMZ Allentown - August 6th, 2020
- Hey software developers, youre approaching machine learning the wrong way - The Next Web - August 6th, 2020
- Introducing The AI & Machine Learning Imperative - MIT Sloan - August 6th, 2020
- Who Does the Machine Learning and Data Science Work? - Customer Think - August 6th, 2020
- Artificial Intelligence and Machine Learning Path to Intelligent Automation - Embedded Computing Design - August 6th, 2020
- Blacklight Solutions Unveils Software to Simplify Business Analytics with AI and Machine Learning - PRNewswire - August 6th, 2020
- AI is learning when it should and shouldnt defer to a human - MIT Technology Review - August 6th, 2020
- Moderna Announced Partnership With Amazon Web Services for Their Analytics and Machine Learning Services - Science Times - August 6th, 2020
- Surprisingly Recent Galaxy Discovered Using Machine Learning May Be the Last Generation Galaxy in the Long Cosmic History - SciTechDaily - August 6th, 2020
- STMicroelectronics Releases STM32 Condition-Monitoring Function Pack Leveraging Tools from Cartesiam for Simplified Machine Learning - ELE Times - August 6th, 2020
- Machine Learning Reveals What Makes People Happy In A Relationship - Forbes - August 4th, 2020
- Benefits Of AI And Machine Learning | Expert Panel | Security News - SecurityInformed - August 4th, 2020
- Preparing new machine learning models used to take weeks Activeloop teams up with NVIDIA to reduce that time to hours - MENAFN.COM - August 4th, 2020
- IoT automation trend rides the next wave of machine learning, Big Data - Urgent Communications - August 4th, 2020
- Decoding Practical Problems and Business Implications of Machine Learning - Analytics Insight - August 4th, 2020
- Artificial Intelligence and Machine Learning Industry 2020 Market Manufacturers Analysis, Share, Size, Growth, Trends and Research Report 2026 -... - August 4th, 2020
- Could this software help users trust machine learning decisions? - C4ISRNet - July 27th, 2020
- Top Five Data Privacy Issues that Artificial Intelligence and Machine Learning Startups Need to Know - insideBIGDATA - July 27th, 2020
- COVID-19 Impacts: Machine Learning Market will Accelerate at a CAGR of about 39% through 2020-2024 | The Increasing Adoption of Cloud-based Offerings... - July 27th, 2020
- Deep learning's role in the evolution of machine learning - TechTarget - July 1st, 2020
- 2 books to deepen your command of python machine learning - TechTalks - July 1st, 2020
- What I Learned From Looking at 200 Machine Learning Tools - Machine Learning Times - machine learning & data science news - The Predictive... - July 1st, 2020
- Protecting inventions which use Machine Learning and Artificial Intelligence - Lexology - July 1st, 2020
- Machine learning finds use in creating sharper maps of 'ecosystem' lines in the ocean - Firstpost - July 1st, 2020
- Fake data is great data when it comes to machine learning - Stacey on IoT - July 1st, 2020
- Decisions and NLP Logix Announce Partnership to bring the Power of Machine Learning to Business Process Management - Benzinga - July 1st, 2020
- Machine Learning in Medical Imaging Market Strategies and Insight Driven Transformation 2020-2030 - Cole of Duty - July 1st, 2020
- Impact of COVID-19 Outbreak on Artificial Intelligence and Machine Learning Market to Witness AIBrain, Amazon, Anki, CloudMinds - Cole of Duty - July 1st, 2020
- Machine Learning Market Projected to Register 43.5% CAGR to 2030 Intel, H2Oai - 3rd Watch News - July 1st, 2020
- Learn the business value of AI's various techniques - TechTarget - July 1st, 2020
- Machine Learning As A Service In Manufacturing Market Augmented Expansion to Be Registered by 2018-2023 - 3rd Watch News - July 1st, 2020
- COVID 19 Impact on Machine Learning in Medicine Market Outlook 2020 Industry Size, Top Key Manufacturers, Growth Insights, Demand Analysis and... - July 1st, 2020
- Machine learning algorithm from RaySearch enhances workflow at Swedish radiation therapy clinic - DOTmed HealthCare Business News - July 1st, 2020
- What a machine learning tool that turns Obama white can (and cant) tell us about AI bias - The Verge - June 25th, 2020
- AI and Machine Learning Are Changing Everything. Here's How You Can Get In On The Fun - ExtremeTech - June 25th, 2020
- SLAM + Machine Learning Ushers in the "Age of Perception - Robotics Business Review - June 25th, 2020
- Googles new ML Kit SDK keeps all machine learning on the device - SlashGear - June 25th, 2020