Menu

Monday, March 21, 2022

Regression using Random Forest, SVM, and MLP

 

Regression is the process of process of  estimating the relationships between a dependent (or target) variable and one or more independent (or predictor) variables. It finds application in the area of Inference Analysis.  It is a handy technique for forecasting the future trends in data.  

Example: Consider that a HR head wants to fix salary of a new employee. For finalizing the salary the head, considers the various parameters like the level of education, no of years of experience, last position held, expertise level etc.  Now if the salary is predicted using only one parameter say 'no of years of experience' then this type of regression is called as Simple Linear Regression (one target and one predictor variable) . Also, if multiple parameters say 'level of education', 'no of years of experience', 'last position held' are used to fix the salary then it becomes Multivariate Regression (single target, multiple predictor variables).

Irrespective of the model you choose for the task of performing Simple Linear Regression, you need to complete the following steps.

  1. Prepare the training data: This step may involve operations such as data cleaning, transformation etc. 
  2. Create the model for prediction: During this step, the model of your choice needs to be initialized and configured.  
  3. Train the model: During this step, the model is trained on the data created in step 1 above,
  4. Deploy the model for prediction: This step accepts the test data and predicts the value of the target variable.

In this Article, let us explore three simple ways of performing Simple Linear Regression using the models such as:  

  1. Random Forest
  2. Support Vector Machine (SVM)
  3. Multi Layer Perceptron (MLP)

Let us consider the training data from the file 'Salaries.csv'.


Problem Statement: Using this data, we want to predict the salary of new person (target variable) using the parameter of 'no. of years of experience' (predictor variable).

In this Article, let us explore three simple ways of performing Simple Linear Regression using the models such as:  

  1. Random Forest
  2. Support Vector Machine (SVM)
  3. Multi Layer Perceptron (MLP)

4.      Let us explore these regression models. 
 

1.      1. Random Forest: A random forest is  an ensemble that consists of many decisions trees. It uses bagging and feature randomness when building each individual tree. While predicting, for the purpose of maximizing the prediction accuracy, it considers the prediction which has been generated by the maximum trees.  

       The 'sklearn' library in Python can be used to create the random forest as shown below.

 

Python3

# prediction using Random forest

# Importing the libraries

import pandas as pd

from sklearn.ensemble import RandomForestRegressor

 Now let us initialize the  training data set.

Python3

data = pd.read_csv('Salaries.csv')

x = data.iloc[:, 1:2].values  # so x=Yrs. of Experience

y = data.iloc[:, 2].values    # so y= Salary in Rs.

 Next step is to initialize the Random Forest model and feeding the training dataset to it.

Python3

# Create a Random Forest model.Default no of trees=100

model = RandomForestRegressor()

#Train the model using the training data

model.fit(x, y)

One the model is trained, you can use it for the task of prediction. Let us try to predict the salary of a person whi has experience of 7.4 years. 

Python3

 

#Predict the salary for test dataset

Y_pred = model.predict(np.array([7.4]).reshape(1, 1)) # test the output by changing values

print("Predicted Salary=", Y_pred)

 

Output: Predicted Salary= [82500.]

 

2. Support Vector Machine (SVM): A support vector machine (SVM) is a supervised machine learning model that can be used for both the tasks of classification and regression. After giving an SVM model sets of labeled training data they’re able to predict the target. The SVM models use kernel functions to avoid complex computations which make them suitable for handling the large data.

 The 'sklearn' library in Python can be used to create the SVM as shown below.

Python3

# prediction using SVM

from sklearn import svm

from sklearn import metrics

import pandas as pd

 

 

data = pd.read_csv('Salaries.csv')

x = data.iloc[:, 1:2].values  # so x=Yrs. of Experience

y = data.iloc[:, 2].values    # so y= Salary in Rs.

 

#Create a svm with Linear Kernel

model = svm.SVC() # model = svm.SVC(kernel='linear')

#Train the model using the training data

model.fit(x,y)

 

 

#Predict the salary for test dataset

y_pred = model.predict(np.array([7.4]).reshape(1, 1))

print("Predicted Salary=", y_pred)

 

Output: Predicted Salary= [80000]

3. Multi Layer Perceptron (MLP): It  is one of the most common neural network models used in machine learning. A multi-layered perceptron consists of interconnected neurons transferring information to each other. The MLP is a feedforward neural network, which means that the data is transmitted from the input layer to the output layer in the forward direction. The connections between the layers are assigned weights. The weight of a connection specifies its importance.  The technique of 'Backpropagation' is used to optimize the weights of an MLP till the weights converge to predict the correct values.  

The 'sklearn' library in Python can be used to create the MLP regressor as shown below.

Python3

# prediction using NN: MLP

 

from sklearn.neural_network import MLPRegressor

import pandas as pd

import numpy as np

 

data = pd.read_csv('Salaries.csv')

x = data.iloc[:, 1:2].values  # so x=Yrs. of Experience

y = data.iloc[:, 2].values    # so y= Salary in Rs.

 

# create the MLPRegressor model

nn = MLPRegressor(solver='lbfgs', alpha=1e-1, hidden_layer_sizes=(5, 2), random_state=0)

#Train the model using the training sets

nn.fit(x,y)

 

#predict the salary of a person who has experience of 7.4 years.

y_pred = nn.predict(np.array([7.4]).reshape(1, 1))

print("Predicted Salary=", y_pred)

 

Output: Predicted Salary= [88285.71344169]

Conclusion: The three models discussed have different levels of accuracy as depicted from the output obtained. So the 'prediction accuracy' parameter affects the decision of selecting the proper model for the task of prediction.  

 

 



No comments:

Post a Comment