In this post, I will show how to conduct a linear regression with Python. There are many similar articles on the web, but I thought to write a simple one and share it with you.

Importing all important libraries

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt 

Reading the CSV file

df = pd.read_csv('https://gist.githubusercontent.com/omarish/5687264/raw/7e5c814ce6ef33e25d5259c1fe79463c190800d9/mpg.csv')

Checking the data type of te file

df.dtypes
mpg             float64
cylinders         int64
displacement    float64
horsepower       object
weight            int64
acceleration    float64
model_year        int64
origin            int64
name             object
dtype: object

data cleaning step

print (df[pd.to_numeric(df['horsepower'], errors='coerce').isnull()])
     mpg  cylinders  displacement horsepower  weight  acceleration  \
32   25.0          4          98.0          ?    2046          19.0   
126  21.0          6         200.0          ?    2875          17.0   
330  40.9          4          85.0          ?    1835          17.3   
336  23.6          4         140.0          ?    2905          14.3   
354  34.5          4         100.0          ?    2320          15.8   
374  23.0          4         151.0          ?    3035          20.5   

     model_year  origin                  name  
32           71       1            ford pinto  
126          74       1         ford maverick  
330          80       2  renault lecar deluxe  
336          80       1    ford mustang cobra  
354          81       2           renault 18i  
374          82       1        amc concord dl  

cleaning ? from the particular column

df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')
cols = df.columns

Making all the unwanted value as NaN

df[cols] = df[cols].apply(pd.to_numeric, errors='coerce') 
df.applymap(np.isreal)
df.head()
	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
0	18.0	8	307.0	130.0	3504	12.0	70	1	NaN
1	15.0	8	350.0	165.0	3693	11.5	70	1	NaN
2	18.0	8	318.0	150.0	3436	11.0	70	1	NaN
3	16.0	8	304.0	150.0	3433	12.0	70	1	NaN
4	17.0	8	302.0	140.0	3449	10.5	70	1	NaN

Dropping all the unnecessary columns

df = df.drop(['name','origin','model_year'], axis=1)
df = df.replace('?', np.nan)
df = df.dropna()

Separating the dependent variable(y) and independent variable(x)

X = df.drop('mpg', axis=1) 
y = df[['mpg']]

Drop all the Nan values from the dataset

df=df.dropna()

Making training and testing dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

Linear regression and finding the best fit line for only one column

reg = LinearRegression()
reg.fit(X_train[['horsepower']], y_train)

Predicting the value

y_predicted = reg.predict(X_test[['horsepower']])

Evalution matrices

print("Mean squared error: %.2f" % mean_squared_error(y_test, y_predicted))
print('R²: %.2f' % r2_score(y_test, y_predicted))
Mean squared error: 28.66
R²: 0.59

Finding the best fit line for more than one column

reg.fit(X_train[['horsepower','weight','cylinders']], y_train)
y_predicted = reg.predict(X_test[['horsepower','weight','cylinders']])

print("Mean squared error: %.2f" % mean_squared_error(y_test, y_predicted))
print('R²: %.2f' % r2_score(y_test, y_predicted))
Mean squared error: 19.12
R²: 0.72