Member-only story
Data Preprocessing in Python
An implementation
At the heart of Machine Learning is to process data. Your machine learning tools are as good as the quality of your data. This blog deals with the various steps of cleaning data. Your data needs to go through a few steps before it is could be used for making predictions.

The Dataset for this blog can be accessed from here.
Steps involved in data preprocessing :
- Importing the required Libraries
- Importing the data set
- Handling the Missing Data.
- Encoding Categorical Data.
- Splitting the data set into test set and training set.
- Feature Scaling.
So let us look at these steps one by one.
Step 1: Importing the required Libraries
To follow along you will need to download this dataset : Data.csv
Every time we make a new model, we will require to import Numpy and Pandas. Numpy is a Library which contains Mathematical functions and is used for scientific computing while Pandas is used to import and manage the data sets.
import pandas as pd
import numpy as np
Here we are importing the pandas and Numpy library and assigning a shortcut “pd” and “np” respectively.
Step 2: Importing the Dataset
Data sets are available in .csv format. A CSV file stores tabular data in plain text. Each line of the file is a data record. We use the read_csv method of the pandas library to read a local CSV file as a dataframe.
dataset = pd.read_csv('Data.csv')
After carefully inspecting our dataset, we are going to create a matrix of features in our dataset (X) and create a dependent vector (Y) with their respective observations. To read the columns, we will use iloc of pandas (used to fix the indexes for selection) which takes two parameters — [row selection, column selection].
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values