Member-only story

Data Preprocessing in Python

An implementation

Published in

DataDrivenInvestor

5 min readJun 29, 2019

At the heart of Machine Learning is to process data. Your machine learning tools are as good as the quality of your data. This blog deals with the various steps of cleaning data. Your data needs to go through a few steps before it is could be used for making predictions.

The Dataset for this blog can be accessed from here.

Steps involved in data preprocessing :

Importing the required Libraries
Importing the data set
Handling the Missing Data.
Encoding Categorical Data.
Splitting the data set into test set and training set.
Feature Scaling.

So let us look at these steps one by one.

Step 1: Importing the required Libraries

To follow along you will need to download this dataset : Data.csv

Every time we make a new model, we will require to import Numpy and Pandas. Numpy is a Library which contains Mathematical functions and is used for scientific computing while Pandas is used to import and manage the data sets.

import pandas as pd
import numpy as np

Here we are importing the pandas and Numpy library and assigning a shortcut “pd” and “np” respectively.

Step 2: Importing the Dataset

Data sets are available in .csv format. A CSV file stores tabular data in plain text. Each line of the file is a data record. We use the read_csv method of the pandas library to read a local CSV file as a dataframe.

dataset = pd.read_csv('Data.csv')

After carefully inspecting our dataset, we are going to create a matrix of features in our dataset (X) and create a dependent vector (Y) with their respective observations. To read the columns, we will use iloc of pandas (used to fix the indexes for selection) which takes two parameters — [row selection, column selection].

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

DataDrivenInvestor

Data Preprocessing in Python

An implementation

Steps involved in data preprocessing :

Step 1: Importing the required Libraries

Step 2: Importing the Dataset

Step 3: Handling the Missing Data

Published in DataDrivenInvestor

Written by Afroz Chakure

Responses (2)