Deep Dive in Machine Learning with Python

Part — XIV: Initial Data Analysis (IDA) with example

Rajesh Sharma
Analytics Vidhya

--

Image Link

Welcome to another blog of Deep Dive in Machine Learning with Python, in the last blog we touch base on different levels of data analysis.

In today’s blog, we will delve into the core concepts of Initial Data Analysis (IDA) by using the Autism Spectrum Disorder (Children) dataset. Thanks to the Dataset creators and UCI ML Repository for providing this dataset.

Import required python libraries

Python packages

Load the ARFF(Attribute Relational File Format) Dataset file

Loaded ASD dataset

Step-1: Change the character encoding

If you refer to the above image(i.e. Loaded ASD dataset) then you will find ‘b’ is associated with every data value. This refers that the data is in bytes, hence we need to change the character encoding.

Character encoding

Browse the dataset

Here, we get to know the number of features in the dataset and displayed the top-5 records.

Step-2: Datatype Handling

Features datatypes

Step2.1: ‘AGE’ converted to dtype ‘INT’

So, before converting the data type of ‘AGE’ variable from FLOAT to INT, we need to fill its NULL values. Hence, replacing the NULLs with 0 and later on, handle these 4 records.

Filled the NULL values in ‘AGE’
AGE dtype converted to INT

Step2.2: Labelling ‘GENDER’ to dtype ‘INT’ (1 represents m(i.e. male) and 0 represents f(i.e. female))

COUNT of MALES and FEMALES
GENDER encoded to 0 and 1

Step2.3: Labelling ‘BORN_WITH_JAUNDICE’ to dtype ‘INT’ (1 corresponds as ‘yes’ and 0 as ‘no’)

Before Labelling
After Labelling

Step2.4: Labelling ‘FAMILY_MEMBER_WITH_PDD’ to dtype ‘INT’ (1 corresponds as ‘yes’ and 0 as ‘no’)

Before Labelling
After Labelling

Step2.5: Labelling ‘USED_SCREENING_APP_BEFORE’ to dtype ‘INT’ (1 corresponds as ‘yes’ and 0 as ‘no’)

Before Labelling
After Labelling

Step2.6: Converting the data types of ‘Screening Questions’ variables to ‘INT’

Step2.7: Labelling ‘SCREENING_SCORE’ to dtype ‘INT’

Before dtype conversion
After dtype conversion

Step2.8: Labelling ‘ASD_Label’ to dtype ‘INT’ (1 corresponds as ‘yes’ and 0 as ‘no’)

Before Labelling
After Labelling

Step2.9: Standardizing data of ‘WHOS_COMPLETING_TEST’

Before Standardizing
After Standardizing

First-hand cleaned DataFrame

Courtesy WWE and The New Day

Congratulations, we come to the end of this blog. To summarize, we covered the first two stages of Initial Data Analysis (IDA).

Follow to get notified for the upcoming posts where we will work to fill the MISSING values in ‘ETHNICITY’ and ‘WHOS_COMPLETING_TEST’. And, build our first Machine Learning Regression model to predict the missing values in ‘AGE’.

If you want to download the Jupyter Notebook of this blog, then kindly access below GitHub repository:

https://github.com/Rajesh-ML-Engg/Autism_Spectrum_Disorder

Thank you and happy learning!!

Blog-15: Initial Data Analysis -II

--

--

Rajesh Sharma
Analytics Vidhya

It can be messy, it can be unstructured but it always speaks, we only need to understand its language!!