Deep Dive in Machine Learning with Python

Part — XIV: Initial Data Analysis (IDA) with example

Published in

Analytics Vidhya

4 min readFeb 15, 2020

Welcome to another blog of Deep Dive in Machine Learning with Python, in the last blog we touch base on different levels of data analysis.

In today’s blog, we will delve into the core concepts of Initial Data Analysis (IDA) by using the Autism Spectrum Disorder (Children) dataset. Thanks to the Dataset creators and UCI ML Repository for providing this dataset.

Import required python libraries

Load the ARFF(Attribute Relational File Format) Dataset file

Step-1: Change the character encoding

If you refer to the above image(i.e. Loaded ASD dataset) then you will find ‘b’ is associated with every data value. This refers that the data is in bytes, hence we need to change the character encoding.

Browse the dataset

Here, we get to know the number of features in the dataset and displayed the top-5 records.

Step-2: Datatype Handling

Step2.1: ‘AGE’ converted to dtype ‘INT’

So, before converting the data type of ‘AGE’ variable from FLOAT to INT, we need to fill its NULL values. Hence, replacing the NULLs with 0 and later on, handle these 4 records.

Step2.2: Labelling ‘GENDER’ to dtype ‘INT’ (1 represents m(i.e. male) and 0 represents f(i.e. female))

Step2.3: Labelling ‘BORN_WITH_JAUNDICE’ to dtype ‘INT’ (1 corresponds as ‘yes’ and 0 as ‘no’)

Step2.4: Labelling ‘FAMILY_MEMBER_WITH_PDD’ to dtype ‘INT’ (1 corresponds as ‘yes’ and 0 as ‘no’)

Step2.5: Labelling ‘USED_SCREENING_APP_BEFORE’ to dtype ‘INT’ (1 corresponds as ‘yes’ and 0 as ‘no’)

Step2.6: Converting the data types of ‘Screening Questions’ variables to ‘INT’

Step2.7: Labelling ‘SCREENING_SCORE’ to dtype ‘INT’

Step2.8: Labelling ‘ASD_Label’ to dtype ‘INT’ (1 corresponds as ‘yes’ and 0 as ‘no’)

Step2.9: Standardizing data of ‘WHOS_COMPLETING_TEST’

First-hand cleaned DataFrame

Congratulations, we come to the end of this blog. To summarize, we covered the first two stages of Initial Data Analysis (IDA).

Follow to get notified for the upcoming posts where we will work to fill the MISSING values in ‘ETHNICITY’ and ‘WHOS_COMPLETING_TEST’. And, build our first Machine Learning Regression model to predict the missing values in ‘AGE’.

If you want to download the Jupyter Notebook of this blog, then kindly access below GitHub repository:
https://github.com/Rajesh-ML-Engg/Autism_Spectrum_Disorder

Thank you and happy learning!!

Blog-15: Initial Data Analysis -II