Deep Dive in Machine Learning with Python
Part — XIV: Initial Data Analysis (IDA) with example
Welcome to another blog of Deep Dive in Machine Learning with Python, in the last blog we touch base on different levels of data analysis.
In today’s blog, we will delve into the core concepts of Initial Data Analysis (IDA) by using the Autism Spectrum Disorder (Children) dataset. Thanks to the Dataset creators and UCI ML Repository for providing this dataset.
Import required python libraries
Load the ARFF(Attribute Relational File Format) Dataset file
Step-1: Change the character encoding
If you refer to the above image(i.e. Loaded ASD dataset) then you will find ‘b’ is associated with every data value. This refers that the data is in bytes, hence we need to change the character encoding.
Browse the dataset
Here, we get to know the number of features in the dataset and displayed the top-5 records.
Step-2: Datatype Handling
Step2.1: ‘AGE’ converted to dtype ‘INT’
So, before converting the data type of ‘AGE’ variable from FLOAT to INT, we need to fill its NULL values. Hence, replacing the NULLs with 0 and later on, handle these 4 records.
Step2.2: Labelling ‘GENDER’ to dtype ‘INT’ (1 represents m(i.e. male) and 0 represents f(i.e. female))
Step2.3: Labelling ‘BORN_WITH_JAUNDICE’ to dtype ‘INT’ (1 corresponds as ‘yes’ and 0 as ‘no’)
Step2.4: Labelling ‘FAMILY_MEMBER_WITH_PDD’ to dtype ‘INT’ (1 corresponds as ‘yes’ and 0 as ‘no’)
Step2.5: Labelling ‘USED_SCREENING_APP_BEFORE’ to dtype ‘INT’ (1 corresponds as ‘yes’ and 0 as ‘no’)
Step2.6: Converting the data types of ‘Screening Questions’ variables to ‘INT’
Step2.7: Labelling ‘SCREENING_SCORE’ to dtype ‘INT’
Step2.8: Labelling ‘ASD_Label’ to dtype ‘INT’ (1 corresponds as ‘yes’ and 0 as ‘no’)
Step2.9: Standardizing data of ‘WHOS_COMPLETING_TEST’
First-hand cleaned DataFrame
Congratulations, we come to the end of this blog. To summarize, we covered the first two stages of Initial Data Analysis (IDA).
Follow to get notified for the upcoming posts where we will work to fill the MISSING values in ‘ETHNICITY’ and ‘WHOS_COMPLETING_TEST’. And, build our first Machine Learning Regression model to predict the missing values in ‘AGE’.
If you want to download the Jupyter Notebook of this blog, then kindly access below GitHub repository:
Thank you and happy learning!!
Blog-15: Initial Data Analysis -II