Learning Python for Data Science: Data Inspection and Data Types

Data = Understanding

Zainul Arifin
Nerd For Tech

--

Picture by Alexander Sinn from Unsplash

Before we go any further, a little disclaimer, I am by no means an expert in Python and the Pandas library. In fact, it is the opposite. I am an avid R and its libraries user but I picked up Python programming just recently because I am interested in learning data science with Python, sorry for being unfaithful R 😢

What I hope to achieve by writing this post is first to help myself and other people that are just starting to program in python.

Without further stalling, let's get to it!

From years of programming in R, the first and perhaps the most crucial step in data analysis is getting the data itself. So let’s use a dataset that people in data science has probably used at least once, the iris dataset. We are going to utilize the package statsmodels to get the iris data in Python. For data inspection, we are going to use Pandas so let’s also install that. Use pip to install the packages if you haven’t done so.

#install statsmodel with pip
pip install statsmodels
pip install pandas

To save the iris dataset into a variable in jupyter notebook or Python IDE and visualize the data frame, use the code below:

import statsmodels.api as sm
import pandas as pd
iris = sm.datasets.get_rdataset('iris').data
iris.head() #to check the first 5 rows of the dataframe
The first 5 rows of the iris dataset

Data Inspection

.head() and .tail()

Seeing is believing. To check if we have successfully imported the iris dataset, we can use iris.head() to check the first 5 entries of the iris data. In the data frame shown above, we can see that the first rows and columns are bolded. This is highlighted by Pandas to indicate row indices and column names. It is also important to note that python indexing starts from 0 instead of 1 which is different from other programming languages, such as R and MATLAB.

If you are interested in knowing the last 5 entries instead, we can use iris.tail(). We can also specify a parameter inside of head() or tail() by inserting a number to show n observations instead of 5. For example, iris.tail(10) shows the last 10 entries from the iris data.

.size() and .shape()

Now that we have seen our data, we should be interested in knowing the size of the data frame. The size or of the data frame is simply the number of rows multiplied by the number of columns. To get this information, we can use iris.size to see how many entries do we have (not counting row and column label).

There are 750 entries in the iris data frame

Now, knowing only the size of a data frame might not be very informative since we do not know the number of rows and columns. To obtain this information, we can use iris.shape to return the number of rows and columns, subsequently.

There are 150 rows and 5 columns in the iris data frame. Note: number shown is always (row, col)

Pandas Data Types

Having the appropriate data types for each column in the data frame is integral. This is because each data type has its corresponding operation unique to them. Let’s look at a simple example below:

The first addition is an addition between strings, and the second one is an addition between integers. On numeric data types, we can perform mathematical operations such as addition, subtraction, multiplication, etc. String operation, on the other hand, would result in the joining of the two for addition. As such, it is important to confirm the data types of our data.

Before we get into that, let’s familiarize ourselves with data types available in Pandas (do you know that Pandas and Python can have different data types?).

A brief explanation on Pandas and Python data types (image is created by the author)

This post is going to cover every Pandas datatype except for datetime64, more on datetime64 on a later post.

Int64 and float64 are used to represent numbers in Pandas. Int64 (integers) are numbers without fractions, examples would be -3, 0, and 3. Float64 are numbers that are represented in decimals, such as -3.00, 0.534, and 9.99.

Bool or boolean can only contain two values, either True or False. A boolean data type is especially useful for a column with a yes or no answer. An example would be infected status, infected (True) or not infected (False).

The object data type in Pandas is similar to the string data type python. An object data type is usually represented by characters, numeric and non-numeric. The category data type is similar to the object data type but it only has a finite amount of possibilities.

To check data types of a Pandas data frame, we can use .dtypes. Let’s apply this to the iris data frame.

Checking data types of the columns in the iris data frame

Looking at the data types, it seems that every column already has its appropriate data type. But, if we want to be more precise, we can change the Species column into category instead. We can use .astype(‘category’) to change the column from object to category.

iris["Species"] =  iris["Species"].astype('category')
#iris["Species"] <- the square bracket is used to select the column
#To change a column to other data type, fill in the parameter
#Example:
#.astype('int64') would change the column to integer
#.astype('float64') would change the column to floating points

We can check the change with dtypes.

Now Species data type is category

While it seems tedious to change object data type to category, there is an advantage after the conversion.

Memory Efficiency

We can use iris.info() to obtain information such as data types, column names, column value count (non-null), and memory used for storing. We can see that category data type used less memory compared to when it is stored as an object.

The reason for this is because values in category are saved as a key. Then for every incidence of the value, the key is stored instead of the values. Below is a simple explanation of how this works.

Images are regenerated by the author. The original depiction is created by https://github.com/tomytjandra

Conclusion

Congratulations! You have read this far and we have only been exploring the data so far. But that is okay because for analysis to be conducted, we must be familiar first with our data and what it can offer. We will discuss more on data wrangling and subsetting in a later post 😊

That is all for now, I hope you liked the post.

--

--