How to work with categorical variables in python.

What are categorical variables?

When using datasets for machine learning they can contain numerical variables and/or categorical variables. Categorical variables are variables that can only contain a limited number of possible values. For example, we can have a categorical variable ‘car’ which can only contain the values {‘Ford’, ‘Audi’, ‘BMW’, ‘Toyota’}.

Many machine learning algorithms support categorical variables in which case no pre-processing of the dataset is necessary. If there is no support for categorical variables then we need to pre-process that data to make sure that the machine learning algorithm supports it. It could also be that you prefer to pre-process the data yourself regardless of the support by the machine learning algorithm.

In this blog I’ll show you 2 quick and easy ways to encode the categorical variables in your dataset. The Python script file CategoricalVariables.py and dataset CategoricalVariables.csv used in this blog can be found in Github.

The dataset

I’ve created a very simple dataset with some HR information from a company. The size and type of data is enough to show the basic principals of working with categorical values.

So I’ll first import the modules. The csv file will be loaded into a pandas Dataframe.

import numpy as np
import pandas as pd 
from sklearn.preprocessing import LabelEncoder

# Import dataset
df = pd.read_csv('CategoricalData.csv')

The DataFrame contains 11 rows and 5 columns.

df.shape

(11, 5)

Let’s look at the first 5 rows of the dataset.

df.head()

   Gender  Age  Salary Title Department
0    Male   18   20000   NaN         IT
1  Female   22   26000   Ms.         HR
2  Female   25   27000   Ms.         HR
3    Male   20   19500   Mr.         IT
4  Female   45   38500  Mrs.      Staff

It has a few columns with categorical variables. We can recognize these by the type object

df.dtypes

Gender        object
Age            int64
Salary         int64
Title         object
Department    object
dtype: object

The dataset also contains a few null values. We have to assign these values some value to be able to encode the categorical data. First we should identify the rows with the null values.

df[df.isnull().any(axis=1)]

   Gender  Age  Salary Title Department
0    Male   18   20000   NaN         IT
5  Female   32   34000  Mrs.        NaN
7    Male   44   36000   NaN        NaN
9    Male   27   24000   Mr.        NaN

The Title contains 2 null values. In both cases the person is a Male…so I’ll set the Title to Mr.. The Department column also contains some null values..we’ll make them all work for IT.

# Update null values
df.Title.fillna('Mr.', inplace=True)
df.Department.fillna('IT', inplace=True)

As a last step before the actual encoding we’ll create a list with only the categorical column names.

# Create List with Categorical columns
categorical_columns = [cat for cat in df.columns 
                      if df[cat].dtype == 'object']
print(categorical_columns)

['Gender', 'Title', 'Department']

Method 1: Label encoding

With label encoding each categorical value will be converted to a numerical value. For example the Gender column will be encoded with the following numerical values:

  • Female = 0
  • Male = 1

Let’s see how we can implement this in code. First I’ll make a copy of the DataFrame (to make sure that for each method I have the original DataFrame to start with). Next we’ll create a LabelEncoder. We’ll loop through the columns in the categorical_columns list. For each column we’ll call le.fit_transform to encode the categorical values.

# Method 1 - Label Encoding
df_le = df.copy(deep=True)
le = LabelEncoder()
for column in categorical_columns:
    df_le[column] = le.fit_transform(df_le[column])

If we look at the DataFrame we can see that all categorical values are encoded.

print(df_le.head(n=15))

    Gender  Age  Salary  Title  Department
0        1   18   20000      0           1
1        0   22   26000      2           0
2        0   25   27000      2           0
3        1   20   19500      0           1
4        0   45   38500      1           2
5        0   32   34000      1           1
6        1   65   42500      0           2
7        1   44   36000      0           1
8        0   31   44500      2           2
9        1   27   24000      0           1
10       1   37   40000      0           0

Method 2: One hot encoding

Another method that can be used is one-hot encoding (sometimes also called dummy variables). For each categorical value a column will be created and by specifying 0 or 1 the original value can be encoded in each row.

Pandas will by default one-hot encode all columns of type object> or category when using pd.get_dummies. There is no need to specify any column names.

# Method 2 - One hot Encoding
df_ohe = df.copy(deep=True)
df_ohe = pd.get_dummies(df_ohe)

After performing the default one-hot encoding with pandas the DataFrame contains the information as can be seen below (I removed a few columns to prevent the screen from getting completely filled). All categorical variables are now one-hot encoded. For example the original column Gender contained 2 string values Male/Female. These are now concatenated to the column name.

Age  Salary  Gender_Female  Gender_Male  Title_Mr.  Title_Mrs. 
 18   20000              0            1          1           0
 22   26000              1            0          0           0
 25   27000              1            0          0           0
 20   19500              0            1          1           0
 45   38500              1            0          0           1
 32   34000              1            0          0           1
 65   42500              0            1          1           0
 44   36000              0            1          1           0
 31   44500              1            0          0           0
 27   24000              0            1          1           0
 37   40000              0            1          1           0

If you do want to one-hot encode a specific column you can specify that as follows.

df_ohe = pd.get_dummies(df_ohe, columns=['Gender'])

You can also modify the prefix.

df_ohe = pd.get_dummies(df_ohe, columns=['Gender'], prefix=['OHE_Gender'])

The DataFrame now contains the Gender information with different column names.

    Age  Salary Title Department  OHE_Gender_Female  OHE_Gender_Male
0    18   20000   Mr.         IT                  0                1
1    22   26000   Ms.         HR                  1                0
2    25   27000   Ms.         HR                  1                0
3    20   19500   Mr.         IT                  0                1
4    45   38500  Mrs.      Staff                  1                0
5    32   34000  Mrs.         IT                  1                0
6    65   42500   Mr.      Staff                  0                1
7    44   36000   Mr.         IT                  0                1
8    31   44500   Ms.      Staff                  1                0
9    27   24000   Mr.         IT                  0                1
10   37   40000   Mr.         HR                  0                1

Summary

In this blog I gave a brief explanation about categorical variables. I showed how you can find them in a dataset and how to update the null values if needed. Next we looked at (probably..) the 2 most common ways to encode the categorical values in python.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s