Data Wrangling in Pandas
‘Data Wrangling’ refers to the data manipulation often needed to transform raw data into a form that is useful for analysis.
Some of Key methods:
- Merging and Concatenating
- Reshaping data
- Data transformations
- Categorization
- Detecting and Filtering Outliers
- Creating Dummy Variables
import pandas as pd
import numpy as np
Merging
Merging two datasets is a very common operation in preparing data for analysis. It generally means adding columns from one table to colums from another, where the value of some key, or merge field, matches.
1- many to one merge
# df1, df2 two simple DataFrames
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df1
|
key |
data1 |
0 |
b |
0 |
1 |
b |
1 |
2 |
a |
2 |
3 |
c |
3 |
4 |
a |
4 |
5 |
a |
5 |
6 |
b |
6 |
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
df2
|
key |
data2 |
0 |
a |
0 |
1 |
b |
1 |
2 |
d |
2 |
|
key |
data1 |
data2 |
0 |
b |
0 |
1 |
1 |
b |
1 |
1 |
2 |
b |
6 |
1 |
3 |
a |
2 |
0 |
4 |
a |
4 |
0 |
5 |
a |
5 |
0 |
It is a many to one merge. The join field is implicit, based on what columns it finds in common between the two dataframes. Note that they share some values of the key field (a, b), but do not share key values c and d.
The result contains the values from both inputs where they both have a value of the merge field, which is ‘key’ in this example. The default behavior is that the key value has to be in both inputs to be kept.
In set terms it would be an intersection of the two sets.
pd.merge(df1,df2, on='key')
|
key |
data1 |
data2 |
0 |
b |
0 |
1 |
1 |
b |
1 |
1 |
2 |
b |
6 |
1 |
3 |
a |
2 |
0 |
4 |
a |
4 |
0 |
5 |
a |
5 |
0 |
Here is the same merge, but making the join field explicit.
2- many-to-many merge
# if there are more than one value of key in both dataframes => it is a many-to-many merge.
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df3 = pd.DataFrame({'key': ['a', 'b', 'b', 'd'],'data2': range(4)})
print(df1)
print(df3)
pd.merge(df1,df3, on='key')
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
key data2
0 a 0
1 b 1
2 b 2
3 d 3
|
key |
data1 |
data2 |
0 |
b |
0 |
1 |
1 |
b |
0 |
2 |
2 |
b |
1 |
1 |
3 |
b |
1 |
2 |
4 |
b |
6 |
1 |
5 |
b |
6 |
2 |
6 |
a |
2 |
0 |
7 |
a |
4 |
0 |
8 |
a |
5 |
0 |
This produces a cartesian product of the number of occurrences of each key value in both dataframes
There are several types of joins: left, right, inner, and outer.
- left join => it keeps the result if it shows up in df1,
regardless of whether it also shows up in df2. It fills in a value of NaN for the missing value from df2.
pd.merge(df1,df3, on='key', how='left')
|
key |
data1 |
data2 |
0 |
b |
0 |
1.0 |
1 |
b |
0 |
2.0 |
2 |
b |
1 |
1.0 |
3 |
b |
1 |
2.0 |
4 |
a |
2 |
0.0 |
5 |
c |
3 |
NaN |
6 |
a |
4 |
0.0 |
7 |
a |
5 |
0.0 |
8 |
b |
6 |
1.0 |
9 |
b |
6 |
2.0 |
- right join => Same idea, but this time it keeps a result if it shows up in df2, regardless of whether it also shows up in df1.
pd.merge(df1,df3, on='key', how='right')
|
key |
data1 |
data2 |
0 |
b |
0.0 |
1 |
1 |
b |
1.0 |
1 |
2 |
b |
6.0 |
1 |
3 |
b |
0.0 |
2 |
4 |
b |
1.0 |
2 |
5 |
b |
6.0 |
2 |
6 |
a |
2.0 |
0 |
7 |
a |
4.0 |
0 |
8 |
a |
5.0 |
0 |
9 |
d |
NaN |
3 |
- inner join => Intersection of two sets , the default argument on merge function
pd.merge(df1,df3, on='key', how='inner')
|
key |
data1 |
data2 |
0 |
b |
0 |
1 |
1 |
b |
0 |
2 |
2 |
b |
1 |
1 |
3 |
b |
1 |
2 |
4 |
b |
6 |
1 |
5 |
b |
6 |
2 |
6 |
a |
2 |
0 |
7 |
a |
4 |
0 |
8 |
a |
5 |
0 |
- outer join => If inner joins are like an intersection of two sets, outer joins are unions.
pd.merge(df1,df3, on='key', how='outer')
|
key |
data1 |
data2 |
0 |
b |
0.0 |
1.0 |
1 |
b |
0.0 |
2.0 |
2 |
b |
1.0 |
1.0 |
3 |
b |
1.0 |
2.0 |
4 |
b |
6.0 |
1.0 |
5 |
b |
6.0 |
2.0 |
6 |
a |
2.0 |
0.0 |
7 |
a |
4.0 |
0.0 |
8 |
a |
5.0 |
0.0 |
9 |
c |
3.0 |
NaN |
10 |
d |
NaN |
3.0 |
- if the join fields have different names => No problem - just specify the names.
df4 = pd.DataFrame({'key_1': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df5 = pd.DataFrame({'key_2': ['a', 'b', 'b', 'd'],'data2': range(4)})
pd.merge(df4,df5, left_on='key_1', right_on='key_2')
|
key_1 |
data1 |
key_2 |
data2 |
0 |
b |
0 |
b |
1 |
1 |
b |
0 |
b |
2 |
2 |
b |
1 |
b |
1 |
3 |
b |
1 |
b |
2 |
4 |
b |
6 |
b |
1 |
5 |
b |
6 |
b |
2 |
6 |
a |
2 |
a |
0 |
7 |
a |
4 |
a |
0 |
8 |
a |
5 |
a |
0 |
# try to merge two dataframes df4 and df5 below using a combination of a data column and an index one.
df4 = pd.DataFrame({'key_1': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df5 = pd.DataFrame({'data2': [4,6,8,10]}, index=['a','b','c','d'])
print(df4)
print(df5)
key_1 data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
data2
a 4
b 6
c 8
d 10
pd.merge(df4,df5, left_on='key_1', right_index=True)
|
key_1 |
data1 |
data2 |
0 |
b |
0 |
6 |
1 |
b |
1 |
6 |
6 |
b |
6 |
6 |
2 |
a |
2 |
4 |
4 |
a |
4 |
4 |
5 |
a |
5 |
4 |
3 |
c |
3 |
8 |
Concatenating
# Concatenating can append rows, or columns, depending on which axis you use. Default is 0
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])
print(s3)
pd.concat([s1, s2, s3])
f 5
g 6
dtype: int64
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
concatenating series on axis 0, creates a longer series, appending each of the three ones
pd.concat([s1, s2, s3], axis=1)
|
0 |
1 |
2 |
a |
0.0 |
NaN |
NaN |
b |
1.0 |
NaN |
NaN |
c |
NaN |
2.0 |
NaN |
d |
NaN |
3.0 |
NaN |
e |
NaN |
4.0 |
NaN |
f |
NaN |
NaN |
5.0 |
g |
NaN |
NaN |
6.0 |
# Outer join is the default:
pd.concat([s1, s2, s3], axis=1, join='outer')
|
0 |
1 |
2 |
a |
0.0 |
NaN |
NaN |
b |
1.0 |
NaN |
NaN |
c |
NaN |
2.0 |
NaN |
d |
NaN |
3.0 |
NaN |
e |
NaN |
4.0 |
NaN |
f |
NaN |
NaN |
5.0 |
g |
NaN |
NaN |
6.0 |
# inner join
pd.concat([s1, s2, s3], axis=1, join='inner')
inner join produces here an empty result,there is no common values ..
EX: use s4, s5, S5 below to apply outer the inner joins and compare results
s4 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
s5 = pd.Series([1, 2, 3], index=['d', 'e', 'f'])
s6 = pd.Series([7, 8, 9, 10], index=['d', 'e', 'f', 'g'])
print(s6)
d 7
e 8
f 9
g 10
dtype: int64
# outer join
pd.concat([s4, s5, s6], axis=1, join='outer')
|
0 |
1 |
2 |
c |
4.0 |
NaN |
NaN |
d |
5.0 |
1.0 |
7.0 |
e |
6.0 |
2.0 |
8.0 |
f |
NaN |
3.0 |
9.0 |
g |
NaN |
NaN |
10.0 |
# inner join
pd.concat([s4, s5, s6], axis=1, join='inner')
with some overlapping values,there’s more chance to have with inner join non-empty results
Note that it contains only entries that overlap in all three series.
Reshaping with Hierarchical Indexing
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Lyon', 'Paris'], name='state'),
columns=pd.Index(['one', 'two', 'three'], name='number'))
data
number |
one |
two |
three |
state |
|
|
|
Lyon |
0 |
1 |
2 |
Paris |
3 |
4 |
5 |
# Stack pivots the columns into rows, producing a Series with a hierarchical index:
result = data.stack()
result
state number
Lyon one 0
two 1
three 2
Paris one 3
two 4
three 5
dtype: int32
# Unstack reverses this process:
result.unstack()
# Start with a dataframe containing some duplicate values
data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4,'k2': [1, 1, 2, 3, 3, 4, 99]})
data
|
k1 |
k2 |
0 |
one |
1 |
1 |
one |
1 |
2 |
one |
2 |
3 |
two |
3 |
4 |
two |
3 |
5 |
two |
4 |
6 |
two |
99 |
# find rows with duplicate values
data.duplicated()
0 False
1 True
2 False
3 False
4 True
5 False
6 False
dtype: bool
# remove duplicate values
data.drop_duplicates()
|
k1 |
k2 |
0 |
one |
1 |
2 |
one |
2 |
3 |
two |
3 |
5 |
two |
4 |
6 |
two |
99 |
#If 99 is a code for missing data, we could replace any such values with NaNs
data['k2'].replace(99,np.nan)
0 1.0
1 1.0
2 2.0
3 3.0
4 3.0
5 4.0
6 NaN
Name: k2, dtype: float64
Categorization (binning)
# to create categories of data using ranges to bin the data using cut
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
pandas.core.arrays.categorical.Categorical
pd.core.arrays.categorical.Categorical?
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
closed='right',
dtype='interval[int64]')
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
# Consistent with mathematical notation for intervals, a parenthesis means that the side is open while the
#square bracket means it is closed (inclusive). Which side is closed can be changed by passing right=False:
cats = pd.cut(ages, bins, right=False)
print(ages)
print(pd.value_counts(cats))
[20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
[25, 35) 4
[18, 25) 4
[35, 60) 3
[60, 100) 1
dtype: int64
Removing Outliers
# data1 is a dataframe with 4 columns of 1,000 random numbers
# We use a fixed seed for the random number generator to get repeatable results
np.random.seed(12345)
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()
|
0 |
1 |
2 |
3 |
count |
1000.000000 |
1000.000000 |
1000.000000 |
1000.000000 |
mean |
-0.067684 |
0.067924 |
0.025598 |
-0.002298 |
std |
0.998035 |
0.992106 |
1.006835 |
0.996794 |
min |
-3.428254 |
-3.548824 |
-3.184377 |
-3.745356 |
25% |
-0.774890 |
-0.591841 |
-0.641675 |
-0.644144 |
50% |
-0.116401 |
0.101143 |
0.002073 |
-0.013611 |
75% |
0.616366 |
0.780282 |
0.680391 |
0.654328 |
max |
3.366626 |
2.653656 |
3.260383 |
3.927528 |
# This identifies any values in column 3 with absolute values > 3
col = data[3]
col[np.abs(col) > 3]
97 3.927528
305 -3.399312
400 -3.745356
Name: 3, dtype: float64
# This identifies all the rows with any column containing absolute values > 3
data[(np.abs(data) > 3).any(1)]
# This caps the values at -3 to 3 :
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()
Computing Dummy Variables
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],'data1': range(6)})
df
|
key |
data1 |
0 |
b |
0 |
1 |
b |
1 |
2 |
a |
2 |
3 |
c |
3 |
4 |
a |
4 |
5 |
b |
5 |
- Dummy variables are useful in statistical modeling, to have 0/1 indicator variables for the presence of some condition
# This generates dummy variables for each value of key
pd.get_dummies(df['key'])
|
a |
b |
c |
0 |
0 |
1 |
0 |
1 |
0 |
1 |
0 |
2 |
1 |
0 |
0 |
3 |
0 |
0 |
1 |
4 |
1 |
0 |
0 |
5 |
0 |
1 |
0 |
# This generates dummy variables for each value of key and appends these to the dataframe
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
Notice that we used join instead of merge. The join method is very similar to merge, but uses indexes to merge, by default. From the documentation:
http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
merge is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.
The related DataFrame.join method, uses merge internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior for merge).