Nanterre p10 - Dev Data

Logo

semaine s14

semaine s15

semaine courante (s17)

planning des veilles

Data Wrangling in Pandas

‘Data Wrangling’ refers to the data manipulation often needed to transform raw data into a form that is useful for analysis.

Some of Key methods:

import pandas as pd
import numpy as np

Merging

Merging two datasets is a very common operation in preparing data for analysis. It generally means adding columns from one table to colums from another, where the value of some key, or merge field, matches.

1- many to one merge

# df1, df2 two simple DataFrames 
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df1

key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
df2
key data2
0 a 0
1 b 1
2 d 2
pd.merge(df1,df2)
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0

It is a many to one merge. The join field is implicit, based on what columns it finds in common between the two dataframes. Note that they share some values of the key field (a, b), but do not share key values c and d.

The result contains the values from both inputs where they both have a value of the merge field, which is ‘key’ in this example. The default behavior is that the key value has to be in both inputs to be kept.

In set terms it would be an intersection of the two sets.

pd.merge(df1,df2, on='key')
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0

Here is the same merge, but making the join field explicit.

2- many-to-many merge

# if there are more than one value of key in both dataframes => it is a many-to-many merge.
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df3 = pd.DataFrame({'key': ['a', 'b', 'b', 'd'],'data2': range(4)})
print(df1)
print(df3)
pd.merge(df1,df3, on='key')

  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   a      5
6   b      6
  key  data2
0   a      0
1   b      1
2   b      2
3   d      3
key data1 data2
0 b 0 1
1 b 0 2
2 b 1 1
3 b 1 2
4 b 6 1
5 b 6 2
6 a 2 0
7 a 4 0
8 a 5 0

This produces a cartesian product of the number of occurrences of each key value in both dataframes

There are several types of joins: left, right, inner, and outer.

pd.merge(df1,df3, on='key', how='left')
key data1 data2
0 b 0 1.0
1 b 0 2.0
2 b 1 1.0
3 b 1 2.0
4 a 2 0.0
5 c 3 NaN
6 a 4 0.0
7 a 5 0.0
8 b 6 1.0
9 b 6 2.0
pd.merge(df1,df3, on='key', how='right')
key data1 data2
0 b 0.0 1
1 b 1.0 1
2 b 6.0 1
3 b 0.0 2
4 b 1.0 2
5 b 6.0 2
6 a 2.0 0
7 a 4.0 0
8 a 5.0 0
9 d NaN 3
pd.merge(df1,df3, on='key', how='inner')
key data1 data2
0 b 0 1
1 b 0 2
2 b 1 1
3 b 1 2
4 b 6 1
5 b 6 2
6 a 2 0
7 a 4 0
8 a 5 0
pd.merge(df1,df3, on='key', how='outer')
key data1 data2
0 b 0.0 1.0
1 b 0.0 2.0
2 b 1.0 1.0
3 b 1.0 2.0
4 b 6.0 1.0
5 b 6.0 2.0
6 a 2.0 0.0
7 a 4.0 0.0
8 a 5.0 0.0
9 c 3.0 NaN
10 d NaN 3.0
df4 = pd.DataFrame({'key_1': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df5 = pd.DataFrame({'key_2': ['a', 'b', 'b', 'd'],'data2': range(4)})
pd.merge(df4,df5, left_on='key_1', right_on='key_2')
key_1 data1 key_2 data2
0 b 0 b 1
1 b 0 b 2
2 b 1 b 1
3 b 1 b 2
4 b 6 b 1
5 b 6 b 2
6 a 2 a 0
7 a 4 a 0
8 a 5 a 0
print(df5)
   data2
a      4
b      6
c      8
d     10
# try to  merge two dataframes df4 and df5 below using a combination of a data column and an index one.
df4 = pd.DataFrame({'key_1': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df5 = pd.DataFrame({'data2': [4,6,8,10]}, index=['a','b','c','d'])
print(df4)
print(df5)
  key_1  data1
0     b      0
1     b      1
2     a      2
3     c      3
4     a      4
5     a      5
6     b      6
   data2
a      4
b      6
c      8
d     10
pd.merge(df4,df5, left_on='key_1', right_index=True)
key_1 data1 data2
0 b 0 6
1 b 1 6
6 b 6 6
2 a 2 4
4 a 4 4
5 a 5 4
3 c 3 8

Concatenating

# Concatenating can append rows, or columns, depending on which axis you use. Default is 0
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])
print(s3)
pd.concat([s1, s2, s3])

f    5
g    6
dtype: int64





a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

concatenating series on axis 0, creates a longer series, appending each of the three ones

pd.concat([s1, s2, s3], axis=1)
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
# Outer join is the default:
pd.concat([s1, s2, s3], axis=1, join='outer')
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
# inner join 
pd.concat([s1, s2, s3], axis=1, join='inner')
0 1 2

inner join produces here an empty result,there is no common values ..

EX: use s4, s5, S5 below to apply outer the inner joins and compare results

s4 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
s5 = pd.Series([1, 2, 3], index=['d', 'e', 'f'])
s6 = pd.Series([7, 8, 9, 10], index=['d', 'e', 'f', 'g'])
print(s6)
d     7
e     8
f     9
g    10
dtype: int64
# outer join
pd.concat([s4, s5, s6], axis=1, join='outer')
0 1 2
c 4.0 NaN NaN
d 5.0 1.0 7.0
e 6.0 2.0 8.0
f NaN 3.0 9.0
g NaN NaN 10.0
# inner join 
pd.concat([s4, s5, s6], axis=1, join='inner')

0 1 2
d 5 1 7
e 6 2 8

with some overlapping values,there’s more chance to have with inner join non-empty results

Note that it contains only entries that overlap in all three series.

Reshaping with Hierarchical Indexing

data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                 index=pd.Index(['Lyon', 'Paris'], name='state'),
                 columns=pd.Index(['one', 'two', 'three'], name='number'))
data
number one two three
state
Lyon 0 1 2
Paris 3 4 5
# Stack pivots the columns into rows, producing a Series with a hierarchical index:
result = data.stack()
result
state  number
Lyon   one       0
       two       1
       three     2
Paris  one       3
       two       4
       three     5
dtype: int32
# Unstack reverses this process:
result.unstack()

Data Transformations

# Start with a dataframe containing some duplicate values
data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4,'k2': [1, 1, 2, 3, 3, 4, 99]})
data
k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 99
# find rows with duplicate values
data.duplicated()
0    False
1     True
2    False
3    False
4     True
5    False
6    False
dtype: bool
# remove duplicate values
data.drop_duplicates()
k1 k2
0 one 1
2 one 2
3 two 3
5 two 4
6 two 99
#If 99 is a code for missing data, we could replace any such values with NaNs
data['k2'].replace(99,np.nan)
0    1.0
1    1.0
2    2.0
3    3.0
4    3.0
5    4.0
6    NaN
Name: k2, dtype: float64

Categorization (binning)

# to create categories of data using ranges to bin the data using cut
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
type(cats)
pandas.core.arrays.categorical.Categorical
pd.core.arrays.categorical.Categorical?
cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')
cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
pd.value_counts(cats)
(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64
# Consistent with mathematical notation for intervals, a parenthesis means that the side is open while the 
#square bracket means it is closed (inclusive). Which side is closed can be changed by passing right=False:
cats = pd.cut(ages, bins, right=False)
print(ages)
print(pd.value_counts(cats))
[20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
[25, 35)     4
[18, 25)     4
[35, 60)     3
[60, 100)    1
dtype: int64

Removing Outliers

# data1 is a dataframe with 4 columns of 1,000 random numbers
# We use a fixed seed for the random number generator to get repeatable results
np.random.seed(12345)
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.067684 0.067924 0.025598 -0.002298
std 0.998035 0.992106 1.006835 0.996794
min -3.428254 -3.548824 -3.184377 -3.745356
25% -0.774890 -0.591841 -0.641675 -0.644144
50% -0.116401 0.101143 0.002073 -0.013611
75% 0.616366 0.780282 0.680391 0.654328
max 3.366626 2.653656 3.260383 3.927528
# This identifies any values in column 3 with absolute values > 3
col = data[3]
col[np.abs(col) > 3]
97     3.927528
305   -3.399312
400   -3.745356
Name: 3, dtype: float64
# This identifies all the rows with any column containing absolute values > 3
data[(np.abs(data) > 3).any(1)]
# This caps the values at -3 to 3 :
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()

Computing Dummy Variables

df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],'data1': range(6)})
df
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
# This generates dummy variables for each value of key
pd.get_dummies(df['key'])
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
# This generates dummy variables for each value of key and appends these to the dataframe
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

Notice that we used join instead of merge. The join method is very similar to merge, but uses indexes to merge, by default. From the documentation:

http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging merge is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.

The related DataFrame.join method, uses merge internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior for merge).