Nanterre p10 - Dev Data

Logo

semaine s14

semaine s15

semaine courante (s17)

planning des veilles

Projet Pandas

Ressources Utiles pour le projet:

Importer les Libraries de dépendances

# import libraries 
import pandas as pd


chargement du Dataset

# charger le fichier locations.csv dans un pandas dataframe locations et afficher les 5 premières lignes de votre df
#TBD

neighborhood title price bedrooms pid longitude date subregion link latitude sqft
0 (bayview) Take A TOUR ON OUR ONE FURNISHED BEDROOM TODAY $950 / 1br - 4076905111 -122.396965 Sep 18 2013 SF /sfc/apa/4076905111.html 37.761216 / 1br -
1 (bayview) Only walking distance to major shopping centers. $950 / 1br - 4076901755 -122.396793 Sep 18 2013 SF /sfc/apa/4076901755.html 37.761080 / 1br -
2 (bayview) furnished - 1 Bedroom(s), 1 Bath(s), Air Condi... $950 / 1br - 4076899340 -122.397100 Sep 18 2013 SF /sfc/apa/4076899340.html 37.762100 / 1br -
3 (financial district) *NEW* Beautiful, Upscale Condo in Historic Jac... $3300 / 1br - 830ft² - 4067393707 -122.399747 Sep 18 2013 SF /sfc/apa/4067393707.html 37.798108 / 1br - 830ft² -
4 (visitacion valley) 楼上全层3房 $2000 / 3br - 1280ft² - 4076901071 NaN Sep 18 2013 SF /sfc/apa/4076901071.html NaN / 3br - 1280ft² -

Nettoyage des données

#  supprimer les parenthèses autour des valeurs de la colonne neighborhood 
#  supprimer $  de la colonne priceet afficer les valeurs en tant que float

#TBD

neighborhood title price bedrooms pid longitude date subregion link latitude sqft
0 bayview Take A TOUR ON OUR ONE FURNISHED BEDROOM TODAY 950.0 / 1br - 4076905111 -122.396965 Sep 18 2013 SF /sfc/apa/4076905111.html 37.761216 / 1br -
1 bayview Only walking distance to major shopping centers. 950.0 / 1br - 4076901755 -122.396793 Sep 18 2013 SF /sfc/apa/4076901755.html 37.761080 / 1br -
2 bayview furnished - 1 Bedroom(s), 1 Bath(s), Air Condi... 950.0 / 1br - 4076899340 -122.397100 Sep 18 2013 SF /sfc/apa/4076899340.html 37.762100 / 1br -
3 financial district *NEW* Beautiful, Upscale Condo in Historic Jac... 3300.0 / 1br - 830ft² - 4067393707 -122.399747 Sep 18 2013 SF /sfc/apa/4067393707.html 37.798108 / 1br - 830ft² -
4 visitacion valley 楼上全层3房 2000.0 / 3br - 1280ft² - 4076901071 NaN Sep 18 2013 SF /sfc/apa/4076901071.html NaN / 3br - 1280ft² -
# Eclater le contenu de la colonne data en colonnes: month day year et supprimer la colonne d'orig (date)

#TBD

neighborhood title price bedrooms pid longitude subregion link latitude sqft month day year
0 bayview Take A TOUR ON OUR ONE FURNISHED BEDROOM TODAY 950.0 / 1br - 4076905111 -122.396965 SF /sfc/apa/4076905111.html 37.761216 / 1br - Sep 18 2013
1 bayview Only walking distance to major shopping centers. 950.0 / 1br - 4076901755 -122.396793 SF /sfc/apa/4076901755.html 37.761080 / 1br - Sep 18 2013
2 bayview furnished - 1 Bedroom(s), 1 Bath(s), Air Condi... 950.0 / 1br - 4076899340 -122.397100 SF /sfc/apa/4076899340.html 37.762100 / 1br - Sep 18 2013
3 financial district *NEW* Beautiful, Upscale Condo in Historic Jac... 3300.0 / 1br - 830ft² - 4067393707 -122.399747 SF /sfc/apa/4067393707.html 37.798108 / 1br - 830ft² - Sep 18 2013
4 visitacion valley 楼上全层3房 2000.0 / 3br - 1280ft² - 4076901071 NaN SF /sfc/apa/4076901071.html NaN / 3br - 1280ft² - Sep 18 2013
# definir une fonction clean_bedrooms qui permet de changer la colonne bedrooms pour garder uniquement le premier entier 
# exemples: /1br - => 1 
# /3br - 1280ft² - => 3

#TBD

neighborhood title price bedrooms pid longitude subregion link latitude sqft month day year
0 bayview Take A TOUR ON OUR ONE FURNISHED BEDROOM TODAY 950.0 1.0 4076905111 -122.396965 SF /sfc/apa/4076905111.html 37.761216 / 1br - Sep 18 2013
1 bayview Only walking distance to major shopping centers. 950.0 1.0 4076901755 -122.396793 SF /sfc/apa/4076901755.html 37.761080 / 1br - Sep 18 2013
2 bayview furnished - 1 Bedroom(s), 1 Bath(s), Air Condi... 950.0 1.0 4076899340 -122.397100 SF /sfc/apa/4076899340.html 37.762100 / 1br - Sep 18 2013
3 financial district *NEW* Beautiful, Upscale Condo in Historic Jac... 3300.0 1.0 4067393707 -122.399747 SF /sfc/apa/4067393707.html 37.798108 / 1br - 830ft² - Sep 18 2013
4 visitacion valley 楼上全层3房 2000.0 3.0 4076901071 NaN SF /sfc/apa/4076901071.html NaN / 3br - 1280ft² - Sep 18 2013
# definir une fonction clean_surface qui permet de changer la colonne sqft et garder uniquement le dernier entier
# exemple: /3br - 1280ft² - => 1280

#TBD
neighborhood title price bedrooms pid longitude subregion link latitude sqft month day year
0 bayview Take A TOUR ON OUR ONE FURNISHED BEDROOM TODAY 950.0 1.0 4076905111 -122.396965 SF /sfc/apa/4076905111.html 37.761216 NaN Sep 18 2013
1 bayview Only walking distance to major shopping centers. 950.0 1.0 4076901755 -122.396793 SF /sfc/apa/4076901755.html 37.761080 NaN Sep 18 2013
2 bayview furnished - 1 Bedroom(s), 1 Bath(s), Air Condi... 950.0 1.0 4076899340 -122.397100 SF /sfc/apa/4076899340.html 37.762100 NaN Sep 18 2013
3 financial district *NEW* Beautiful, Upscale Condo in Historic Jac... 3300.0 1.0 4067393707 -122.399747 SF /sfc/apa/4067393707.html 37.798108 830.0 Sep 18 2013
4 visitacion valley 楼上全层3房 2000.0 3.0 4076901071 NaN SF /sfc/apa/4076901071.html NaN 1280.0 Sep 18 2013

Manipulation des données :

Partie1:

  1. Trouvez des valeurs aberrantes en loyer, par exemple en dessous de 200 ou au-dessus de 10 000
  2. Analyser les données sans les valeurs manquantes
  3. définir un nouveau dataset sans les valeurs aberrantes
df.describe()
price bedrooms pid longitude latitude sqft day year
count 4908.000000 4544.000000 5.000000e+03 3143.000000 3143.000000 3178.000000 5000.000000 5000.0
mean 2656.999389 2.066241 4.068059e+09 -122.264948 37.757411 1173.613593 17.523800 2013.0
std 1915.147477 1.011606 1.344453e+07 0.278825 0.364646 751.552623 0.766258 0.0
min 1.000000 1.000000 4.008227e+09 -123.799100 36.813820 1.000000 14.000000 2013.0
25% 1695.000000 1.000000 4.065685e+09 -122.442365 37.469365 747.250000 17.000000 2013.0
50% 2208.500000 2.000000 4.074290e+09 -122.283714 37.760858 1000.000000 18.000000 2013.0
75% 2995.000000 3.000000 4.075949e+09 -122.045047 37.900832 1350.000000 18.000000 2013.0
max 35000.000000 8.000000 4.076908e+09 -120.034132 41.456848 12700.000000 18.000000 2013.0
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   neighborhood  4986 non-null   object 
 1   title         5000 non-null   object 
 2   price         4908 non-null   float64
 3   bedrooms      4544 non-null   float64
 4   pid           5000 non-null   int64  
 5   longitude     3143 non-null   float64
 6   subregion     5000 non-null   object 
 7   link          5000 non-null   object 
 8   latitude      3143 non-null   float64
 9   sqft          3178 non-null   float64
 10  month         5000 non-null   object 
 11  day           5000 non-null   int32  
 12  year          5000 non-null   int32  
dtypes: float64(5), int32(2), int64(1), object(5)
memory usage: 371.2+ KB
df['price'].isna().sum()
92
len(df['price'])
5000

#TBC

Partie 2:

en utilisant le dernier Dataframe (avec les données propres) :

  1. Filtrer les enregistrements avec plus de 5 chambres
  2. Filtrer les enregistrements ayant des valeurs de sqft < 500 et > 3000
  3. Créez un ensemble (set) de 5 catégories de prix qui incluent toutes les valeurs, précisez le nombre d’enregistrements dans chaque catégorie
filtered.head()
neighborhood title price bedrooms pid longitude subregion link latitude sqft month day year
0 bayview Take A TOUR ON OUR ONE FURNISHED BEDROOM TODAY 950.0 1.0 4076905111 -122.396965 SF /sfc/apa/4076905111.html 37.761216 NaN Sep 18 2013
1 bayview Only walking distance to major shopping centers. 950.0 1.0 4076901755 -122.396793 SF /sfc/apa/4076901755.html 37.761080 NaN Sep 18 2013
2 bayview furnished - 1 Bedroom(s), 1 Bath(s), Air Condi... 950.0 1.0 4076899340 -122.397100 SF /sfc/apa/4076899340.html 37.762100 NaN Sep 18 2013
3 financial district *NEW* Beautiful, Upscale Condo in Historic Jac... 3300.0 1.0 4067393707 -122.399747 SF /sfc/apa/4067393707.html 37.798108 830.0 Sep 18 2013
4 visitacion valley 楼上全层3房 2000.0 3.0 4076901071 NaN SF /sfc/apa/4076901071.html NaN 1280.0 Sep 18 2013

#TBC
price bedrooms pid longitude latitude sqft day year
count 1936.000000 1936.000000 1.936000e+03 1936.000000 1936.000000 1936.000000 1936.000000 1936.0
mean 2598.870351 2.107955 4.068081e+09 -122.222475 37.734211 1181.790806 17.541322 2013.0
std 1284.523424 1.002436 1.353531e+07 0.284795 0.386713 608.700661 0.746636 0.0
min 575.000000 1.000000 4.012055e+09 -123.799100 36.813820 200.000000 14.000000 2013.0
25% 1800.000000 1.000000 4.065635e+09 -122.432609 37.423714 785.000000 17.000000 2013.0
50% 2295.000000 2.000000 4.074286e+09 -122.243881 37.680504 1011.000000 18.000000 2013.0
75% 3000.000000 3.000000 4.076001e+09 -122.000592 37.947712 1400.000000 18.000000 2013.0
max 9999.000000 5.000000 4.076903e+09 -120.063949 41.456848 6500.000000 18.000000 2013.0
# TBC

Enregistrement des données propres en BDD

stocker votre jeu de données propore dans une Base de données que vous créerez

#TBD