kaggle房价预测:使用Python综合探索数据 January 23, 2018 > 学习[COMPREHENSIVE DATA EXPLORATION WITH PYTHON](https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python "COMPREHENSIVE DATA EXPLORATION WITH PYTHON")笔记 **'The most difficult thing in life is to know yourself'** 作者按以下章节展开讲解 - 理解问题 - 分析理解每个变量,以及每个变量的意义和对房价预测的重要性 - 单变量学习 - 只关注因变量`SalePrice`并且想办法对它了解更多 - 多变量学习 - 去尝试搞清楚自变量和因变量之间的关系 - 基础数据清理 - 处理缺失的数据,虚假的数据和分类数据数字化 - 测试假设 - 检查数据是否符合大多数多元技术所要求的假设(正态性) 下面首先看一下训练数据都有什么字段吧。 ```python import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np from scipy.stats import norm from sklearn.preprocessing import StandardScaler from scipy import stats import warnings warnings.filterwarnings('ignore') %matplotlib inline ``` ```python df_train = pd.read_csv('data/kaggle_house_pred_train.csv') #check the decoration df_train.columns ``` Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'SalePrice'], dtype='object') ### 理解问题 **Try to understand each variable and relevance to this house prices problem.** 为了使分析有所规范,所以可以做这样一个表格: > - 变量- 变量名称. - 类型 - 变量类型的标识。这个字段有两个可能的值:'数字'或'分类'。 - 字段 - 变量段的标识。我们可以定义三种选型:建筑,空间和位置。当我们说“建筑”时,我们是指与建筑物的物理特征相关的变量(例如“OverallQual”)。当我们说'空间'时,我们是指一个报告房子空间属性的变量(例如'TotalBsmtSF')。最后,当我们说“地点”的时候,我们是指一个变量,它提供了房子所在地的信息(例如“Neighborhood”)。 - 期望 - 我们对“SalePrice”中的变量影响的期望。我们可以使用“高”,“中”和“低”的分类尺度作为可能的值。 - 结论 - 在我们快速查看数据之后,我们对变量重要性的结论。我们可以保持与“期望”相同的分类尺度。 - 评论 - 你自己的评论 “类型”和“字段”只是可能的特称参考,与这两个字段相比,“期望”一栏显得非常重要,因为他有助于我们提升“第六感”。(老外也有sixth sense啊) 要想做这样一个表格,就必须理解每个变量的含义(kaggle数据页面有介绍每个变量的含义),并且自问: - 当你买房的时候你会考虑哪些变量? - 如果要考虑某个变量,这个变量有多重要?他的重要程度 - 一个变量的信息是否早已经被另一个变量所描述?简而言之,y和x有关,训练的时候是不是只考虑x就可以了? 问完就可以填写“期望”这个字段了。 接下来可以做一些这些变量和售价之间的散点图,观察这些变量是否和我们预期的一样重要,然后填写“结论”。 作者认为下面几个变量很重要: - OverallQua - YearBuilt - TotalBsmtSF - GrLivArea 选择了两个“建筑”变量(“OverallQua”和“YearBuilt”)和两个“空间”变量(“TotalBsmtSF”和“GrLivArea”)。这可能有点意外,因为它违背了房地产的口头禅,所有重要的事情是“位置,位置和位置”。对于分类变量,这种快速数据检查过程可能有些苛刻。 例如,作者预计“Neigborhood”变量会更加相关,但在分析数据之后,作者最终将其排除在外。 也许这与使用 scatter plots 而不是 boxplots 相关, scatter plots 适合分类变量的可视化。数据可视化的方式往往影响我们的结论。 ### 单变量学习 分析“SalePrice” “SalePrice”是我们探索的原因,首先看一下他的大致长相。 ```python df_train['SalePrice'].describe() ``` count 1460.000000 mean 180921.195890 std 79442.502883 min 34900.000000 25% 129975.000000 50% 163000.000000 75% 214000.000000 max 755000.000000 Name: SalePrice, dtype: float64 看起来售价的最小值也大于0,哇偶,上帝,太好了:smile:这样不用担心有人为特性影响到我们的模型了。 ```python # 绘制直方图 sns.distplot(df_train['SalePrice']) ``` ![output_7_1.png](/images/2018/01/2750444538.png) - 可以看出,偏离正态分布 - 可以明显看出偏度为正 - 大致看出峰度 [这里](https://support.minitab.com/zh-cn/minitab/18/help-and-how-to/statistics/basic-statistics/supporting-topics/data-concepts/how-skewness-and-kurtosis-affect-your-distribution/)补充一下偏度和峰度。 ```python # df_train # df_train['SalePrice'] print("Skewness: %f" % df_train['SalePrice'].skew()) print("Kurtosis: %f" % df_train['SalePrice'].kurt()) ``` Skewness: 1.882876 Kurtosis: 6.536282 ### 多变量学习 #### 和数字类型变量之间的关系 ```python # scatter plot grlivarea/saleprice var = 'GrLivArea' data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1) data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000)); ``` ![output_11_0.png](/images/2018/01/1249194380.png) 看起来`SalePrice`和`GrLivArea`是线性关系 ```python # scatter plot totalbsmtsf/saleprice var = 'TotalBsmtSF' data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1) data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000)); ``` ![output_13_0.png](/images/2018/01/3117483105.png) 看起来二者像指数关系。 #### 和类别变量之间的关系 ```python #box plot overallqual/saleprice var = 'OverallQual' data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1) plt.subplots(figsize=(8, 6)) fig = sns.boxplot(x=var, y="SalePrice", data=data) fig.axis(ymin=0, ymax=800000); ``` ![output_15_0.png](/images/2018/01/3390834454.png) `SalePrice`与`OverallQual`与成正相关。 ```python var = 'YearBuilt' data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1) f, ax = plt.subplots(figsize=(16, 8)) fig = sns.boxplot(x=var, y="SalePrice", data=data) fig.axis(ymin=0, ymax=800000); plt.xticks(rotation=90); ``` ![output_17_0.png](/images/2018/01/3740997567.png) Although it's not a strong tendency, I'd say that 'SalePrice' is more prone to spend more money in new stuff than in old relics. **总结** - “GrLivArea”和“TotalBsmtSF”似乎与“SalePrice”线性相关。 两者正相关。 在“TotalBsmtSF”的情况下,我们可以看到线性关系的斜率特别高。 - “OverallQual”和“YearBuilt”似乎也与“SalePrice”有关。 在“OverallQual”的情况下,这种关系似乎更强一些,箱形图显示销售价格随整体质量的变化。 我们只分析了四个变量,但还有很多其他的变量应该分析。 这里的诀窍似乎是选择正确的特征(特征选择),而不是它们之间复杂关系的定义(特征工程)。 接下来分析相关性,从相关性热力图中可以明显的看到房价和变量的相关关系。 ```python #correlation matrix 相关性 corrmat = df_train.corr() # print(corrmat) f, ax = plt.subplots(figsize=(12, 9)) sns.heatmap(corrmat, vmax=.8, square=True); ``` ![output_20_0.png](/images/2018/01/1867240476.png) At first sight, there are two red colored squares that get my attention. The first one refers to the 'TotalBsmtSF' and '1stFlrSF' variables, and the second one refers to the 'GarageX' variables. Both cases show how significant the correlation is between these variables. Actually, this correlation is so strong that it can indicate a situation of multicollinearity. If we think about these variables, we can conclude that they give almost the same information so multicollinearity really occurs. Heatmaps are great to detect this kind of situations and in problems dominated by feature selection, like ours, they are an essential tool. Another thing that got my attention was the 'SalePrice' correlations. We can see our well-known 'GrLivArea', 'TotalBsmtSF', and 'OverallQual' saying a big 'Hi!', but we can also see many other variables that should be taken into account. That's what we will do next. ```python #saleprice correlation matrix k = 10 #number of variables for heatmap # 按照SalePrice排序,取前十个最大的列的索引 cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index # print(cols) # print(df_train[cols]) # print('--------') # print(df_train[cols].values) # # cm = np.corrcoef(df_train[cols].values) # print(cm) # print('--------') # print(df_train[cols].values.T)、 # 转置是因为 当np相关性只传入一个参数是,可以把每行当成一个向量,分别计算不同行向量之间相关性,直接读进来的列向量,所以这里需要转置 cm = np.corrcoef(df_train[cols].values.T) # print(cm) sns.set(font_scale=1.25) hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values) plt.show() ``` ![output_22_0.png](/images/2018/01/3618860159.png) ```python sns.set() cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt'] sns.pairplot(df_train[cols], size = 2.5) plt.show() ``` ![output_23_0.png](/images/2018/01/3753545264.png) ### 数据清理 ```python # missing data total = df_train.isnull().sum().sort_values(ascending=False) percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False) missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent']) missing_data.head(20) ``` Total Percent PoolQC 1453 0.995205 MiscFeature 1406 0.963014 Alley 1369 0.937671 Fence 1179 0.807534 FireplaceQu 690 0.472603 LotFrontage 259 0.177397 GarageCond 81 0.055479 GarageType 81 0.055479 GarageYrBlt 81 0.055479 GarageFinish 81 0.055479 GarageQual 81 0.055479 BsmtExposure 38 0.026027 BsmtFinType2 38 0.026027 BsmtFinType1 37 0.025342 BsmtCond 37 0.025342 BsmtQual 37 0.025342 MasVnrArea 8 0.005479 MasVnrType 8 0.005479 Electrical 1 0.000685 Utilities 0 0.000000 ```python #dealing with missing data df_train = df_train.drop((missing_data[missing_data['Total'] > 1]).index,1) df_train = df_train.drop(df_train.loc[df_train['Electrical'].isnull()].index) df_train.isnull().sum().max() #just checking that there's no missing data missing... ``` 0 ```python #standardizing data saleprice_scaled = StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]); # print(saleprice_scaled) low_range = saleprice_scaled[saleprice_scaled[:,0].argsort()][:10] high_range= saleprice_scaled[saleprice_scaled[:,0].argsort()][-10:] print('outer range (low) of the distribution:') print(low_range) print('\nouter range (high) of the distribution:') print(high_range) ``` outer range (low) of the distribution: [[-1.83820775] [-1.83303414] [-1.80044422] [-1.78282123] [-1.77400974] [-1.62295562] [-1.6166617 ] [-1.58519209] [-1.58519209] [-1.57269236]] outer range (high) of the distribution: [[ 3.82758058] [ 4.0395221 ] [ 4.49473628] [ 4.70872962] [ 4.728631 ] [ 5.06034585] [ 5.42191907] [ 5.58987866] [ 7.10041987] [ 7.22629831]] ```python # bivariate analysis saleprice/grlivarea var = 'GrLivArea' data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1) data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000)); ``` ![output_28_0.png](/images/2018/01/452221701.png) ```python #deleting points df_train.sort_values(by = 'GrLivArea', ascending = False)[:2] ``` Id MSSubClass MSZoning LotArea Street LotShape LandContour Utilities LotConfig LandSlope ... EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition SalePrice 1298 1299 60 RL 63887 Pave IR3 Bnk AllPub Corner Gtl ... 0 0 0 480 0 1 2008 New Partial 160000 523 524 60 RL 40094 Pave IR1 Bnk AllPub Inside Gtl ... 0 0 0 0 0 10 2007 New Partial 184750 2 rows × 63 columns ```python df_train = df_train.drop(df_train[df_train['Id'] == 1299].index) df_train = df_train.drop(df_train[df_train['Id'] == 524].index) ``` ### 验证正态性 ```python #bivariate analysis saleprice/grlivarea var = 'TotalBsmtSF' data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1) data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000)); ``` ![output_32_0.png](/images/2018/01/1990281349.png) ```python #histogram and normal probability plot sns.distplot(df_train['SalePrice'], fit=norm); fig = plt.figure() # 正态分布检验 res = stats.probplot(df_train['SalePrice'], plot=plt) ``` ![output_33_0.png](/images/2018/01/3804016521.png) ![output_33_1.png](/images/2018/01/3712101037.png) ```python #applying log transformation df_train['SalePrice'] = np.log(df_train['SalePrice']) ``` ```python #transformed histogram and normal probability plot sns.distplot(df_train['SalePrice'], fit=norm); fig = plt.figure() res = stats.probplot(df_train['SalePrice'], plot=plt) ``` ![output_35_0.png](/images/2018/01/1956123659.png) ![output_35_1.png](/images/2018/01/188372489.png) ```python #histogram and normal probability plot sns.distplot(df_train['GrLivArea'], fit=norm); fig = plt.figure() res = stats.probplot(df_train['GrLivArea'], plot=plt) ``` ![output_36_0.png](/images/2018/01/3500358761.png) ![output_36_1.png](/images/2018/01/1623301044.png) ```python #data transformation df_train['GrLivArea'] = np.log(df_train['GrLivArea']) ``` ```python #transformed histogram and normal probability plot sns.distplot(df_train['GrLivArea'], fit=norm); fig = plt.figure() res = stats.probplot(df_train['GrLivArea'], plot=plt) ``` ![output_38_0.png](/images/2018/01/3957133017.png) ![output_38_1.png](/images/2018/01/913001010.png) ```python #histogram and normal probability plot sns.distplot(df_train['TotalBsmtSF'], fit=norm); fig = plt.figure() res = stats.probplot(df_train['TotalBsmtSF'], plot=plt) ``` ![output_39_0.png](/images/2018/01/3591162563.png) ![output_39_1.png](/images/2018/01/2833180838.png) ```python # 评论里有人说可以用log1p = log(x+1) 来解决x为0的情况 #create column for new variable (one is enough because it's a binary categorical feature) #if area>0 it gets 1, for area==0 it gets 0 df_train['HasBsmt'] = pd.Series(len(df_train['TotalBsmtSF']), index=df_train.index) df_train['HasBsmt'] = 0 df_train.loc[df_train['TotalBsmtSF']>0,'HasBsmt'] = 1 ``` ```python #transform data df_train.loc[df_train['HasBsmt']==1,'TotalBsmtSF'] = np.log(df_train['TotalBsmtSF']) ``` ```python #histogram and normal probability plot sns.distplot(df_train[df_train['TotalBsmtSF']>0]['TotalBsmtSF'], fit=norm); fig = plt.figure() res = stats.probplot(df_train[df_train['TotalBsmtSF']>0]['TotalBsmtSF'], plot=plt) ``` ![output_42_0.png](/images/2018/01/942450730.png) ![output_42_1.png](/images/2018/01/2363066184.png) ```python #scatter plot plt.scatter(df_train['GrLivArea'], df_train['SalePrice']); ``` ![output_43_0.png](/images/2018/01/2783941295.png) ```python #scatter plot plt.scatter(df_train[df_train['TotalBsmtSF']>0]['TotalBsmtSF'], df_train[df_train['TotalBsmtSF']>0]['SalePrice']); ``` ![output_44_0.png](/images/2018/01/3126370882.png) ```python # 最后 one-hot编码 (针对类别) df_train = pd.get_dummies(df_train) ``` ```python df_train.head(20) ``` Id MSSubClass LotArea OverallQual OverallCond YearBuilt YearRemodAdd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF ... SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial 0 1 60 8450 7 5 2003 2003 706 0 150 ... 0 0 0 1 0 0 0 0 1 0 1 2 20 9600 6 8 1976 1976 978 0 284 ... 0 0 0 1 0 0 0 0 1 0 2 3 60 11250 7 5 2001 2002 486 0 434 ... 0 0 0 1 0 0 0 0 1 0 3 4 70 9550 7 5 1915 1970 216 0 540 ... 0 0 0 1 1 0 0 0 0 0 4 5 60 14260 8 5 2000 2000 655 0 490 ... 0 0 0 1 0 0 0 0 1 0 5 6 50 14115 5 5 1993 1995 732 0 64 ... 0 0 0 1 0 0 0 0 1 0 6 7 20 10084 8 5 2004 2005 1369 0 317 ... 0 0 0 1 0 0 0 0 1 0 7 8 60 10382 7 6 1973 1973 859 32 216 ... 0 0 0 1 0 0 0 0 1 0 8 9 50 6120 7 5 1931 1950 0 0 952 ... 0 0 0 1 1 0 0 0 0 0 9 10 190 7420 5 6 1939 1950 851 0 140 ... 0 0 0 1 0 0 0 0 1 0 10 11 20 11200 5 5 1965 1965 906 0 134 ... 0 0 0 1 0 0 0 0 1 0 11 12 60 11924 9 5 2005 2006 998 0 177 ... 0 1 0 0 0 0 0 0 0 1 12 13 20 12968 5 6 1962 1962 737 0 175 ... 0 0 0 1 0 0 0 0 1 0 13 14 20 10652 7 5 2006 2007 0 0 1494 ... 0 1 0 0 0 0 0 0 0 1 14 15 20 10920 6 5 1960 1960 733 0 520 ... 0 0 0 1 0 0 0 0 1 0 15 16 45 6120 7 8 1929 2001 0 0 832 ... 0 0 0 1 0 0 0 0 1 0 16 17 20 11241 6 7 1970 1970 578 0 426 ... 0 0 0 1 0 0 0 0 1 0 17 18 90 10791 4 5 1967 1967 0 0 0 ... 0 0 0 1 0 0 0 0 1 0 18 19 20 13695 5 5 2004 2004 646 0 468 ... 0 0 0 1 0 0 0 0 1 0 19 20 20 7560 5 6 1958 1965 504 0 525 ... 0 0 0 0 1 0 0 0 0 0 20 rows × 222 columns