Geeks of Coding

Join us on Telegram

Home Forums Cody Bank Kaggle House Prices Advanced Regression Techniques

Viewing 0 reply threads
  • Author
    • #1103
      Abhishek TyagiAbhishek Tyagi


      Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

      To download the Code Github Link –

        File descriptions
        train.csv – the training set
        test.csv – the test set
        data_description.txt – full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
        sample_submission.csv – a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms
        Data fields
        Here’s a brief version of what you’ll find in the data description file.

        SalePrice – the property’s sale price in dollars. This is the target variable that you’re trying to predict.
        MSSubClass: The building class
        MSZoning: The general zoning classification
        LotFrontage: Linear feet of street connected to property
        LotArea: Lot size in square feet
        Street: Type of road access
        Alley: Type of alley access
        LotShape: General shape of property
        LandContour: Flatness of the property
        Utilities: Type of utilities available
        LotConfig: Lot configuration
        LandSlope: Slope of property
        Neighborhood: Physical locations within Ames city limits
        Condition1: Proximity to main road or railroad
        Condition2: Proximity to main road or railroad (if a second is present)
        BldgType: Type of dwelling
        HouseStyle: Style of dwelling
        OverallQual: Overall material and finish quality
        OverallCond: Overall condition rating
        YearBuilt: Original construction date
        YearRemodAdd: Remodel date
        RoofStyle: Type of roof
        RoofMatl: Roof material
        Exterior1st: Exterior covering on house
        Exterior2nd: Exterior covering on house (if more than one material)
        MasVnrType: Masonry veneer type
        MasVnrArea: Masonry veneer area in square feet
        ExterQual: Exterior material quality
        ExterCond: Present condition of the material on the exterior
        Foundation: Type of foundation
        BsmtQual: Height of the basement
        BsmtCond: General condition of the basement
        BsmtExposure: Walkout or garden level basement walls
        BsmtFinType1: Quality of basement finished area
        BsmtFinSF1: Type 1 finished square feet
        BsmtFinType2: Quality of second finished area (if present)
        BsmtFinSF2: Type 2 finished square feet
        BsmtUnfSF: Unfinished square feet of basement area
        TotalBsmtSF: Total square feet of basement area
        Heating: Type of heating
        HeatingQC: Heating quality and condition
        CentralAir: Central air conditioning
        Electrical: Electrical system
        1stFlrSF: First Floor square feet
        2ndFlrSF: Second floor square feet
        LowQualFinSF: Low quality finished square feet (all floors)
        GrLivArea: Above grade (ground) living area square feet
        BsmtFullBath: Basement full bathrooms
        BsmtHalfBath: Basement half bathrooms
        FullBath: Full bathrooms above grade
        HalfBath: Half baths above grade
        Bedroom: Number of bedrooms above basement level
        Kitchen: Number of kitchens
        KitchenQual: Kitchen quality
        TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
        Functional: Home functionality rating
        Fireplaces: Number of fireplaces
        FireplaceQu: Fireplace quality
        GarageType: Garage location
        GarageYrBlt: Year garage was built
        GarageFinish: Interior finish of the garage
        GarageCars: Size of garage in car capacity
        GarageArea: Size of garage in square feet
        GarageQual: Garage quality
        GarageCond: Garage condition
        PavedDrive: Paved driveway
        WoodDeckSF: Wood deck area in square feet
        OpenPorchSF: Open porch area in square feet
        EnclosedPorch: Enclosed porch area in square feet
        3SsnPorch: Three season porch area in square feet
        ScreenPorch: Screen porch area in square feet
        PoolArea: Pool area in square feet
        PoolQC: Pool quality
        Fence: Fence quality
        MiscFeature: Miscellaneous feature not covered in other categories
        MiscVal: $Value of miscellaneous feature
        MoSold: Month Sold
        YrSold: Year Sold
        SaleType: Type of sale
        SaleCondition: Condition of sale

      import numpy as np
      import matplotlib.pylab as plt'bmh')
      %matplotlib inline
      import seaborn as sns
      from sklearn.model_selection import cross_val_score, cross_val_predict
      from sklearn.model_selection import GridSearchCV
      from sklearn.preprocessing import scale
      from sklearn.linear_model import LinearRegression, Ridge, RidgeCV
      from sklearn.ensemble import GradientBoostingRegressor
      pd.set_option('display.max_columns', 85)
      train = pd.read_csv('train.csv')
      test = pd.read_csv('test.csv')
      print('train shape:', train.shape, '\n', 'test shape:', test.shape)
      missing_numeric = pd.concat([train.isnull().sum(), test.isnull().sum()], axis=1, keys=['train', 'test'])
      missing_numeric = missing_numeric[(missing_numeric['train']>0) | (missing_numeric['test']>0)]
      missing_numeric.sort_values(by=['train', 'test'], ascending=False)
      # Drop the features which I'm not interested in 
      feature_drop = ['PoolQC', 'MiscFeature', 'Fence', 'FireplaceQu', 'LotFrontage', 'GarageYrBlt', 'MoSold', 'YrSold', 
                      'LowQualFinSF', 'MiscVal', 'PoolArea']
      datasets = [train, test]
      for df in datasets:
          df.drop(feature_drop, axis=1, inplace=True)
          df.loc[df['Alley'].isnull(), 'Alley'] = 'NoAlley'
      # If a house has no garage, it will have missing value on the 'Garage related' features, so just fill NaNs with 'NoGarage'.
          df.loc[df['GarageCond'].isnull(), 'GarageCond'] = 'NoGarage'
          df.loc[df['GarageQual'].isnull(), 'GarageQual'] = 'NoGarage'
          df.loc[df['GarageType'].isnull(), 'GarageType'] = 'NoGarage'
          df.loc[df['GarageFinish'].isnull(), 'GarageFinish'] = 'NoGarage'
      # If a house has no basement, it will have missing value on the 'basement related' features, so just fill NaNs with 'NoBsmt'.    
          df.loc[df['BsmtExposure'].isnull(), 'BsmtExposure'] = 'NoBsmt'
          df.loc[df['BsmtFinType2'].isnull(), 'BsmtFinType2'] = 'NoBsmt'
          df.loc[df['BsmtCond'].isnull(), 'BsmtCond'] = 'NoBsmt'
          df.loc[df['BsmtQual'].isnull(), 'BsmtQual'] = 'NoBsmt'
          df.loc[df['BsmtFinType1'].isnull(), 'BsmtFinType1'] = 'NoBsmt'
      # Masonry veneer feature: just fill with 'None' if there is no Masonry veneer.    
          df.loc[df['MasVnrType'].isnull(), 'MasVnrType'] = 'None'
          df.loc[df['MasVnrArea'].isnull(), 'MasVnrArea'] = 0
      train['Electrical'].fillna(train['Electrical'].mode()[0], inplace=True)
      test_numeric_missing = ['BsmtFullBath', 'BsmtHalfBath', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'GarageArea', 'GarageCars', 'TotalBsmtSF']
      test_categorical_missing = ['MSZoning', 'Functional', 'Utilities', 'Exterior1st', 'Exterior2nd', 'KitchenQual', 'SaleType']
      for i in test_numeric_missing:
          test[i].fillna(0, inplace=True)
      for j in test_categorical_missing:
          test[j].fillna(test[j].mode()[0], inplace=True)
      # Check the missing values again for datasets
      missing_numeric = pd.concat([train.isnull().sum(), test.isnull().sum()], axis=1, keys=['train', 'test'])
      missing_numeric = missing_numeric[(missing_numeric['train']>0) | (missing_numeric['test']>0)]
      missing_numeric.sort_values(by=['train', 'test'], ascending=False)


      print(train['SalePrice'].describe(), '\n')
      print('Before Transformation Skew: ', train['SalePrice'].skew())
      target = np.log1p(train['SalePrice'])
      print('Log Transformation Skew: ', target.skew())
      plt.rcParams['figure.figsize'] = (12, 5)
      target_log_tran = pd.DataFrame({'befrore transformation':train['SalePrice'], 'log transformation': target})
      skewness = pd.DataFrame({'Skewness':train.select_dtypes(exclude=[object]).skew()})
      print(skewness[skewness['Skewness']>0.8].sort_values(by='Skewness'), '\n')  
      skews = ['2ndFlrSF', 'BsmtUnfSF', 'GrLivArea', '1stFlrSF', 'MSSubClass', 'TotalBsmtSF', 'WoodDeckSF', 'BsmtFinSF1', 'OpenPorchSF', 
               'MasVnrArea', 'EnclosedPorch', 'BsmtHalfBath', 'ScreenPorch', 'BsmtFinSF2', 'KitchenAbvGr', '3SsnPorch', 'LotArea']
      for df in datasets:
          for s in skews:
              df[s] = np.log1p(df[s])
      corr = train.select_dtypes(exclude=[object]).corr()
      print(corr['SalePrice'].sort_values(ascending=False)[:22], '\n')
      numeric_data = train[['SalePrice', 'OverallQual', 'GrLivArea', 'GarageArea', 'GarageCars', 'TotalBsmtSF', '1stFlrSF', 
                                   'FullBath', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'Fireplaces', 'BsmtFinSF1',
                                  'WoodDeckSF', '2ndFlrSF', 'OpenPorchSF', 'HalfBath', 'LotArea', 'BsmtFullBath', 'BsmtUnfSF']]
      corr = numeric_data.corr()
      plt.figure(figsize=(12, 12))
      mask = np.zeros_like(corr)
      mask[np.triu_indices_from(mask)] = True
      sns.heatmap(corr, vmax=1, square=True, annot=True, mask=mask, cbar=False, linewidths=0.1)
      numeric_data_select = train[['SalePrice', 'OverallQual', 'GrLivArea', 'GarageArea', 'FullBath', 'TotalBsmtSF', 
                                          'YearBuilt', 'MasVnrArea', 'Fireplaces', 'BsmtFinSF1', 'WoodDeckSF', 'OpenPorchSF',
                                          'HalfBath', 'LotArea']]
      corr_select = numeric_data_select.corr()
      plt.figure(figsize=(8, 8))
      mask = np.zeros_like(corr_select)
      mask[np.triu_indices_from(mask)] = True
      sns.heatmap(corr_select, vmax=1, square=True, annot=True, mask=mask, cbar=False, linewidths=0.1)

      sns.pairplot(numeric_data_select, height=2)

      plt.rcParams['figure.figsize'] = (12, 4)
      sns.boxplot(train['OverallQual'], target)
      sns.boxplot(train['FullBath'], target)
      plt.scatter(train['GrLivArea'], target)
      plt.scatter(train['GarageArea'], target)
      plt.scatter(train['TotalBsmtSF'], target)
      plt.scatter(train['MasVnrArea'], target)
      sns.boxplot(train['Fireplaces'], target)
      plt.scatter(train['YearBuilt'], target)
      plt.scatter(train['BsmtFinSF1'], target)
      plt.scatter(train['WoodDeckSF'], target)
      #Removing outliners
      index_remove = train[train['GrLivArea'] > 8.5].index.tolist()+train[train['GarageArea'] > 1200].index.tolist()+train[train['TotalBsmtSF'] > 8.2].index.tolist()+train[train['BsmtFinSF1'] > 8].index.tolist()
      index_remove = list(set(index_remove))  # remove duplicate values
      train = train.drop(train.index[index_remove], axis=0)
      train = train[train['SalePrice'] <=550000]
      #categorial data manupulation
      categorical_data = train.select_dtypes(include=[object])
      plt.rcParams['figure.figsize'] = (12, 7)
      sns.boxplot(train['ExterQual'], target)
      sns.boxplot(train['BsmtQual'], target)
      sns.boxplot(train['BsmtExposure'], target)
      sns.boxplot(train['GarageFinish'], target)
      sns.boxplot(train['SaleCondition'], target)
      sns.boxplot(train['CentralAir'], target)
      sns.boxplot(train['KitchenQual'], target)
      train_ExterQual_dummy = pd.get_dummies(train['ExterQual'], prefix='ExterQual')
      test_ExterQual_dummy = pd.get_dummies(test['ExterQual'], prefix='ExterQual')
      train_BsmtQual_dummy = pd.get_dummies(train['BsmtQual'], prefix='BsmtQual')
      test_BsmtQual_dummy = pd.get_dummies(test['BsmtQual'], prefix='BsmtQual')
      train_BsmtExposure_dummy = pd.get_dummies(train['BsmtExposure'], prefix='BsmtExposure')
      test_BsmtExposure_dummy = pd.get_dummies(test['BsmtExposure'], prefix='BsmtExposure')
      train_GarageFinish_dummy = pd.get_dummies(train['GarageFinish'], prefix='GarageFinish')
      test_GarageFinish_dummy = pd.get_dummies(test['GarageFinish'], prefix='GarageFinish')
      train_SaleCondition_dummy = pd.get_dummies(train['SaleCondition'], prefix='SaleCondition')
      test_SaleCondition_dummy = pd.get_dummies(test['SaleCondition'], prefix='SaleCondition')
      train_CentralAir_dummy = pd.get_dummies(train['CentralAir'], prefix='CentralAir')
      test_CentralAir_dummy = pd.get_dummies(test['CentralAir'], prefix='CentralAir')
      train_KitchenQual_dummy = pd.get_dummies(train['KitchenQual'], prefix='KitchenQual')
      test_KitchenQual_dummy = pd.get_dummies(test['KitchenQual'], prefix='KitchenQual')
      # Define a model evaluation function by outputing R2 score and mean squared error. (using 10-fold cross validation)
      def model_eval(model):
          model_fit =, y)
          R2 = cross_val_score(model_fit, X, y, cv=10 , scoring='r2').mean()
          MSE = -cross_val_score(lr, X, y, cv=10 , scoring='neg_mean_squared_error').mean()
          print('R2 Score:', R2, '|', 'MSE:', MSE)
      data = train.select_dtypes(exclude=[object])
      y = np.log1p(data['SalePrice'])
      X = data.drop(['Id', 'SalePrice'], axis=1)
      X = pd.concat([X, train_ExterQual_dummy, train_BsmtQual_dummy, train_GarageFinish_dummy, train_BsmtExposure_dummy,
                    train_SaleCondition_dummy, train_CentralAir_dummy, train_KitchenQual_dummy], axis=1)
      lr = LinearRegression()
      ri = Ridge(alpha=0.1, normalize=False)
      ricv = RidgeCV(cv=5)
      gdb = GradientBoostingRegressor(n_estimators=200)
      for model in [lr, ri, ricv, gdb]:
      test_id = test['Id']
      test = test.select_dtypes(exclude=[object]).drop('Id', axis=1)
      test = pd.concat([test, test_ExterQual_dummy, test_BsmtQual_dummy, test_GarageFinish_dummy, test_BsmtExposure_dummy,
                    test_SaleCondition_dummy, test_CentralAir_dummy, test_KitchenQual_dummy], axis=1)

      pred = ri.predict(test)

      pred = np.expm1(pred)
      prediction = pd.DataFrame({'Id':test_id, 'SalePrice':pred})
      prediction.to_csv('Prediction1.csv', index=False)
      plt.scatter(cross_val_predict(lr, X, y), y)
      plt.xlabel('Predicted Values')
      plt.ylabel('True Values')

      The file saved in your notebook file with a name called Prediction1.csv

Viewing 0 reply threads
  • You must be logged in to reply to this topic.