Geeks of Coding

Join us on Telegram

Home Forums Cody Bank Kaggle House Prices Advanced Regression Techniques

Viewing 0 reply threads
  • Author
    • #1103
      Abhishek TyagiAbhishek Tyagi


        Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

        To download the Code Github Link –

          File descriptions
          train.csv – the training set
          test.csv – the test set
          data_description.txt – full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
          sample_submission.csv – a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms
          Data fields
          Here’s a brief version of what you’ll find in the data description file.

          SalePrice – the property’s sale price in dollars. This is the target variable that you’re trying to predict.
          MSSubClass: The building class
          MSZoning: The general zoning classification
          LotFrontage: Linear feet of street connected to property
          LotArea: Lot size in square feet
          Street: Type of road access
          Alley: Type of alley access
          LotShape: General shape of property
          LandContour: Flatness of the property
          Utilities: Type of utilities available
          LotConfig: Lot configuration
          LandSlope: Slope of property
          Neighborhood: Physical locations within Ames city limits
          Condition1: Proximity to main road or railroad
          Condition2: Proximity to main road or railroad (if a second is present)
          BldgType: Type of dwelling
          HouseStyle: Style of dwelling
          OverallQual: Overall material and finish quality
          OverallCond: Overall condition rating
          YearBuilt: Original construction date
          YearRemodAdd: Remodel date
          RoofStyle: Type of roof
          RoofMatl: Roof material
          Exterior1st: Exterior covering on house
          Exterior2nd: Exterior covering on house (if more than one material)
          MasVnrType: Masonry veneer type
          MasVnrArea: Masonry veneer area in square feet
          ExterQual: Exterior material quality
          ExterCond: Present condition of the material on the exterior
          Foundation: Type of foundation
          BsmtQual: Height of the basement
          BsmtCond: General condition of the basement
          BsmtExposure: Walkout or garden level basement walls
          BsmtFinType1: Quality of basement finished area
          BsmtFinSF1: Type 1 finished square feet
          BsmtFinType2: Quality of second finished area (if present)
          BsmtFinSF2: Type 2 finished square feet
          BsmtUnfSF: Unfinished square feet of basement area
          TotalBsmtSF: Total square feet of basement area
          Heating: Type of heating
          HeatingQC: Heating quality and condition
          CentralAir: Central air conditioning
          Electrical: Electrical system
          1stFlrSF: First Floor square feet
          2ndFlrSF: Second floor square feet
          LowQualFinSF: Low quality finished square feet (all floors)
          GrLivArea: Above grade (ground) living area square feet
          BsmtFullBath: Basement full bathrooms
          BsmtHalfBath: Basement half bathrooms
          FullBath: Full bathrooms above grade
          HalfBath: Half baths above grade
          Bedroom: Number of bedrooms above basement level
          Kitchen: Number of kitchens
          KitchenQual: Kitchen quality
          TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
          Functional: Home functionality rating
          Fireplaces: Number of fireplaces
          FireplaceQu: Fireplace quality
          GarageType: Garage location
          GarageYrBlt: Year garage was built
          GarageFinish: Interior finish of the garage
          GarageCars: Size of garage in car capacity
          GarageArea: Size of garage in square feet
          GarageQual: Garage quality
          GarageCond: Garage condition
          PavedDrive: Paved driveway
          WoodDeckSF: Wood deck area in square feet
          OpenPorchSF: Open porch area in square feet
          EnclosedPorch: Enclosed porch area in square feet
          3SsnPorch: Three season porch area in square feet
          ScreenPorch: Screen porch area in square feet
          PoolArea: Pool area in square feet
          PoolQC: Pool quality
          Fence: Fence quality
          MiscFeature: Miscellaneous feature not covered in other categories
          MiscVal: $Value of miscellaneous feature
          MoSold: Month Sold
          YrSold: Year Sold
          SaleType: Type of sale
          SaleCondition: Condition of sale

        import numpy as np
        import matplotlib.pylab as plt'bmh')
        %matplotlib inline
        import seaborn as sns
        from sklearn.model_selection import cross_val_score, cross_val_predict
        from sklearn.model_selection import GridSearchCV
        from sklearn.preprocessing import scale
        from sklearn.linear_model import LinearRegression, Ridge, RidgeCV
        from sklearn.ensemble import GradientBoostingRegressor
        pd.set_option('display.max_columns', 85)
        train = pd.read_csv('train.csv')
        test = pd.read_csv('test.csv')
        print('train shape:', train.shape, '\n', 'test shape:', test.shape)
        missing_numeric = pd.concat([train.isnull().sum(), test.isnull().sum()], axis=1, keys=['train', 'test'])
        missing_numeric = missing_numeric[(missing_numeric['train']>0) | (missing_numeric['test']>0)]
        missing_numeric.sort_values(by=['train', 'test'], ascending=False)
        # Drop the features which I'm not interested in 
        feature_drop = ['PoolQC', 'MiscFeature', 'Fence', 'FireplaceQu', 'LotFrontage', 'GarageYrBlt', 'MoSold', 'YrSold', 
                        'LowQualFinSF', 'MiscVal', 'PoolArea']
        datasets = [train, test]
        for df in datasets:
            df.drop(feature_drop, axis=1, inplace=True)
            df.loc[df['Alley'].isnull(), 'Alley'] = 'NoAlley'
        # If a house has no garage, it will have missing value on the 'Garage related' features, so just fill NaNs with 'NoGarage'.
            df.loc[df['GarageCond'].isnull(), 'GarageCond'] = 'NoGarage'
            df.loc[df['GarageQual'].isnull(), 'GarageQual'] = 'NoGarage'
            df.loc[df['GarageType'].isnull(), 'GarageType'] = 'NoGarage'
            df.loc[df['GarageFinish'].isnull(), 'GarageFinish'] = 'NoGarage'
        # If a house has no basement, it will have missing value on the 'basement related' features, so just fill NaNs with 'NoBsmt'.    
            df.loc[df['BsmtExposure'].isnull(), 'BsmtExposure'] = 'NoBsmt'
            df.loc[df['BsmtFinType2'].isnull(), 'BsmtFinType2'] = 'NoBsmt'
            df.loc[df['BsmtCond'].isnull(), 'BsmtCond'] = 'NoBsmt'
            df.loc[df['BsmtQual'].isnull(), 'BsmtQual'] = 'NoBsmt'
            df.loc[df['BsmtFinType1'].isnull(), 'BsmtFinType1'] = 'NoBsmt'
        # Masonry veneer feature: just fill with 'None' if there is no Masonry veneer.    
            df.loc[df['MasVnrType'].isnull(), 'MasVnrType'] = 'None'
            df.loc[df['MasVnrArea'].isnull(), 'MasVnrArea'] = 0
        train['Electrical'].fillna(train['Electrical'].mode()[0], inplace=True)
        test_numeric_missing = ['BsmtFullBath', 'BsmtHalfBath', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'GarageArea', 'GarageCars', 'TotalBsmtSF']
        test_categorical_missing = ['MSZoning', 'Functional', 'Utilities', 'Exterior1st', 'Exterior2nd', 'KitchenQual', 'SaleType']
        for i in test_numeric_missing:
            test[i].fillna(0, inplace=True)
        for j in test_categorical_missing:
            test[j].fillna(test[j].mode()[0], inplace=True)
        # Check the missing values again for datasets
        missing_numeric = pd.concat([train.isnull().sum(), test.isnull().sum()], axis=1, keys=['train', 'test'])
        missing_numeric = missing_numeric[(missing_numeric['train']>0) | (missing_numeric['test']>0)]
        missing_numeric.sort_values(by=['train', 'test'], ascending=False)


        print(train['SalePrice'].describe(), '\n')
        print('Before Transformation Skew: ', train['SalePrice'].skew())
        target = np.log1p(train['SalePrice'])
        print('Log Transformation Skew: ', target.skew())
        plt.rcParams['figure.figsize'] = (12, 5)
        target_log_tran = pd.DataFrame({'befrore transformation':train['SalePrice'], 'log transformation': target})
        skewness = pd.DataFrame({'Skewness':train.select_dtypes(exclude=[object]).skew()})
        print(skewness[skewness['Skewness']>0.8].sort_values(by='Skewness'), '\n')  
        skews = ['2ndFlrSF', 'BsmtUnfSF', 'GrLivArea', '1stFlrSF', 'MSSubClass', 'TotalBsmtSF', 'WoodDeckSF', 'BsmtFinSF1', 'OpenPorchSF', 
                 'MasVnrArea', 'EnclosedPorch', 'BsmtHalfBath', 'ScreenPorch', 'BsmtFinSF2', 'KitchenAbvGr', '3SsnPorch', 'LotArea']
        for df in datasets:
            for s in skews:
                df[s] = np.log1p(df[s])
        corr = train.select_dtypes(exclude=[object]).corr()
        print(corr['SalePrice'].sort_values(ascending=False)[:22], '\n')
        numeric_data = train[['SalePrice', 'OverallQual', 'GrLivArea', 'GarageArea', 'GarageCars', 'TotalBsmtSF', '1stFlrSF', 
                                     'FullBath', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'Fireplaces', 'BsmtFinSF1',
                                    'WoodDeckSF', '2ndFlrSF', 'OpenPorchSF', 'HalfBath', 'LotArea', 'BsmtFullBath', 'BsmtUnfSF']]
        corr = numeric_data.corr()
        plt.figure(figsize=(12, 12))
        mask = np.zeros_like(corr)
        mask[np.triu_indices_from(mask)] = True
        sns.heatmap(corr, vmax=1, square=True, annot=True, mask=mask, cbar=False, linewidths=0.1)
        numeric_data_select = train[['SalePrice', 'OverallQual', 'GrLivArea', 'GarageArea', 'FullBath', 'TotalBsmtSF', 
                                            'YearBuilt', 'MasVnrArea', 'Fireplaces', 'BsmtFinSF1', 'WoodDeckSF', 'OpenPorchSF',
                                            'HalfBath', 'LotArea']]
        corr_select = numeric_data_select.corr()
        plt.figure(figsize=(8, 8))
        mask = np.zeros_like(corr_select)
        mask[np.triu_indices_from(mask)] = True
        sns.heatmap(corr_select, vmax=1, square=True, annot=True, mask=mask, cbar=False, linewidths=0.1)

        sns.pairplot(numeric_data_select, height=2)

        plt.rcParams['figure.figsize'] = (12, 4)
        sns.boxplot(train['OverallQual'], target)
        sns.boxplot(train['FullBath'], target)
        plt.scatter(train['GrLivArea'], target)
        plt.scatter(train['GarageArea'], target)
        plt.scatter(train['TotalBsmtSF'], target)
        plt.scatter(train['MasVnrArea'], target)
        sns.boxplot(train['Fireplaces'], target)
        plt.scatter(train['YearBuilt'], target)
        plt.scatter(train['BsmtFinSF1'], target)
        plt.scatter(train['WoodDeckSF'], target)
        #Removing outliners
        index_remove = train[train['GrLivArea'] > 8.5].index.tolist()+train[train['GarageArea'] > 1200].index.tolist()+train[train['TotalBsmtSF'] > 8.2].index.tolist()+train[train['BsmtFinSF1'] > 8].index.tolist()
        index_remove = list(set(index_remove))  # remove duplicate values
        train = train.drop(train.index[index_remove], axis=0)
        train = train[train['SalePrice'] <=550000]
        #categorial data manupulation
        categorical_data = train.select_dtypes(include=[object])
        plt.rcParams['figure.figsize'] = (12, 7)
        sns.boxplot(train['ExterQual'], target)
        sns.boxplot(train['BsmtQual'], target)
        sns.boxplot(train['BsmtExposure'], target)
        sns.boxplot(train['GarageFinish'], target)
        sns.boxplot(train['SaleCondition'], target)
        sns.boxplot(train['CentralAir'], target)
        sns.boxplot(train['KitchenQual'], target)
        train_ExterQual_dummy = pd.get_dummies(train['ExterQual'], prefix='ExterQual')
        test_ExterQual_dummy = pd.get_dummies(test['ExterQual'], prefix='ExterQual')
        train_BsmtQual_dummy = pd.get_dummies(train['BsmtQual'], prefix='BsmtQual')
        test_BsmtQual_dummy = pd.get_dummies(test['BsmtQual'], prefix='BsmtQual')
        train_BsmtExposure_dummy = pd.get_dummies(train['BsmtExposure'], prefix='BsmtExposure')
        test_BsmtExposure_dummy = pd.get_dummies(test['BsmtExposure'], prefix='BsmtExposure')
        train_GarageFinish_dummy = pd.get_dummies(train['GarageFinish'], prefix='GarageFinish')
        test_GarageFinish_dummy = pd.get_dummies(test['GarageFinish'], prefix='GarageFinish')
        train_SaleCondition_dummy = pd.get_dummies(train['SaleCondition'], prefix='SaleCondition')
        test_SaleCondition_dummy = pd.get_dummies(test['SaleCondition'], prefix='SaleCondition')
        train_CentralAir_dummy = pd.get_dummies(train['CentralAir'], prefix='CentralAir')
        test_CentralAir_dummy = pd.get_dummies(test['CentralAir'], prefix='CentralAir')
        train_KitchenQual_dummy = pd.get_dummies(train['KitchenQual'], prefix='KitchenQual')
        test_KitchenQual_dummy = pd.get_dummies(test['KitchenQual'], prefix='KitchenQual')
        # Define a model evaluation function by outputing R2 score and mean squared error. (using 10-fold cross validation)
        def model_eval(model):
            model_fit =, y)
            R2 = cross_val_score(model_fit, X, y, cv=10 , scoring='r2').mean()
            MSE = -cross_val_score(lr, X, y, cv=10 , scoring='neg_mean_squared_error').mean()
            print('R2 Score:', R2, '|', 'MSE:', MSE)
        data = train.select_dtypes(exclude=[object])
        y = np.log1p(data['SalePrice'])
        X = data.drop(['Id', 'SalePrice'], axis=1)
        X = pd.concat([X, train_ExterQual_dummy, train_BsmtQual_dummy, train_GarageFinish_dummy, train_BsmtExposure_dummy,
                      train_SaleCondition_dummy, train_CentralAir_dummy, train_KitchenQual_dummy], axis=1)
        lr = LinearRegression()
        ri = Ridge(alpha=0.1, normalize=False)
        ricv = RidgeCV(cv=5)
        gdb = GradientBoostingRegressor(n_estimators=200)
        for model in [lr, ri, ricv, gdb]:
        test_id = test['Id']
        test = test.select_dtypes(exclude=[object]).drop('Id', axis=1)
        test = pd.concat([test, test_ExterQual_dummy, test_BsmtQual_dummy, test_GarageFinish_dummy, test_BsmtExposure_dummy,
                      test_SaleCondition_dummy, test_CentralAir_dummy, test_KitchenQual_dummy], axis=1)

        pred = ri.predict(test)

        pred = np.expm1(pred)
        prediction = pd.DataFrame({'Id':test_id, 'SalePrice':pred})
        prediction.to_csv('Prediction1.csv', index=False)
        plt.scatter(cross_val_predict(lr, X, y), y)
        plt.xlabel('Predicted Values')
        plt.ylabel('True Values')

        The file saved in your notebook file with a name called Prediction1.csv

    Viewing 0 reply threads
    • You must be logged in to reply to this topic.