yoooniverse
[kaggle] Learn Tutorial_Intro to Machine Learning (정리)_3(실습코드) 본문
KAGGLE/Intro to Machine Learning
[kaggle] Learn Tutorial_Intro to Machine Learning (정리)_3(실습코드)
Ykl 2022. 11. 5. 18:26In [1]:
ls
A quick look at data
In [26]:
import pandas as pd
iowa_file_path = 'train.csv'
home_data = pd.read_csv(iowa_file_path)
home_data.describe()
Out[26]:
In [27]:
avg_lot_size = round(home_data['LotArea'].mean())
print(f'avg_lot_size : {avg_lot_size}')
newest_home_age = 2022 - round(home_data['YearBuilt'].max())
print(f'newest_home_age: {newest_home_age}')
Step 1: Specify Prediction Target
Step 2: Create X
In [28]:
home_data.columns
Out[28]:
👆 prediction target : 'SalePrice'
In [63]:
#home_data = home_data.dropna(axis=0)
#no need to run this code on this data
In [31]:
y = home_data.SalePrice
home_features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[home_features]
print(X.describe())
print(X.head())
Step 3: Specify and Fit Model
In [37]:
from sklearn.tree import DecisionTreeRegressor
home_model = DecisionTreeRegressor(random_state=1)
home_model.fit(X, y)
#learned pattern of the data
#check whether the prediction is correct with 5 rows of the data
print(f'Making predictions for the following 5 houses: {X.head()}')
print(f'The real answers are\n {y.head()}')
print(f'The predictions are\n {home_model.predict(X.head())}')
#seems correct
Step 4: Make Predictions
In [38]:
prediction = home_model.predict(X)
print(y[:10])
print(prediction[:10])
Model Validation
this includes the steps of validating the model with spliting data into two groups.
Step 1: Split Your Data
Step 2: Specify and Fit the Model
Step 3: Make Predictions with Validation data
In [40]:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
home_model.fit(train_X, train_y)
val_predictions = home_model.predict(val_X)
print(f'val_prediction : \n{val_predictions[:10]}')
print(f'val_y : \n{val_y[:10]}')
Step 4: Calculate the Mean Absolute Error in Validation Data
In [42]:
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_predictions, val_y)
print(f'val_mae: {val_mae}')
Underfitting and Overfitting
creating function 'def' to check underfitting & overfitting
-> train several tree models which has different size of leaf node. after that, check each model's mae value and find the optimal sized tree model.
In [43]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=1)
model.fit(train_X, train_y)
val_preds = model.predict(val_X)
mae = mean_absolute_error(val_preds, val_y)
return mae
In [51]:
max_leaf_nodes = [5, 25, 50, 100, 250, 500]
scores = {leaf_size : get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in max_leaf_nodes}
print(scores)
best_tree_size = min(scores, key=scores.get)
print(f'best_tree_size : {best_tree_size}')
Step 2: Fit Model Using All Data
In [52]:
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)
final_model.fit(X, y)
Out[52]:
Random Forests
In [55]:
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)
rf_val_preds = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_preds, val_y)
print("Validation MAE for Random Forest Model: {}".format(rf_val_mae))
print(scores[best_tree_size])
In [57]:
from google.colab import drive
drive.mount('/content/drive')
In [62]:
!jupyter nbconvert --to html "/content/drive/MyDrive/Kaggle/project0/kaggle_learn_ml.ipynb"
In [ ]:
'KAGGLE > Intro to Machine Learning' 카테고리의 다른 글
[kaggle] Learn Tutorial_Intro to Machine Learning (정리)_2 (0) | 2022.11.04 |
---|---|
[kaggle] Learn Tutorial_Intro to Machine Learning (정리)_1 (0) | 2022.11.03 |
[kaggle] Learn Tutorial_Intro to Machine Learning (정리)_0 (0) | 2022.11.02 |
Comments