Version: 1.0
Last Updated: 2025-12-12
Language: Bahasa Indonesia & English
Course: Kaggle Learn β Intro to Machine Learning
Dataset: Iowa Housing Prices (Ames Housing Dataset)
- Ringkasan Course
- Prerequisite
- Setup Environment
- Lesson 1: How Models Work
- Lesson 2: Basic Data Exploration
- Lesson 3: Your First Machine Learning Model
- Lesson 4: Model Validation
- Lesson 5: Underfitting and Overfitting
- Lesson 6: Random Forests
- Lesson 7: Machine Learning Competitions
- Cheat Sheet
- Troubleshooting
- Next Steps
- Resources
Intro to Machine Learning adalah course pemula untuk memahami konsep dasar machine learning dan membangun model prediktif menggunakan scikit-learn.
β
Konsep dasar machine learning (supervised learning, regression)
β
Eksplorasi data dengan pandas
β
Membangun model Decision Tree & Random Forest
β
Evaluasi model dengan Mean Absolute Error (MAE)
β
Train/validation split untuk menghindari overfitting
β
Hyperparameter tuning untuk improve akurasi
β
Submit prediksi ke Kaggle competition
- Python 3.x
- pandas β data manipulation
- scikit-learn β machine learning models
- numpy β numerical operations
- Kaggle Notebooks β cloud environment (opsional)
Iowa Housing Prices β prediksi harga rumah berdasarkan 79 features (luas tanah, tahun dibangun, jumlah kamar, dll).
- Python basics (variables, functions, loops, conditionals)
- Basic math (mean, median, algebra dasar)
- Curiosity & persistence π
β Background matematika/statistik advanced
β Pengalaman coding sebelumnya (tapi helpful)
β Pengetahuan machine learning (ini course pemula)
Opsi 1: Kaggle Notebooks (Easiest)
- Gratis, cloud-based, dataset sudah tersedia
- Link: https://www.kaggle.com/learn/intro-to-machine-learning
Opsi 2: Local Jupyter Notebook
# Install dependencies
pip install pandas scikit-learn numpy jupyter
# Download dataset dari Kaggle
# Jalankan Jupyter
jupyter notebookpip install pandas scikit-learn numpy matplotlib seaborn jupyterimport pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error# Path ke file dataset
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
# Load data dengan pandas
home_data = pd.read_csv(iowa_file_path)
# Preview data
print(home_data. head())
print(home_data.describe())Machine Learning = membuat komputer "belajar" dari data untuk membuat prediksi/keputusan.
- Input: Data dengan label/target (contoh: harga rumah)
- Output: Model yang bisa prediksi label untuk data baru
- Contoh: Prediksi harga rumah berdasarkan features (luas, lokasi, tahun dibangun)
Model yang membuat keputusan dengan "memotong" data berdasarkan features.
Contoh Decision Tree (simplified):
[All Houses]
/ \
[YearBuilt > 2000? ] [YearBuilt β€ 2000?]
/ \ / \
[Large Area] [Small] [Good Cond] [Poor Cond]
β $300k β $200k β $180k β $120k
| Term | Definisi |
|---|---|
| Features (X) | Input variables (mis. luas tanah, tahun dibangun) |
| Target (y) | Variable yang ingin diprediksi (mis. harga rumah) |
| Training Data | Data untuk melatih model |
| Prediction | Output model untuk data baru |
| Model | Algoritma yang "belajar" pola dari data |
import pandas as pd
# Load data
home_data = pd.read_csv('train.csv')
# Lihat 5 baris pertama
print(home_data.head())
# Lihat statistik deskriptif
print(home_data.describe())
# Lihat semua kolom
print(home_data.columns)
# Lihat info dataset (tipe data, missing values)
print(home_data.info())| Method | Fungsi |
|---|---|
.head(n) |
Lihat n baris pertama (default 5) |
.tail(n) |
Lihat n baris terakhir |
.describe() |
Statistik deskriptif (mean, std, min, max, dll) |
.info() |
Info tipe data & missing values |
.shape |
Dimensi dataset (rows, columns) |
.columns |
Daftar nama kolom |
.isnull().sum() |
Hitung missing values per kolom |
SalePrice LotArea YearBuilt
count 1460.00 1460.00 1460.00
mean 180921.20 10516.83 1971.27
std 79442.50 9981.26 30.20
min 34900.00 1300.00 1872.00
25% 129975.00 7553.50 1954.00
50% 163000.00 9478.50 1973.00
75% 214000.00 11601.50 2000.00
max 755000.00 215245.00 2010. 00
- Define Target (y) β kolom yang ingin diprediksi
- Choose Features (X) β kolom input untuk model
- Define Model β pilih algoritma (mis. Decision Tree)
- Fit Model β latih model dengan data
- Predict β buat prediksi
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
# 1. Load data
iowa_file_path = 'train.csv'
home_data = pd.read_csv(iowa_file_path)
# 2. Define Target (y)
y = home_data.SalePrice
# 3. Choose Features (X)
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF',
'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]
# 4. Define Model
iowa_model = DecisionTreeRegressor(random_state=1)
# 5. Fit Model
iowa_model. fit(X, y)
# 6. Predict (pada data training untuk demo)
predictions = iowa_model.predict(X)
print("Predictions:", predictions[: 5])
print("Actual values:", y. head().values)random_state=1β memastikan hasil reproducible (sama setiap kali dijalankan).fit(X, y)β melatih model dengan features (X) dan target (y).predict(X)β membuat prediksi
Prediksi pada training data akan terlihat sangat akurat (bahkan sempurna) β ini misleading!
Untuk evaluasi yang benar, gunakan validation data (Lesson 4).
In-sample evaluation = evaluasi model pada data yang sama yang dipakai untuk training β menyesatkan (model bisa overfitting).
Pisahkan data menjadi:
- Training set (75%) β untuk melatih model
- Validation set (25%) β untuk evaluasi
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
# Split data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
# Define & fit model (hanya dengan training data)
iowa_model = DecisionTreeRegressor(random_state=1)
iowa_model.fit(train_X, train_y)
# Predict pada validation data
val_predictions = iowa_model.predict(val_X)
# Calculate MAE (Mean Absolute Error)
val_mae = mean_absolute_error(val_y, val_predictions)
print("Validation MAE:", val_mae)Formula:
MAE = (1/n) Γ Ξ£ |actual - predicted|
Interpretasi:
"Secara rata-rata, prediksi model meleset sekitar $X dari harga aktual."
Contoh:
- MAE = 25,000 β rata-rata error $25,000
- Semakin rendah MAE, semakin baik model
| Metric | Training Data | Validation Data |
|---|---|---|
| MAE | ~500 (sangat rendah) | ~29,000 (realistis) |
| Interpretasi | Model "mengingat" training data | Model performance pada data baru |
| Kesimpulan | Misleading (overfitting) | β Realistis |
Underfitting = model terlalu sederhana β gagal menangkap pola penting β MAE tinggi
Overfitting = model terlalu kompleks β "mengingat" training data β MAE rendah di training, tinggi di validation
MAE
|
| Underfitting
| \
| \_____ Sweet Spot (Optimal)
| \
| \_____ Overfitting
|_________________________ Model Complexity
(shallow tree) (deep tree)
max_leaf_nodes = jumlah maksimal leaves (kelompok akhir) di Decision Tree.
- Rendah (mis. 5) β shallow tree β underfitting
- Tinggi (mis. 5000) β deep tree β overfitting
- Optimal (mis. 100) β sweet spot β MAE terendah
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return mae
# Test berbagai nilai
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y)
for leaf_size in candidate_max_leaf_nodes}
# Cari yang optimal
best_tree_size = min(scores, key=scores.get)
print(f"Optimal max_leaf_nodes: {best_tree_size}")
print(f"Best MAE: {scores[best_tree_size]: ,.0f}")
# Fit final model dengan parameter optimal
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)
final_model.fit(X, y) # Fit dengan SEMUA datamax_leaf_nodes=5 β MAE: 35,044 (underfitting)
max_leaf_nodes=25 β MAE: 29,016
max_leaf_nodes=50 β MAE: 27,405
max_leaf_nodes=100 β MAE: 27,282 β
OPTIMAL
max_leaf_nodes=250 β MAE: 27,893
max_leaf_nodes=500 β MAE: 29,454 (overfitting)
Bahkan setelah tuning, Decision Tree punya keterbatasan:
- Sensitif terhadap small changes di data
- Trade-off sulit antara underfitting & overfitting
Random Forest = ensemble dari banyak Decision Trees (default: 100 trees).
- Build 100 trees dengan:
- Random subset data (bootstrap sampling)
- Random subset features
- Setiap tree membuat prediksi
- Final prediction = rata-rata prediksi semua trees
β
Lebih akurat dari single Decision Tree
β
Mengatasi overfitting (averaging mengurangi variance)
β
Robust dengan default parameters (tidak perlu tuning ekstensif)
β
"Just works" β good performance out-of-the-box
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# Define Random Forest model
forest_model = RandomForestRegressor(random_state=1)
# Fit model
forest_model.fit(train_X, train_y)
# Predict
forest_preds = forest_model.predict(val_X)
# Evaluate
forest_mae = mean_absolute_error(val_y, forest_preds)
print("Random Forest MAE:", forest_mae)| Model | Validation MAE | Notes |
|---|---|---|
| Decision Tree (tuned) | ~27,282 | Setelah tuning max_leaf_nodes |
| Random Forest (default) | ~21,857 | ~20% lebih baik, tanpa tuning! |
# Custom parameters
forest_model_tuned = RandomForestRegressor(
n_estimators=200, # 200 trees (default: 100)
max_depth=15, # max depth per tree
min_samples_split=5,
random_state=1
)
forest_model_tuned.fit(train_X, train_y)Parameter Penting:
n_estimatorsβ jumlah trees (lebih banyak = lebih akurat, tapi lebih lambat)max_depthβ kedalaman maksimal tiap tree (kontrol overfitting)min_samples_splitβ minimum samples untuk split node
# Fit model dengan semua training data (bukan hanya train_X)
rf_model_on_full_data = RandomForestRegressor(random_state=1)
rf_model_on_full_data.fit(X, y) # β SEMUA dataCatatan: Validation data sudah selesai tugasnya (untuk tuning). Final model harus belajar dari semua data untuk maksimalkan performa.
# Load test data
test_data = pd.read_csv('../input/test.csv')
# Pilih features yang sama
test_X = test_data[features]
# Predict
test_preds = rf_model_on_full_data.predict(test_X)# Format submission
output = pd.DataFrame({
'Id': test_data. Id,
'SalePrice': test_preds
})
# Save to CSV
output.to_csv('submission.csv', index=False)
print("Submission file created!")- Join competition: Housing Prices Competition
- Save & Run notebook ("Save Version" β "Save and Run All")
- Open in Viewer
- Tab "Data" β klik "submission.csv" β "Submit"
Baseline (7 features):
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF',
'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']Improved (25 features):
features = [
'MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
'YearRemodAdd', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
'Fireplaces', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold'
]Top Features untuk Akurasi:
OverallQualβββ β kualitas keseluruhan (1-10)GrLivAreaβββ β luas living area (sq ft)YearBuiltββ β tahun dibangunTotRmsAbvGrdβ β total ruangan
rf_model_on_full_data = RandomForestRegressor(
n_estimators=200,
max_depth=15,
min_samples_split=5,
random_state=1
)from xgboost import XGBRegressor
xgb_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, random_state=1)
xgb_model.fit(X, y)
test_preds = xgb_model.predict(test_X)| Model | Validation MAE | Leaderboard Score |
|---|---|---|
| Baseline (7 features) | ~22,000 | ~26,000 |
| Improved (25 features) | ~17,000 | ~20,000 |
| Tuned + 25 features | ~15,000 | ~18,000 |
import pandas as pd
# Load data
df = pd. read_csv('file.csv')
# Inspect
df.head() # 5 baris pertama
df.describe() # Statistik deskriptif
df. info() # Info tipe data & missing values
df.columns # Nama kolom
df.shape # (rows, columns)
# Select
df['column'] # Select 1 kolom (Series)
df[['col1', 'col2']] # Select multiple kolom (DataFrame)from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
# 1. Split data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
# 2. Define model
model = RandomForestRegressor(random_state=1)
# 3. Fit model
model. fit(train_X, train_y)
# 4. Predict
predictions = model.predict(val_X)
# 5. Evaluate
mae = mean_absolute_error(val_y, predictions)
print("MAE:", mae)| Aspect | Decision Tree | Random Forest |
|---|---|---|
| Speed | β‘ Cepat (single tree) | π’ Lebih lambat (100 trees) |
| Accuracy | π Sedang | π― Tinggi |
| Overfitting | β Robust | |
| Interpretability | π Mudah dijelaskan | π Black-box |
| Tuning | π§ Perlu tuning ekstensif | β Default sudah bagus |
Penyebab: Nama kolom salah atau tidak ada di dataset.
Solusi:
# Cek nama kolom yang benar
print(home_data.columns)
# Pastikan ejaan & huruf besar/kecil sesuai
y = home_data. SalePrice # β
Benar
y = home_data. saleprice # β Salah (case-sensitive)Penyebab: Belum menjalankan cell yang define X dan y.
Solusi: Jalankan cell setup secara berurutan dari atas.
Penyebab: Ada missing values (NaN) di data.
Solusi:
# Opsi 1: Hapus baris dengan missing values
home_data = home_data.dropna(axis=0)
# Opsi 2: Gunakan features tanpa missing values
# (lihat list 25 features yang aman di Lesson 7)Penyebab: Model underfitting atau features kurang informatif.
Solusi:
- Tambahkan lebih banyak features (gunakan 25 features dari Lesson 7)
- Tuning hyperparameters (
max_depth,n_estimators) - Coba Random Forest (biasanya lebih baik dari Decision Tree)
Penyebab: Overfitting β model terlalu spesifik pada training data.
Solusi:
- Kurangi kompleksitas model (
max_depth,n_estimators) - Gunakan cross-validation (advanced topic, pelajari di Intermediate ML)
β
Selesaikan semua 7 exercises
β
Klik "Get Certificate" di halaman course
β
Upload ke LinkedIn (Licenses & Certifications)
- Handle missing values
- Categorical variables (one-hot encoding)
- Pipelines
- XGBoost (model lebih powerful)
- Cross-validation
- Deep dive data manipulation
- Merging, grouping, pivoting
- Time series
- Data cleaning
- Create new features dari existing data
- Feature selection
- Dimensionality reduction (PCA)
- Matplotlib, Seaborn
- Exploratory Data Analysis (EDA)
- Storytelling dengan data
Ide Project:
- Predictive maintenance (IoT sensor data β predict failure)
- Stock price prediction
- Customer churn prediction
- House price prediction (improve Kaggle submission)
Template Project:
- Load & explore data (EDA)
- Feature engineering
- Train multiple models (Decision Tree, Random Forest, XGBoost)
- Compare MAE
- Hyperparameter tuning
- Final model + visualization
- Publish di GitHub + README lengkap
Beginner-Friendly Competitions:
- Titanic (classification)
- House Prices (regression) β Anda sudah mulai ini!
- Digit Recognizer (computer vision)
Benefits:
- Practice dengan real datasets
- Learn dari kernels/notebooks orang lain
- Build reputation (medals di profil)
- Scikit-Learn: https://scikit-learn.org/stable/
- Pandas: https://pandas.pydata.org/docs/
- Kaggle Learn: https://www.kaggle.com/learn
-
"Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" β AurΓ©lien GΓ©ron
β Best practical ML book (Python) -
"Python for Data Analysis" β Wes McKinney (creator of pandas)
β Deep dive pandas -
"The Elements of Statistical Learning" β Hastie, Tibshirani, Friedman
β Theory (advanced, for S2/S3)
- Andrew Ng β Machine Learning (Coursera) β Classic, foundational
- Fast.ai β Practical Deep Learning β Top-down approach
- DataCamp / Coursera β Python for Data Science
- Kaggle Forums & Discussions
- r/MachineLearning (Reddit)
- Stack Overflow β untuk troubleshooting
| Metric | Value |
|---|---|
| Lessons | 7 |
| Exercises | 7 |
| Estimated Time | 3-5 hours |
| Difficulty | Beginner |
| Certificate | β Yes (free) |
| Model | Validation MAE | Notes |
|---|---|---|
| Baseline Decision Tree | ~29,000 | No tuning |
| Tuned Decision Tree | ~27,282 | max_leaf_nodes=100 |
| Random Forest (default) | ~21,857 | Best default |
| Random Forest (25 features) | ~17,000 | Improved features |
Gunakan checklist ini untuk track progress Anda:
- Lesson 1: Pahami konsep Decision Tree
- Lesson 2: Load & explore data dengan pandas
- Exercise 2: Hitung average lot size & newest home age
- Lesson 3: Build first Decision Tree model
- Exercise 3: Fit model & make predictions
- Lesson 4: Pahami train/validation split & MAE
- Exercise 4: Calculate validation MAE
- Lesson 5: Pahami underfitting vs overfitting
- Exercise 5: Find optimal
max_leaf_nodes - Lesson 6: Build Random Forest model
- Exercise 6: Compare Random Forest vs Decision Tree
- Lesson 7: Submit ke Kaggle competition
- Exercise 7: Create submission. csv & submit
- Bonus: Improve model (tambah features, tuning)
- Bonus: Download sertifikat & upload LinkedIn
Course Created by: Kaggle (Dan Becker, Alexis Cook)
Dataset: Ames Housing Dataset (Dean De Cock)
Libraries: scikit-learn, pandas, numpy (open-source community)
Documentation Author: [Your Name / GitHub Profile]
Last Updated: 2025-12-12
This documentation is provided for educational purposes.
Course content Β© Kaggle.
Code examples: MIT License (free to use & modify).
Found a typo or want to improve this documentation?
- Open an issue on GitHub
- Submit a pull request
- Contact: [Your Email / GitHub]
Selamat Belajar! π Happy Machine Learning! π
| Term | Definisi |
|---|---|
| Algorithm | Prosedur/formula untuk solve problem (mis. Decision Tree) |
| Classification | Prediksi kategori (mis. spam/not spam) |
| Cross-Validation | Teknik validasi dengan multiple train/val splits |
| Decision Tree | Model yang membuat keputusan dengan splits berdasarkan features |
| Ensemble | Kombinasi multiple models (mis. Random Forest) |
| Features (X) | Input variables untuk model |
| Hyperparameter | Parameter yang diset sebelum training (mis. max_depth) |
| MAE | Mean Absolute Error β metrik evaluasi regresi |
| Model | Representasi matematis yang dipelajari dari data |
| Overfitting | Model terlalu spesifik pada training data β buruk di data baru |
| Prediction | Output model untuk input baru |
| Random Forest | Ensemble dari banyak Decision Trees |
| Regression | Prediksi nilai kontinu (mis. harga rumah) |
| Supervised Learning | Learning dari data dengan label/target |
| Target (y) | Variable yang ingin diprediksi |
| Training Data | Data untuk melatih model |
| Underfitting | Model terlalu sederhana β gagal menangkap pola |
| Validation Data | Data terpisah untuk evaluasi (tidak dipakai saat training) |
DecisionTreeRegressor(
max_depth=None, # Unlimited depth (default)
max_leaf_nodes=None, # Unlimited leaves (default)
min_samples_split=2, # Min samples untuk split (default)
min_samples_leaf=1, # Min samples di leaf (default)
random_state=None # Set untuk reproducibility
)Recommended untuk tuning:
max_leaf_nodes: [50, 100, 250, 500]max_depth: [5, 10, 15, 20]
RandomForestRegressor(
n_estimators=100, # Jumlah trees (default: 100)
max_depth=None, # Unlimited depth (default)
min_samples_split=2,
min_samples_leaf=1,
max_features='auto', # Sqrt(n_features) untuk regression
random_state=None
)Recommended untuk tuning:
n_estimators: [100, 200, 300]max_depth: [10, 15, 20, None]
END OF DOCUMENTATION
Version History:
- v1.0 (2025-12-12) β Initial complete documentation
Maintainer: [Your Name]
Contact: [Your GitHub / Email]
π **Congratulations on completing Intro to Machine Learning! ** π