Skip to content

yirassssindaba-coder/Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 

Repository files navigation

Intro to Machine Learning β€” Panduan Lengkap & Referensi Cepat

Version: 1.0
Last Updated: 2025-12-12
Language: Bahasa Indonesia & English
Course: Kaggle Learn β€” Intro to Machine Learning
Dataset: Iowa Housing Prices (Ames Housing Dataset)


πŸ“š Daftar Isi

  1. Ringkasan Course
  2. Prerequisite
  3. Setup Environment
  4. Lesson 1: How Models Work
  5. Lesson 2: Basic Data Exploration
  6. Lesson 3: Your First Machine Learning Model
  7. Lesson 4: Model Validation
  8. Lesson 5: Underfitting and Overfitting
  9. Lesson 6: Random Forests
  10. Lesson 7: Machine Learning Competitions
  11. Cheat Sheet
  12. Troubleshooting
  13. Next Steps
  14. Resources

🎯 Ringkasan Course

Intro to Machine Learning adalah course pemula untuk memahami konsep dasar machine learning dan membangun model prediktif menggunakan scikit-learn.

Apa yang Dipelajari:

βœ… Konsep dasar machine learning (supervised learning, regression)
βœ… Eksplorasi data dengan pandas
βœ… Membangun model Decision Tree & Random Forest
βœ… Evaluasi model dengan Mean Absolute Error (MAE)
βœ… Train/validation split untuk menghindari overfitting
βœ… Hyperparameter tuning untuk improve akurasi
βœ… Submit prediksi ke Kaggle competition

Tools & Libraries:

  • Python 3.x
  • pandas β€” data manipulation
  • scikit-learn β€” machine learning models
  • numpy β€” numerical operations
  • Kaggle Notebooks β€” cloud environment (opsional)

Dataset:

Iowa Housing Prices β€” prediksi harga rumah berdasarkan 79 features (luas tanah, tahun dibangun, jumlah kamar, dll).


πŸ“‹ Prerequisite

Knowledge:

  • Python basics (variables, functions, loops, conditionals)
  • Basic math (mean, median, algebra dasar)
  • Curiosity & persistence πŸš€

Tidak Perlu:

❌ Background matematika/statistik advanced
❌ Pengalaman coding sebelumnya (tapi helpful)
❌ Pengetahuan machine learning (ini course pemula)

Rekomendasi Setup:

Opsi 1: Kaggle Notebooks (Easiest)

Opsi 2: Local Jupyter Notebook

# Install dependencies
pip install pandas scikit-learn numpy jupyter

# Download dataset dari Kaggle
# Jalankan Jupyter
jupyter notebook

πŸ›  Setup Environment

Install Libraries (Local):

pip install pandas scikit-learn numpy matplotlib seaborn jupyter

Import Libraries (Standard):

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

Load Dataset:

# Path ke file dataset
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

# Load data dengan pandas
home_data = pd.read_csv(iowa_file_path)

# Preview data
print(home_data. head())
print(home_data.describe())

πŸ“– Lesson 1: How Models Work

Konsep Dasar

Machine Learning = membuat komputer "belajar" dari data untuk membuat prediksi/keputusan.

Supervised Learning:

  • Input: Data dengan label/target (contoh: harga rumah)
  • Output: Model yang bisa prediksi label untuk data baru
  • Contoh: Prediksi harga rumah berdasarkan features (luas, lokasi, tahun dibangun)

Decision Tree:

Model yang membuat keputusan dengan "memotong" data berdasarkan features.

Contoh Decision Tree (simplified):

                    [All Houses]
                    /          \
        [YearBuilt > 2000? ]     [YearBuilt ≀ 2000?]
         /         \             /           \
   [Large Area] [Small]    [Good Cond]  [Poor Cond]
   β†’ $300k      β†’ $200k    β†’ $180k      β†’ $120k

Key Terms:

Term Definisi
Features (X) Input variables (mis. luas tanah, tahun dibangun)
Target (y) Variable yang ingin diprediksi (mis. harga rumah)
Training Data Data untuk melatih model
Prediction Output model untuk data baru
Model Algoritma yang "belajar" pola dari data

πŸ“Š Lesson 2: Basic Data Exploration

Load & Inspect Data

import pandas as pd

# Load data
home_data = pd.read_csv('train.csv')

# Lihat 5 baris pertama
print(home_data.head())

# Lihat statistik deskriptif
print(home_data.describe())

# Lihat semua kolom
print(home_data.columns)

# Lihat info dataset (tipe data, missing values)
print(home_data.info())

Key Pandas Methods:

Method Fungsi
.head(n) Lihat n baris pertama (default 5)
.tail(n) Lihat n baris terakhir
.describe() Statistik deskriptif (mean, std, min, max, dll)
.info() Info tipe data & missing values
.shape Dimensi dataset (rows, columns)
.columns Daftar nama kolom
.isnull().sum() Hitung missing values per kolom

Contoh Output .describe():

       SalePrice    LotArea   YearBuilt
count  1460.00      1460.00   1460.00
mean   180921.20    10516.83  1971.27
std    79442.50     9981.26   30.20
min    34900.00     1300.00   1872.00
25%    129975.00    7553.50   1954.00
50%    163000.00    9478.50   1973.00
75%    214000.00    11601.50  2000.00
max    755000.00    215245.00 2010. 00

πŸ€– Lesson 3: Your First Machine Learning Model

Workflow:

  1. Define Target (y) β€” kolom yang ingin diprediksi
  2. Choose Features (X) β€” kolom input untuk model
  3. Define Model β€” pilih algoritma (mis. Decision Tree)
  4. Fit Model β€” latih model dengan data
  5. Predict β€” buat prediksi

Kode Lengkap:

import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# 1. Load data
iowa_file_path = 'train.csv'
home_data = pd.read_csv(iowa_file_path)

# 2. Define Target (y)
y = home_data.SalePrice

# 3. Choose Features (X)
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
            'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# 4. Define Model
iowa_model = DecisionTreeRegressor(random_state=1)

# 5. Fit Model
iowa_model. fit(X, y)

# 6. Predict (pada data training untuk demo)
predictions = iowa_model.predict(X)
print("Predictions:", predictions[: 5])
print("Actual values:", y. head().values)

Penjelasan:

  • random_state=1 β†’ memastikan hasil reproducible (sama setiap kali dijalankan)
  • .fit(X, y) β†’ melatih model dengan features (X) dan target (y)
  • .predict(X) β†’ membuat prediksi

⚠️ Catatan Penting:

Prediksi pada training data akan terlihat sangat akurat (bahkan sempurna) β†’ ini misleading!
Untuk evaluasi yang benar, gunakan validation data (Lesson 4).


βœ… Lesson 4: Model Validation

Masalah: In-Sample Score

In-sample evaluation = evaluasi model pada data yang sama yang dipakai untuk training β†’ menyesatkan (model bisa overfitting).

Solusi: Train-Test Split

Pisahkan data menjadi:

  • Training set (75%) β†’ untuk melatih model
  • Validation set (25%) β†’ untuk evaluasi

Kode:

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Split data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Define & fit model (hanya dengan training data)
iowa_model = DecisionTreeRegressor(random_state=1)
iowa_model.fit(train_X, train_y)

# Predict pada validation data
val_predictions = iowa_model.predict(val_X)

# Calculate MAE (Mean Absolute Error)
val_mae = mean_absolute_error(val_y, val_predictions)
print("Validation MAE:", val_mae)

Mean Absolute Error (MAE):

Formula:

MAE = (1/n) Γ— Ξ£ |actual - predicted|

Interpretasi:

"Secara rata-rata, prediksi model meleset sekitar $X dari harga aktual."

Contoh:

  • MAE = 25,000 β†’ rata-rata error $25,000
  • Semakin rendah MAE, semakin baik model

Perbandingan In-Sample vs Out-of-Sample:

Metric Training Data Validation Data
MAE ~500 (sangat rendah) ~29,000 (realistis)
Interpretasi Model "mengingat" training data Model performance pada data baru
Kesimpulan Misleading (overfitting) βœ… Realistis

🎯 Lesson 5: Underfitting and Overfitting

Konsep:

Underfitting = model terlalu sederhana β†’ gagal menangkap pola penting β†’ MAE tinggi
Overfitting = model terlalu kompleks β†’ "mengingat" training data β†’ MAE rendah di training, tinggi di validation

Grafik:

MAE
 |
 |   Underfitting
 |       \
 |        \_____ Sweet Spot (Optimal)
 |              \
 |               \_____ Overfitting
 |_________________________ Model Complexity
    (shallow tree)          (deep tree)

Hyperparameter: max_leaf_nodes

max_leaf_nodes = jumlah maksimal leaves (kelompok akhir) di Decision Tree.

  • Rendah (mis. 5) β†’ shallow tree β†’ underfitting
  • Tinggi (mis. 5000) β†’ deep tree β†’ overfitting
  • Optimal (mis. 100) β†’ sweet spot β†’ MAE terendah

Experiment: Find Optimal max_leaf_nodes

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return mae

# Test berbagai nilai
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) 
          for leaf_size in candidate_max_leaf_nodes}

# Cari yang optimal
best_tree_size = min(scores, key=scores.get)
print(f"Optimal max_leaf_nodes: {best_tree_size}")
print(f"Best MAE: {scores[best_tree_size]: ,.0f}")

# Fit final model dengan parameter optimal
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)
final_model.fit(X, y)  # Fit dengan SEMUA data

Hasil Contoh:

max_leaf_nodes=5    β†’ MAE:  35,044 (underfitting)
max_leaf_nodes=25   β†’ MAE:  29,016
max_leaf_nodes=50   β†’ MAE: 27,405
max_leaf_nodes=100  β†’ MAE: 27,282 βœ… OPTIMAL
max_leaf_nodes=250  β†’ MAE: 27,893
max_leaf_nodes=500  β†’ MAE: 29,454 (overfitting)

🌲 Lesson 6: Random Forests

Masalah dengan Single Decision Tree:

Bahkan setelah tuning, Decision Tree punya keterbatasan:

  • Sensitif terhadap small changes di data
  • Trade-off sulit antara underfitting & overfitting

Solusi: Random Forest

Random Forest = ensemble dari banyak Decision Trees (default: 100 trees).

Cara Kerja:

  1. Build 100 trees dengan:
    • Random subset data (bootstrap sampling)
    • Random subset features
  2. Setiap tree membuat prediksi
  3. Final prediction = rata-rata prediksi semua trees

Keunggulan:

βœ… Lebih akurat dari single Decision Tree
βœ… Mengatasi overfitting (averaging mengurangi variance)
βœ… Robust dengan default parameters (tidak perlu tuning ekstensif)
βœ… "Just works" β€” good performance out-of-the-box

Kode:

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Define Random Forest model
forest_model = RandomForestRegressor(random_state=1)

# Fit model
forest_model.fit(train_X, train_y)

# Predict
forest_preds = forest_model.predict(val_X)

# Evaluate
forest_mae = mean_absolute_error(val_y, forest_preds)
print("Random Forest MAE:", forest_mae)

Perbandingan: Decision Tree vs Random Forest

Model Validation MAE Notes
Decision Tree (tuned) ~27,282 Setelah tuning max_leaf_nodes
Random Forest (default) ~21,857 ~20% lebih baik, tanpa tuning!

Tuning Random Forest (Opsional):

# Custom parameters
forest_model_tuned = RandomForestRegressor(
    n_estimators=200,      # 200 trees (default:  100)
    max_depth=15,          # max depth per tree
    min_samples_split=5,
    random_state=1
)
forest_model_tuned.fit(train_X, train_y)

Parameter Penting:

  • n_estimators β€” jumlah trees (lebih banyak = lebih akurat, tapi lebih lambat)
  • max_depth β€” kedalaman maksimal tiap tree (kontrol overfitting)
  • min_samples_split β€” minimum samples untuk split node

πŸ† Lesson 7: Machine Learning Competitions

Submit ke Kaggle Competition

Step 1: Train Model dengan SEMUA Data

# Fit model dengan semua training data (bukan hanya train_X)
rf_model_on_full_data = RandomForestRegressor(random_state=1)
rf_model_on_full_data.fit(X, y)  # ← SEMUA data

Catatan: Validation data sudah selesai tugasnya (untuk tuning). Final model harus belajar dari semua data untuk maksimalkan performa.

Step 2: Load Test Data & Predict

# Load test data
test_data = pd.read_csv('../input/test.csv')

# Pilih features yang sama
test_X = test_data[features]

# Predict
test_preds = rf_model_on_full_data.predict(test_X)

Step 3: Save Predictions

# Format submission
output = pd.DataFrame({
    'Id': test_data. Id,
    'SalePrice': test_preds
})

# Save to CSV
output.to_csv('submission.csv', index=False)
print("Submission file created!")

Step 4: Submit di Kaggle

  1. Join competition: Housing Prices Competition
  2. Save & Run notebook ("Save Version" β†’ "Save and Run All")
  3. Open in Viewer
  4. Tab "Data" β†’ klik "submission.csv" β†’ "Submit"

Improve Model (Naik Leaderboard):

1. Tambahkan Features

Baseline (7 features):

features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
            'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

Improved (25 features):

features = [
    'MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 
    'YearRemodAdd', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
    'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
    'Fireplaces', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
    'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold'
]

Top Features untuk Akurasi:

  • OverallQual ⭐⭐⭐ β€” kualitas keseluruhan (1-10)
  • GrLivArea ⭐⭐⭐ β€” luas living area (sq ft)
  • YearBuilt ⭐⭐ β€” tahun dibangun
  • TotRmsAbvGrd ⭐ β€” total ruangan

2. Tuning Hyperparameters

rf_model_on_full_data = RandomForestRegressor(
    n_estimators=200,
    max_depth=15,
    min_samples_split=5,
    random_state=1
)

3. Try XGBoost (Advanced)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, random_state=1)
xgb_model.fit(X, y)
test_preds = xgb_model.predict(test_X)

Expected Results:

Model Validation MAE Leaderboard Score
Baseline (7 features) ~22,000 ~26,000
Improved (25 features) ~17,000 ~20,000
Tuned + 25 features ~15,000 ~18,000

πŸ“ Cheat Sheet

Pandas Essentials

import pandas as pd

# Load data
df = pd. read_csv('file.csv')

# Inspect
df.head()                  # 5 baris pertama
df.describe()              # Statistik deskriptif
df. info()                  # Info tipe data & missing values
df.columns                 # Nama kolom
df.shape                   # (rows, columns)

# Select
df['column']               # Select 1 kolom (Series)
df[['col1', 'col2']]       # Select multiple kolom (DataFrame)

Scikit-Learn Workflow

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# 1. Split data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# 2. Define model
model = RandomForestRegressor(random_state=1)

# 3. Fit model
model. fit(train_X, train_y)

# 4. Predict
predictions = model.predict(val_X)

# 5. Evaluate
mae = mean_absolute_error(val_y, predictions)
print("MAE:", mae)

Decision Tree vs Random Forest

Aspect Decision Tree Random Forest
Speed ⚑ Cepat (single tree) 🐒 Lebih lambat (100 trees)
Accuracy πŸ“Š Sedang 🎯 Tinggi
Overfitting ⚠️ Mudah overfit βœ… Robust
Interpretability πŸ“– Mudah dijelaskan πŸ”’ Black-box
Tuning πŸ”§ Perlu tuning ekstensif βœ… Default sudah bagus

🚨 Troubleshooting

Error: KeyError: 'column_name'

Penyebab: Nama kolom salah atau tidak ada di dataset.
Solusi:

# Cek nama kolom yang benar
print(home_data.columns)

# Pastikan ejaan & huruf besar/kecil sesuai
y = home_data. SalePrice  # βœ… Benar
y = home_data. saleprice  # ❌ Salah (case-sensitive)

Error: NameError: name 'X' is not defined

Penyebab: Belum menjalankan cell yang define X dan y.
Solusi: Jalankan cell setup secara berurutan dari atas.

Error: ValueError: Input contains NaN

Penyebab: Ada missing values (NaN) di data.
Solusi:

# Opsi 1: Hapus baris dengan missing values
home_data = home_data.dropna(axis=0)

# Opsi 2: Gunakan features tanpa missing values
# (lihat list 25 features yang aman di Lesson 7)

MAE Sangat Tinggi (> 50,000)

Penyebab: Model underfitting atau features kurang informatif.
Solusi:

  • Tambahkan lebih banyak features (gunakan 25 features dari Lesson 7)
  • Tuning hyperparameters (max_depth, n_estimators)
  • Coba Random Forest (biasanya lebih baik dari Decision Tree)

Validation MAE β‰ͺ Leaderboard Score

Penyebab: Overfitting β€” model terlalu spesifik pada training data.
Solusi:

  • Kurangi kompleksitas model (max_depth, n_estimators)
  • Gunakan cross-validation (advanced topic, pelajari di Intermediate ML)

πŸš€ Next Steps

1. Download Sertifikat

βœ… Selesaikan semua 7 exercises
βœ… Klik "Get Certificate" di halaman course
βœ… Upload ke LinkedIn (Licenses & Certifications)

2. Lanjut Course Berikutnya

Intermediate Machine Learning (Recommended)

  • Handle missing values
  • Categorical variables (one-hot encoding)
  • Pipelines
  • XGBoost (model lebih powerful)
  • Cross-validation

Pandas

  • Deep dive data manipulation
  • Merging, grouping, pivoting
  • Time series
  • Data cleaning

Feature Engineering

  • Create new features dari existing data
  • Feature selection
  • Dimensionality reduction (PCA)

Data Visualization

  • Matplotlib, Seaborn
  • Exploratory Data Analysis (EDA)
  • Storytelling dengan data

3. Build Portfolio Project

Ide Project:

  • Predictive maintenance (IoT sensor data β†’ predict failure)
  • Stock price prediction
  • Customer churn prediction
  • House price prediction (improve Kaggle submission)

Template Project:

  1. Load & explore data (EDA)
  2. Feature engineering
  3. Train multiple models (Decision Tree, Random Forest, XGBoost)
  4. Compare MAE
  5. Hyperparameter tuning
  6. Final model + visualization
  7. Publish di GitHub + README lengkap

4. Join Kaggle Competitions

Beginner-Friendly Competitions:

  • Titanic (classification)
  • House Prices (regression) ← Anda sudah mulai ini!
  • Digit Recognizer (computer vision)

Benefits:

  • Practice dengan real datasets
  • Learn dari kernels/notebooks orang lain
  • Build reputation (medals di profil)

πŸ“š Resources

Official Documentation

Books (Recommended)

  1. "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" β€” AurΓ©lien GΓ©ron
    β†’ Best practical ML book (Python)

  2. "Python for Data Analysis" β€” Wes McKinney (creator of pandas)
    β†’ Deep dive pandas

  3. "The Elements of Statistical Learning" β€” Hastie, Tibshirani, Friedman
    β†’ Theory (advanced, for S2/S3)

Online Courses

  • Andrew Ng β€” Machine Learning (Coursera) β€” Classic, foundational
  • Fast.ai β€” Practical Deep Learning β€” Top-down approach
  • DataCamp / Coursera β€” Python for Data Science

Communities

  • Kaggle Forums & Discussions
  • r/MachineLearning (Reddit)
  • Stack Overflow β€” untuk troubleshooting

πŸ“Š Summary Metrics

Course Completion Stats:

Metric Value
Lessons 7
Exercises 7
Estimated Time 3-5 hours
Difficulty Beginner
Certificate βœ… Yes (free)

Model Performance (Iowa Housing):

Model Validation MAE Notes
Baseline Decision Tree ~29,000 No tuning
Tuned Decision Tree ~27,282 max_leaf_nodes=100
Random Forest (default) ~21,857 Best default
Random Forest (25 features) ~17,000 Improved features

βœ… Checklist Completion

Gunakan checklist ini untuk track progress Anda:

  • Lesson 1: Pahami konsep Decision Tree
  • Lesson 2: Load & explore data dengan pandas
  • Exercise 2: Hitung average lot size & newest home age
  • Lesson 3: Build first Decision Tree model
  • Exercise 3: Fit model & make predictions
  • Lesson 4: Pahami train/validation split & MAE
  • Exercise 4: Calculate validation MAE
  • Lesson 5: Pahami underfitting vs overfitting
  • Exercise 5: Find optimal max_leaf_nodes
  • Lesson 6: Build Random Forest model
  • Exercise 6: Compare Random Forest vs Decision Tree
  • Lesson 7: Submit ke Kaggle competition
  • Exercise 7: Create submission. csv & submit
  • Bonus: Improve model (tambah features, tuning)
  • Bonus: Download sertifikat & upload LinkedIn

πŸŽ“ Credits & Acknowledgments

Course Created by: Kaggle (Dan Becker, Alexis Cook)
Dataset: Ames Housing Dataset (Dean De Cock)
Libraries: scikit-learn, pandas, numpy (open-source community)
Documentation Author: [Your Name / GitHub Profile]
Last Updated: 2025-12-12


πŸ“„ License

This documentation is provided for educational purposes.
Course content Β© Kaggle.
Code examples: MIT License (free to use & modify).


πŸ™ Feedback & Contributions

Found a typo or want to improve this documentation?

  • Open an issue on GitHub
  • Submit a pull request
  • Contact: [Your Email / GitHub]

Selamat Belajar! πŸš€ Happy Machine Learning! πŸŽ‰


Appendix A: Glossary

Term Definisi
Algorithm Prosedur/formula untuk solve problem (mis. Decision Tree)
Classification Prediksi kategori (mis. spam/not spam)
Cross-Validation Teknik validasi dengan multiple train/val splits
Decision Tree Model yang membuat keputusan dengan splits berdasarkan features
Ensemble Kombinasi multiple models (mis. Random Forest)
Features (X) Input variables untuk model
Hyperparameter Parameter yang diset sebelum training (mis. max_depth)
MAE Mean Absolute Error β€” metrik evaluasi regresi
Model Representasi matematis yang dipelajari dari data
Overfitting Model terlalu spesifik pada training data β†’ buruk di data baru
Prediction Output model untuk input baru
Random Forest Ensemble dari banyak Decision Trees
Regression Prediksi nilai kontinu (mis. harga rumah)
Supervised Learning Learning dari data dengan label/target
Target (y) Variable yang ingin diprediksi
Training Data Data untuk melatih model
Underfitting Model terlalu sederhana β†’ gagal menangkap pola
Validation Data Data terpisah untuk evaluasi (tidak dipakai saat training)

Appendix B: Common Parameter Values

DecisionTreeRegressor

DecisionTreeRegressor(
    max_depth=None,           # Unlimited depth (default)
    max_leaf_nodes=None,      # Unlimited leaves (default)
    min_samples_split=2,      # Min samples untuk split (default)
    min_samples_leaf=1,       # Min samples di leaf (default)
    random_state=None         # Set untuk reproducibility
)

Recommended untuk tuning:

  • max_leaf_nodes: [50, 100, 250, 500]
  • max_depth: [5, 10, 15, 20]

RandomForestRegressor

RandomForestRegressor(
    n_estimators=100,         # Jumlah trees (default: 100)
    max_depth=None,           # Unlimited depth (default)
    min_samples_split=2,
    min_samples_leaf=1,
    max_features='auto',      # Sqrt(n_features) untuk regression
    random_state=None
)

Recommended untuk tuning:

  • n_estimators: [100, 200, 300]
  • max_depth: [10, 15, 20, None]

END OF DOCUMENTATION

Version History:

  • v1.0 (2025-12-12) β€” Initial complete documentation

Maintainer: [Your Name]
Contact: [Your GitHub / Email]

πŸŽ‰ **Congratulations on completing Intro to Machine Learning! ** πŸŽ‰

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published