{
"cells": [
{
"cell_type": "markdown",
"id": "14a65a23",
"metadata": {},
"source": [
"# *California Housing Price Prediction*"
]
},
{
"cell_type": "markdown",
"id": "ad353530",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "c5f2493e",
"metadata": {},
"source": [
"### *1- Business framing*"
]
},
{
"cell_type": "markdown",
"id": "4f5e64bc",
"metadata": {},
"source": [
"##### 1 - What type of Machine‑Learning task is this (supervised, unsupervised, reinforcement)?\n",
"***=> Supervised***\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "3b45d9e0",
"metadata": {},
"source": [
"##### 2 - Is it a regression or classification problem?\n",
"***=> Regression***"
]
},
{
"cell_type": "markdown",
"id": "b17d4c9d",
"metadata": {},
"source": [
"##### 3 - Which error metric would you propose first and why? \n",
"***=>***\n"
]
},
{
"cell_type": "markdown",
"id": "0a2571d6",
"metadata": {},
"source": [
"##### 4 - When might MAE be preferable?\n",
"***=>***\n"
]
},
{
"cell_type": "markdown",
"id": "d3f7ff0d",
"metadata": {},
"source": [
"## 1 | Getting the Data"
]
},
{
"cell_type": "markdown",
"id": "2747e729",
"metadata": {},
"source": [
"#### 1.1 Set‑up"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "61eaa606",
"metadata": {},
"outputs": [],
"source": [
"%python3 -m venv .venv\n",
"%source .venv/bin/activate"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "250a3db0",
"metadata": {},
"outputs": [],
"source": [
"import os, tarfile, urllib.request\n",
"DOWNLOAD_ROOT = \"https://raw.githubusercontent.com/ageron/handson-ml2/master/\"\n",
"HOUSING_PATH = os.path.join(\"datasets\", \"housing\")\n",
"HOUSING_URL = DOWNLOAD_ROOT + \"datasets/housing/housing.tgz\"\n",
"\n",
"def fetch_housing_data(housing_url: str = HOUSING_URL, housing_path: str = HOUSING_PATH):\n",
" \"\"\"Download & uncompress the California housing dataset.\"\"\"\n",
" # ensure local directory exists\n",
" os.makedirs(housing_path, exist_ok=True)\n",
"\n",
" # download the archive only if it isn’t already present\n",
" tgz_path = os.path.join(housing_path, \"housing.tgz\")\n",
" if not os.path.isfile(tgz_path):\n",
" print(\" Downloading …\")\n",
" urllib.request.urlretrieve(housing_url, tgz_path)\n",
"\n",
" # extract csv\n",
" with tarfile.open(tgz_path) as housing_tgz:\n",
" housing_tgz.extractall(path=housing_path)\n",
" print(\" Dataset ready at\", housing_path)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4600622c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Dataset ready at datasets/housing\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/7r/57q1jjz92rq1jvkqjft02y6w0000gn/T/ipykernel_2872/3359881798.py:19: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.\n",
" housing_tgz.extractall(path=housing_path)\n"
]
}
],
"source": [
"fetch_housing_data()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "823fb9ac",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: pandas in ./.venv/lib/python3.13/site-packages (2.3.3)\n",
"Requirement already satisfied: numpy>=1.26.0 in ./.venv/lib/python3.13/site-packages (from pandas) (2.3.5)\n",
"Requirement already satisfied: python-dateutil>=2.8.2 in ./.venv/lib/python3.13/site-packages (from pandas) (2.9.0.post0)\n",
"Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.13/site-packages (from pandas) (2025.2)\n",
"Requirement already satisfied: tzdata>=2022.7 in ./.venv/lib/python3.13/site-packages (from pandas) (2025.3)\n",
"Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.13/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n",
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m25.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.3\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install pandas \n",
"import pandas as pd\n",
"\n",
"def load_housing_data(housing_path: str = HOUSING_PATH) -> pd.DataFrame:\n",
" csv_path = os.path.join(housing_path, \"housing.csv\")\n",
" return pd.read_csv(csv_path)\n",
"\n",
"housing = load_housing_data()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "9986286d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" longitude | \n",
" latitude | \n",
" housing_median_age | \n",
" total_rooms | \n",
" total_bedrooms | \n",
" population | \n",
" households | \n",
" median_income | \n",
" median_house_value | \n",
" ocean_proximity | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" -122.23 | \n",
" 37.88 | \n",
" 41.0 | \n",
" 880.0 | \n",
" 129.0 | \n",
" 322.0 | \n",
" 126.0 | \n",
" 8.3252 | \n",
" 452600.0 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" | 1 | \n",
" -122.22 | \n",
" 37.86 | \n",
" 21.0 | \n",
" 7099.0 | \n",
" 1106.0 | \n",
" 2401.0 | \n",
" 1138.0 | \n",
" 8.3014 | \n",
" 358500.0 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" | 2 | \n",
" -122.24 | \n",
" 37.85 | \n",
" 52.0 | \n",
" 1467.0 | \n",
" 190.0 | \n",
" 496.0 | \n",
" 177.0 | \n",
" 7.2574 | \n",
" 352100.0 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" | 3 | \n",
" -122.25 | \n",
" 37.85 | \n",
" 52.0 | \n",
" 1274.0 | \n",
" 235.0 | \n",
" 558.0 | \n",
" 219.0 | \n",
" 5.6431 | \n",
" 341300.0 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" | 4 | \n",
" -122.25 | \n",
" 37.85 | \n",
" 52.0 | \n",
" 1627.0 | \n",
" 280.0 | \n",
" 565.0 | \n",
" 259.0 | \n",
" 3.8462 | \n",
" 342200.0 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" longitude latitude housing_median_age total_rooms total_bedrooms \\\n",
"0 -122.23 37.88 41.0 880.0 129.0 \n",
"1 -122.22 37.86 21.0 7099.0 1106.0 \n",
"2 -122.24 37.85 52.0 1467.0 190.0 \n",
"3 -122.25 37.85 52.0 1274.0 235.0 \n",
"4 -122.25 37.85 52.0 1627.0 280.0 \n",
"\n",
" population households median_income median_house_value ocean_proximity \n",
"0 322.0 126.0 8.3252 452600.0 NEAR BAY \n",
"1 2401.0 1138.0 8.3014 358500.0 NEAR BAY \n",
"2 496.0 177.0 7.2574 352100.0 NEAR BAY \n",
"3 558.0 219.0 5.6431 341300.0 NEAR BAY \n",
"4 565.0 259.0 3.8462 342200.0 NEAR BAY "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"housing.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ab8283cc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"shape: (20640, 10) \n",
"\n",
"\n",
"RangeIndex: 20640 entries, 0 to 20639\n",
"Data columns (total 10 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 longitude 20640 non-null float64\n",
" 1 latitude 20640 non-null float64\n",
" 2 housing_median_age 20640 non-null float64\n",
" 3 total_rooms 20640 non-null float64\n",
" 4 total_bedrooms 20433 non-null float64\n",
" 5 population 20640 non-null float64\n",
" 6 households 20640 non-null float64\n",
" 7 median_income 20640 non-null float64\n",
" 8 median_house_value 20640 non-null float64\n",
" 9 ocean_proximity 20640 non-null object \n",
"dtypes: float64(9), object(1)\n",
"memory usage: 1.6+ MB\n",
"info: None \n",
"\n"
]
}
],
"source": [
"print(\"shape:\", housing.shape, \"\\n\")\n",
"print(\"info:\", housing.info(), \"\\n\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "673826c2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" longitude | \n",
" latitude | \n",
" housing_median_age | \n",
" total_rooms | \n",
" total_bedrooms | \n",
" population | \n",
" households | \n",
" median_income | \n",
" median_house_value | \n",
"
\n",
" \n",
" \n",
" \n",
" | count | \n",
" 20640.000000 | \n",
" 20640.000000 | \n",
" 20640.000000 | \n",
" 20640.000000 | \n",
" 20433.000000 | \n",
" 20640.000000 | \n",
" 20640.000000 | \n",
" 20640.000000 | \n",
" 20640.000000 | \n",
"
\n",
" \n",
" | mean | \n",
" -119.569704 | \n",
" 35.631861 | \n",
" 28.639486 | \n",
" 2635.763081 | \n",
" 537.870553 | \n",
" 1425.476744 | \n",
" 499.539680 | \n",
" 3.870671 | \n",
" 206855.816909 | \n",
"
\n",
" \n",
" | std | \n",
" 2.003532 | \n",
" 2.135952 | \n",
" 12.585558 | \n",
" 2181.615252 | \n",
" 421.385070 | \n",
" 1132.462122 | \n",
" 382.329753 | \n",
" 1.899822 | \n",
" 115395.615874 | \n",
"
\n",
" \n",
" | min | \n",
" -124.350000 | \n",
" 32.540000 | \n",
" 1.000000 | \n",
" 2.000000 | \n",
" 1.000000 | \n",
" 3.000000 | \n",
" 1.000000 | \n",
" 0.499900 | \n",
" 14999.000000 | \n",
"
\n",
" \n",
" | 25% | \n",
" -121.800000 | \n",
" 33.930000 | \n",
" 18.000000 | \n",
" 1447.750000 | \n",
" 296.000000 | \n",
" 787.000000 | \n",
" 280.000000 | \n",
" 2.563400 | \n",
" 119600.000000 | \n",
"
\n",
" \n",
" | 50% | \n",
" -118.490000 | \n",
" 34.260000 | \n",
" 29.000000 | \n",
" 2127.000000 | \n",
" 435.000000 | \n",
" 1166.000000 | \n",
" 409.000000 | \n",
" 3.534800 | \n",
" 179700.000000 | \n",
"
\n",
" \n",
" | 75% | \n",
" -118.010000 | \n",
" 37.710000 | \n",
" 37.000000 | \n",
" 3148.000000 | \n",
" 647.000000 | \n",
" 1725.000000 | \n",
" 605.000000 | \n",
" 4.743250 | \n",
" 264725.000000 | \n",
"
\n",
" \n",
" | max | \n",
" -114.310000 | \n",
" 41.950000 | \n",
" 52.000000 | \n",
" 39320.000000 | \n",
" 6445.000000 | \n",
" 35682.000000 | \n",
" 6082.000000 | \n",
" 15.000100 | \n",
" 500001.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" longitude latitude housing_median_age total_rooms \\\n",
"count 20640.000000 20640.000000 20640.000000 20640.000000 \n",
"mean -119.569704 35.631861 28.639486 2635.763081 \n",
"std 2.003532 2.135952 12.585558 2181.615252 \n",
"min -124.350000 32.540000 1.000000 2.000000 \n",
"25% -121.800000 33.930000 18.000000 1447.750000 \n",
"50% -118.490000 34.260000 29.000000 2127.000000 \n",
"75% -118.010000 37.710000 37.000000 3148.000000 \n",
"max -114.310000 41.950000 52.000000 39320.000000 \n",
"\n",
" total_bedrooms population households median_income \\\n",
"count 20433.000000 20640.000000 20640.000000 20640.000000 \n",
"mean 537.870553 1425.476744 499.539680 3.870671 \n",
"std 421.385070 1132.462122 382.329753 1.899822 \n",
"min 1.000000 3.000000 1.000000 0.499900 \n",
"25% 296.000000 787.000000 280.000000 2.563400 \n",
"50% 435.000000 1166.000000 409.000000 3.534800 \n",
"75% 647.000000 1725.000000 605.000000 4.743250 \n",
"max 6445.000000 35682.000000 6082.000000 15.000100 \n",
"\n",
" median_house_value \n",
"count 20640.000000 \n",
"mean 206855.816909 \n",
"std 115395.615874 \n",
"min 14999.000000 \n",
"25% 119600.000000 \n",
"50% 179700.000000 \n",
"75% 264725.000000 \n",
"max 500001.000000 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"housing.describe()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "a1ab7eb2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ocean_proximity\n",
"<1H OCEAN 9136\n",
"INLAND 6551\n",
"NEAR OCEAN 2658\n",
"NEAR BAY 2290\n",
"ISLAND 5\n",
"Name: count, dtype: int64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"housing[\"ocean_proximity\"].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "159cfa35",
"metadata": {},
"source": [
"- Plus grand écart-type (std) : C'est bien median_house_value avec un écart-type d'environ 115 395.\n",
"\n",
"- Valeurs manquantes : Oui, la variable total_bedrooms a des valeurs manquantes. Elle ne compte que 20 433 valeurs non-nulles sur un total de 20 640\n",
"\n",
"- Il y a 20 640 lignes pour 10 colonnes\n",
"\n",
"- D'après le résultat de ton value_counts(), voici la répartition pour ocean_proximity :\n",
"ocean_proximity\n",
"<1H OCEAN 9136\n",
"INLAND 6551\n",
"NEAR OCEAN 2658\n",
"NEAR BAY 2290\n",
"ISLAND 5"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "71440da2",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}