Spaces:

03chrisk
/

air_quality_test

Sleeping

App Files Files Community

atodorov284 commited on Sep 23, 2024

Commit

d2d624a

1 Parent(s): 6015ff3

added extra materials and update README

Browse files

Files changed (7) hide show

README.md +9 -1
air-quality-forecast/data_pipeline.py +130 -52
air-quality-forecast/main.py +1 -2
air-quality-forecast/utils.py +54 -21
data/interim/correlation_matrix.csv +30 -0
extra_scripts/corr_map.R +34 -0
extra_scripts/histograms.tex +167 -0

README.md CHANGED Viewed

@@ -4,7 +4,15 @@
     <img src="https://img.shields.io/badge/CCDS-Project%20template-328F97?logo=cookiecutter" />
 </a>
-A short description of the project.
 ## Project Organization

     <img src="https://img.shields.io/badge/CCDS-Project%20template-328F97?logo=cookiecutter" />
 </a>
+Air pollution is a significant environmental concern, especially in urban areas, where the high levels of nitrogen dioxide and ozone can have a negative impact on human health, the ecosystem and on the overall quality of life. Given these risks, monitoring and forecasting the level of air pollution is an important task in order to allow for timely actions to reduce the harmful effects.
+In the Netherlands, cities like Utrecht experience challenges concerning air quality due to urbanization, transportation, and industrial activities. Developing a system that can provide accurate and robust real-time air quality monitoring and reliable forecasts for future pollution levels would allow authorities and residents to take preventive measures and adjust their future activities based on expected air quality. This project focuses on the time-series forecasting of air pollution levels, specifically NO$_2$ and O$_3$ concentrations, for the next three days. This task can be framed as a regression problem, where the goal is to predict continuous values based on historical environmental data. Moreover, it provides infrastructure for real-time prediction, based on recent measurements.
+## How To Run This Code
+Currently, this repository is at the data engineering stage. To run the data pipeline, run main.py under air-quality-forecast, which contains the source code of this project. The processed and split datasets can be found under data/processed, namely x_train, x_val, x_test, y_train, y_val, y_test.
+The notebooks in this project were used as scratch for analysis and data merge and do not reflect our thorough methodology (source is under air-quality-forecast). Some extra scripts for the generation of our plots in the report can be found under extra_scripts.
 ## Project Organization

air-quality-forecast/data_pipeline.py CHANGED Viewed

@@ -1,11 +1,12 @@
 import pandas as pd
 import os
 from utils import FeatureSelector, InputValidator
-from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
 from sklearn.model_selection import train_test_split
 from typing import Tuple
 import numpy as np
 class DataLoader:
     def __init__(self, raw_data_path: str, processed_data_path: str) -> None:
         """
@@ -30,8 +31,14 @@ class DataLoader:
         """
         InputValidator.validate_file_exists(self.raw_data_path, "raw_data_path")
-        self.raw_griftpark_data = pd.read_csv(os.path.join(self.raw_data_path, 'v1_raw_griftpark,-utrecht-air-quality.csv'))
-        self.raw_utrecht_data = pd.read_csv(os.path.join(self.raw_data_path, 'v1_utrecht 2014-01-29 to 2024-09-11.csv'))
         return self.raw_griftpark_data, self.raw_utrecht_data
     def save_to_csv(self, name: str, data: pd.DataFrame) -> None:
@@ -42,16 +49,18 @@ class DataLoader:
         :param data: The Pandas DataFrame to save as a CSV.
         """
         InputValidator.validate_type(name, str, "name")
         # If the data is a numpy array, convert it to a Pandas DataFrame
         if isinstance(data, np.ndarray):
             data = pd.DataFrame(data)
         data.to_csv(os.path.join(self.processed_data_path, name))
 class FeatureProcessor:
-    def __init__(self, griftpark_data: pd.DataFrame, utrecht_data: pd.DataFrame) -> None:
         """
         Initializes the FeatureProcessor with Griftpark and Utrecht data.
@@ -60,7 +69,7 @@ class FeatureProcessor:
         """
         InputValidator.validate_type(griftpark_data, pd.DataFrame, "griftpark_data")
         InputValidator.validate_type(utrecht_data, pd.DataFrame, "utrecht_data")
         self.griftpark_data = griftpark_data
         self.utrecht_data = utrecht_data
         self.merged_data = None
@@ -71,8 +80,12 @@ class FeatureProcessor:
         :return: The merged Pandas DataFrame.
         """
-        self.utrecht_data['datetime'] = pd.to_datetime(self.utrecht_data['datetime'], format='%Y-%m-%d').dt.strftime('%d/%m/%Y')
-        self.merged_data = pd.merge(self.griftpark_data, self.utrecht_data, left_on='date', right_on='datetime')
         return self.merged_data
     def sort_data_by_date(self) -> pd.DataFrame:
@@ -83,8 +96,10 @@ class FeatureProcessor:
         """
         if self.merged_data is None:
             raise ValueError("Merged data not available. Please merge data first.")
-        self.merged_data['datetime'] = pd.to_datetime(self.merged_data['datetime'], format='%d/%m/%Y')
-        self.merged_data.sort_values(by='datetime', ascending=False, inplace=True)
         return self.merged_data
     def select_features(self) -> pd.DataFrame:
@@ -95,20 +110,20 @@ class FeatureProcessor:
         """
         if self.merged_data is None:
             raise ValueError("Merged data not available. Please merge data first.")
         InputValidator.validate_type(self.merged_data, pd.DataFrame, "merged_data")
         # Feature selection logic
         cols_to_drop = FeatureSelector.uninformative_columns()
-        self.merged_data.drop(cols_to_drop, axis=1, inplace=True, errors='ignore')
         self.merged_data = FeatureSelector.rename_initial_columns(self.merged_data)
         self.merged_data = FeatureSelector.change_to_numeric(self.merged_data)
         selected_columns = FeatureSelector.select_cols_by_correlation(self.merged_data)
-        domain_knowledge_columns = ['precip','windspeed', 'winddir']
-        selected_columns = ['date'] + selected_columns + domain_knowledge_columns
         self.merged_data = self.merged_data[selected_columns]
         return self.merged_data
     def apply_time_shift(self, t_max: int = 3) -> pd.DataFrame:
@@ -120,20 +135,26 @@ class FeatureProcessor:
         """
         if self.merged_data is None:
             raise ValueError("Data not available. Please process data first.")
         InputValidator.validate_type(t_max, int, "t_max")
         # Time shifting logic
         all_cols = self.merged_data.columns
         for t in range(1, t_max + 1):
             for col in all_cols:
-                self.merged_data[[f'{col} - day {t}']] = self.merged_data[[col]].shift(-t)
         for t in range(1, 3):
-            for col in ['o3', 'no2']:
-                self.merged_data[[f'{col} + day {t}']] = self.merged_data[[col]].shift(t)
-        self.merged_data[self.merged_data.columns] = self.merged_data[self.merged_data.columns].apply(pd.to_numeric)
         return self.merged_data
     def preprocess_data(self) -> pd.DataFrame:
@@ -143,13 +164,36 @@ class FeatureProcessor:
         :return: The preprocessed Pandas DataFrame.
         """
         self.select_features()
-        self.merged_data.set_index('date', inplace=True)
-        self.merged_data.dropna(subset=['o3', 'no2'])
         self.apply_time_shift()
         # Drop unnecessary columns
-        self.merged_data.drop(['pm25', 'pm10', 'temp', 'humidity', 'visibility', 'solarradiation', 'precip', 'windspeed', 'winddir'], axis=1, inplace=True)
-        self.merged_data.drop(index=['29/01/2014', '30/01/2014', '31/01/2014', '10/09/2024', '11/09/2024'], inplace=True)
         self.preprocessed_data = self.merged_data
         return self.preprocessed_data
@@ -172,14 +216,28 @@ class PreprocessingPipeline:
         Initializes the PreprocessingPipeline with paths to raw and processed data directories.
         """
         project_root = os.path.dirname(os.path.dirname(__file__))
-        raw_data_path = os.path.join(project_root, 'data', 'raw')
-        processed_data_path = os.path.join(project_root, 'data', 'processed')
         self.data_loader = DataLoader(raw_data_path, processed_data_path)
         self.feature_processor = None
         self.normalizer = MinMaxScaler()
-    def train_test_validation_split(self, x: pd.DataFrame, y:pd.DataFrame,test_size_: float = 0.15, validation_size: float = 0.15, random_state_ = 4242) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
         """
         Split the data into training and testing sets.
@@ -190,12 +248,21 @@ class PreprocessingPipeline:
         InputValidator.validate_type(y, pd.DataFrame, "data")
         InputValidator.validate_type(test_size_, float, "test_size")
         InputValidator.validate_type(validation_size, float, "validation_size")
-        test_val_proportion = validation_size / (test_size_ + validation_size)  # Proportion of test to validation_size
-        x_train, x_test_val, y_train, y_test_val = train_test_split(x, y, test_size=(test_size_+validation_size), random_state=random_state_)
-        x_test, x_val, y_test, y_val = train_test_split(x_test_val, y_test_val, test_size=test_val_proportion, random_state=random_state_)
         return x_train, x_test, x_val, y_train, y_test, y_val
     def run_pipeline(self) -> pd.DataFrame:
         """
         Run the entire preprocessing pipeline: load data, process features, normalize, and save to CSV.
@@ -203,7 +270,7 @@ class PreprocessingPipeline:
         :param normalizer_type: The type of normalizer to use.
         :return: The final normalized Pandas DataFrame.
         """
         # Step 1: Load raw data
         griftpark_data, utrecht_data = self.data_loader()
@@ -212,33 +279,44 @@ class PreprocessingPipeline:
         preprocessed_data = self.feature_processor()
         # Step 3: Save processed data
-        self.data_loader.save_to_csv('v3_lagged_no_missing_predicted_data.csv', preprocessed_data)
         # Step 4: Split data into train, test, and validation sets
-        columns_to_predict = ['no2', 'o3', 'no2 + day 1', 'o3 + day 1', 'no2 + day 2', 'o3 + day 2']
         x = preprocessed_data.drop(columns_to_predict, axis=1)
         y = preprocessed_data[columns_to_predict]
-        x_train, x_test, x_val, y_train, y_test, y_val = self.train_test_validation_split(x, y)
         # Step 5: Normalize data for 3 sets (x_train, x_test, x_val)
-        x_train[x_train.columns] = self.normalizer.fit_transform(x_train[x_train.columns])
         x_test[x_test.columns] = self.normalizer.transform(x_test[x_test.columns])
         x_val[x_val.columns] = self.normalizer.transform(x_val[x_val.columns])
         # Convert the normalized NumPy array back to a DataFrame
         # normalized_x_train = pd.DataFrame(x_train, columns=preprocessed_data.columns, index=preprocessed_data.index)
-        #Step 6: Save normalized data
-        self.data_loader.save_to_csv('x_train.csv', x_train)
-        self.data_loader.save_to_csv('x_test.csv', x_test)
-        self.data_loader.save_to_csv('x_val.csv', x_val)
-        self.data_loader.save_to_csv('y_train.csv', y_train)
-        self.data_loader.save_to_csv('y_test.csv', y_test)
-        self.data_loader.save_to_csv('y_val.csv', y_val)
         # Convert the normalized NumPy array back to a DataFrame
         # normalized_df = pd.DataFrame(normalized_data, columns=preprocessed_data.columns, index=preprocessed_data.index)
         return preprocessed_data

 import pandas as pd
 import os
 from utils import FeatureSelector, InputValidator
+from sklearn.preprocessing import MinMaxScaler
 from sklearn.model_selection import train_test_split
 from typing import Tuple
 import numpy as np
 class DataLoader:
     def __init__(self, raw_data_path: str, processed_data_path: str) -> None:
         """
         """
         InputValidator.validate_file_exists(self.raw_data_path, "raw_data_path")
+        self.raw_griftpark_data = pd.read_csv(
+            os.path.join(
+                self.raw_data_path, "v1_raw_griftpark,-utrecht-air-quality.csv"
+            )
+        )
+        self.raw_utrecht_data = pd.read_csv(
+            os.path.join(self.raw_data_path, "v1_utrecht 2014-01-29 to 2024-09-11.csv")
+        )
         return self.raw_griftpark_data, self.raw_utrecht_data
     def save_to_csv(self, name: str, data: pd.DataFrame) -> None:
         :param data: The Pandas DataFrame to save as a CSV.
         """
         InputValidator.validate_type(name, str, "name")
         # If the data is a numpy array, convert it to a Pandas DataFrame
         if isinstance(data, np.ndarray):
             data = pd.DataFrame(data)
         data.to_csv(os.path.join(self.processed_data_path, name))
 class FeatureProcessor:
+    def __init__(
+        self, griftpark_data: pd.DataFrame, utrecht_data: pd.DataFrame
+    ) -> None:
         """
         Initializes the FeatureProcessor with Griftpark and Utrecht data.
         """
         InputValidator.validate_type(griftpark_data, pd.DataFrame, "griftpark_data")
         InputValidator.validate_type(utrecht_data, pd.DataFrame, "utrecht_data")
         self.griftpark_data = griftpark_data
         self.utrecht_data = utrecht_data
         self.merged_data = None
         :return: The merged Pandas DataFrame.
         """
+        self.utrecht_data["datetime"] = pd.to_datetime(
+            self.utrecht_data["datetime"], format="%Y-%m-%d"
+        ).dt.strftime("%d/%m/%Y")
+        self.merged_data = pd.merge(
+            self.griftpark_data, self.utrecht_data, left_on="date", right_on="datetime"
+        )
         return self.merged_data
     def sort_data_by_date(self) -> pd.DataFrame:
         """
         if self.merged_data is None:
             raise ValueError("Merged data not available. Please merge data first.")
+        self.merged_data["datetime"] = pd.to_datetime(
+            self.merged_data["datetime"], format="%d/%m/%Y"
+        )
+        self.merged_data.sort_values(by="datetime", ascending=False, inplace=True)
         return self.merged_data
     def select_features(self) -> pd.DataFrame:
         """
         if self.merged_data is None:
             raise ValueError("Merged data not available. Please merge data first.")
         InputValidator.validate_type(self.merged_data, pd.DataFrame, "merged_data")
         # Feature selection logic
         cols_to_drop = FeatureSelector.uninformative_columns()
+        self.merged_data.drop(cols_to_drop, axis=1, inplace=True, errors="ignore")
         self.merged_data = FeatureSelector.rename_initial_columns(self.merged_data)
         self.merged_data = FeatureSelector.change_to_numeric(self.merged_data)
         selected_columns = FeatureSelector.select_cols_by_correlation(self.merged_data)
+        domain_knowledge_columns = ["precip", "windspeed", "winddir"]
+        selected_columns = ["date"] + selected_columns + domain_knowledge_columns
         self.merged_data = self.merged_data[selected_columns]
         return self.merged_data
     def apply_time_shift(self, t_max: int = 3) -> pd.DataFrame:
         """
         if self.merged_data is None:
             raise ValueError("Data not available. Please process data first.")
         InputValidator.validate_type(t_max, int, "t_max")
         # Time shifting logic
         all_cols = self.merged_data.columns
         for t in range(1, t_max + 1):
             for col in all_cols:
+                self.merged_data[[f"{col} - day {t}"]] = self.merged_data[[col]].shift(
+                    -t
+                )
         for t in range(1, 3):
+            for col in ["o3", "no2"]:
+                self.merged_data[[f"{col} + day {t}"]] = self.merged_data[[col]].shift(
+                    t
+                )
+        self.merged_data[self.merged_data.columns] = self.merged_data[
+            self.merged_data.columns
+        ].apply(pd.to_numeric)
         return self.merged_data
     def preprocess_data(self) -> pd.DataFrame:
         :return: The preprocessed Pandas DataFrame.
         """
         self.select_features()
+        self.merged_data.set_index("date", inplace=True)
+        self.merged_data.dropna(subset=["o3", "no2"])
         self.apply_time_shift()
         # Drop unnecessary columns
+        self.merged_data.drop(
+            [
+                "pm25",
+                "pm10",
+                "temp",
+                "humidity",
+                "visibility",
+                "solarradiation",
+                "precip",
+                "windspeed",
+                "winddir",
+            ],
+            axis=1,
+            inplace=True,
+        )
+        self.merged_data.drop(
+            index=[
+                "29/01/2014",
+                "30/01/2014",
+                "31/01/2014",
+                "10/09/2024",
+                "11/09/2024",
+            ],
+            inplace=True,
+        )
         self.preprocessed_data = self.merged_data
         return self.preprocessed_data
         Initializes the PreprocessingPipeline with paths to raw and processed data directories.
         """
         project_root = os.path.dirname(os.path.dirname(__file__))
+        raw_data_path = os.path.join(project_root, "data", "raw")
+        processed_data_path = os.path.join(project_root, "data", "processed")
         self.data_loader = DataLoader(raw_data_path, processed_data_path)
         self.feature_processor = None
         self.normalizer = MinMaxScaler()
+    def train_test_validation_split(
+        self,
+        x: pd.DataFrame,
+        y: pd.DataFrame,
+        test_size_: float = 0.15,
+        validation_size: float = 0.15,
+        random_state_=4242,
+    ) -> Tuple[
+        pd.DataFrame,
+        pd.DataFrame,
+        pd.DataFrame,
+        pd.DataFrame,
+        pd.DataFrame,
+        pd.DataFrame,
+    ]:
         """
         Split the data into training and testing sets.
         InputValidator.validate_type(y, pd.DataFrame, "data")
         InputValidator.validate_type(test_size_, float, "test_size")
         InputValidator.validate_type(validation_size, float, "validation_size")
+        test_val_proportion = validation_size / (
+            test_size_ + validation_size
+        )  # Proportion of test to validation_size
+        x_train, x_test_val, y_train, y_test_val = train_test_split(
+            x, y, test_size=(test_size_ + validation_size), random_state=random_state_
+        )
+        x_test, x_val, y_test, y_val = train_test_split(
+            x_test_val,
+            y_test_val,
+            test_size=test_val_proportion,
+            random_state=random_state_,
+        )
         return x_train, x_test, x_val, y_train, y_test, y_val
     def run_pipeline(self) -> pd.DataFrame:
         """
         Run the entire preprocessing pipeline: load data, process features, normalize, and save to CSV.
         :param normalizer_type: The type of normalizer to use.
         :return: The final normalized Pandas DataFrame.
         """
         # Step 1: Load raw data
         griftpark_data, utrecht_data = self.data_loader()
         preprocessed_data = self.feature_processor()
         # Step 3: Save processed data
+        self.data_loader.save_to_csv(
+            "v3_lagged_no_missing_predicted_data.csv", preprocessed_data
+        )
         # Step 4: Split data into train, test, and validation sets
+        columns_to_predict = [
+            "no2",
+            "o3",
+            "no2 + day 1",
+            "o3 + day 1",
+            "no2 + day 2",
+            "o3 + day 2",
+        ]
         x = preprocessed_data.drop(columns_to_predict, axis=1)
         y = preprocessed_data[columns_to_predict]
+        x_train, x_test, x_val, y_train, y_test, y_val = (
+            self.train_test_validation_split(x, y)
+        )
         # Step 5: Normalize data for 3 sets (x_train, x_test, x_val)
+        x_train[x_train.columns] = self.normalizer.fit_transform(
+            x_train[x_train.columns]
+        )
         x_test[x_test.columns] = self.normalizer.transform(x_test[x_test.columns])
         x_val[x_val.columns] = self.normalizer.transform(x_val[x_val.columns])
         # Convert the normalized NumPy array back to a DataFrame
         # normalized_x_train = pd.DataFrame(x_train, columns=preprocessed_data.columns, index=preprocessed_data.index)
+        # Step 6: Save normalized data
+        self.data_loader.save_to_csv("x_train.csv", x_train)
+        self.data_loader.save_to_csv("x_test.csv", x_test)
+        self.data_loader.save_to_csv("x_val.csv", x_val)
+        self.data_loader.save_to_csv("y_train.csv", y_train)
+        self.data_loader.save_to_csv("y_test.csv", y_test)
+        self.data_loader.save_to_csv("y_val.csv", y_val)
         # Convert the normalized NumPy array back to a DataFrame
         # normalized_df = pd.DataFrame(normalized_data, columns=preprocessed_data.columns, index=preprocessed_data.index)
         return preprocessed_data

air-quality-forecast/main.py CHANGED Viewed

@@ -1,6 +1,5 @@
 from data_pipeline import PreprocessingPipeline
 if __name__ == "__main__":
     pipeline = PreprocessingPipeline()
-    pipeline.run_pipeline()

 from data_pipeline import PreprocessingPipeline
 if __name__ == "__main__":
     pipeline = PreprocessingPipeline()
+    pipeline.run_pipeline()

air-quality-forecast/utils.py CHANGED Viewed

@@ -5,20 +5,23 @@ import os
 File with utilities
 """
 class InputValidator:
     @staticmethod
     def validate_type(value, expected_type, variable_name: str) -> None:
         """
         Validate the type of the given variable.
         :param value: The value to validate.
         :param expected_type: The expected type of the value.
         :param variable_name: The name of the variable for error messages.
         :raises TypeError: If the value is not of the expected type.
         """
         if not isinstance(value, expected_type):
-            raise TypeError(f"{variable_name} must be of type {expected_type.__name__}.")
     @staticmethod
     def validate_file_exists(path: str, variable_name: str) -> None:
         """
@@ -31,33 +34,63 @@ class InputValidator:
         if not os.path.exists(path):
             raise FileNotFoundError(f"{variable_name} path {path} does not exist.")
 class FeatureSelector:
     def uninformative_columns() -> list:
-        """ Those columns provide no information that the model can use"""
-        return ["Unnamed: 0", 'name', 'datetime', 'sunrise', 'sunset', 'preciptype', 'conditions', 'description', 'icon', 'stations']
     def rename_initial_columns(data):
-        """ Rename the columns of the datasets to remove whitespaces."""
-        data = data.rename(columns={" pm25": "pm25", " pm10": "pm10", " o3": "o3", " no2": "no2", " so2": "so2"})
-        return data
     def change_to_numeric(data):
-        """ Change each entry to a numerical value."""
-        data.loc[:, data.columns != 'date'] = data.loc[:, data.columns != 'date'].apply(pd.to_numeric, errors='coerce')
         return data
     def select_cols_by_correlation(data) -> list:
-        """ Select columns based on correlation criteria."""
-        #Step 1: Calculate correlations between features and O3/NO2
-        corr_no2 = abs(data.loc[:, data.columns != 'date'].corr()['no2'])
-        corr_o3 = abs(data.loc[:, data.columns != 'date'].corr()['o3'])
-        #Step 2: Remove the columns not correlated with any of the labels
         columns_above_threshold = (corr_no2 > 0.3) | (corr_o3 > 0.3)
         selected_columns = columns_above_threshold[columns_above_threshold].index
-        #Step 3: Remove the columns with high correlations with each other (chosen by manual inspection of the correlation matrix)
-        to_remove = ['feelslikemax', 'feelslikemin', 'feelslike', 'tempmin', 'tempmax', 'dew', 'solarenergy', 'uvindex']
         selected_columns = [item for item in selected_columns if item not in to_remove]
         return selected_columns

 File with utilities
 """
 class InputValidator:
     @staticmethod
     def validate_type(value, expected_type, variable_name: str) -> None:
         """
         Validate the type of the given variable.
         :param value: The value to validate.
         :param expected_type: The expected type of the value.
         :param variable_name: The name of the variable for error messages.
         :raises TypeError: If the value is not of the expected type.
         """
         if not isinstance(value, expected_type):
+            raise TypeError(
+                f"{variable_name} must be of type {expected_type.__name__}."
+            )
     @staticmethod
     def validate_file_exists(path: str, variable_name: str) -> None:
         """
         if not os.path.exists(path):
             raise FileNotFoundError(f"{variable_name} path {path} does not exist.")
 class FeatureSelector:
     def uninformative_columns() -> list:
+        """Those columns provide no information that the model can use"""
+        return [
+            "Unnamed: 0",
+            "name",
+            "datetime",
+            "sunrise",
+            "sunset",
+            "preciptype",
+            "conditions",
+            "description",
+            "icon",
+            "stations",
+        ]
     def rename_initial_columns(data):
+        """Rename the columns of the datasets to remove whitespaces."""
+        data = data.rename(
+            columns={
+                " pm25": "pm25",
+                " pm10": "pm10",
+                " o3": "o3",
+                " no2": "no2",
+                " so2": "so2",
+            }
+        )
+        return data
     def change_to_numeric(data):
+        """Change each entry to a numerical value."""
+        data.loc[:, data.columns != "date"] = data.loc[:, data.columns != "date"].apply(
+            pd.to_numeric, errors="coerce"
+        )
         return data
     def select_cols_by_correlation(data) -> list:
+        """Select columns based on correlation criteria."""
+        # Step 1: Calculate correlations between features and O3/NO2
+        corr_no2 = abs(data.loc[:, data.columns != "date"].corr()["no2"])
+        corr_o3 = abs(data.loc[:, data.columns != "date"].corr()["o3"])
+        # Step 2: Remove the columns not correlated with any of the labels
         columns_above_threshold = (corr_no2 > 0.3) | (corr_o3 > 0.3)
         selected_columns = columns_above_threshold[columns_above_threshold].index
+        # Step 3: Remove the columns with high correlations with each other (chosen by manual inspection of the correlation matrix)
+        to_remove = [
+            "feelslikemax",
+            "feelslikemin",
+            "feelslike",
+            "tempmin",
+            "tempmax",
+            "dew",
+            "solarenergy",
+            "uvindex",
+        ]
         selected_columns = [item for item in selected_columns if item not in to_remove]
         return selected_columns

data/interim/correlation_matrix.csv ADDED Viewed

	@@ -0,0 +1,30 @@

+,pm25,pm10,o3,no2,so2,tempmax,tempmin,temp,feelslikemax,feelslikemin,feelslike,dew,humidity,precip,precipprob,precipcover,snow,snowdepth,windgust,windspeed,winddir,sealevelpressure,cloudcover,visibility,solarradiation,solarenergy,uvindex,severerisk,moonphase
+pm25,1,0.602,-0.239,0.397,0.044,-0.293,-0.434,-0.38,-0.293,-0.419,-0.374,-0.361,0.177,-0.224,-0.24,-0.241,0.103,0.099,-0.333,-0.243,-0.225,0.274,-0.106,-0.599,-0.223,-0.224,-0.209,0.053,0.07
+pm10,0.602,1,-0.146,0.504,0.085,-0.114,-0.245,-0.191,-0.116,-0.227,-0.184,-0.212,0.018,-0.182,-0.252,-0.229,0.052,0.036,-0.251,-0.199,-0.157,0.246,-0.175,-0.356,-0.097,-0.097,-0.074,0.113,0.02
+o3,-0.239,-0.146,1,-0.505,0.001,0.599,0.412,0.555,0.577,0.42,0.533,0.356,-0.587,-0.048,-0.166,-0.133,-0.051,-0.053,-0.006,-0.018,-0.062,0.036,-0.275,0.418,0.636,0.636,0.608,0.073,0.029
+no2,0.397,0.504,-0.505,1,0.012,-0.384,-0.453,-0.444,-0.377,-0.43,-0.424,-0.379,0.285,-0.113,-0.083,-0.076,0.038,0.039,-0.243,-0.209,-0.038,0.146,-0.088,-0.35,-0.372,-0.372,-0.358,-0.026,0.018
+so2,0.044,0.085,0.001,0.012,1,0.044,0.023,0.037,0.052,0.036,0.046,0.045,-0.003,-0.021,-0.018,-0.022,-0.016,-0.033,-0.019,-0.045,0.082,0.03,-0.008,-0.103,0.038,0.037,0.058,,0.078
+tempmax,-0.293,-0.114,0.599,-0.384,0.044,1,0.828,0.967,0.993,0.861,0.964,0.849,-0.494,-0.063,-0.203,-0.24,-0.165,-0.173,-0.139,-0.172,0,0.045,-0.242,0.353,0.726,0.726,0.711,0.182,0.026
+tempmin,-0.434,-0.245,0.412,-0.453,0.023,0.828,1,0.934,0.829,0.984,0.923,0.935,-0.233,0.117,0.077,-0.002,-0.141,-0.179,0.063,-0.002,0.147,-0.158,0.107,0.284,0.477,0.477,0.459,0.146,0.017
+temp,-0.38,-0.191,0.555,-0.444,0.037,0.967,0.934,1,0.963,0.949,0.993,0.925,-0.412,0.009,-0.09,-0.146,-0.162,-0.186,-0.052,-0.102,0.071,-0.04,-0.097,0.35,0.656,0.657,0.636,0.166,0.021
+feelslikemax,-0.293,-0.116,0.577,-0.377,0.052,0.993,0.829,0.963,1,0.867,0.97,0.857,-0.467,-0.055,-0.197,-0.235,-0.175,-0.19,-0.152,-0.19,0.007,0.049,-0.221,0.337,0.715,0.715,0.703,0.181,0.026
+feelslikemin,-0.419,-0.227,0.42,-0.43,0.036,0.861,0.984,0.949,0.867,1,0.952,0.946,-0.242,0.078,0.026,-0.057,-0.162,-0.192,-0.028,-0.098,0.144,-0.099,0.066,0.281,0.528,0.528,0.512,0.144,0.014
+feelslike,-0.374,-0.184,0.533,-0.424,0.046,0.964,0.923,0.993,0.97,0.952,1,0.927,-0.389,0,-0.104,-0.159,-0.176,-0.198,-0.097,-0.153,0.076,-0.016,-0.096,0.334,0.66,0.66,0.642,0.159,0.019
+dew,-0.361,-0.212,0.356,-0.379,0.045,0.849,0.935,0.925,0.857,0.946,0.927,1,-0.04,0.13,0.072,0.001,-0.15,-0.185,-0.063,-0.128,0.154,-0.12,0.103,0.127,0.43,0.43,0.416,0.156,0.024
+humidity,0.177,0.018,-0.587,0.285,-0.003,-0.494,-0.233,-0.412,-0.467,-0.242,-0.389,-0.04,1,0.285,0.391,0.387,0.076,0.05,-0.065,-0.081,0.155,-0.171,0.472,-0.658,-0.698,-0.698,-0.685,-0.063,0.007
+precip,-0.224,-0.182,-0.048,-0.113,-0.021,-0.063,0.117,0.009,-0.055,0.078,0,0.13,0.285,1,0.48,0.682,0.039,0.001,0.374,0.32,0.12,-0.403,0.278,-0.119,-0.219,-0.219,-0.223,-0.043,-0.021
+precipprob,-0.24,-0.252,-0.166,-0.083,-0.018,-0.203,0.077,-0.09,-0.197,0.026,-0.104,0.072,0.391,0.48,1,0.642,0.069,0.012,0.433,0.37,0.279,-0.476,0.419,-0.113,-0.356,-0.356,-0.355,-0.005,-0.02
+precipcover,-0.241,-0.229,-0.133,-0.076,-0.022,-0.24,-0.002,-0.146,-0.235,-0.057,-0.159,0.001,0.387,0.682,0.642,1,0.101,0.034,0.41,0.335,0.187,-0.466,0.378,-0.175,-0.375,-0.375,-0.387,-0.047,-0.032
+snow,0.103,0.052,-0.051,0.038,-0.016,-0.165,-0.141,-0.162,-0.175,-0.162,-0.176,-0.15,0.076,0.039,0.069,0.101,1,0.346,0.027,0.041,-0.057,-0.061,0.056,-0.096,-0.097,-0.097,-0.102,-0.011,0.034
+snowdepth,0.099,0.036,-0.053,0.039,-0.033,-0.173,-0.179,-0.186,-0.19,-0.192,-0.198,-0.185,0.05,0.001,0.012,0.034,0.346,1,-0.012,0.016,-0.062,-0.007,0.002,-0.042,-0.076,-0.076,-0.075,-0.015,0.016
+windgust,-0.333,-0.251,-0.006,-0.243,-0.019,-0.139,0.063,-0.052,-0.152,-0.028,-0.097,-0.063,-0.065,0.374,0.433,0.41,0.027,-0.012,1,0.907,0.191,-0.418,0.201,0.173,-0.174,-0.174,-0.176,-0.022,-0.021
+windspeed,-0.243,-0.199,-0.018,-0.209,-0.045,-0.172,-0.002,-0.102,-0.19,-0.098,-0.153,-0.128,-0.081,0.32,0.37,0.335,0.041,0.016,0.907,1,0.101,-0.374,0.136,0.148,-0.187,-0.187,-0.184,-0.045,-0.015
+winddir,-0.225,-0.157,-0.062,-0.038,0.082,0,0.147,0.071,0.007,0.144,0.076,0.154,0.155,0.12,0.279,0.187,-0.057,-0.062,0.191,0.101,1,-0.096,0.224,-0.003,-0.029,-0.028,-0.037,0.003,-0.047
+sealevelpressure,0.274,0.246,0.036,0.146,0.03,0.045,-0.158,-0.04,0.049,-0.099,-0.016,-0.12,-0.171,-0.403,-0.476,-0.466,-0.061,-0.007,-0.418,-0.374,-0.096,1,-0.325,-0.045,0.216,0.216,0.213,-0.032,-0.005
+cloudcover,-0.106,-0.175,-0.275,-0.088,-0.008,-0.242,0.107,-0.097,-0.221,0.066,-0.096,0.103,0.472,0.278,0.419,0.378,0.056,0.002,0.201,0.136,0.224,-0.325,1,-0.208,-0.452,-0.452,-0.438,-0.063,-0.026
+visibility,-0.599,-0.356,0.418,-0.35,-0.103,0.353,0.284,0.35,0.337,0.281,0.334,0.127,-0.658,-0.119,-0.113,-0.175,-0.096,-0.042,0.173,0.148,-0.003,-0.045,-0.208,1,0.45,0.45,0.447,0.033,-0.023
+solarradiation,-0.223,-0.097,0.636,-0.372,0.038,0.726,0.477,0.656,0.715,0.528,0.66,0.43,-0.698,-0.219,-0.356,-0.375,-0.097,-0.076,-0.174,-0.187,-0.029,0.216,-0.452,0.45,1,1,0.965,0.118,0.007
+solarenergy,-0.224,-0.097,0.636,-0.372,0.037,0.726,0.477,0.657,0.715,0.528,0.66,0.43,-0.698,-0.219,-0.356,-0.375,-0.097,-0.076,-0.174,-0.187,-0.028,0.216,-0.452,0.45,1,1,0.965,0.119,0.007
+uvindex,-0.209,-0.074,0.608,-0.358,0.058,0.711,0.459,0.636,0.703,0.512,0.642,0.416,-0.685,-0.223,-0.355,-0.387,-0.102,-0.075,-0.176,-0.184,-0.037,0.213,-0.438,0.447,0.965,0.965,1,0.115,0.004
+severerisk,0.053,0.113,0.073,-0.026,,0.182,0.146,0.166,0.181,0.144,0.159,0.156,-0.063,-0.043,-0.005,-0.047,-0.011,-0.015,-0.022,-0.045,0.003,-0.032,-0.063,0.033,0.118,0.119,0.115,1,0.023
+moonphase,0.07,0.02,0.029,0.018,0.078,0.026,0.017,0.021,0.026,0.014,0.019,0.024,0.007,-0.021,-0.02,-0.032,0.034,0.016,-0.021,-0.015,-0.047,-0.005,-0.026,-0.023,0.007,0.007,0.004,0.023,1

extra_scripts/corr_map.R ADDED Viewed

	@@ -0,0 +1,34 @@

+library(ggplot2)
+library(reshape2)
+library(dplyr)
+library(extrafont)
+loadfonts(device = "pdf")  # Load fonts
+correlation_matrix <- read.csv(file.choose(), row.names = 1)
+correlation_matrix[abs(correlation_matrix) < 0.3] <- NA
+correlation_matrix[lower.tri(correlation_matrix)] <- NA
+diag(correlation_matrix) <- NA
+melted_cor <- melt(as.matrix(correlation_matrix), na.rm = TRUE)
+melted_cor$fill_value <- ifelse(is.na(melted_cor$value), NA, melted_cor$value)
+ggplot(melted_cor, aes(x = Var1, y = Var2)) +
+  geom_tile(aes(fill = fill_value), color = "black") +
+  labs(x = NULL, y = NULL, fill = "Pearson's\nCorrelation") +
+  scale_fill_gradient2(mid = "#FBFEF9", low = "#0C6291", high = "#A63446",
+                       limits = c(-1, 1), na.value = "lightgray") +
+  theme_classic() +
+  scale_x_discrete(expand = c(0, 0)) +
+  scale_y_discrete(expand = c(0, 0)) +
+  theme(text = element_text(family = "Arial"),  # Use a common font
+        plot.title = element_text(size = 14, family = "Arial"),  # Title font
+        axis.text.x = element_text(angle = 45, hjust = 1),
+        legend.position = "right",  # Position of the legend
+        legend.title = element_text(size = 12),  # Title size
+        legend.text = element_text(size = 10),  # Text size
+        legend.key.size = unit(1.5, "cm"),  # Size of the legend keys
+        legend.key.width = unit(1, "cm"))  # Width of the legend keys

extra_scripts/histograms.tex ADDED Viewed

	@@ -0,0 +1,167 @@

+\documentclass[tikz]{standalone}
+\usepackage{pgfplots}
+\usepackage{pgfplotstable}
+\pgfplotsset{compat=1.17}
+\begin{document}
+% Load the dataset
+\pgfplotstableread[col sep=comma]{data/processed/v2_merged_selected_features_with_missing.csv}\datatable
+\begin{tikzpicture}
+% Main title at the top
+\node[align=center] at (7, 3.75) {\textbf{Distributions of Environmental Variables}};
+% First row of plots
+\begin{axis}[
+    at={(0,0)},
+    width=5.5cm,
+    xlabel=PM$_{2.5}$ ($\mu g /m^3$),
+    tick label style={font=\fontsize{8}{8}\selectfont},
+    ylabel=Frequency,
+    ybar=0pt, bar width=1,
+]
+\addplot+[fill=cyan,
+    fill opacity=0.5,
+    hist={bins=20}
+] table [y index=1] {\datatable};
+\end{axis}
+\begin{axis}[
+    at={(5cm,0)},
+    width=5.5cm,
+    tick label style={font=\fontsize{8}{8}\selectfont},
+    xlabel=PM$_{10}$ ($\mu g /m^3$),
+    ybar=0pt, bar width=1,
+]
+\addplot+[fill=cyan,
+    fill opacity=0.5,
+    hist={bins=20}
+] table [y index=2] {\datatable};
+\end{axis}
+\begin{axis}[
+    at={(10cm,0)},
+    width=5.5cm,
+    tick label style={font=\fontsize{8}{8}\selectfont},
+    xlabel=O$_{3}$ ($\mu g /m^3$),
+    ybar=0pt, bar width=1,
+]
+\addplot+[fill=cyan,
+    fill opacity=0.5,
+    hist={bins=20}
+] table [y index=3] {\datatable};
+\end{axis}
+% Second row of plots
+\begin{axis}[
+    at={(0,-5cm)},
+    width=5.5cm,
+    xlabel=NO$_{2}$ ($\mu g /m^3$),
+    tick label style={font=\fontsize{8}{8}\selectfont},
+    ylabel=Frequency,
+    ybar=0pt, bar width=1,
+]
+\addplot+[fill=cyan,
+    fill opacity=0.5,
+    hist={bins=20}
+] table [y index=4] {\datatable};
+\end{axis}
+\begin{axis}[
+    at={(5cm,-5cm)},
+    width=5.5cm,
+    xlabel=Temperature (°C),
+    tick label style={font=\fontsize{8}{8}\selectfont},
+    ybar=0pt, bar width=1,
+]
+\addplot+[fill=cyan,
+    fill opacity=0.5,
+    hist={bins=20}
+] table [y index=5] {\datatable};
+\end{axis}
+\begin{axis}[
+    at={(10cm,-5cm)},
+    width=5.5cm,
+    xlabel=Humidity (\%),
+    tick label style={font=\fontsize{8}{8}\selectfont},
+    ybar=0pt, bar width=1,
+]
+\addplot+[fill=cyan,
+    fill opacity=0.5,
+    hist={bins=20}
+] table [y index=6] {\datatable};
+\end{axis}
+% Third row of plots
+\begin{axis}[
+    at={(0,-10cm)},
+    width=5.5cm,
+    xlabel=Visibility ($km$),
+    tick label style={font=\fontsize{8}{8}\selectfont},
+    ylabel=Frequency,
+    ybar=0pt, bar width=1,
+]
+\addplot+[fill=cyan,
+    fill opacity=0.5,
+    hist={bins=20}
+] table [y index=7] {\datatable};
+\end{axis}
+\begin{axis}[
+    at={(5cm,-10cm)},
+    width=5.5cm,
+    xlabel=Solar Radiation ($W/m^2$),
+    tick label style={font=\fontsize{8}{8}\selectfont},
+    ybar=0pt, bar width=1,
+]
+\addplot+[fill=cyan,
+    fill opacity=0.5,
+    hist={bins=20}
+] table [y index=8] {\datatable};
+\end{axis}
+\begin{axis}[
+    at={(10cm,-10cm)},
+    width=5.5cm,
+    xlabel=Precipitation ($mm$),
+    tick label style={font=\fontsize{8}{8}\selectfont},
+    ybar=0pt, bar width=1,
+]
+\addplot+[fill=cyan,
+    fill opacity=0.5,
+    hist={bins=20}
+] table [y index=9] {\datatable};
+\end{axis}
+\begin{axis}[
+    at={(2cm,-15cm)},
+    width=5.5cm,
+    xlabel=Windspeed ($km/h$),
+    ylabel=Frequency,
+    tick label style={font=\fontsize{8}{8}\selectfont},
+    ybar=0pt, bar width=1,
+]
+\addplot+[fill=cyan,
+    fill opacity=0.5,
+    hist={bins=20}
+] table [y index=10] {\datatable};
+\end{axis}
+\begin{axis}[
+    at={(8cm,-15cm)},
+    width=5.5cm,
+    xlabel=Wind Direction (degrees),
+    tick label style={font=\fontsize{8}{8}\selectfont},
+    ybar=0pt, bar width=1,
+]
+\addplot+[fill=cyan,
+    fill opacity=0.5,
+    hist={bins=20}
+] table [y index=11] {\datatable};
+\end{axis}
+\end{tikzpicture}
+\end{document}