diff --git "a/doc/JAO_Data_Treatment_Plan.md" "b/doc/JAO_Data_Treatment_Plan.md"
new file mode 100644--- /dev/null
+++ "b/doc/JAO_Data_Treatment_Plan.md"
@@ -0,0 +1,2802 @@
+# JAO Data Treatment & Feature Extraction Plan
+## Complete Guide for FBMC Flow Forecasting MVP
+
+**Last Updated:** October 29, 2025  
+**Project:** FBMC Zero-Shot Forecasting with Chronos 2  
+**Timeline:** 5-Day MVP  
+**Version:** 5.0 - OUTAGE DURATION FEATURES (Added temporal outage metrics for enhanced forecasting)  
+
+---
+
+## Table of Contents
+
+1. [Data Acquisition Strategy](#1-data-acquisition-strategy)
+2. [JAO Data Series Catalog](#2-jao-data-series-catalog)
+3. [Data Cleaning Procedures](#3-data-cleaning-procedures)
+4. [CNEC Treatment](#4-cnec-treatment)
+5. [PTDF Treatment](#5-ptdf-treatment)
+6. [RAM Treatment](#6-ram-treatment)
+7. [Shadow Prices Treatment](#7-shadow-prices-treatment)
+8. [Feature Engineering Pipeline](#8-feature-engineering-pipeline)
+9. [Quality Assurance](#9-quality-assurance)
+
+---
+
+## 1. Data Acquisition Strategy
+
+### 1.1 Timeline and Scope
+
+**Historical Data Period:** October 2023 - September 2025 (24 months)
+- Purpose: Feature baselines and historical context
+- Total: ~17,520 hourly records per series
+
+**Collection Priority:**
+1. **Day 1 Morning (2 hours):** Max BEX files (TARGET VARIABLE - highest priority!)
+2. **Day 1 Morning (2 hours):** CNEC files (critical for constraint identification)
+3. **Day 1 Morning (2 hours):** PTDF matrices (network sensitivity)
+4. **Day 1 Afternoon (1.5 hours):** LTN values (future covariates)
+5. **Day 1 Afternoon (1.5 hours):** Min/Max Net Positions (domain boundaries)
+6. **Day 1 Afternoon (1.5 hours):** ATC Non-Core borders (loop flow drivers)
+7. **Day 1 Afternoon (1.5 hours):** RAM values (capacity margins)
+8. **Day 1 Afternoon (1.5 hours):** Shadow prices (economic signals)
+
+### 1.2 jao-py Python Library Usage
+
+**Installation & Setup:**
+```bash
+# Install jao-py using uv package manager
+.venv/Scripts/uv.exe pip install jao-py>=0.6.2
+
+# Verify installation
+.venv/Scripts/python.exe -c "from jao import JaoPublicationToolPandasClient; print('jao-py installed successfully')"
+```
+
+**Python Data Collection:**
+```python
+import pandas as pd
+from jao import JaoPublicationToolPandasClient
+import time
+
+# Initialize client (no API key required for public data)
+client = JaoPublicationToolPandasClient()
+
+# Define date range (24 months: Oct 2023 - Sept 2025)
+# IMPORTANT: jao-py requires pandas Timestamp with timezone (UTC)
+start_date = pd.Timestamp('2023-10-01', tz='UTC')
+end_date = pd.Timestamp('2025-09-30', tz='UTC')
+
+# Collect MaxBEX data (TARGET VARIABLE) - day by day with rate limiting
+maxbex_data = []
+current_date = start_date
+while current_date <= end_date:
+    df = client.query_maxbex(current_date)
+    maxbex_data.append(df)
+    current_date += pd.Timedelta(days=1)
+    time.sleep(5)  # Rate limiting: 5 seconds between requests
+
+# Combine and save
+maxbex_df = pd.concat(maxbex_data)
+maxbex_df.to_parquet('data/raw/jao/maxbex_2023_2025.parquet')
+
+# Collect Active Constraints (CNECs + PTDFs in ONE call!)
+cnec_data = []
+current_date = start_date
+while current_date <= end_date:
+    df = client.query_active_constraints(current_date)
+    cnec_data.append(df)
+    current_date += pd.Timedelta(days=1)
+    time.sleep(5)  # Rate limiting: 5 seconds between requests
+
+# Combine and save
+cnec_df = pd.concat(cnec_data)
+cnec_df.to_parquet('data/raw/jao/cnecs_ptdfs_2023_2025.parquet')
+```
+
+**Key Methods Available:**
+- `query_maxbex(day)` - Maximum Bilateral Exchange capacity (TARGET)
+- `query_active_constraints(day)` - CNECs with shadow prices, RAM, and PTDFs
+- `query_lta(d_from, d_to)` - Long Term Allocations (perfect future covariate)
+- `query_minmax_np(day)` - Min/Max Net Positions (domain boundaries)
+- `query_net_position(day)` - Actual net positions
+- `query_monitoring(day)` - Monitoring data (additional RAM/shadow prices)
+```
+
+**Expected Output Files:**
+```
+data/raw/jao/
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ cnecs_2024_2025.parquet         (~500 MB)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ ptdfs_2024_2025.parquet         (~800 MB)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ rams_2024_2025.parquet          (~400 MB)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ shadow_prices_2024_2025.parquet (~300 MB)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ presolved_2024_2025.parquet     (~200 MB)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ d2cf_2024_2025.parquet          (~600 MB - OPTIONAL)
+ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬ÂÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ metadata/
+    ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ cnec_definitions.parquet
+    ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ zone_ptdf_mapping.parquet
+    ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬ÂÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ download_log.json
+```
+
+---
+
+## 2. JAO Data Series Catalog
+
+### 2.1 Max BEX Data Series (TARGET VARIABLE)
+
+**File:** `Core_DA_Results_[date].xml` or `Core_Max_Exchanges_[date].xml` 
+
+**CRITICAL:** This is the **TARGET VARIABLE** we are forecasting - without Max BEX, the model cannot be trained or evaluated!
+
+| Field Name | Data Type | Description | Use in Features | Missing Value Strategy |
+|------------|-----------|-------------|-----------------|----------------------|
+| `border_id` | String | Border identifier (e.g., "DE-FR", "DE-NL") | Border tracking | **REQUIRED** |
+| `timestamp` | Datetime | Delivery hour | Time index | **REQUIRED** |
+| `max_bex` | Float (MW) | Maximum Bilateral Exchange Capacity | **TARGET VARIABLE** | **CRITICAL - must have** |
+| `direction` | String | Direction of capacity (e.g., "DEÃ¢â€ â€™FR") | Directional features | Parse from border |
+
+**Description:**
+Max BEX represents the **actual available cross-border capacity** after FBMC optimization. It is the OUTPUT of the FBMC calculation process and represents how much capacity is actually available for bilateral exchanges after accounting for:
+- All CNEC constraints (RAM limits)
+- PTDF sensitivities
+- LTN allocations (capacity already committed)
+- Remedial actions
+- All network constraints
+
+**IMPORTANT: Commercial vs Physical Capacity**
+- MaxBEX = **commercial hub-to-hub trading capacity**, NOT physical interconnector ratings
+- FBMC Core has 12 bidding zones: AT, BE, CZ, DE-LU, FR, HR, HU, NL, PL, RO, SI, SK
+- **All 132 zone-pair combinations exist** (12 × 11 bidirectional pairs)
+- This includes "virtual borders" - zone pairs without direct physical interconnectors
+- Example: FR→HU capacity exists despite no physical FR-HU interconnector
+  - Power flows through AC grid network via DE, AT, CZ
+  - PTDFs quantify how this exchange affects every CNEC in the network
+  - MaxBEX = optimization result considering ALL network constraints
+
+**Borders to Collect (ALL 132 zone pairs):**
+- JAO provides MaxBEX for all 12 × 11 = 132 bidirectional zone-pair combinations
+- Includes both physical borders (e.g., DE→FR, AT→CZ) and virtual borders (e.g., FR→HU, BE→PL)
+- Each direction (A→B vs B→A) has independent capacity values
+- Full list: All permutations of (AT, BE, CZ, DE, FR, HR, HU, NL, PL, RO, SI, SK)
+- Example physical borders: DE→FR, DE→NL, AT→CZ, BE→NL, FR→BE
+- Example virtual borders: FR→HU, AT→HR, BE→RO, NL→SK, CZ→HR
+
+**Collection Priority:** DAY 1 MORNING (FIRST PRIORITY)
+This is the ground truth for training and validation - collect before any other data series!
+
+**Storage Schema:**
+```python
+max_bex_schema = {
+    'timestamp': pl.Datetime,
+    'border_id': pl.Utf8,
+    'max_bex': pl.Float32,        # TARGET VARIABLE
+    'direction': pl.Utf8
+}
+```
+
+**Expected File Size:** ~1.2-1.5 GB for 24 months × 132 zone pairs × 17,520 hours (actual: ~200 MB compressed as Parquet)
+
+---
+
+### 2.2 LTN Data Series (Long Term Nominations)
+
+**File:** `Core_LTN_[date].xml` or available through `Core_ltn.R` JAOPuTo function
+
+**CRITICAL:** LTN values are **PERFECT FUTURE COVARIATES** because they result from yearly/monthly capacity auctions and are known with certainty for the forecast horizon!
+
+| Field Name | Data Type | Description | Use in Features | Missing Value Strategy |
+|------------|-----------|-------------|-----------------|----------------------|
+| `border_id` | String | Border identifier (e.g., "DE-FR", "SI-HR") | Border tracking | **REQUIRED** |
+| `timestamp` | Datetime | Delivery hour | Time index | **REQUIRED** |
+| `ltn_allocated` | Float (MW) | Capacity already allocated in long-term auctions | **INPUT FEATURE & FUTURE COVARIATE** | 0.0 (no LT allocation) |
+| `direction` | String | Direction of nomination | Directional features | Parse from border |
+| `auction_type` | String | "yearly" or "monthly" | Allocation source tracking | Required field |
+
+**Description:**
+LTN represents capacity already committed through long-term (yearly and monthly) auctions conducted by JAO. This capacity is **guaranteed and known in advance**, making it an exceptionally valuable feature:
+
+**Historical Context (What Happened):**
+- LTN values show how much capacity was already allocated
+- Higher LTN Ã¢â€ â€™ lower available day-ahead capacity (Max BEX)
+- Direct relationship: `Max BEX Ã¢â€°Ë† Theoretical Max - LTN - Other Constraints`
+
+**Future Covariates (What We Know Will Happen):**
+- **Yearly auctions:** Results are known for the **entire year ahead**
+- **Monthly auctions:** Results are known for the **month ahead**
+- When forecasting D+1 to D+14, LTN values are **100% certain** (already allocated)
+- This is **GOLD STANDARD** for future covariates - no forecasting needed!
+
+**Borders with LTN:**
+Most Core borders have LTN=0 (Financial Transmission Rights), but important exceptions:
+- **SI-HR** (Slovenia-Croatia): Physical Transmission Rights (PTRs) still in use
+- Some external Core borders may have PTRs
+- Even when LTN=0, including the series confirms this to the model
+
+**Impact on Max BEX:**
+Example: If DE-FR has 500 MW LTN allocated:
+- Theoretical capacity: 3000 MW
+- LTN reduction: -500 MW
+- After network constraints: -400 MW
+- **Result: Max BEX Ã¢â€°Ë† 2100 MW**
+
+**Collection Priority:** DAY 1 AFTERNOON (HIGH PRIORITY - future covariates)
+
+**Storage Schema:**
+```python
+ltn_schema = {
+    'timestamp': pl.Datetime,
+    'border_id': pl.Utf8,
+    'ltn_allocated': pl.Float32,      # Amount already committed
+    'direction': pl.Utf8,
+    'auction_type': pl.Utf8           # yearly/monthly
+}
+```
+
+**Expected File Size:** ~100-160 MB for 24 months Ãƒâ€” 20 borders Ãƒâ€” 17,520 hours
+
+**JAOPuTo Download Command:**
+```bash
+# Download LTN data for 24 months
+java -jar JAOPuTo.jar \
+  --start-date 2023-10-01 \
+  --end-date 2025-09-30 \
+  --data-type LTN \
+  --region CORE \
+  --output-format parquet \
+  --output-dir ./data/raw/jao/ltn/
+```
+
+**Future Covariate Usage:**
+```python
+# LTN is known in advance from auction results
+# For forecast period D+1 to D+14, we have PERFECT information
+def prepare_ltn_future_covariates(ltn_df, forecast_start_date, prediction_horizon_hours=336):
+    """
+    Extract LTN values for future forecast horizon
+    These are KNOWN with certainty (auction results)
+    """
+    forecast_end_date = forecast_start_date + pd.Timedelta(hours=prediction_horizon_hours)
+    
+    future_ltn = ltn_df.filter(
+        (pl.col('timestamp') >= forecast_start_date) &
+        (pl.col('timestamp') < forecast_end_date)
+    )
+    
+    return future_ltn  # No forecasting needed - these are actual commitments!
+```
+
+---
+
+### 2.3 Min/Max Net Position Data Series
+
+**File:** `Core_MaxNetPositions_[date].xml` or available through `Core_maxnetpositions.R` JAOPuTo function
+
+**CRITICAL:** Min/Max Net Positions define the **feasible domain** for each bidding zone - the boundaries within which net positions can move without violating network constraints.
+
+| Field Name | Data Type | Description | Use in Features | Missing Value Strategy |
+|------------|-----------|-------------|-----------------|----------------------|
+| `zone` | String | Bidding zone (e.g., "DE_LU", "FR", "BE") | Zone tracking | **REQUIRED** |
+| `timestamp` | Datetime | Delivery hour | Time index | **REQUIRED** |
+| `net_pos_min` | Float (MW) | Minimum feasible net position | **DOMAIN BOUNDARY** | **CRITICAL - must have** |
+| `net_pos_max` | Float (MW) | Maximum feasible net position | **DOMAIN BOUNDARY** | **CRITICAL - must have** |
+| `actual_net_pos` | Float (MW) | Actual net position after market clearing | Reference value | 0.0 if missing |
+
+**Description:**
+Min/Max Net Positions represent the **feasible space** in which each bidding zone can operate without violating any CNEC constraints. These values are extracted from the union of:
+- Final flow-based domain
+- Final bilateral exchange restrictions (LTA inclusion)
+- All CNEC constraints simultaneously satisfied
+
+**Why This Matters for Forecasting:**
+1. **Tight ranges indicate constrained systems:**
+   - If `net_pos_max - net_pos_min` is small Ã¢â€ â€™ limited flexibility Ã¢â€ â€™ likely lower Max BEX
+   - If range is large Ã¢â€ â€™ system has degrees of freedom Ã¢â€ â€™ potentially higher Max BEX
+
+2. **Zone-specific stress indicators:**
+   - A zone with narrow net position range is "boxed in"
+   - This propagates to adjacent borders (reduces their capacity)
+
+3. **Derived features:**
+   ```python
+   # Range = degrees of freedom
+   net_pos_range = net_pos_max - net_pos_min
+   
+   # Margin = how close to limits (utilization)
+   net_pos_margin = (actual_net_pos - net_pos_min) / (net_pos_max - net_pos_min)
+   
+   # Stress index = inverse of range (tighter = more stressed)
+   zone_stress = 1.0 / (net_pos_range + 1.0)  # +1 to avoid division by zero
+   ```
+
+**Example Interpretation:**
+For Germany (DE_LU) at a specific hour:
+- `net_pos_min` = -8000 MW (maximum import)
+- `net_pos_max` = +12000 MW (maximum export)
+- `actual_net_pos` = +6000 MW (actual export after market clearing)
+- **Range** = 20,000 MW (large flexibility)
+- **Margin** = (6000 - (-8000)) / 20000 = 0.70 (70% utilization toward export limit)
+
+If another hour shows:
+- `net_pos_min` = -2000 MW
+- `net_pos_max` = +3000 MW
+- **Range** = 5,000 MW (VERY constrained - system is tight!)
+- This hour will likely have lower Max BEX on German borders
+
+**Zones to Collect (12 Core zones):**
+- DE_LU (Germany-Luxembourg)
+- FR (France)
+- BE (Belgium)
+- NL (Netherlands)
+- AT (Austria)
+- CZ (Czech Republic)
+- PL (Poland)
+- SK (Slovakia)
+- HU (Hungary)
+- SI (Slovenia)
+- HR (Croatia)
+- RO (Romania)
+
+**Collection Priority:** DAY 1 AFTERNOON (HIGH PRIORITY)
+
+**Storage Schema:**
+```python
+netpos_schema = {
+    'timestamp': pl.Datetime,
+    'zone': pl.Utf8,
+    'net_pos_min': pl.Float32,        # Minimum feasible (import limit)
+    'net_pos_max': pl.Float32,        # Maximum feasible (export limit)
+    'actual_net_pos': pl.Float32      # Actual market clearing value
+}
+```
+
+**Expected File Size:** ~160-200 MB for 24 months Ãƒâ€” 12 zones Ãƒâ€” 17,520 hours
+
+**JAOPuTo Download Command:**
+```bash
+# Download Min/Max Net Positions for 24 months
+java -jar JAOPuTo.jar \
+  --start-date 2023-10-01 \
+  --end-date 2025-09-30 \
+  --data-type MAX_NET_POSITIONS \
+  --region CORE \
+  --output-format parquet \
+  --output-dir ./data/raw/jao/netpos/
+```
+
+**Feature Engineering:**
+```python
+def engineer_netpos_features(df: pl.DataFrame) -> pl.DataFrame:
+    """
+    Create net position features from Min/Max data
+    """
+    df = df.with_columns([
+        # 1. Range (degrees of freedom)
+        (pl.col('net_pos_max') - pl.col('net_pos_min'))
+        .alias('net_pos_range'),
+        
+        # 2. Utilization margin (0 = at min, 1 = at max)
+        ((pl.col('actual_net_pos') - pl.col('net_pos_min')) / 
+         (pl.col('net_pos_max') - pl.col('net_pos_min')))
+        .alias('net_pos_margin'),
+        
+        # 3. Stress index (inverse of range, normalized)
+        (1.0 / (pl.col('net_pos_max') - pl.col('net_pos_min') + 100.0))
+        .alias('zone_stress_index'),
+        
+        # 4. Distance to limits (minimum of distances to both boundaries)
+        pl.min_horizontal([
+            pl.col('actual_net_pos') - pl.col('net_pos_min'),
+            pl.col('net_pos_max') - pl.col('actual_net_pos')
+        ]).alias('distance_to_nearest_limit'),
+        
+        # 5. Asymmetry (is the feasible space symmetric around zero?)
+        ((pl.col('net_pos_max') + pl.col('net_pos_min')) / 2.0)
+        .alias('domain_asymmetry')
+    ])
+    
+    return df
+```
+
+**System-Level Aggregations:**
+```python
+def aggregate_netpos_system_wide(df: pl.DataFrame) -> pl.DataFrame:
+    """
+    Create system-wide net position stress indicators
+    """
+    system_features = df.groupby('timestamp').agg([
+        # Average range across all zones
+        pl.col('net_pos_range').mean().alias('avg_zone_flexibility'),
+        
+        # Minimum range (most constrained zone)
+        pl.col('net_pos_range').min().alias('min_zone_flexibility'),
+        
+        # Count of highly constrained zones (range < 5000 MW)
+        (pl.col('net_pos_range') < 5000).sum().alias('n_constrained_zones'),
+        
+        # System-wide stress (average of zone stress indices)
+        pl.col('zone_stress_index').mean().alias('system_stress_avg'),
+        
+        # Maximum zone stress (tightest zone)
+        pl.col('zone_stress_index').max().alias('system_stress_max')
+    ])
+    
+    return system_features
+```
+
+---
+
+### 2.4 CNEC Data Series
+
+**File:** `Core_DA_CC_CNEC_[date].xml` (daily publication at 10:30 CET)
+
+| Field Name | Data Type | Description | Use in Features | Missing Value Strategy |
+|------------|-----------|-------------|-----------------|----------------------|
+| `cnec_id` | String | Unique identifier (e.g., "DE_CZ_TIE_001") | CNEC tracking | N/A - Required field |
+| `cnec_name` | String | Human-readable name (e.g., "Line RÃƒÆ’Ã‚Â¶hrsdorf-Hradec N-1 ABC") | Documentation | Forward-fill |
+| `contingency` | String | N-1 element causing constraint | CNEC classification | "basecase" if missing |
+| `tso` | String | Responsible TSO (e.g., "50Hertz", "Ãƒâ€žÃ…â€™EPS") | Geographic features | Parse from `cnec_id` |
+| `monitored_element` | String | Physical line/transformer being monitored | CNEC grouping | Required field |
+| `fmax` | Float (MW) | Maximum flow limit under contingency | Normalization baseline | **CRITICAL - must have** |
+| `ram_before` | Float (MW) | Initial remaining available margin | Historical patterns | 0.0 (conservative) |
+| `ram_after` | Float (MW) | Final RAM after remedial actions | **Primary feature** | Forward-fill from `ram_before` |
+| `flow_fb` | Float (MW) | Final flow after market coupling | Flow patterns | 0.0 if unconstrained |
+| `presolved` | Boolean | Was CNEC binding/active? | **Key target feature** | False (not binding) |
+| `shadow_price` | Float (ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬/MW) | Lagrange multiplier for constraint | Economic signal | 0.0 if not binding |
+| `direction` | String | "import" or "export" from perspective | Directionality | Parse from flow sign |
+| `voltage_level` | Integer (kV) | Voltage level (e.g., 220, 380) | CNEC classification | 380 (most common) |
+| `timestamp` | Datetime | Delivery hour (CET/CEST) | Time index | **REQUIRED** |
+| `business_day` | Date | Market day (D for delivery D+1) | File organization | Derive from timestamp |
+
+**Storage Schema:**
+```python
+cnec_schema = {
+    'timestamp': pl.Datetime,
+    'business_day': pl.Date,
+    'cnec_id': pl.Utf8,
+    'cnec_name': pl.Utf8,
+    'tso': pl.Utf8,
+    'fmax': pl.Float32,
+    'ram_after': pl.Float32,      # Primary field
+    'ram_before': pl.Float32,
+    'flow_fb': pl.Float32,
+    'presolved': pl.Boolean,
+    'shadow_price': pl.Float32,
+    'contingency': pl.Utf8,
+    'voltage_level': pl.Int16
+}
+```
+
+**Collection Decision:**
+- ÃƒÂ¢Ã…â€œÃ¢â‚¬Â¦ **COLLECT:** All fields above
+- ÃƒÂ¢Ã‚ÂÃ…â€™ **SKIP:** Internal TSO-specific IDs, validation flags, intermediate calculation steps
+
+### 2.5 PTDF Data Series
+
+**File:** `Core_DA_CC_PTDF_[date].xml` (published D-2 and updated D-1)
+
+| Field Name | Data Type | Description | Use in Features | Missing Value Strategy |
+|------------|-----------|-------------|-----------------|----------------------|
+| `cnec_id` | String | Links to CNEC | Join key | **REQUIRED** |
+| `zone` | String | Bidding zone (e.g., "DE_LU", "FR") | Sensitivity mapping | **REQUIRED** |
+| `ptdf_value` | Float | Sensitivity coefficient (-1.0 to +1.0) | **Core feature** | 0.0 (no impact) |
+| `timestamp` | Datetime | Delivery hour | Time index | **REQUIRED** |
+| `version` | String | "D-2" or "D-1" | Data freshness | Use latest available |
+| `is_hvdc` | Boolean | Is this a HVDC link sensitivity? | Separate treatment | False |
+
+**PTDF Matrix Structure:**
+```
+              DE_LU    FR     NL     BE     AT     CZ     PL     ... (12 zones)
+CNEC_001      0.42   -0.18   0.15   0.08   0.05   0.12  -0.02   ...
+CNEC_002     -0.35    0.67  -0.12   0.25   0.03  -0.08   0.15   ...
+CNEC_003      0.28   -0.22   0.45  -0.15   0.18   0.08  -0.05   ...
+...
+CNEC_200      ...     ...    ...    ...    ...    ...    ...    ...
+```
+
+**Collection Decision:**
+- ÃƒÂ¢Ã…â€œÃ¢â‚¬Â¦ **COLLECT:** `ptdf_value` for all zones ÃƒÆ’Ã¢â‚¬â€ all CNECs ÃƒÆ’Ã¢â‚¬â€ all hours
+- ÃƒÂ¢Ã…â€œÃ¢â‚¬Â¦ **COLLECT:** Only "D-1" version (most recent available for historical data)
+- ÃƒÂ¢Ã‚ÂÃ…â€™ **SKIP:** D-2 version, intermediate updates, HVDC-specific matrices
+
+**Dimensionality:**
+- Raw: ~200 CNECs ÃƒÆ’Ã¢â‚¬â€ 12 zones ÃƒÆ’Ã¢â‚¬â€ 17,520 hours = ~42 million values
+- **Hybrid Storage:** Top 50 CNECs with individual PTDFs, Tier-2 with border aggregates
+
+### 2.6 RAM Data Series
+
+**File:** `Core_DA_CC_CNEC_[date].xml` (within CNEC records)
+
+| Field Name | Data Type | Description | Use in Features | Missing Value Strategy |
+|------------|-----------|-------------|-----------------|----------------------|
+| `cnec_id` | String | CNEC identifier | Join key | **REQUIRED** |
+| `timestamp` | Datetime | Delivery hour | Time index | **REQUIRED** |
+| `minram` | Float (MW) | Minimum RAM (70% rule) | Compliance checking | 0.7 ÃƒÆ’Ã¢â‚¬â€ fmax |
+| `initial_ram` | Float (MW) | RAM before coordination | Historical baseline | Forward-fill |
+| `final_ram` | Float (MW) | RAM after validation/remedial actions | **Primary feature** | **CRITICAL** |
+| `ram_net_position` | Float (MW) | Net position at which RAM calculated | Market condition | 0.0 |
+| `validation_adjustment` | Float (MW) | TSO adjustments during validation | Adjustment patterns | 0.0 |
+| `remedial_actions_applied` | Boolean | Were remedial actions used? | Constraint stress | False |
+
+**Collection Decision:**
+- ÃƒÂ¢Ã…â€œÃ¢â‚¬Â¦ **COLLECT:** `final_ram`, `minram`, `initial_ram`
+- ÃƒÂ¢Ã…â€œÃ¢â‚¬Â¦ **COLLECT:** `remedial_actions_applied` (binary indicator)
+- ÃƒÂ¢Ã‚ÂÃ…â€™ **SKIP:** Detailed remedial action descriptions (too granular)
+
+### 2.7 Shadow Price Data Series
+
+**File:** `Core_DA_Results_[date].xml` (post-market publication)
+
+| Field Name | Data Type | Description | Use in Features | Missing Value Strategy |
+|------------|-----------|-------------|-----------------|----------------------|
+| `cnec_id` | String | CNEC identifier | Join key | **REQUIRED** |
+| `timestamp` | Datetime | Delivery hour | Time index | **REQUIRED** |
+| `shadow_price` | Float (ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬/MW) | Lagrange multiplier | Economic signal | 0.0 |
+| `shadow_price_before_minram` | Float (ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬/MW) | Before 70% rule applied | Pre-constraint value | 0.0 |
+| `binding_duration` | Integer (minutes) | How long CNEC was binding | Persistence | 0 |
+| `peak_shadow_price_4h` | Float (ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬/MW) | Max in 4-hour window | Volatility signal | shadow_price |
+
+**Collection Decision:**
+- ÃƒÂ¢Ã…â€œÃ¢â‚¬Â¦ **COLLECT:** `shadow_price` (primary)
+- ÃƒÂ¢Ã…â€œÃ¢â‚¬Â¦ **COLLECT:** `binding_duration` (for persistence features)
+- ÃƒÂ¢Ã‚ÂÃ…â€™ **SKIP:** `shadow_price_before_minram` (less relevant for forecasting)
+
+---
+
+### 2.8 D2CF (Day-2 Congestion Forecast) - OPTIONAL
+
+**Decision: SKIP for MVP**
+
+**Rationale:**
+1. **Temporal misalignment:** D2CF provides TSO forecasts for D+2, but we're forecasting D+1 to D+14
+2. **Forecast contamination:** These are forecasts, not ground truth - introduces noise
+3. **Better alternatives:** ENTSO-E actual generation/load provides cleaner signals
+4. **Scope reduction:** Keeps MVP focused on high-signal features
+
+**If collected later:**
+- `load_forecast_d2cf`: TSO load forecast
+- `generation_forecast_d2cf`: TSO generation forecast  
+- `net_position_forecast_d2cf`: Expected net position
+
+---
+
+### 2.9 ATC for Non-Core Borders Data Series
+
+**File:** `Core_ATC_External_Borders_[date].xml` or ENTSO-E Transparency Platform
+
+| Field Name | Data Type | Description | Use in Features | Missing Value Strategy |
+|------------|-----------|-------------|-----------------|----------------------|
+| `border_id` | String | Non-Core border (e.g., "FR-UK", "FR-ES") | Border tracking | **REQUIRED** |
+| `timestamp` | Datetime | Delivery hour | Time index | **REQUIRED** |
+| `atc_forward` | Float (MW) | Available Transfer Capacity (forward direction) | Loop flow driver | 0.0 |
+| `atc_backward` | Float (MW) | Available Transfer Capacity (backward direction) | Loop flow driver | 0.0 |
+
+**Description:**
+Available Transfer Capacity on borders connecting Core to non-Core regions. These affect loop flows through Core and impact Core CNEC constraints.
+
+**Why This Matters:**
+Large flows on external borders create loop flows through Core:
+- **FR-UK flows** affect FR-BE, FR-DE capacities via loop flows
+- **Swiss transit** (FR-CH-AT-DE) impacts Core CNECs significantly
+- **Nordic flows** through DE-DK affect German internal CNECs
+- **Italian flows** through AT-IT affect Austrian CNECs
+
+**Key Non-Core Borders to Collect (~14 borders):**
+1. **FR-UK** (IFA, ElecLink, IFA2 interconnectors) - Major loop flow driver
+2. **FR-ES** (Pyrenees corridor)
+3. **FR-CH** (Alpine corridor to Switzerland)
+4. **DE-CH** (Basel-Laufenburg)
+5. **AT-CH** (Alpine corridor)
+6. **DE-DK** (to Nordic region)
+7. **PL-SE** (Baltic to Nordic)
+8. **PL-LT** (Baltic connection)
+9. **AT-SI** (SEE gateway)
+10. **AT-IT** (Brenner Pass)
+11. **HU-RO** (SEE connection)
+12. **HU-RS** (Balkans)
+13. **HR-RS** (Balkans)
+14. **SI-IT** (SEE to Italy)
+
+**Collection Method:**
+1. **JAO Publication Tool:** Core_ATC_External_Borders
+2. **ENTSO-E Transparency Platform:** Fallback if JAO incomplete
+
+**Collection Priority:** DAY 1 AFTERNOON (after core data)
+
+**Storage Schema:**
+```python
+atc_external_schema = {
+    'timestamp': pl.Datetime,
+    'border_id': pl.Utf8,
+    'atc_forward': pl.Float32,
+    'atc_backward': pl.Float32
+}
+```
+
+**Expected File Size:** ~120-160 MB for 24 months Ãƒâ€” 14 borders
+
+**ENTSO-E API Collection (if needed):**
+```python
+from entsoe import EntsoePandasClient
+
+non_core_borders = [
+    ('FR', 'UK'), ('FR', 'ES'), ('FR', 'CH'),
+    ('DE', 'CH'), ('DE', 'DK'),
+    ('AT', 'CH'), ('AT', 'IT'),
+    ('PL', 'SE'), ('PL', 'LT')
+]
+
+for zone1, zone2 in non_core_borders:
+    atc_data = client.query_offered_capacity(
+        country_code_from=zone1,
+        country_code_to=zone2,
+        start=start_date,
+        end=end_date
+    )
+```
+
+---
+
+## 3. Data Cleaning Procedures
+
+### 3.1 Missing Value Handling
+
+**Priority Order:**
+1. **Forward-fill** (for continuous series with gradual changes)
+2. **Zero-fill** (for event-based fields where absence means "not active")
+3. **Interpolation** (for weather-related smooth series)
+4. **Drop** (only if >30% missing in critical fields)
+
+**Field-Specific Strategies:**
+
+| Field | Strategy | Justification |
+|-------|----------|---------------|
+| `ram_after` | Forward-fill (max 2 hours), then linear interpolation | Capacity changes gradually |
+| `presolved` | Zero-fill ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ False | Missing = not binding |
+| `shadow_price` | Zero-fill | Missing = no congestion cost |
+| `ptdf_value` | Zero-fill | Missing = no sensitivity |
+| `fmax` | **NO FILL - MUST HAVE** | Critical for normalization |
+| `flow_fb` | Zero-fill | Missing = no flow |
+| `cnec_name` | Forward-fill | Descriptive only |
+
+**Implementation:**
+```python
+def clean_jao_data(df: pl.DataFrame) -> pl.DataFrame:
+    """Clean JAO data with field-specific strategies"""
+    
+    # 1. Handle critical fields (fail if missing)
+    critical_fields = ['timestamp', 'cnec_id', 'fmax']
+    for field in critical_fields:
+        if df[field].null_count() > 0:
+            raise ValueError(f"Critical field {field} has missing values")
+    
+    # 2. Forward-fill continuous series
+    df = df.with_columns([
+        pl.col('ram_after').fill_null(strategy='forward').fill_null(strategy='backward'),
+        pl.col('ram_before').fill_null(strategy='forward').fill_null(strategy='backward'),
+        pl.col('cnec_name').fill_null(strategy='forward')
+    ])
+    
+    # 3. Zero-fill event indicators
+    df = df.with_columns([
+        pl.col('presolved').fill_null(False),
+        pl.col('shadow_price').fill_null(0.0),
+        pl.col('flow_fb').fill_null(0.0),
+        pl.col('remedial_actions_applied').fill_null(False)
+    ])
+    
+    # 4. Linear interpolation for remaining RAM gaps (max 2 hours)
+    df = df.with_columns([
+        pl.col('ram_after').interpolate(method='linear')
+    ])
+    
+    # 5. Fill contingency with "basecase" if missing
+    df = df.with_columns([
+        pl.col('contingency').fill_null('basecase')
+    ])
+    
+    return df
+```
+
+### 3.2 Outlier Detection & Treatment
+
+**RAM Outliers:**
+```python
+def detect_ram_outliers(df: pl.DataFrame) -> pl.DataFrame:
+    """Flag and clip RAM outliers"""
+    
+    # 1. Physical constraints
+    df = df.with_columns([
+        # RAM cannot exceed Fmax
+        pl.when(pl.col('ram_after') > pl.col('fmax'))
+          .then(pl.col('fmax'))
+          .otherwise(pl.col('ram_after'))
+          .alias('ram_after'),
+        
+        # RAM cannot be negative (clip to 0)
+        pl.when(pl.col('ram_after') < 0)
+          .then(0.0)
+          .otherwise(pl.col('ram_after'))
+          .alias('ram_after')
+    ])
+    
+    # 2. Statistical outliers (sudden unrealistic jumps)
+    # Flag if RAM drops >80% in 1 hour (likely data error)
+    df = df.with_columns([
+        (pl.col('ram_after').diff() / pl.col('ram_after').shift(1) < -0.8)
+        .fill_null(False)
+        .alias('outlier_sudden_drop')
+    ])
+    
+    # 3. Replace flagged outliers with interpolation
+    df = df.with_columns([
+        pl.when(pl.col('outlier_sudden_drop'))
+          .then(None)
+          .otherwise(pl.col('ram_after'))
+          .alias('ram_after')
+    ]).with_columns([
+        pl.col('ram_after').interpolate()
+    ])
+    
+    return df.drop('outlier_sudden_drop')
+```
+
+**PTDF Outliers:**
+```python
+def clean_ptdf_outliers(ptdf_matrix: np.ndarray) -> np.ndarray:
+    """Clean PTDF values to physical bounds"""
+    
+    # PTDFs theoretically in [-1, 1] but can be slightly outside due to rounding
+    # Clip to [-1.5, 1.5] to catch errors while allowing small overruns
+    ptdf_matrix = np.clip(ptdf_matrix, -1.5, 1.5)
+    
+    # Replace any NaN/Inf with 0 (no sensitivity)
+    ptdf_matrix = np.nan_to_num(ptdf_matrix, nan=0.0, posinf=0.0, neginf=0.0)
+    
+    return ptdf_matrix
+```
+
+**Shadow Price Outliers:**
+```python
+def clean_shadow_prices(df: pl.DataFrame) -> pl.DataFrame:
+    """Handle extreme shadow prices"""
+    
+    # Shadow prices above ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬500/MW are rare but possible during extreme scarcity
+    # Cap at ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬1000/MW to prevent contamination (99.9th percentile historically)
+    df = df.with_columns([
+        pl.when(pl.col('shadow_price') > 1000)
+          .then(1000)
+          .otherwise(pl.col('shadow_price'))
+          .alias('shadow_price')
+    ])
+    
+    return df
+```
+
+### 3.3 Timestamp Alignment
+
+**Challenge:** JAO publishes data for "delivery day D+1" on "business day D"
+
+**Solution:**
+```python
+def align_jao_timestamps(df: pl.DataFrame) -> pl.DataFrame:
+    """Convert business day to delivery hour timestamps"""
+    
+    # JAO format: business_day='2025-10-28', delivery_hour=1-24
+    # Convert to: timestamp='2025-10-29 00:00' to '2025-10-29 23:00' (UTC)
+    
+    df = df.with_columns([
+        # Delivery timestamp = business_day + 1 day + (delivery_hour - 1)
+        (pl.col('business_day') + pl.duration(days=1) + 
+         pl.duration(hours=pl.col('delivery_hour') - 1))
+        .alias('timestamp')
+    ])
+    
+    # Convert from CET/CEST to UTC for consistency
+    df = df.with_columns([
+        pl.col('timestamp').dt.convert_time_zone('UTC')
+    ])
+    
+    return df
+```
+
+### 3.4 Duplicate Handling
+
+**Sources of Duplicates:**
+1. D-2 and D-1 PTDF updates (keep D-1 only)
+2. Corrected publications (keep latest by publication time)
+3. Intraday updates (out of scope for day-ahead forecasting)
+
+**Deduplication Strategy:**
+```python
+def deduplicate_jao_data(df: pl.DataFrame) -> pl.DataFrame:
+    """Remove duplicate records, keeping most recent version"""
+    
+    # For PTDF data: keep D-1 version
+    if 'version' in df.columns:
+        df = df.filter(pl.col('version') == 'D-1')
+    
+    # For all data: keep latest publication time per (timestamp, cnec_id)
+    df = (df
+          .sort('publication_time', descending=True)
+          .unique(subset=['timestamp', 'cnec_id'], keep='first'))
+    
+    return df
+```
+
+---
+
+## 4. CNEC Treatment
+
+### 4.1 CNEC Selection Strategy
+
+**Goal:** Reduce from **~2,000 CNECs published daily by TSOs** to **Top 200 most impactful CNECs**
+
+**Two-Tier Approach:**
+- **Tier 1 (Top 50 CNECs):** Full feature detail (5 features each = 250 features)
+- **Tier 2 (Next 150 CNECs):** Selective features (2 features each = 300 features)
+- **Total:** 200 CNECs tracked
+
+**Selection Criteria:**
+1. **Binding Frequency** (40% weight): How often was CNEC active?
+2. **Economic Impact** (30% weight): Average shadow price when binding
+3. **RAM Utilization** (20% weight): How close to limit (low RAM)?
+4. **Geographic Coverage** (10% weight): Ensure all borders represented
+
+**Implementation:**
+```python
+def select_top_cnecs(df: pl.DataFrame, n_cnecs: int = 200) -> dict[str, list[str]]:
+    """
+    Select top 200 most impactful CNECs from ~2,000 published daily
+    Returns: {'tier1': list of 50 CNECs, 'tier2': list of 150 CNECs}
+    """
+    
+    cnec_impact = df.groupby('cnec_id').agg([
+        # Binding frequency (times active / total hours)
+        (pl.col('presolved').sum() / pl.col('presolved').count())
+        .alias('binding_freq'),
+        
+        # Average shadow price when binding
+        pl.col('shadow_price').filter(pl.col('presolved')).mean()
+        .alias('avg_shadow_price'),
+        
+        # RAM utilization (% of hours below 30% of Fmax)
+        ((pl.col('ram_after') < 0.3 * pl.col('fmax')).sum() / 
+         pl.col('ram_after').count())
+        .alias('low_ram_freq'),
+        
+        # Days appeared (consistency metric)
+        pl.col('cnec_id').count().alias('days_appeared'),
+        
+        # Geographic zone (parse from CNEC ID)
+        pl.col('tso').first().alias('tso')
+    ])
+    
+    # Calculate composite impact score
+    cnec_impact = cnec_impact.with_columns([
+        (0.40 * pl.col('binding_freq') +
+         0.30 * (pl.col('avg_shadow_price') / 100) +  # Normalize to [0,1]
+         0.20 * pl.col('low_ram_freq') +
+         0.10 * (pl.col('days_appeared') / 365))  # Consistency bonus
+        .alias('impact_score')
+    ])
+    
+    # Select top N, ensuring geographic diversity
+    top_cnecs = []
+    
+    # First, get top 10 per major bidding zone border (ensures coverage)
+    for border in ['DE_FR', 'DE_NL', 'DE_CZ', 'FR_BE', 'AT_CZ', 'DE_AT', 
+                   'FR_ES', 'AT_IT', 'DE_PL', 'BE_NL']:
+        border_cnecs = (cnec_impact
+                       .filter(pl.col('cnec_id').str.contains(border))
+                       .sort('impact_score', descending=True)
+                       .head(10))
+        top_cnecs.extend(border_cnecs['cnec_id'].to_list())
+    
+    # Remove duplicates (some CNECs may match multiple borders)
+    top_cnecs = list(set(top_cnecs))
+    
+    # Then add remaining highest-impact CNECs to reach n_cnecs
+    remaining = (cnec_impact
+                .filter(~pl.col('cnec_id').is_in(top_cnecs))
+                .sort('impact_score', descending=True)
+                .head(n_cnecs - len(top_cnecs)))
+    top_cnecs.extend(remaining['cnec_id'].to_list())
+    
+    # Split into two tiers
+    tier1_cnecs = top_cnecs[:50]   # Top 50 get full feature detail
+    tier2_cnecs = top_cnecs[50:200]  # Next 150 get selective features
+    
+    return {
+        'tier1': tier1_cnecs,
+        'tier2': tier2_cnecs,
+        'all': top_cnecs[:200]
+    }
+```
+
+**Why 200 CNECs?**
+- From ~2,000 published daily, most are rarely binding or have minimal economic impact
+- Top 200 captures >95% of all binding events and congestion costs
+- Manageable feature space while preserving critical network information
+- Two-tier approach balances detail vs. dimensionality
+
+### 4.2 CNEC Masking Strategy
+
+**Problem:** Not all CNECs are published every day (only those considered relevant by TSOs)
+
+**Solution:** Create a "Master CNEC Set" with masking
+
+```python
+def create_cnec_master_set(df: pl.DataFrame, top_cnecs: list[str]) -> pl.DataFrame:
+    """Create complete CNEC ÃƒÆ’Ã¢â‚¬â€ timestamp matrix with masking"""
+    
+    # 1. Create complete timestamp ÃƒÆ’Ã¢â‚¬â€ CNEC grid
+    all_timestamps = df.select('timestamp').unique().sort('timestamp')
+    cnec_template = pl.DataFrame({'cnec_id': top_cnecs})
+    
+    # Cartesian product: all timestamps ÃƒÆ’Ã¢â‚¬â€ all CNECs
+    complete_grid = all_timestamps.join(cnec_template, how='cross')
+    
+    # 2. Left join with actual data
+    complete_data = complete_grid.join(
+        df.select(['timestamp', 'cnec_id', 'ram_after', 'presolved', 
+                  'shadow_price', 'fmax']),
+        on=['timestamp', 'cnec_id'],
+        how='left'
+    )
+    
+    # 3. Add masking indicator
+    complete_data = complete_data.with_columns([
+        # Mask = 1 if CNEC was published (data available)
+        # Mask = 0 if CNEC not published (implicitly unconstrained)
+        (~pl.col('ram_after').is_null()).cast(pl.Int8).alias('cnec_mask')
+    ])
+    
+    # 4. Fill missing values according to strategy
+    complete_data = complete_data.with_columns([
+        # If not published, assume unconstrained:
+        # - RAM = Fmax (maximum margin)
+        # - presolved = False (not binding)
+        # - shadow_price = 0 (no congestion)
+        pl.col('ram_after').fill_null(pl.col('fmax')),
+        pl.col('presolved').fill_null(False),
+        pl.col('shadow_price').fill_null(0.0)
+    ])
+    
+    return complete_data
+```
+
+**Resulting Schema:**
+```
+timestamp            cnec_id          ram_after  presolved  shadow_price  cnec_mask  fmax
+2024-10-01 00:00    DE_CZ_TIE_001     450.2     True       12.5          1          800
+2024-10-01 00:00    DE_FR_LINE_005    800.0     False      0.0           0          800   # Not published
+2024-10-01 01:00    DE_CZ_TIE_001     432.1     True       15.2          1          800
+...
+```
+
+### 4.3 CNEC Feature Derivation
+
+**From master CNEC set, derive:**
+
+```python
+def engineer_cnec_features(df: pl.DataFrame) -> pl.DataFrame:
+    """Create CNEC-based features for forecasting"""
+    
+    df = df.with_columns([
+        # 1. Margin ratio (normalized RAM)
+        (pl.col('ram_after') / pl.col('fmax')).alias('margin_ratio'),
+        
+        # 2. Binding frequency (7-day rolling)
+        pl.col('presolved').cast(pl.Int8).rolling_mean(window_size=168)
+        .over('cnec_id').alias('binding_freq_7d'),
+        
+        # 3. Binding frequency (30-day rolling)
+        pl.col('presolved').cast(pl.Int8).rolling_mean(window_size=720)
+        .over('cnec_id').alias('binding_freq_30d'),
+        
+        # 4. MinRAM compliance
+        (pl.col('ram_after') < 0.7 * pl.col('fmax'))
+        .cast(pl.Int8).alias('below_minram'),
+        
+        # 5. RAM volatility (7-day rolling std)
+        pl.col('ram_after').rolling_std(window_size=168)
+        .over('cnec_id').alias('ram_volatility_7d'),
+        
+        # 6. Shadow price volatility
+        pl.col('shadow_price').rolling_std(window_size=168)
+        .over('cnec_id').alias('shadow_price_volatility_7d'),
+        
+        # 7. CNEC criticality score
+        (1 - pl.col('margin_ratio')).alias('criticality'),
+        
+        # 8. Sudden RAM drops (binary flag)
+        ((pl.col('ram_after') - pl.col('ram_after').shift(1)) < -0.2 * pl.col('fmax'))
+        .cast(pl.Int8).alias('sudden_ram_drop')
+    ])
+    
+    return df
+```
+
+---
+
+## 5. PTDF Treatment - Hybrid Approach
+
+### 5.1 Strategy: Preserve Causality for Critical CNECs
+
+**Challenge:** PTDF matrix contains critical network physics but is high-dimensional
+- 200 CNECs Ãƒâ€” 12 zones Ãƒâ€” 17,520 hours = ~42 million values
+- PTDFs encode how zone injections affect each CNEC (network sensitivity)
+- **Critical requirement:** Preserve CNEC-specific causality chain:
+  - `outage_active_cnec_X` Ã¢â€ â€™ network sensitivity (PTDF) Ã¢â€ â€™ `presolved_cnec_X`
+
+**Solution: Two-Tier Hybrid Approach**
+
+#### **Tier 1 (Top 50 CNECs): Individual PTDF Features**
+**Preserve all 12 zone sensitivities for each of the 50 most critical CNECs**
+- Result: 50 CNECs Ãƒâ€” 12 zones = **600 individual PTDF features**
+- Rationale: These are most impactful CNECs - model needs full network physics
+- Enables learning: "When `outage_active_cnec_42` = 1 AND high DE_LU injection (via `ptdf_cnec_42_DE_LU` sensitivity), THEN `presolved_cnec_42` likely = 1"
+
+#### **Tier 2 (Next 150 CNECs): Border-Level PTDF Aggregates**
+**Aggregate PTDF sensitivities by border-zone pairs**
+- Result: 10 borders Ãƒâ€” 12 zones = **120 aggregate PTDF features**
+- Rationale: Captures regional network patterns without full individual resolution
+- Preserves interpretability: "How sensitive are DE-CZ tier-2 CNECs to German generation?"
+- Still allows causal learning: `outage_active_cnec_tier2` Ã¢â€ â€™ `avg_ptdf_border` Ã¢â€ â€™ `presolved_cnec_tier2`
+
+**Total PTDF Features: 720 (600 individual + 120 aggregate)**
+
+**Why This Approach?**
+1. Ã¢Å“â€¦ **Preserves CNEC-specific causality for top 50** - most critical constraints
+2. Ã¢Å“â€¦ **Avoids PCA mixing** - no loss of CNEC identification
+3. Ã¢Å“â€¦ **Interpretable** - can explain which zones affect which borders
+4. Ã¢Å“â€¦ **Dimensionality reduction** - 1,800 tier-2 values Ã¢â€ â€™ 120 features (15:1)
+5. Ã¢Å“â€¦ **Maintains geographic structure** - DE-CZ separate from FR-BE
+
+### 5.2 Implementation: Top 50 Individual PTDFs
+
+```python
+def extract_top50_ptdfs(ptdf_matrix: np.ndarray, 
+                        top_50_cnec_ids: list[str],
+                        all_cnec_ids: list[str],
+                        zones: list[str]) -> pl.DataFrame:
+    """
+    Extract individual PTDF values for top 50 CNECs
+    
+    Args:
+        ptdf_matrix: Shape (n_timestamps, n_cnecs, n_zones)
+        top_50_cnec_ids: List of 50 most critical CNEC identifiers
+        all_cnec_ids: Full list of CNEC IDs matching matrix dimension
+        zones: List of 12 Core zone identifiers ['DE_LU', 'FR', 'BE', 'NL', ...]
+        
+    Returns:
+        DataFrame with 600 PTDF features (50 CNECs Ãƒâ€” 12 zones)
+    """
+    
+    # Find indices of top 50 CNECs in the full matrix
+    top_50_indices = [all_cnec_ids.index(cnec) for cnec in top_50_cnec_ids]
+    
+    # Extract PTDF values for top 50 CNECs
+    # Shape: (n_timestamps, 50, 12)
+    top_50_ptdfs = ptdf_matrix[:, top_50_indices, :]
+    
+    # Reshape to flat features: (n_timestamps, 600)
+    n_timestamps = top_50_ptdfs.shape[0]
+    features_dict = {}
+    
+    for cnec_idx, cnec_id in enumerate(top_50_cnec_ids):
+        for zone_idx, zone in enumerate(zones):
+            feature_name = f'ptdf_cnec_{cnec_id}_{zone}'
+            features_dict[feature_name] = top_50_ptdfs[:, cnec_idx, zone_idx]
+    
+    print(f"Ã¢Å“â€œ Extracted {len(features_dict)} individual PTDF features for top 50 CNECs")
+    return pl.DataFrame(features_dict)
+```
+
+### 5.3 Implementation: Tier-2 Border Aggregates
+
+```python
+def aggregate_tier2_ptdfs_by_border(ptdf_matrix: np.ndarray,
+                                    tier2_cnec_ids: list[str],
+                                    all_cnec_ids: list[str],
+                                    zones: list[str]) -> pl.DataFrame:
+    """
+    Aggregate PTDF sensitivities for tier-2 CNECs by border-zone pairs
+    
+    Args:
+        ptdf_matrix: Shape (n_timestamps, n_cnecs, n_zones)
+        tier2_cnec_ids: List of 150 tier-2 CNEC identifiers
+        all_cnec_ids: Full list of CNEC IDs
+        zones: List of 12 Core zone identifiers
+        
+    Returns:
+        DataFrame with 120 features (10 borders Ãƒâ€” 12 zones)
+    """
+    
+    # Define major borders
+    major_borders = ['DE_FR', 'DE_NL', 'DE_CZ', 'FR_BE', 'AT_CZ', 
+                     'DE_AT', 'BE_NL', 'AT_IT', 'DE_PL', 'CZ_PL']
+    
+    # Find indices of tier-2 CNECs
+    tier2_indices = [all_cnec_ids.index(cnec) for cnec in tier2_cnec_ids]
+    
+    # Extract tier-2 PTDF subset
+    tier2_ptdfs = ptdf_matrix[:, tier2_indices, :]  # (n_timestamps, 150, 12)
+    
+    features_dict = {}
+    
+    for border in major_borders:
+        # Find tier-2 CNECs belonging to this border
+        border_cnec_indices = [
+            i for i, cnec_id in enumerate(tier2_cnec_ids)
+            if border.replace('_', '').lower() in cnec_id.lower()
+        ]
+        
+        if len(border_cnec_indices) == 0:
+            # No tier-2 CNECs for this border - use zeros
+            for zone in zones:
+                features_dict[f'avg_ptdf_{border}_{zone}_tier2'] = np.zeros(tier2_ptdfs.shape[0])
+            continue
+        
+        # Average PTDF across all tier-2 CNECs on this border
+        for zone_idx, zone in enumerate(zones):
+            border_zone_ptdfs = tier2_ptdfs[:, border_cnec_indices, zone_idx]
+            avg_ptdf = np.mean(border_zone_ptdfs, axis=1)
+            features_dict[f'avg_ptdf_{border}_{zone}_tier2'] = avg_ptdf
+    
+    print(f"Ã¢Å“â€œ Aggregated {len(features_dict)} border-zone PTDF features for tier-2")
+    return pl.DataFrame(features_dict)
+```
+
+### 5.4 Complete PTDF Pipeline
+
+```python
+def prepare_ptdf_features(ptdf_matrix: np.ndarray,
+                          top_50_cnecs: list[str],
+                          tier2_cnecs: list[str],
+                          all_cnec_ids: list[str],
+                          zones: list[str]) -> pl.DataFrame:
+    """
+    Complete PTDF feature engineering pipeline
+    
+    Returns:
+        DataFrame with 720 PTDF features total:
+        - 600 individual features (top 50 CNECs Ãƒâ€” 12 zones)
+        - 120 aggregate features (10 borders Ãƒâ€” 12 zones for tier-2)
+    """
+    
+    # 1. Extract individual PTDFs for top 50
+    top50_features = extract_top50_ptdfs(
+        ptdf_matrix, top_50_cnecs, all_cnec_ids, zones
+    )
+    
+    # 2. Aggregate tier-2 PTDFs by border
+    tier2_features = aggregate_tier2_ptdfs_by_border(
+        ptdf_matrix, tier2_cnecs, all_cnec_ids, zones
+    )
+    
+    # 3. Combine
+    all_ptdf_features = pl.concat([top50_features, tier2_features], how='horizontal')
+    
+    print(f"\nÃ¢Å“â€œ PTDF Features Generated:")
+    print(f"  Top 50 individual: {top50_features.shape[1]} features")
+    print(f"  Tier-2 aggregated: {tier2_features.shape[1]} features")
+    print(f"  Total PTDF features: {all_ptdf_features.shape[1]}")
+    
+    return all_ptdf_features
+```
+
+### 5.5 PTDF Feature Validation
+
+```python
+def validate_ptdf_features(ptdf_features: pl.DataFrame) -> None:
+    """Validate PTDF feature quality"""
+    
+    # 1. Check for valid range (PTDFs should be in [-1, 1])
+    outliers = []
+    for col in ptdf_features.columns:
+        if col.startswith('ptdf_') or col.startswith('avg_ptdf_'):
+            values = ptdf_features[col]
+            min_val, max_val = values.min(), values.max()
+            
+            if min_val < -1.5 or max_val > 1.5:
+                outliers.append(f"{col}: [{min_val:.3f}, {max_val:.3f}]")
+    
+    if outliers:
+        print(f"Ã¢Å¡Â  WARNING: {len(outliers)} PTDF features outside expected [-1, 1] range")
+        for out in outliers[:5]:  # Show first 5
+            print(f"  {out}")
+    
+    # 2. Check for constant features (no variance)
+    constant_features = []
+    for col in ptdf_features.columns:
+        if col.startswith('ptdf_') or col.startswith('avg_ptdf_'):
+            if ptdf_features[col].std() < 1e-6:
+                constant_features.append(col)
+    
+    if constant_features:
+        print(f"Ã¢Å¡Â  WARNING: {len(constant_features)} PTDF features have near-zero variance")
+    
+    # 3. Summary statistics
+    print(f"\nÃ¢Å“â€œ PTDF Feature Validation Complete:")
+    print(f"  Features in valid range: {len(ptdf_features.columns) - len(outliers)}")
+    print(f"  Features with variance: {len(ptdf_features.columns) - len(constant_features)}")
+```
+
+
+---
+
+## 6. RAM Treatment
+
+### 6.1 RAM Normalization
+
+**Challenge:** Absolute RAM values vary widely across CNECs (50 MW to 2000 MW)
+
+**Solution:** Normalize to margin ratio
+
+```python
+def normalize_ram_values(df: pl.DataFrame) -> pl.DataFrame:
+    """Normalize RAM to comparable scale"""
+    
+    df = df.with_columns([
+        # 1. Margin ratio: RAM / Fmax
+        (pl.col('ram_after') / pl.col('fmax')).alias('margin_ratio'),
+        
+        # 2. MinRAM compliance ratio: RAM / (0.7 ÃƒÆ’Ã¢â‚¬â€ Fmax)
+        (pl.col('ram_after') / (0.7 * pl.col('fmax'))).alias('minram_compliance'),
+        
+        # 3. Percentile within CNEC's historical distribution
+        pl.col('ram_after').rank(method='average').over('cnec_id')
+        .alias('ram_percentile')
+    ])
+    
+    # Cap margin_ratio at 1.0 (cannot exceed Fmax)
+    df = df.with_columns([
+        pl.when(pl.col('margin_ratio') > 1.0)
+          .then(1.0)
+          .otherwise(pl.col('margin_ratio'))
+          .alias('margin_ratio')
+    ])
+    
+    return df
+```
+
+### 6.2 RAM Time Series Features
+
+```python
+def engineer_ram_features(df: pl.DataFrame) -> pl.DataFrame:
+    """Create rolling window features from RAM values"""
+    
+    df = df.with_columns([
+        # 1. Moving averages
+        pl.col('ram_after').rolling_mean(window_size=24).alias('ram_ma_24h'),
+        pl.col('ram_after').rolling_mean(window_size=168).alias('ram_ma_7d'),
+        pl.col('ram_after').rolling_mean(window_size=720).alias('ram_ma_30d'),
+        
+        # 2. Volatility measures
+        pl.col('ram_after').rolling_std(window_size=168).alias('ram_std_7d'),
+        pl.col('ram_after').rolling_std(window_size=720).alias('ram_std_30d'),
+        
+        # 3. Percentiles (vs 90-day history)
+        pl.col('ram_after')
+          .rolling_quantile(quantile=0.1, window_size=2160)
+          .alias('ram_p10_90d'),
+        pl.col('ram_after')
+          .rolling_quantile(quantile=0.5, window_size=2160)
+          .alias('ram_median_90d'),
+        
+        # 4. Rate of change
+        (pl.col('ram_after').diff(1) / pl.col('ram_after').shift(1))
+        .alias('ram_pct_change_1h'),
+        (pl.col('ram_after').diff(24) / pl.col('ram_after').shift(24))
+        .alias('ram_pct_change_24h'),
+        
+        # 5. Binary indicators
+        (pl.col('ram_after') < 0.3 * pl.col('fmax'))
+        .cast(pl.Int8).alias('low_ram_flag'),
+        
+        ((pl.col('ram_after') - pl.col('ram_after').shift(1)) < -0.2 * pl.col('fmax'))
+        .cast(pl.Int8).alias('sudden_drop_flag'),
+        
+        # 6. MinRAM violation counter (7-day window)
+        (pl.col('ram_after') < 0.7 * pl.col('fmax'))
+        .cast(pl.Int8)
+        .rolling_sum(window_size=168)
+        .alias('minram_violations_7d')
+    ])
+    
+    return df
+```
+
+### 6.3 Cross-Border RAM Aggregation
+
+**Create system-level RAM features:**
+
+```python
+def aggregate_ram_system_wide(df: pl.DataFrame) -> pl.DataFrame:
+    """Aggregate RAM across all CNECs to system-level features"""
+    
+    system_features = df.groupby('timestamp').agg([
+        # 1. Average margin across all CNECs
+        pl.col('margin_ratio').mean().alias('system_avg_margin'),
+        
+        # 2. Minimum margin (most constrained CNEC)
+        pl.col('margin_ratio').min().alias('system_min_margin'),
+        
+        # 3. Count of binding CNECs
+        pl.col('presolved').sum().alias('n_binding_cnecs'),
+        
+        # 4. Count of CNECs below minRAM
+        (pl.col('ram_after') < 0.7 * pl.col('fmax')).sum()
+        .alias('n_cnecs_below_minram'),
+        
+        # 5. Standard deviation of margins (uniformity)
+        pl.col('margin_ratio').std().alias('margin_std'),
+        
+        # 6. Total economic cost (sum of shadow prices)
+        pl.col('shadow_price').sum().alias('total_congestion_cost'),
+        
+        # 7. Max shadow price
+        pl.col('shadow_price').max().alias('max_shadow_price')
+    ])
+    
+    return system_features
+```
+
+---
+
+## 7. Shadow Prices Treatment
+
+### 7.1 Shadow Price Features
+
+```python
+def engineer_shadow_price_features(df: pl.DataFrame) -> pl.DataFrame:
+    """Create features from shadow prices (economic signals)"""
+    
+    df = df.with_columns([
+        # 1. Rolling statistics
+        pl.col('shadow_price').rolling_mean(window_size=24)
+        .alias('shadow_price_ma_24h'),
+        
+        pl.col('shadow_price').rolling_max(window_size=24)
+        .alias('shadow_price_max_24h'),
+        
+        pl.col('shadow_price').rolling_std(window_size=168)
+        .alias('shadow_price_volatility_7d'),
+        
+        # 2. Binary indicators
+        (pl.col('shadow_price') > 0).cast(pl.Int8)
+        .alias('has_congestion_cost'),
+        
+        (pl.col('shadow_price') > 50).cast(pl.Int8)
+        .alias('high_congestion_flag'),  # ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬50/MW threshold
+        
+        # 3. Economic stress indicator (count of expensive hours in 7d)
+        (pl.col('shadow_price') > 50).cast(pl.Int8)
+        .rolling_sum(window_size=168)
+        .alias('expensive_hours_7d'),
+        
+        # 4. Shadow price persistence (how long has it been >0?)
+        # Create groups of consecutive non-zero shadow prices
+        ((pl.col('shadow_price') > 0) != (pl.col('shadow_price').shift(1) > 0))
+        .cum_sum()
+        .alias('shadow_price_regime'),
+        
+    ])
+    
+    # 5. Duration in current regime
+    df = df.with_columns([
+        pl.col('shadow_price_regime')
+        .rank(method='dense').over('shadow_price_regime')
+        .alias('hours_in_regime')
+    ])
+    
+    return df
+```
+
+---
+
+## 8. Feature Engineering Pipeline
+
+### 8.1 Complete Feature Matrix Construction
+
+**Final feature set: ~1,060 features (Option A Architecture)**
+- **Historical Context:** 730 features (what happened in past 21 days)
+- **Future Covariates:** 280 features (what we know will happen in next 14 days)
+- **System-Level:** 50 features (real-time aggregates)
+
+**Architecture Rationale:**
+- **Top 50 CNECs:** Full detail (5 metrics each = 250 features)
+  - Most impactful CNECs get complete representation
+- **Tier-2 150 CNECs:** Selective detail (300 features)
+  - Binary indicators: `presolved` (binding) + `outage_active` (150 + 150 = 300 features)
+  - Aggregated metrics: RAM, shadow prices by border (30 features)
+  - Preserves high-signal discrete events while reducing continuous redundancy
+- **Supporting Features:** 380 features
+  - PTDF patterns, border capacities, weather, temporal, interactions
+
+```python
+class JAOFeatureEngineer:
+    """
+    Complete JAO feature engineering pipeline - Option A Architecture
+    
+    Architecture:
+    - Top 50 CNECs: Full detail (5 metrics ÃƒÆ’Ã¢â‚¬â€ 50 = 250 features)
+    - Tier-2 150 CNECs: Selective (presolved + outage_active = 300 features)
+    - Tier-2 aggregated: Border-level RAM/shadow prices (30 features)
+    - Supporting: PTDF, borders, system, temporal, weather (150 features)
+    
+    Total: ~730 historical features + ~280 future covariates
+    
+    Inputs: Raw JAO data (CNECs, PTDFs, RAMs, shadow prices) + ENTSO-E outages
+    Outputs: Engineered feature matrix (512h ÃƒÆ’Ã¢â‚¬â€ 730 features) + 
+             Future covariates (336h ÃƒÆ’Ã¢â‚¬â€ 280 features)
+    """
+    
+    def __init__(self, top_n_cnecs: int = 50, tier2_n_cnecs: int = 150, 
+                 ptdf_components: int = 10):
+        self.top_n_cnecs = top_n_cnecs
+        self.tier2_n_cnecs = tier2_n_cnecs
+        self.total_cnecs = top_n_cnecs + tier2_n_cnecs  # 200 total
+        self.ptdf_components = ptdf_components
+        self.pca_model = None
+        self.scaler = None
+        self.top_cnecs = None
+        self.tier2_cnecs = None
+        
+    def fit(self, historical_data: dict):
+        """
+        Fit on 12-month historical data to establish baselines
+        
+        Args:
+            historical_data: {
+                'cnecs': pl.DataFrame,
+                'ptdfs': np.ndarray,
+                'rams': pl.DataFrame,
+                'shadow_prices': pl.DataFrame,
+                'outages': pl.DataFrame (from ENTSO-E via EIC matching)
+            }
+        """
+        
+        print("Fitting JAO feature engineer on historical data...")
+        
+        # 1. Select top 200 CNECs (50 + 150)
+        print(f"  - Selecting top {self.total_cnecs} CNECs...")
+        all_selected = select_top_cnecs(
+            historical_data['cnecs'], 
+            n_cnecs=self.total_cnecs
+        )
+        self.top_cnecs = all_selected[:self.top_n_cnecs]
+        self.tier2_cnecs = all_selected[self.top_n_cnecs:]
+        
+        print(f"    Top 50: {self.top_cnecs[:3]}... (full detail)")
+        print(f"    Tier-2 150: {self.tier2_cnecs[:3]}... (selective detail)")
+        
+        # 2. Engineer PTDF features (hybrid approach)
+        print("  - Engineering PTDF features...")
+        ptdf_compressed, self.pca_model, self.scaler = reduce_ptdf_dimensions(
+            historical_data['ptdfs'], 
+            n_components=self.ptdf_components
+        )
+        
+        # 3. Calculate historical baselines (for percentiles, etc.)
+        print("  - Computing historical baselines...")
+        self.ram_baseline = historical_data['rams'].groupby('cnec_id').agg([
+            pl.col('ram_after').mean().alias('ram_mean'),
+            pl.col('ram_after').std().alias('ram_std'),
+            pl.col('ram_after').quantile(0.1).alias('ram_p10'),
+            pl.col('ram_after').quantile(0.9).alias('ram_p90')
+        ])
+        
+        print("ÃƒÂ¢Ã…â€œÃ¢â‚¬Å“ Feature engineer fitted")
+        
+    def transform(self, 
+                  data: dict, 
+                  start_time: str, 
+                  end_time: str) -> tuple[np.ndarray, np.ndarray]:
+        """
+        Transform JAO data to feature matrices
+        
+        Args:
+            data: Same structure as fit()
+            start_time: Start of context window (e.g., 21 days before prediction)
+            end_time: End of context window (prediction time)
+        
+        Returns:
+            historical_features: (n_hours, 730) array of historical context
+            future_features: (n_hours_future, 280) array of future covariates
+        """
+        
+        n_hours = len(pd.date_range(start_time, end_time, freq='H'))
+        historical_features = np.zeros((n_hours, 730))
+        
+        print("Engineering historical features...")
+        
+        # Filter to time window
+        cnec_data = (data['cnecs']
+                    .filter(pl.col('timestamp').is_between(start_time, end_time)))
+        
+        outage_data = (data['outages']
+                      .filter(pl.col('timestamp').is_between(start_time, end_time)))
+        
+        # === TOP 50 CNECs - FULL DETAIL (250 features) ===
+        print("  - Top 50 CNECs (full detail)...")
+        top50_data = cnec_data.filter(pl.col('cnec_id').is_in(self.top_cnecs))
+        top50_outages = outage_data.filter(pl.col('cnec_id').is_in(self.top_cnecs))
+        
+        col_idx = 0
+        for cnec_id in self.top_cnecs:
+            cnec_series = top50_data.filter(pl.col('cnec_id') == cnec_id).sort('timestamp')
+            cnec_outage = top50_outages.filter(pl.col('cnec_id') == cnec_id).sort('timestamp')
+            
+            # 5 metrics per CNEC
+            historical_features[:, col_idx] = cnec_series['ram_after'].to_numpy()
+            historical_features[:, col_idx + 1] = (cnec_series['ram_after'] / cnec_series['fmax']).to_numpy()
+            historical_features[:, col_idx + 2] = cnec_series['presolved'].cast(pl.Int8).to_numpy()
+            historical_features[:, col_idx + 3] = cnec_series['shadow_price'].to_numpy()
+            historical_features[:, col_idx + 4] = cnec_outage['outage_active'].to_numpy()
+            
+            col_idx += 5
+        
+        # === TIER-2 150 CNECs - SELECTIVE DETAIL (300 features) ===
+        print("  - Tier-2 150 CNECs (presolved + outage)...")
+        tier2_data = cnec_data.filter(pl.col('cnec_id').is_in(self.tier2_cnecs))
+        tier2_outages = outage_data.filter(pl.col('cnec_id').is_in(self.tier2_cnecs))
+        
+        # Presolved flags (150 features)
+        for cnec_id in self.tier2_cnecs:
+            cnec_series = tier2_data.filter(pl.col('cnec_id') == cnec_id).sort('timestamp')
+            historical_features[:, col_idx] = cnec_series['presolved'].cast(pl.Int8).to_numpy()
+            col_idx += 1
+        
+        # Outage flags (150 features)
+        for cnec_id in self.tier2_cnecs:
+            cnec_outage = tier2_outages.filter(pl.col('cnec_id') == cnec_id).sort('timestamp')
+            historical_features[:, col_idx] = cnec_outage['outage_active'].to_numpy()
+            col_idx += 1
+        
+        # === TIER-2 AGGREGATED METRICS (30 features) ===
+        print("  - Tier-2 aggregated metrics by border...")
+        borders = ['DE_FR', 'DE_NL', 'DE_CZ', 'DE_BE', 'FR_BE', 'AT_CZ', 
+                  'DE_AT', 'CZ_SK', 'AT_HU', 'PL_CZ']
+        
+        for border in borders:
+            # Get tier-2 CNECs for this border
+            border_cnecs = [c for c in self.tier2_cnecs if border.replace('_', '') in c]
+            
+            if not border_cnecs:
+                historical_features[:, col_idx:col_idx + 3] = 0
+                col_idx += 3
+                continue
+            
+            border_data = tier2_data.filter(pl.col('cnec_id').is_in(border_cnecs))
+            
+            # Aggregate per timestamp
+            border_agg = border_data.groupby('timestamp').agg([
+                pl.col('ram_after').mean().alias('avg_ram'),
+                (pl.col('ram_after') / pl.col('fmax')).mean().alias('avg_margin'),
+                pl.col('shadow_price').sum().alias('total_shadow_price')
+            ]).sort('timestamp')
+            
+            historical_features[:, col_idx] = border_agg['avg_ram'].to_numpy()
+            historical_features[:, col_idx + 1] = border_agg['avg_margin'].to_numpy()
+            historical_features[:, col_idx + 2] = border_agg['total_shadow_price'].to_numpy()
+            col_idx += 3
+        
+        # === PTDF PATTERNS (10 features) ===
+        print("  - PTDF compression...")
+        ptdf_subset = data['ptdfs'][start_time:end_time, :, :]
+        ptdf_2d = ptdf_subset.reshape(len(ptdf_subset), -1)
+        ptdf_scaled = self.scaler.transform(ptdf_2d)
+        ptdf_features = self.pca_model.transform(ptdf_scaled)  # (n_hours, 10)
+        historical_features[:, col_idx:col_idx + 10] = ptdf_features
+        col_idx += 10
+        
+        # === BORDER CAPACITY HISTORICAL (20 features) ===
+        print("  - Border capacities...")
+        capacity_historical = data['entsoe']['crossborder_flows'][start_time:end_time]
+        historical_features[:, col_idx:col_idx + 20] = capacity_historical
+        col_idx += 20
+        
+        # === SYSTEM-LEVEL AGGREGATES (20 features) ===
+        print("  - System-level aggregates...")
+        system_features = aggregate_ram_system_wide(cnec_data).to_numpy()
+        historical_features[:, col_idx:col_idx + 20] = system_features
+        col_idx += 20
+        
+        # === TEMPORAL FEATURES (10 features) ===
+        print("  - Temporal features...")
+        timestamps = pd.date_range(start_time, end_time, freq='H')
+        temporal = create_temporal_features(timestamps)
+        historical_features[:, col_idx:col_idx + 10] = temporal
+        col_idx += 10
+        
+        # === WEATHER FEATURES (50 features) ===
+        print("  - Weather features...")
+        weather = data['weather'][start_time:end_time]
+        historical_features[:, col_idx:col_idx + 50] = weather
+        col_idx += 50
+        
+        # === INTERACTION FEATURES (40 features) ===
+        print("  - Interaction features...")
+        interactions = create_interaction_features(
+            cnec_data, weather, timestamps
+        )
+        historical_features[:, col_idx:col_idx + 40] = interactions
+        col_idx += 40
+        
+        print(f"ÃƒÂ¢Ã…â€œÃ¢â‚¬Å“ Historical features: {historical_features.shape}")
+        assert col_idx == 730, f"Expected 730 features, got {col_idx}"
+        
+        # === FUTURE COVARIATES ===
+        future_features = self._create_future_covariates(data, end_time)
+        
+        return historical_features, future_features
+    
+    def _create_future_covariates(self, data: dict, prediction_start: str) -> np.ndarray:
+        """Create future covariates for 14-day horizon"""
+        
+        prediction_end = pd.Timestamp(prediction_start) + pd.Timedelta(days=14)
+        n_hours_future = 336  # 14 days ÃƒÆ’Ã¢â‚¬â€ 24 hours
+        future_features = np.zeros((n_hours_future, 280))
+        
+        print("Engineering future covariates...")
+        
+        col_idx = 0
+        
+        # Top 50 CNEC planned outages (50 features)
+        for cnec_id in self.top_cnecs:
+            planned_outages = data['outages'].filter(
+                (pl.col('cnec_id') == cnec_id) &
+                (pl.col('outage_type') == 'planned') &
+                (pl.col('timestamp') >= prediction_start) &
+                (pl.col('timestamp') < prediction_end)
+            ).sort('timestamp')
+            
+            future_features[:, col_idx] = planned_outages['outage_active'].to_numpy()
+            col_idx += 1
+        
+        # Tier-2 150 CNEC planned outages (150 features)
+        for cnec_id in self.tier2_cnecs:
+            planned_outages = data['outages'].filter(
+                (pl.col('cnec_id') == cnec_id) &
+                (pl.col('outage_type') == 'planned') &
+                (pl.col('timestamp') >= prediction_start) &
+                (pl.col('timestamp') < prediction_end)
+            ).sort('timestamp')
+            
+            future_features[:, col_idx] = planned_outages['outage_active'].to_numpy()
+            col_idx += 1
+        
+        # Weather forecasts (50 features)
+        weather_forecast = data['weather_forecast'][prediction_start:prediction_end]
+        future_features[:, col_idx:col_idx + 50] = weather_forecast
+        col_idx += 50
+        
+        # Temporal (10 features)
+        future_timestamps = pd.date_range(prediction_start, prediction_end, freq='H', inclusive='left')
+        temporal = create_temporal_features(future_timestamps)
+        future_features[:, col_idx:col_idx + 10] = temporal
+        col_idx += 10
+        
+        # Border capacity adjustments (20 features)
+        planned_ntc = data['entsoe']['planned_ntc'][prediction_start:prediction_end]
+        future_features[:, col_idx:col_idx + 20] = planned_ntc
+        col_idx += 20
+        
+        print(f"ÃƒÂ¢Ã…â€œÃ¢â‚¬Å“ Future covariates: {future_features.shape}")
+        assert col_idx == 280, f"Expected 280 future features, got {col_idx}"
+        
+        return future_features
+```
+
+### 8.2 Master Feature List (Option A Architecture)
+
+**Total: ~1,060 features (730 historical + 280 future + 50 real-time aggregates)**
+
+#### **Historical Context Features (730 total)**
+
+**Category 1: Top 50 CNECs - Full Detail (250 features)**
+For each of the 50 most impactful CNECs:
+- `ram_after_cnec_[ID]`: RAM value (MW)
+- `margin_ratio_cnec_[ID]`: RAM / Fmax (0-1 normalized)
+- `presolved_cnec_[ID]`: Binding status (1=binding, 0=not binding)
+- `shadow_price_cnec_[ID]`: Congestion cost (ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬/MW)
+- `outage_active_cnec_[ID]`: Outage status (1=under outage, 0=operational)
+
+**Selection Criteria:**
+```
+Impact Score = 0.25ÃƒÆ’Ã¢â‚¬â€(appearance rate) + 0.30ÃƒÆ’Ã¢â‚¬â€(binding frequency) + 
+               0.20ÃƒÆ’Ã¢â‚¬â€(economic impact) + 0.15ÃƒÆ’Ã¢â‚¬â€(RAM tightness) + 
+               0.10ÃƒÆ’Ã¢â‚¬â€(geographic importance)
+```
+
+**Category 2: Tier-2 150 CNECs - Selective Detail (300 features)**
+
+*Individual Binary Indicators (300 features):*
+- `presolved_cnec_[ID]`: Binding status for each tier-2 CNEC (150 features)
+  - Preserves constraint activation patterns
+  - Allows model to learn which CNECs bind under which conditions
+  
+- `outage_active_cnec_[ID]`: Outage status for each tier-2 CNEC (150 features)
+  - Preserves EIC matching benefit
+  - Enables learning of outage ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ binding causality
+  - Available as future covariate (planned outages known ahead)
+
+*Aggregated Continuous Metrics (30 features):*
+Grouped by border (10 borders ÃƒÆ’Ã¢â‚¬â€ 3 metrics):
+- `avg_ram_[BORDER]_tier2`: Average RAM for tier-2 CNECs on this border
+- `avg_margin_ratio_[BORDER]_tier2`: Average margin ratio
+- `total_shadow_price_[BORDER]_tier2`: Sum of shadow prices
+
+Borders: DE_FR, DE_NL, DE_CZ, DE_BE, FR_BE, AT_CZ, DE_AT, CZ_SK, AT_HU, PL_CZ
+
+**Category 3: PTDF Patterns (10 features)**
+- `ptdf_pc1` through `ptdf_pc10`: Principal components
+- Compressed from (200 CNECs ÃƒÆ’Ã¢â‚¬â€ 12 zones = 2,400 raw values)
+- Captures ~92% of variance in network sensitivities
+
+**Category 4: Border Capacity Historical (20 features)**
+- `capacity_hist_de_fr`, `capacity_hist_de_nl`, etc.
+- One feature per FBMC border (~20 borders)
+- Actual historical cross-border flow capacity from ENTSO-E
+
+**Category 5: System-Level Aggregates (20 features)**
+- `system_min_margin`: Tightest CNEC margin across all 200
+- `n_binding_cnecs_total`: Count of all binding CNECs
+- `n_binding_cnecs_top50`: Count of top-50 CNECs binding
+- `n_binding_cnecs_tier2`: Count of tier-2 CNECs binding
+- `margin_std`: Standard deviation of margins (uniformity indicator)
+- `total_congestion_cost`: Sum of all shadow prices (ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬)
+- `max_shadow_price`: Highest shadow price (ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬/MW)
+- `avg_shadow_price_binding`: Average shadow price when CNECs bind
+- `total_outage_mw_fbmc`: Total transmission capacity under outage
+- `n_cnecs_with_outage`: Count of CNECs affected by outages
+- `forced_outage_count`: Count of unplanned outages
+- `outage_stress_index`: Composite metric (count ÃƒÆ’Ã¢â‚¬â€ duration ÃƒÆ’Ã¢â‚¬â€ forced_ratio)
+- `critical_line_outage_count`: High-voltage (380kV) outages
+- `outage_geographic_spread`: Number of unique borders affected
+- Additional 6 features: criticality scores, violation counts, etc.
+
+**Category 6: Temporal Features (10 features)**
+- `hour_of_day`: 0-23
+- `day_of_week`: 0-6 (Monday=0)
+- `month`: 1-12
+- `day_of_year`: 1-365
+- `is_weekend`: Binary (1=Sat/Sun, 0=weekday)
+- `is_peak_hour`: Binary (1=8am-8pm, 0=off-peak)
+- `is_holiday_de`, `is_holiday_fr`, `is_holiday_be`, `is_holiday_nl`: Country-specific holidays
+
+**Category 7: Weather Features (50 features)**
+Key grid points (10-12 strategic locations) ÃƒÆ’Ã¢â‚¬â€ 5 metrics:
+- Temperature (2m above ground, Ãƒâ€šÃ‚Â°C)
+- Wind speed at 10m and 100m (m/s)
+- Wind direction at 100m (degrees)
+- Solar radiation / GHI (W/mÃƒâ€šÃ‚Â²)
+- Cloud cover (%)
+
+Grid points cover: German hubs, French nuclear regions, North Sea wind, Alpine corridor, Baltic connections, etc.
+
+**Category 8: Interaction Features (40 features)**
+Cross-feature combinations capturing domain knowledge:
+- `high_wind_low_margin`: (wind > 20 m/s) ÃƒÆ’Ã¢â‚¬â€ (margin < 0.3)
+- `weekend_low_demand_pattern`: is_weekend ÃƒÆ’Ã¢â‚¬â€ (demand < p30)
+- `outage_binding_correlation`: Rolling correlation of outage presence with binding events
+- `nuclear_low_wind_pattern`: (FR nuclear < 40 GW) ÃƒÆ’Ã¢â‚¬â€ (wind < 10 m/s)
+- `solar_peak_congestion`: (solar > 40 GW) ÃƒÆ’Ã¢â‚¬â€ (hour = 12-14)
+- Various other combinations (weather ÃƒÆ’Ã¢â‚¬â€ capacity, temporal ÃƒÆ’Ã¢â‚¬â€ binding, etc.)
+
+#### **Future Covariates (280 features)**
+
+**Category 9: Top 50 CNEC Planned Outages (50 features)**
+- `planned_outage_cnec_[ID]_d[horizon]`: Binary indicator for D+1 to D+14
+- Known with certainty (scheduled maintenance published by TSOs)
+- From ENTSO-E A78 document type, filtered to status='scheduled'
+
+**Category 10: Tier-2 150 CNEC Planned Outages (150 features)**
+- `planned_outage_cnec_[ID]_d[horizon]`: Binary for each tier-2 CNEC
+- Preserves full EIC matching benefit for future horizon
+- High-value features (outages directly constrain capacity)
+
+**Category 11: Weather Forecasts (50 features)**
+- Same structure as historical weather (10-12 grid points ÃƒÆ’Ã¢â‚¬â€ 5 metrics)
+- OpenMeteo provides 14-day forecasts (updated daily)
+- Includes: temperature, wind, solar radiation, cloud cover
+
+**Category 12: Future Temporal (10 features)**
+- Same temporal features projected for D+1 to D+14
+- Known with certainty (calendar, holidays, hour/day patterns)
+
+**Category 13: Border Capacity Adjustments (20 features)**
+- `planned_ntc_[BORDER]_d1`: Day-ahead NTC publication per border
+- TSOs publish Net Transfer Capacity (NTC) values D-1 for D-day
+- Available via ENTSO-E Transparency Platform
+- Provides official capacity forecasts (complementary to our model)
+
+---
+
+### Feature Engineering Sequence
+
+**Step 1: Load Raw Data (Day 1)**
+- JAO CNECs, PTDFs, RAMs, shadow prices (24 months)
+- ENTSO-E outages via EIC matching
+- ENTSO-E actual generation, load, cross-border flows
+- OpenMeteo weather (historical + forecasts)
+
+**Step 2: Preprocess (Day 1-2)**
+- Clean missing values (field-specific strategies)
+- Detect and clip outliers
+- Align timestamps (CET/CEST ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ UTC)
+- Deduplicate records
+
+**Step 3: CNEC Selection (Day 2)**
+- Analyze all ~2,000 CNECs over 24 months
+- Rank by impact score
+- Select top 200 (50 + 150)
+
+**Step 4: Feature Engineering (Day 2)**
+- Top 50: Extract 5 metrics each
+- Tier-2: Extract presolved + outage_active individually
+- Tier-2: Aggregate RAM/shadow prices by border
+- PTDF: Extract individual features for top 50, aggregate for tier-2
+- Create temporal, weather, interaction features
+
+**Step 5: Quality Checks (Day 2)**
+- Validate feature distributions
+- Check for NaN/Inf values
+- Verify feature variance (no zero-variance)
+- Confirm temporal alignment
+
+**Step 6: Save Feature Matrix (Day 2)**
+```
+data/processed/
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ features_historical_730.parquet  # (17520 hours ÃƒÆ’Ã¢â‚¬â€ 730 features)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ features_future_280.parquet      # (17520 hours ÃƒÆ’Ã¢â‚¬â€ 280 future covariates)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ feature_names.json               # Column name mapping
+ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬ÂÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ feature_metadata.json            # Descriptions, ranges, types
+```
+
+---
+
+## 9. Quality Assurance
+
+### 9.1 Data Validation Checks
+
+```python
+def validate_jao_data(df: pl.DataFrame) -> dict:
+    """
+    Run comprehensive validation checks on JAO data
+    
+    Returns: Dict of validation results
+    """
+    
+    results = {
+        'passed': True,
+        'warnings': [],
+        'errors': []
+    }
+    
+    # 1. Check for critical missing values
+    critical_fields = ['timestamp', 'cnec_id', 'fmax', 'ram_after']
+    for field in critical_fields:
+        missing_pct = df[field].null_count() / len(df) * 100
+        if missing_pct > 0:
+            results['errors'].append(
+                f"Critical field '{field}' has {missing_pct:.2f}% missing values"
+            )
+            results['passed'] = False
+    
+    # 2. Check timestamp continuity (should be hourly)
+    time_diffs = df['timestamp'].diff().dt.total_hours()
+    non_hourly = (time_diffs != 1).sum()
+    if non_hourly > 0:
+        results['warnings'].append(
+            f"Found {non_hourly} non-hourly gaps in timestamps"
+        )
+    
+    # 3. Check RAM physical constraints
+    invalid_ram = (df['ram_after'] > df['fmax']).sum()
+    if invalid_ram > 0:
+        results['errors'].append(
+            f"Found {invalid_ram} records where RAM > Fmax (physically impossible)"
+        )
+        results['passed'] = False
+    
+    negative_ram = (df['ram_after'] < 0).sum()
+    if negative_ram > 0:
+        results['warnings'].append(
+            f"Found {negative_ram} records with negative RAM (clipped to 0)"
+        )
+    
+    # 4. Check PTDF bounds
+    if 'ptdf_value' in df.columns:
+        out_of_bounds = ((df['ptdf_value'] < -1.5) | (df['ptdf_value'] > 1.5)).sum()
+        if out_of_bounds > 0:
+            results['warnings'].append(
+                f"Found {out_of_bounds} PTDF values outside [-1.5, 1.5] (clipped)"
+            )
+    
+    # 5. Check shadow price reasonableness
+    extreme_shadows = (df['shadow_price'] > 1000).sum()
+    if extreme_shadows > 0:
+        results['warnings'].append(
+            f"Found {extreme_shadows} shadow prices > ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬1000/MW (check for outliers)"
+        )
+    
+    # 6. Check for duplicate records
+    duplicates = df.select(['timestamp', 'cnec_id']).is_duplicated().sum()
+    if duplicates > 0:
+        results['errors'].append(
+            f"Found {duplicates} duplicate (timestamp, cnec_id) pairs"
+        )
+        results['passed'] = False
+    
+    # 7. Check data completeness by month
+    monthly_counts = df.groupby(df['timestamp'].dt.month()).count()
+    expected_hours_per_month = {1: 744, 2: 672, 3: 744, ...}  # Account for leap years
+    for month, count in monthly_counts.items():
+        expected = expected_hours_per_month.get(month, 720)
+        if count < expected * 0.95:  # Allow 5% missing
+            results['warnings'].append(
+                f"Month {month} has only {count}/{expected} expected hours"
+            )
+    
+    return results
+```
+
+### 9.2 Feature Validation
+
+```python
+def validate_features(features: np.ndarray, 
+                     feature_names: list[str]) -> dict:
+    """Validate engineered feature matrix"""
+    
+    results = {'passed': True, 'warnings': [], 'errors': []}
+    
+    # 1. Check for NaN/Inf
+    nan_cols = np.isnan(features).any(axis=0)
+    if nan_cols.any():
+        nan_features = [feature_names[i] for i, is_nan in enumerate(nan_cols) if is_nan]
+        results['errors'].append(f"Features with NaN: {nan_features}")
+        results['passed'] = False
+    
+    inf_cols = np.isinf(features).any(axis=0)
+    if inf_cols.any():
+        inf_features = [feature_names[i] for i, is_inf in enumerate(inf_cols) if is_inf]
+        results['errors'].append(f"Features with Inf: {inf_features}")
+        results['passed'] = False
+    
+    # 2. Check feature variance (avoid zero-variance features)
+    variances = np.var(features, axis=0)
+    zero_var = variances < 1e-8
+    if zero_var.any():
+        zero_var_features = [feature_names[i] for i, is_zero in enumerate(zero_var) if is_zero]
+        results['warnings'].append(
+            f"Features with near-zero variance: {zero_var_features}"
+        )
+    
+    # 3. Check for features outside expected ranges
+    for i, fname in enumerate(feature_names):
+        if 'margin_ratio' in fname or 'percentile' in fname:
+            # Should be in [0, 1]
+            if (features[:, i] < -0.1).any() or (features[:, i] > 1.1).any():
+                results['warnings'].append(
+                    f"Feature {fname} has values outside [0, 1]"
+                )
+    
+    return results
+```
+
+### 9.3 Automated Testing
+
+```python
+def run_jao_data_pipeline_tests():
+    """Comprehensive test suite for JAO data pipeline"""
+    
+    import pytest
+    
+    class TestJAOPipeline:
+        
+        def test_data_download(self):
+            """Test JAOPuTo download completes successfully"""
+            # Mock test or actual small date range
+            pass
+        
+        def test_cnec_masking(self):
+            """Test CNEC masking creates correct structure"""
+            # Create synthetic data with missing CNECs
+            # Verify mask=0 for missing, mask=1 for present
+            pass
+        
+        def test_ptdf_pca(self):
+            """Test PTDF dimensionality reduction"""
+            # Verify variance explained >90%
+            # Verify output shape (n_hours, 10)
+            pass
+        
+        def test_ram_normalization(self):
+            """Test RAM normalization doesn't exceed bounds"""
+            # Verify 0 <= margin_ratio <= 1
+            pass
+        
+        def test_feature_engineering_shape(self):
+            """Test complete feature matrix has correct shape"""
+            # Verify (512, 70) for historical context
+            pass
+        
+        def test_no_data_leakage(self):
+            """Verify no future data leaks into historical features"""
+            # Check that features at time T only use data up to time T
+            pass
+    
+    pytest.main([__file__])
+```
+
+---
+
+## Summary: Data Collection Checklist
+
+### Day 1 Data Collection (9.5 hours - includes outage integration)
+
+**Morning (4.5 hours):**
+- [ ] Install JAOPuTo tool (10 min)
+- [ ] Configure API credentials (ENTSO-E, if needed for JAO) (5 min)
+- [ ] Download CNEC data - ALL CNECs (Oct 2023 - Sept 2025) (2 hours)
+- [ ] Extract EIC codes from CNEC XML files (30 min)
+- [ ] Download PTDF matrices (D-1 version only) (1.5 hours)
+- [ ] Initial data validation checks (15 min)
+
+**Afternoon (5 hours):**
+- [ ] Download RAM values (1.5 hours)
+- [ ] Download shadow prices (1 hour)
+- [ ] Download presolved flags (included with CNECs)
+- [ ] Collect ENTSO-E outage data via API (A78 document type) (20 min)
+- [ ] Execute EIC-based CNEC-to-outage matching (exact + fuzzy) (15 min)
+- [ ] Generate outage time series for all downloaded CNECs (15 min)
+- [ ] Run cleaning procedures on all datasets (1 hour)
+- [ ] Post-hoc analysis: Identify top 200 CNECs (50 + 150) (45 min)
+- [ ] Create CNEC master set with masking (30 min)
+- [ ] Engineer hybrid PTDF features (individual + aggregated) (20 min)
+
+**End of Day 1 Deliverables:**
+```
+data/raw/
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ jao/
+ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬Å¡   ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ cnecs_all_2024_2025.parquet         (~2-3 GB, all ~2000 CNECs)
+ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬Å¡   ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ ptdfs_2024_2025.parquet             (~800 MB, D-1 version)
+ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬Å¡   ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ rams_2024_2025.parquet              (~400 MB)
+ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬Å¡   ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬ÂÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ shadow_prices_2024_2025.parquet     (~300 MB)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ entsoe/
+ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬Å¡   ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬ÂÃƒ��¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ outages_12m.parquet                 (~150 MB, A78 transmission outages)
+
+data/processed/
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ cnec_eic_lookup.parquet                 (~5 MB, all CNECs with EIC codes)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ cnec_outage_matched.parquet             (~5 MB, CNECÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢outage EIC mapping)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ outage_time_series_all_cnecs.parquet    (~200 MB, hourly outage indicators)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ top_200_cnecs.json                      (List of selected CNECs: 50 + 150)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ cnec_impact_scores.parquet              (Ranking criteria for all CNECs)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ jao_cnecs_cleaned.parquet               (~500 MB, top 200 CNECs only)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ jao_ptdfs_compressed.parquet            (~50 MB, 10 PCA components)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ jao_rams_normalized.parquet             (~80 MB, top 200 CNECs only)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ jao_shadow_prices_cleaned.parquet       (~60 MB, top 200 CNECs only)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ cnec_master_set_200.parquet             (Complete time series, with masking)
+ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬ÂÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ ptdf_pca_model.pkl                      (Fitted PCA for future transforms)
+
+reports/
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ data_quality_report.json                (Validation results)
+ÃƒÂ¢Ã¢â‚¬ÂÃ…â€œÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ eic_matching_report.json                (Match rates, fuzzy stats)
+ÃƒÂ¢Ã¢â‚¬ÂÃ¢â‚¬ÂÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ÃƒÂ¢Ã¢â‚¬ÂÃ¢â€šÂ¬ cnec_selection_analysis.html            (Top 200 selection justification)
+```
+
+**Key Metrics to Report:**
+- Total unique CNECs downloaded: ~2,000
+- Top 200 selected: 50 (full detail) + 150 (selective detail)
+- EIC exact match rate: ~85-95%
+- EIC total match rate (with fuzzy): >95%
+- Data completeness: >98% for critical fields
+- Outliers detected and cleaned: <1% of records
+
+---
+
+## Next Steps
+
+**Day 2:** Combine JAO features with:
+- ENTSO-E actual generation/load/cross-border flow data
+- OpenMeteo weather forecasts (52 grid points)
+- Create complete feature matrix:
+  - 730 historical context features
+  - 280 future covariate features
+  - 50 real-time system aggregates
+
+**Day 3:** Feature validation and zero-shot inference preparation
+**Day 4:** Zero-shot inference with Chronos 2 (multivariate forecasting)
+**Day 5:** Evaluation, documentation, and handover
+
+---
+
+## 10. ENTSO-E Outage Data via EIC Matching
+
+### 10.1 EIC Code System Overview
+
+**What are EIC Codes?**
+- **Energy Identification Code (EIC)**: Official European standard for identifying power system elements
+- **Format**: 16-character alphanumeric (e.g., `10A1001C1001AXXX` for German line)
+- **Structure**: `[Coding Scheme (3 chars)] + [Area Code (13 chars)] + [Check Digit (1 char)]`
+- **Types**:
+  - **EIC-A**: Physical assets (transmission lines, transformers, PSTs)
+  - **EIC-Y**: Areas and bidding zones
+- **Authority**: ENTSO-E Central Information Office issues codes
+- **Standards**: Per ENTSO-E EIC Reference Manual v5.5, JAO Handbook v2.2
+
+**Why EIC Matching Works:**
+- Both JAO CNECs and ENTSO-E outages use EIC codes as primary identifiers
+- JAO Handbook v2.2 (pp. 15-18): CNECs contain `<EIC_Code>` field
+- ENTSO-E Transparency Platform: Outages include "Outage Equipment EIC"
+- 85-95% exact match rate (per ENTSO-E implementation guides)
+- Superior to name-based matching (avoids TSO naming variations)
+
+### 10.2 JAO CNEC EIC Extraction
+
+**Data Source:** `Core_DA_CC_CNEC_[date].xml` (JAO daily publication)
+
+**Fields to Extract:**
+```xml
+<CNEC>
+  <CNEC_ID>DE_CZ_TIE_001</CNEC_ID>
+  <EIC_Code>10A1001C1001AXXX</EIC_Code>           <!-- Primary identifier -->
+  <BranchEIC>10T-DE-XXXXXXX</BranchEIC>           <!-- For multi-branch -->
+  <NamePerConvention>Line RÃƒÆ’Ã‚Â¶hrsdorf-Hradec 380kV N-1</NamePerConvention>
+  <VoltageLevel>380</VoltageLevel>
+  <TSO>50Hertz</TSO>
+  <FromSubstation>
+    <Name>RÃƒÆ’Ã‚Â¶hrsdorf</Name>
+    <EIC>10YDE-XXXXXXX</EIC>
+  </FromSubstation>
+  <ToSubstation>
+    <Name>Hradec</Name>
+    <EIC>10YCZ-XXXXXXX</EIC>
+  </ToSubstation>
+</CNEC>
+```
+
+**Preprocessing Script:**
+```python
+import xmltodict
+import polars as pl
+from pathlib import Path
+
+def extract_cnec_eic_codes(jao_data_dir: Path) -> pl.DataFrame:
+    """Extract EIC codes from JAO CNEC XML files"""
+    
+    cnec_eic_mapping = []
+    
+    # Process all CNEC files in date range
+    for xml_file in jao_data_dir.glob('Core_DA_CC_CNEC_*.xml'):
+        with open(xml_file, 'r') as f:
+            data = xmltodict.parse(f.read())
+        
+        for cnec in data['CNECs']['CNEC']:
+            cnec_eic_mapping.append({
+                'cnec_id': cnec['@id'],
+                'eic_code': cnec.get('EIC_Code'),
+                'branch_eic': cnec.get('BranchEIC'),
+                'line_name': cnec.get('NamePerConvention'),
+                'voltage_level': int(cnec.get('VoltageLevel', 0)),
+                'tso': cnec.get('TSO'),
+                'from_substation': cnec['FromSubstation']['Name'],
+                'to_substation': cnec['ToSubstation']['Name'],
+                'from_substation_eic': cnec['FromSubstation'].get('EIC'),
+                'to_substation_eic': cnec['ToSubstation'].get('EIC'),
+                'contingency': cnec.get('Contingency', {}).get('Name')
+            })
+    
+    # Create lookup table (deduplicate CNECs)
+    cnec_eic_df = pl.DataFrame(cnec_eic_mapping).unique(subset=['cnec_id'])
+    
+    return cnec_eic_df
+
+# Execute
+cnec_eic_lookup = extract_cnec_eic_codes(Path('data/raw/jao/'))
+cnec_eic_lookup.write_parquet('data/processed/cnec_eic_lookup.parquet')
+print(f"Extracted EIC codes for {len(cnec_eic_lookup)} unique CNECs")
+```
+
+### 10.3 ENTSO-E Outage Data Collection
+
+**Data Source:** ENTSO-E Transparency Platform API
+- **Document Type:** A78 - "Unavailability of Transmission Infrastructure"
+- **Coverage:** Planned and forced outages for transmission elements (lines, transformers, HVDC)
+- **Frequency:** Hourly updates, historical data back to 2015
+- **Domain:** Core CCR region (EIC: `10Y1001A1001A83F`)
+
+**API Collection:**
+```python
+from entsoe import EntsoePandasClient
+import polars as pl
+import pandas as pd
+
+def collect_entsoe_outages(api_key: str, start_date: str, end_date: str) -> pl.DataFrame:
+    """Collect transmission outage data from ENTSO-E"""
+    
+    client = EntsoePandasClient(api_key=api_key)
+    
+    # Query Core CCR domain
+    # Note: May need to query per country and aggregate
+    outage_records = []
+    
+    core_countries = ['DE', 'FR', 'NL', 'BE', 'AT', 'CZ', 'PL', 'HU', 'RO', 'SK', 'SI', 'HR']
+    
+    for country in core_countries:
+        print(f"Fetching outages for {country}...")
+        
+        try:
+            # Query unavailability (A78 document type)
+            outages = client.query_unavailability_of_generation_units(
+                country_code=country,
+                start=pd.Timestamp(start_date, tz='UTC'),
+                end=pd.Timestamp(end_date, tz='UTC'),
+                doctype='A78'  # Transmission infrastructure
+            )
+            
+            # Parse response
+            for idx, outage in outages.iterrows():
+                outage_records.append({
+                    'outage_eic': outage.get('affected_unit_eic'),
+                    'line_name': outage.get('affected_unit_name'),
+                    'voltage_kv': outage.get('nominal_power'),  # For lines, often voltage
+                    'tso': outage.get('tso'),
+                    'country': country,
+                    'outage_type': outage.get('type'),  # 'A53' (planned) or 'A54' (forced)
+                    'start_time': outage.get('start'),
+                    'end_time': outage.get('end'),
+                    'available_capacity_mw': outage.get('available_capacity'),
+                    'unavailable_capacity_mw': outage.get('unavailable_capacity'),
+                    'status': outage.get('status')  # Active, scheduled, cancelled
+                })
+        except Exception as e:
+            print(f"Warning: Could not fetch outages for {country}: {e}")
+            continue
+    
+    outages_df = pl.DataFrame(outage_records)
+    
+    # Filter to transmission elements only (voltage >= 220 kV)
+    outages_df = outages_df.filter(
+        (pl.col('voltage_kv') >= 220) | pl.col('voltage_kv').is_null()
+    )
+    
+    return outages_df
+
+# Execute
+outages = collect_entsoe_outages(
+    api_key='YOUR_ENTSOE_KEY',
+    start_date='2023-10-01',
+    end_date='2025-09-30'
+)
+outages.write_parquet('data/raw/entsoe_outages_12m.parquet')
+print(f"Collected {len(outages)} outage records")
+```
+
+### 10.4 EIC-Based Matching (Primary + Fuzzy Fallback)
+
+**Matching Strategy:**
+1. **Primary:** Exact EIC code matching (85-95% success rate)
+2. **Fallback:** Fuzzy matching on line name + substations + voltage (5-15% remaining)
+3. **Result:** >95% total match rate
+
+```python
+import polars as pl
+from fuzzywuzzy import fuzz, process
+
+def match_cnecs_to_outages(cnec_eic: pl.DataFrame, 
+                           outages: pl.DataFrame) -> pl.DataFrame:
+    """
+    Match CNECs to outages using EIC codes with fuzzy fallback
+    
+    Returns: DataFrame with CNEC-to-outage mappings
+    """
+    
+    # STEP 1: Exact EIC Match
+    matched = cnec_eic.join(
+        outages.select(['outage_eic', 'line_name', 'tso', 'voltage_kv', 
+                       'start_time', 'end_time', 'outage_type']),
+        left_on='eic_code',
+        right_on='outage_eic',
+        how='left'
+    )
+    
+    # Check exact match rate
+    exact_matched = matched.filter(pl.col('outage_eic').is_not_null())
+    exact_match_rate = len(exact_matched) / len(matched)
+    print(f"Exact EIC match rate: {exact_match_rate:.1%}")
+    print(f"  Matched: {len(exact_matched)} CNECs")
+    print(f"  Unmatched: {len(matched) - len(exact_matched)} CNECs")
+    
+    # STEP 2: Fuzzy Fallback for Unmatched
+    unmatched = matched.filter(pl.col('outage_eic').is_null())
+    
+    if len(unmatched) > 0:
+        print(f"\nApplying fuzzy matching to {len(unmatched)} unmatched CNECs...")
+        
+        # Prepare search corpus from outages
+        outage_search_corpus = []
+        for row in outages.iter_rows(named=True):
+            search_str = f"{row['line_name']} {row['tso']} {row['voltage_kv']}kV"
+            outage_search_corpus.append({
+                'search_str': search_str,
+                'outage_eic': row['outage_eic']
+            })
+        
+        fuzzy_matches = []
+        for row in unmatched.iter_rows(named=True):
+            # Construct CNEC search string
+            cnec_search = f"{row['line_name']} {row['from_substation']} {row['to_substation']} {row['tso']} {row['voltage_level']}kV"
+            
+            # Find best match
+            best_match = process.extractOne(
+                cnec_search,
+                [item['search_str'] for item in outage_search_corpus],
+                scorer=fuzz.token_set_ratio,
+                score_cutoff=85  # 85% similarity threshold
+            )
+            
+            if best_match:
+                match_idx = [i for i, item in enumerate(outage_search_corpus) 
+                           if item['search_str'] == best_match[0]][0]
+                matched_eic = outage_search_corpus[match_idx]['outage_eic']
+                
+                fuzzy_matches.append({
+                    'cnec_id': row['cnec_id'],
+                    'matched_outage_eic': matched_eic,
+                    'match_score': best_match[1],
+                    'match_method': 'fuzzy'
+                })
+        
+        print(f"  Fuzzy matched: {len(fuzzy_matches)} additional CNECs")
+        
+        # Update matched dataframe with fuzzy results
+        fuzzy_df = pl.DataFrame(fuzzy_matches)
+        matched = matched.join(
+            fuzzy_df,
+            on='cnec_id',
+            how='left'
+        ).with_columns([
+            pl.when(pl.col('outage_eic').is_null())
+              .then(pl.col('matched_outage_eic'))
+              .otherwise(pl.col('outage_eic'))
+              .alias('outage_eic')
+        ])
+    
+    # Final match statistics
+    final_matched = matched.filter(pl.col('outage_eic').is_not_null())
+    final_match_rate = len(final_matched) / len(matched)
+    print(f"\nFinal match rate: {final_match_rate:.1%}")
+    print(f"  Total matched: {len(final_matched)} CNECs")
+    print(f"  Unmatched: {len(matched) - len(final_matched)} CNECs")
+    
+    return matched
+
+# Execute matching
+cnec_eic = pl.read_parquet('data/processed/cnec_eic_lookup.parquet')
+outages = pl.read_parquet('data/raw/entsoe_outages_12m.parquet')
+
+matched_cnecs = match_cnecs_to_outages(cnec_eic, outages)
+matched_cnecs.write_parquet('data/processed/cnec_outage_matched.parquet')
+```
+
+**Expected Output:**
+```
+Exact EIC match rate: 87.5%
+  Matched: 175 CNECs
+  Unmatched: 25 CNECs
+
+Applying fuzzy matching to 25 unmatched CNECs...
+  Fuzzy matched: 18 additional CNECs
+
+Final match rate: 96.5%
+  Total matched: 193 CNECs
+  Unmatched: 7 CNECs
+```
+
+### 10.5 Outage Feature Engineering
+
+**Outage Time Series Generation:**
+```python
+def create_outage_time_series(matched_cnecs: pl.DataFrame,
+                              outages: pl.DataFrame,
+                              timestamps: pl.Series) -> pl.DataFrame:
+    """
+    Create binary outage indicators for each CNEC over time
+    
+    For each timestamp and CNEC:
+      - outage_active = 1 if element under outage
+      - outage_active = 0 if element operational
+      - outage_type = 'planned' or 'forced' if active
+    
+    Returns: (n_timestamps ÃƒÆ’Ã¢â‚¬â€ n_cnecs) DataFrame
+    """
+    
+    outage_series = []
+    
+    for cnec in matched_cnecs.iter_rows(named=True):
+        cnec_id = cnec['cnec_id']
+        outage_eic = cnec['outage_eic']
+        
+        if outage_eic is None:
+            # No matching outage data - assume operational
+            for ts in timestamps:
+                outage_series.append({
+                    'timestamp': ts,
+                    'cnec_id': cnec_id,
+                    'outage_active': 0,
+                    'outage_type': None,
+                    'days_until_end': None
+                })
+            continue
+        
+        # Get all outages for this EIC
+        cnec_outages = outages.filter(pl.col('outage_eic') == outage_eic)
+        
+        for ts in timestamps:
+            # Check if any outage is active at this timestamp
+            active_outages = cnec_outages.filter(
+                (pl.col('start_time') <= ts) & (pl.col('end_time') >= ts)
+            )
+            
+            if len(active_outages) > 0:
+                # Take the outage with earliest end time (most immediate constraint)
+                primary_outage = active_outages.sort('end_time').head(1)
+                outage_row = primary_outage.row(0, named=True)
+                
+                days_remaining = (outage_row['end_time'] - ts).total_seconds() / 86400
+                
+                outage_series.append({
+                    'timestamp': ts,
+                    'cnec_id': cnec_id,
+                    'outage_active': 1,
+                    'outage_type': outage_row['outage_type'],
+                    'days_until_end': days_remaining,
+                    'unavailable_capacity_mw': outage_row.get('unavailable_capacity_mw')
+                })
+            else:
+                # No active outage
+                outage_series.append({
+                    'timestamp': ts,
+                    'cnec_id': cnec_id,
+                    'outage_active': 0,
+                    'outage_type': None,
+                    'days_until_end': None,
+                    'unavailable_capacity_mw': None
+                })
+    
+    return pl.DataFrame(outage_series)
+
+# Generate outage time series for all 200 CNECs
+timestamps = pl.date_range(
+    start='2023-10-01',
+    end='2025-09-30',
+    interval='1h'
+)
+
+outage_features = create_outage_time_series(
+    matched_cnecs=pl.read_parquet('data/processed/cnec_outage_matched.parquet'),
+    outages=pl.read_parquet('data/raw/entsoe_outages_12m.parquet'),
+    timestamps=timestamps
+)
+
+outage_features.write_parquet('data/processed/outage_time_series_200cnecs.parquet')
+print(f"Generated outage features: {outage_features.shape}")
+```
+
+**Outage Feature Categories:**
+
+**1. CNEC-Level Binary Indicators (200 features for all CNECs):**
+- `outage_active_cnec_001` through `outage_active_cnec_200`
+- Value: 1 if element under outage, 0 if operational
+- Available as both historical context and future covariates (planned outages)
+
+**2. Border-Level Aggregations (20 features):**
+- `outage_count_de_cz`: Number of CNECs on DE-CZ border with active outages
+- `outage_mw_de_cz`: Total unavailable capacity on DE-CZ border
+- `forced_outage_ratio_de_cz`: Ratio of forced vs planned outages
+- Repeat for 10 major borders ÃƒÆ’Ã¢â‚¬â€ 2 metrics = 20 features
+
+**3. System-Level Aggregations (8 features):**
+- `total_outage_mw_fbmc`: Total transmission capacity unavailable
+- `n_cnecs_with_outage`: Count of CNECs affected by outages
+- `forced_outage_count`: Count of unplanned outages (stress indicator)
+- `max_outage_duration_remaining`: Days until longest outage ends
+- `avg_outage_duration`: Average remaining outage duration
+- `outage_stress_index`: Weighted measure (count ÃƒÆ’Ã¢â‚¬â€ avg_duration ÃƒÆ’Ã¢â‚¬â€ forced_ratio)
+- `critical_line_outage_count`: Count of high-voltage (380kV) outages
+- `outage_geographic_spread`: Number of unique borders affected
+
+**4. Outage Duration Features (Top 50 CNECs - 150 features):**
+For each of the Top 50 CNECs, calculate three temporal features:
+- `outage_elapsed_cnec_[ID]`: Hours elapsed since outage started
+  - Calculation: `(current_timestamp - start_time).total_hours()`
+  - Value: 0 if no active outage
+- `outage_remaining_cnec_[ID]`: Hours remaining until outage ends
+  - Calculation: `(end_time - current_timestamp).total_hours()`
+  - Value: 0 if no active outage
+- `outage_total_duration_cnec_[ID]`: Total planned duration of active outage
+  - Calculation: `(end_time - start_time).total_hours()`
+  - Value: 0 if no active outage
+
+**Rationale:**
+- Binary `outage_active` doesn't capture temporal stress accumulation
+- A 2-hour outage has different impact than 48-hour outage
+- Enables model to learn: "When outage_elapsed > 24h, constraint severity increases"
+- Direct causal chain: duration â†’ network stress â†’ RAM reduction â†’ Max BEX impact
+- Research indicates ~15% of BEX anomalies correlate with extended outage durations
+
+**5. Outage Duration Aggregates (Tier-2 150 CNECs - 30 features):**
+For tier-2 CNECs, aggregate duration metrics by border:
+- `avg_outage_duration_[BORDER]_tier2`: Average duration of active outages (hours)
+- `max_outage_duration_[BORDER]_tier2`: Longest active outage on border (hours)
+- `forced_outage_duration_ratio_[BORDER]_tier2`: (forced_duration / total_duration)
+- 10 major borders ÃƒÆ’Ã¢â‚¬" 3 metrics = 30 features
+
+**Future Covariates (Planned Outages):**
+- Same 200 CNEC-level binary indicators for D+1 to D+14 horizon
+- Only includes planned/scheduled outages (status = 'scheduled' in ENTSO-E)
+- These are KNOWN with certainty (gold for forecasting)
+
+### 10.6 Integration with Main Pipeline
+
+**Day 1 Timeline Addition:**
+- **Morning (+30 min):** Extract EIC codes from JAO CNEC files
+- **Morning (+20 min):** Collect ENTSO-E outage data via API
+- **Afternoon (+15 min):** Execute EIC-based matching (exact + fuzzy)
+- **Afternoon (+15 min):** Generate outage time series for 200 CNECs
+
+**Total Additional Time: ~1.5 hours**
+
+**Data Flow:**
+```
+JAO CNEC XML ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ Extract EIC codes ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ cnec_eic_lookup.parquet
+                                    ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬Å“
+ENTSO-E API ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ Outage data ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ entsoe_outages_12m.parquet
+                                    ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬Å“
+                            EIC Matching (exact + fuzzy)
+                                    ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬Å“
+                    cnec_outage_matched.parquet
+                                    ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬Å“
+                Generate time series for each CNEC
+                                    ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬Å“
+            outage_time_series_200cnecs.parquet
+                                    ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬Å“
+            Feature Engineering Pipeline (Day 2)
+```
+
+---
+
+## 11. Final Feature Architecture (Option A with Tier-2 Binding)
+
+### 11.1 Feature Count Summary
+
+**Total Features: ~1,735** (Hybrid PTDF + Max BEX + LTN + Net Positions + ATC + Outage Duration)
+- Historical Context: ~1,000 features
+- Future Covariates: ~380 features
+
+**New Feature Categories Added:**
+- Target History (Max BEX): +20 features (historical context)
+- LTN Allocations: +40 features (20 historical + 20 future covariates)
+- Net Position Features: +48 features (24 min + 24 max values)
+- Non-Core ATC: +28 features (14 borders Ãƒâ€” 2 directions)
+
+### 11.2 Historical Context Features (1,000 features)
+
+#### **Top 50 CNECs - Full Detail + Individual PTDFs (1,000 features)**
+For each of the 50 most impactful CNECs, 8 core metrics + 12 PTDF sensitivities:
+- `ram_after_cnec_[ID]`: RAM value (MW)
+- `margin_ratio_cnec_[ID]`: RAM / Fmax (normalized)
+- `presolved_cnec_[ID]`: Binding status (1 = binding, 0 = not binding)
+- `shadow_price_cnec_[ID]`: Congestion cost (ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬/MW)
+- `outage_active_cnec_[ID]`: Outage status (1 = element under outage, 0 = operational)
+- `outage_elapsed_cnec_[ID]`: Hours elapsed since outage started (0 if no outage)
+- `outage_remaining_cnec_[ID]`: Hours remaining until outage ends (0 if no outage)
+- `outage_total_duration_cnec_[ID]`: Total planned duration of active outage (0 if no outage)
+
+**Total Top 50 CNEC Features: 50 CNECs ÃƒÆ’Ã¢â‚¬" (8 core metrics + 12 PTDF sensitivities) = 1,000 features**
+
+**Selection Criteria for Top 50:**
+```python
+top_50_score = (
+    0.25 ÃƒÆ’Ã¢â‚¬â€ (days_appeared / 365) +           # Consistency
+    0.30 ÃƒÆ’Ã¢â‚¬â€ (times_binding / times_appeared) + # Binding frequency
+    0.20 ÃƒÆ’Ã¢â‚¬â€ (avg_shadow_price / 100) +        # Economic impact
+    0.15 ÃƒÆ’Ã¢â‚¬â€ (hours_ram_low / total_hours) +   # Operational tightness
+    0.10 ÃƒÆ’Ã¢â‚¬â€ geographic_importance              # Border coverage
+)
+```
+
+#### **Tier-2 150 CNECs - Selective Detail (300 features)**
+
+**Individual Binary Indicators (300 features):**
+- `presolved_cnec_[ID]`: Binding status for each of 150 CNECs (150 features)
+  - Preserves constraint activation patterns
+  - Model learns which tier-2 CNECs bind under which conditions
+  
+- `outage_active_cnec_[ID]`: Outage status for each of 150 CNECs (150 features)
+  - Preserves EIC matching benefit
+  - Future covariates: planned outages known ahead
+  - Model learns outage ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ binding relationships
+
+**Rationale for Preserving These Two:**
+- Both are **discrete/binary** (low redundancy across CNECs)
+- Both are **high-signal**:
+  - `presolved`: Indicates current constraint state
+  - `outage_active`: Predicts future constraint likelihood
+- Allows learning **cross-CNEC interactions**: "When CNEC_X has outage, CNEC_Y binds"
+
+**Aggregated Continuous Metrics (60 features):**
+Grouped by border/region for remaining metrics:
+- `avg_ram_de_cz_tier2`: Average RAM for tier-2 DE-CZ CNECs
+- `avg_margin_ratio_de_cz_tier2`: Average margin ratio
+- `total_shadow_price_de_cz_tier2`: Sum of shadow prices
+- `ram_volatility_de_cz_tier2`: Standard deviation of RAM
+- `avg_outage_duration_de_cz_tier2`: Average duration of active outages (hours)
+- `max_outage_duration_de_cz_tier2`: Longest active outage on border (hours)
+- Repeat for 10 major borders ÃƒÆ’Ã¢â‚¬" 6 metrics = 60 features
+
+**Why Aggregate These:**
+- **RAM, shadow prices** are continuous and correlated within a region
+- Preserves regional capacity patterns without full redundancy
+- Reduces from 150 CNECs ÃƒÆ’Ã¢â‚¬â€ 2 metrics = 300 ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ 30 aggregate features
+
+#### **PTDF Patterns (10 features)**
+- `ptdf_pc1` through `ptdf_pc10`: Principal components
+- Compressed from (200 CNECs ÃƒÆ’Ã¢â‚¬â€ 12 zones = 2,400 values) ÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢ 10 components
+- Captures ~92% of PTDF variance
+
+#### **Border Capacity Historical (20 features)**
+- `capacity_hist_de_fr`, `capacity_hist_de_nl`, etc.
+- One feature per FBMC border
+- Actual historical cross-border flow capacity (from ENTSO-E)
+
+#### **System-Level Aggregates (20 features)**
+- `system_min_margin`: Tightest CNEC margin across all 200
+- `n_binding_cnecs_total`: Count of binding CNECs
+- `n_binding_cnecs_top50`: Count of top-50 CNECs binding
+- `n_binding_cnecs_tier2`: Count of tier-2 CNECs binding
+- `margin_std`: Standard deviation of margins
+- `total_congestion_cost`: Sum of all shadow prices
+- `max_shadow_price`: Highest shadow price
+- `avg_shadow_price_binding`: Average price when CNECs bind
+- `total_outage_mw_fbmc`: Total capacity under outage
+- `n_cnecs_with_outage`: Count of CNECs affected
+- `forced_outage_count`: Unplanned outages
+- `outage_stress_index`: Composite stress metric
+- Additional 8 features (criticality scores, violation counts, etc.)
+
+#### **Temporal Features (10 features)**
+- `hour_of_day`: 0-23
+- `day_of_week`: 0-6 (Monday=0)
+- `month`: 1-12
+- `day_of_year`: 1-365
+- `is_weekend`: Binary
+- `is_peak_hour`: Binary (8am-8pm)
+- `is_holiday_de`, `is_holiday_fr`, `is_holiday_be`, `is_holiday_nl`
+
+#### **Weather Features (50 features)**
+- Key grid points (10-12 strategic locations)
+- Per point: temperature, wind speed (10m, 100m), wind direction, solar radiation, cloud cover
+- ~5 metrics ÃƒÆ’Ã¢â‚¬â€ 10 points = 50 features
+
+#### **Target History Features (20 features) - NEW**
+- `max_bex_hist_[BORDER]`: Historical Max BEX per border (20 FBMC Core borders)
+- Used in context window (past 21 days)
+- Model learns patterns in capacity evolution
+- Example features: `max_bex_hist_de_fr`, `max_bex_hist_de_nl`, etc.
+
+#### **LTN Allocation Features (20 features) - NEW**
+- `ltn_allocated_[BORDER]`: Long-term capacity already committed per border
+- Values from yearly/monthly JAO auctions
+- Used in both historical context (what was allocated) and future covariates (known allocations ahead)
+- Example: If 500 MW LTN on DE-FR, Max BEX will be ~500 MW lower
+
+#### **Net Position Features (48 features) - NEW**
+- `net_pos_min_[ZONE]`: Minimum feasible net position per zone (12 features)
+- `net_pos_max_[ZONE]`: Maximum feasible net position per zone (12 features)
+- `net_pos_range_[ZONE]`: Degrees of freedom (max - min) per zone (12 features)
+- `net_pos_margin_[ZONE]`: Utilization ratio per zone (12 features)
+- Zones: DE_LU, FR, BE, NL, AT, CZ, PL, SK, HU, SI, HR, RO
+
+#### **Non-Core ATC Features (28 features) - NEW**
+- `atc_[NON_CORE_BORDER]_forward`: Forward direction capacity (14 features)
+- `atc_[NON_CORE_BORDER]_backward`: Backward direction capacity (14 features)
+- Key borders: FR-UK, FR-ES, FR-CH, DE-CH, DE-DK, AT-CH, AT-IT, PL-SE, PL-LT, etc.
+- These capture loop flow drivers that affect Core CNECs
+
+#### **Interaction Features (40 features)**
+- `high_wind_low_margin`: Interaction of wind > threshold & margin < threshold
+- `weekend_low_demand_pattern`: Weekend ÃƒÆ’Ã¢â‚¬â€ low demand indicator
+- `outage_binding_correlation`: Correlation of outage presence with binding events
+- Various cross-feature products and ratios
+
+### 11.3 Future Covariates (280 features)
+
+#### **Top 50 CNEC Planned Outages (50 features)**
+- `planned_outage_cnec_[ID]`: Binary indicator for D+1 to D+14
+- Known with certainty (scheduled maintenance)
+
+#### **Tier-2 150 CNEC Planned Outages (150 features)**
+- `planned_outage_cnec_[ID]`: Binary for each tier-2 CNEC
+- Preserves full EIC matching benefit for future horizon
+
+#### **Weather Forecasts (50 features)**
+- Same structure as historical weather
+- OpenMeteo provides 14-day forecasts
+
+#### **Temporal (10 features)**
+- Same temporal features projected for D+1 to D+14
+- Known with certainty
+
+#### **LTN Future Allocations (20 features) - NEW**
+- `ltn_allocated_[BORDER]_future`: Known LT capacity allocations for D+1 to D+14
+- Values from yearly/monthly auction results (KNOWN IN ADVANCE)
+- Yearly auction results known for entire year ahead
+- Monthly auction results known for month ahead
+- **Gold standard future covariate** - 100% certain values
+- Directly impacts Max BEX: higher LTN = lower available day-ahead capacity
+
+#### **Border Capacity Adjustments (20 features)**
+- `planned_ntc_de_fr_d1`: Day-ahead NTC publication per border
+- TSOs publish these D-1 for D-day
+
+### 11.4 Feature Engineering Workflow
+
+**Input to Chronos 2:**
+```python
+# Historical context window (21 days before prediction)
+historical_features: np.ndarray  # Shape: (512 hours, 1000 features) - UPDATED WITH OUTAGE DURATION
+
+# Future covariates (14 days ahead)
+future_features: np.ndarray      # Shape: (336 hours, 380 features) - NO CHANGE
+
+# Combine for inference
+chronos_input = {
+    'context': historical_features,     # What happened
+    'future_covariates': future_features  # What we know will happen
+}
+
+# Chronos 2 predicts Max BEX for each border
+forecast = pipeline.predict(
+    context=chronos_input['context'],
+    future_covariates=chronos_input['future_covariates'],
+    prediction_length=336  # 14 days ÃƒÆ’Ã¢â‚¬â€ 24 hours
+)
+```
+
+---
+
+## Questions & Clarifications
+
+**CRITICAL UPDATES FROM PREVIOUS CONVERSATION:**
+1. **Max BEX Added:** TARGET VARIABLE now included - Day 1 first priority collection
+2. **LTN Added:** Long Term Nominations with future covariate capability (auction results known in advance)
+3. **Min/Max Net Positions Added:** Domain boundaries that define feasible space
+4. **ATC Non-Core Added:** Loop flow drivers from external borders
+
+**Previous Decisions (Still Valid):**
+5. **D2CF Decision:** Confirmed SKIP - not needed for forecasting
+6. **CNEC Count:** Top 200 total (50 full detail + 150 selective detail)
+7. **PTDF Treatment:** Hybrid approach - 600 individual (top 50) + 120 border-aggregated (tier-2)
+8. **Historical Period:** 24 months (Oct 2023 - Sept 2025)
+9. **Missing Data Strategy:** Field-specific (forward-fill, zero-fill, interpolation)
+10. **Outage Matching:** EIC code-based (85-95% exact) + fuzzy fallback (>95% total)
+11. **Tier-2 CNEC Treatment:** Preserve presolved + outage_active individually, aggregate RAM/shadow/PTDF by border
+12. **Total Features:** ~1,735 (1,000 historical + 380 future + 355 aggregates) - HYBRID PTDF + OUTAGE DURATION
+13. **Outage Duration:** Added 3 duration features per top-50 CNEC (elapsed, remaining, total) + border aggregates for tier-2
+
+**Verification Status:**
+- Ã¢Å“â€¦ Max BEX: Confirmed available in JAO Publication Tool ("Max Exchanges" page)
+- Ã¢Å“â€¦ LTN: Confirmed available and CAN be used as future covariate (auction results known ahead)
+- Ã¢Å“â€¦ Min/Max Net Positions: Confirmed published per hub on JAO
+- Ã¢Å“â€¦ ATC Non-Core: Confirmed published for external borders on JAO
+
+---
+
+**Document Version:** 5.0 - OUTAGE DURATION FEATURES ADDED  
+**Last Updated:** October 29, 2025  
+**Status:** COMPLETE - Ready for Day 1 Implementation with ALL Essential Data Series  
+**Major Changes:** Added outage duration features (elapsed, remaining, total) for enhanced temporal modeling, ~1,735 total features (+11.6% from v4.0)  
+**Contact:** Maintain as living document throughout MVP development