Spaces:

mutoy
/

Broadcast_paper

Sleeping

App Files Files Community

Choi jun hyeok commited on Oct 23, 2025

Commit

d4a3b8b

1 Parent(s): 1003766

Deploy Flask app to HF Space

Browse files

Files changed (30) hide show

.env +1 -0
.gitattributes +2 -0
Dockerfile +33 -0
EDA_analysis.html +0 -0
README.md +134 -10
age_prediction_model.pkl +3 -0
app.py +432 -0
article_mapping.pkl +3 -0
convert_to_csv.py +83 -0
data/article_metrics_monthly.xlsx +3 -0
data/contents.xlsx +3 -0
data/demographics_part001.xlsx +3 -0
data/demographics_part002.xlsx +3 -0
data/referrer.xlsx +3 -0
data_csv/article_metrics_monthly.csv +3 -0
data_csv/contents.csv +3 -0
data_csv/demographics_merged.csv +3 -0
data_csv/demographics_part001.csv +3 -0
data_csv/demographics_part002.csv +3 -0
data_csv/referrer.csv +3 -0
data_structure_analysis.py +60 -0
index.html +581 -0
label_encoder.pkl +3 -0
onehot_encoder.pkl +3 -0
requirements.txt +11 -0
text_features_matrix.pkl +3 -0
tfidf_vectorizer.pkl +3 -0
train_and_save_models.py +321 -0
view_prediction_model.pkl +3 -0
wsgi.py +9 -0

.env ADDED Viewed

	@@ -0,0 +1 @@


1	+ GEMINI_API_KEY = AIzaSyDUK2u7PljRBqzj9lhM6Ydxm4-SdxCU-vY

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+data/*.xlsx filter=lfs diff=lfs merge=lfs -text
+data_csv/*.csv filter=lfs diff=lfs merge=lfs -text

Dockerfile ADDED Viewed

	@@ -0,0 +1,33 @@

+FROM python:3.10-slim
+ENV DEBIAN_FRONTEND=noninteractive
+# System packages required for KoNLPy (Java) and MeCab tokenizer
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends \
+        build-essential \
+        default-jdk \
+        mecab \
+        libmecab-dev \
+        mecab-ipadic-utf8 \
+        python3-dev \
+    && apt-get clean \
+    && rm -rf /var/lib/apt/lists/*
+WORKDIR /app
+# Install Python dependencies
+COPY requirements.txt ./
+RUN pip install --upgrade pip \
+    && pip install --no-cache-dir -r requirements.txt
+# Copy application source and artifacts
+COPY . .
+ENV PYTHONPATH=/app
+ENV PORT=7860
+EXPOSE 7860
+# Gunicorn respects the PORT provided by Hugging Face Spaces.
+CMD ["sh", "-c", "gunicorn --bind 0.0.0.0:$PORT --workers ${GUNICORN_WORKERS:-2} wsgi:application"]

EDA_analysis.html ADDED Viewed

The diff for this file is too large to render. See raw diff

README.md CHANGED Viewed

@@ -1,10 +1,134 @@
----
-title: Broadcast Paper
-emoji: 🏆
-colorFrom: gray
-colorTo: pink
-sdk: docker
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Dacon Broadcast Article Performance Predictor
+This project hosts a Flask web application that predicts article performance and provides AI-powered SEO recommendations.
+## Local development
+1. Create a virtual environment and install dependencies.
+   ```powershell
+   python -m venv .venv
+   .\.venv\Scripts\Activate.ps1
+   pip install -r requirements.txt
+   ```
+2. Ensure the model artifacts are generated:
+   ```powershell
+   .\.venv\Scripts\python.exe train_and_save_models.py
+   ```
+3. Add your Google Generative AI key to a `.env` file:
+   ```ini
+   GEMINI_API_KEY=your-api-key
+   ```
+4. Run the development server:
+   ```powershell
+   .\.venv\Scripts\python.exe app.py
+   ```
+## Production deployment (Gunicorn + Nginx)
+1. **Copy project to server** (e.g., `/srv/dacon_broadcast_paper`).
+2. **Create virtual environment** and install requirements as above.
+3. **Generate artifacts** on the server or copy them from local build.
+4. **Configure environment variables**:
+   ```bash
+   echo "GEMINI_API_KEY=your-api-key" | sudo tee /etc/dacon_app.env
+   ```
+5. **Test Gunicorn manually**:
+   ```bash
+   cd /srv/dacon_broadcast_paper
+   source .venv/bin/activate
+   gunicorn --bind 127.0.0.1:8000 --workers 3 --timeout 120 wsgi:application
+   ```
+### systemd service
+Use `deploy/dacon_app.service` as a template:
+```bash
+sudo cp deploy/dacon_app.service /etc/systemd/system/dacon_app.service
+sudo systemctl daemon-reload
+sudo systemctl enable dacon_app
+sudo systemctl start dacon_app
+sudo systemctl status dacon_app
+```
+Adjust `WorkingDirectory`, `ExecStart`, and `Environment` entries to match your server paths or reference `/etc/dacon_app.env` with `EnvironmentFile=` if preferred.
+### Nginx reverse proxy
+1. Install Nginx (`sudo apt install nginx`).
+2. Copy the provided config:
+   ```bash
+   sudo cp deploy/dacon_app.nginx.conf /etc/nginx/sites-available/dacon_app
+   sudo ln -s /etc/nginx/sites-available/dacon_app /etc/nginx/sites-enabled/
+   sudo nginx -t
+   sudo systemctl reload nginx
+   ```
+3. Update `server_name` and any path aliases before reloading.
+4. (Optional) Enable HTTPS via Certbot:
+   ```bash
+   sudo apt install certbot python3-certbot-nginx
+   sudo certbot --nginx -d your-domain.com
+   ```
+### Firewall and health checks
+- Open ports 80/443 via `ufw` or your cloud provider’s security group.
+- Use the `/healthz` endpoint for health monitoring.
+- Logs:
+  - Application: `journalctl -u dacon_app`
+  - Nginx: `/var/log/nginx/access.log`, `/var/log/nginx/error.log`
+## File overview
+- `app.py` – Flask application with prediction and SEO endpoints.
+- `wsgi.py` – WSGI entrypoint for production servers.
+- `deploy/dacon_app.service` – sample systemd unit for Gunicorn.
+- `deploy/dacon_app.nginx.conf` – sample Nginx reverse proxy configuration.
+- `train_and_save_models.py` – pipeline that creates required artifacts.
+- `data_csv/` – CSV inputs used by the app.
+## Troubleshooting
+- If Gunicorn crashes, check for missing artifacts under `artifacts/`.
+- Ensure the `.env` file or environment variables include `GEMINI_API_KEY`.
+- Increase `client_max_body_size` in Nginx if large payloads are expected.
+- For Windows hosting, consider running Gunicorn/Nginx via WSL2 or using IIS + FastCGI with `wsgi.py`.
+## Hugging Face Spaces deployment (Docker Space)
+Hugging Face Spaces support custom web apps through Docker. Use the provided `Dockerfile` to containerize the app and expose it via Gunicorn.
+1. **Prepare the repository**
+   - Ensure all required artifacts (`*.pkl`) and the `data_csv/` folder are committed (Spaces pull the repo directly).
+   - Keep individual files under 1&nbsp;GB (Spaces limit); use Git LFS for large artifacts if needed.
+2. **Create a new Space**
+   - On Hugging Face, click **Create Space** → type `Docker` → name it (e.g., `username/dacon-predictor`).
+   - Leave hardware as default unless more RAM is required (~16&nbsp;GB recommended because of NLP dependencies).
+3. **Push the code**
+   - Initialize the Space as a Git repo locally:
+     ```bash
+     huggingface-cli repo create username/dacon-predictor --type=space --space-sdk=docker
+     git remote add space https://huggingface.co/spaces/username/dacon-predictor
+     git push space main
+     ```
+   - Alternatively, clone the empty Space repo and copy the project files into it before pushing.
+4. **Secrets & configuration**
+   - In the Space settings, add a secret named `GEMINI_API_KEY` with your Google Generative AI key.
+   - Optional: set `GUNICORN_WORKERS` to tune concurrency.
+5. **Container build**
+   - Spaces will build the `Dockerfile`. It installs system deps (OpenJDK, MeCab) and Python requirements, then launches Gunicorn binding to `$PORT` (HF uses port 7860 by default).
+   - The app serves `index.html` via Flask, so no additional frontend wiring is required.
+6. **Testing & monitoring**
+   - Once the build finishes, open the Space URL to verify predictions and SEO generation.
+   - Check the Space logs (Settings → Logs) for build/runtime issues, especially MeCab/Java errors.
+### Space-specific tips
+- **Cold start latency**: Spaces sleep when idle; first request may take longer as the model artifacts load.
+- **Resource usage**: If memory spikes occur (pandas + scikit-learn + MeCab), upgrade to a larger hardware tier.
+- **Background tasks**: This setup serves HTTP requests only; long-running offline jobs should be run outside Spaces.
+- **Security**: Secrets set in HF UI aren’t exposed in the repo. Avoid committing `.env` with real keys.
+- **Custom domains**: Hugging Face supports domain mapping on paid tiers if you need branding.

age_prediction_model.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b88f49ca67c9b539dd935b49ae451a4a1524765ebfec865d7ec594e8ebcb6da6
+size 4373490

app.py ADDED Viewed

	@@ -0,0 +1,432 @@

+"""Flask backend for the "신문과방송" article performance prediction web app.
+This server exposes prediction and metadata endpoints that rely on the
+pre-trained artifacts produced during the offline training pipeline.
+"""
+from __future__ import annotations
+import json
+import logging
+import os
+import re
+from pathlib import Path
+from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple, Union, cast
+from flask import Flask, jsonify, request, send_from_directory  # type: ignore[import]
+from konlpy.tag import Okt
+import joblib  # type: ignore[import]
+import numpy as np
+from scipy.sparse import csr_matrix, hstack
+from sklearn.metrics.pairwise import cosine_similarity  # type: ignore[import]
+from dotenv import load_dotenv
+import google.generativeai as genai
+# Optional dependency: pandas is only required for category input handling.
+try:
+    import pandas as pd
+except ImportError:  # pragma: no cover - pandas should be available, but we guard just in case.
+    pd = None  # type: ignore
+if pd is None:
+    raise RuntimeError(
+        "pandas is required for this application. Please install pandas in the runtime environment."
+    )
+load_dotenv()
+API_KEY = os.getenv("GEMINI_API_KEY")
+if not API_KEY:
+    raise RuntimeError("GEMINI_API_KEY is not set. Please define it in your .env file.")
+genai.configure(api_key=API_KEY)  # type: ignore[attr-defined]
+SEO_MODEL_NAME = "gemma-3-27b-it"
+SEO_GENERATIVE_MODEL = genai.GenerativeModel(SEO_MODEL_NAME)  # type: ignore[attr-defined]
+BASE_DIR = Path(__file__).resolve().parent
+ARTIFACT_DIR = BASE_DIR / "artifacts"
+DATA_DIR = BASE_DIR / "data_csv"
+CONTENTS_CSV = DATA_DIR / "contents.csv"
+logger = logging.getLogger(__name__)
+logging.basicConfig(level=logging.INFO)
+# Okt 객체를 전역적으로 초기화합니다.
+OKT = Okt()
+def okt_tokenizer(text: str) -> list[str]:
+    """
+    Tokenize text using Okt to extract nouns and verbs.
+    This function must be defined in the same way as in the training script
+    for the TfidfVectorizer to be loaded correctly.
+    """
+    if not isinstance(text, str) or not text.strip():
+        return []
+    # `stem=True`는 단어를 원형으로 복원해주는 옵션입니다 (e.g., '달렸다' -> '달리다').
+    # 훈련 스크립트와 동일하게 유지해야 합니다.
+    return [
+        word
+        for word, tag in OKT.pos(text, stem=True)
+        if tag in ["Noun", "Verb"]
+    ]
+def _resolve_artifact_path(filename: str) -> Path:
+    """Return the most likely path for a persisted artifact.
+    The training pipeline may save artifacts either in the project root or in an
+    `artifacts/` sub-directory. We attempt both locations for convenience and to
+    provide a clear error if the file cannot be found.
+    """
+    direct_path = BASE_DIR / filename
+    if direct_path.exists():
+        return direct_path
+    artifacts_path = ARTIFACT_DIR / filename
+    if artifacts_path.exists():
+        return artifacts_path
+    search_locations = [str(direct_path), str(artifacts_path)]
+    raise FileNotFoundError(
+        f"Artifact '{filename}' could not be located. Looked in: {search_locations}"
+    )
+def _load_artifact(filename: str) -> Any:
+    """Load a pickled artifact using joblib with helpful error messaging."""
+    path = _resolve_artifact_path(filename)
+    return joblib.load(path)
+app = Flask(__name__, static_folder=".", template_folder=".")
+# --- Artifact Loading -------------------------------------------------------------------------
+try:
+    tfidf_vectorizer = _load_artifact("tfidf_vectorizer.pkl")
+    onehot_encoder = _load_artifact("onehot_encoder.pkl")
+    label_encoder = _load_artifact("label_encoder.pkl")
+    view_prediction_model = _load_artifact("view_prediction_model.pkl")
+    age_prediction_model = _load_artifact("age_prediction_model.pkl")
+    text_features_matrix = _load_artifact("text_features_matrix.pkl")
+    article_mapping = _load_artifact("article_mapping.pkl")
+except FileNotFoundError as exc:  # pragma: no cover - occurs only if artifacts missing.
+    # Fail fast during startup so the issue can be resolved immediately.
+    raise RuntimeError(
+        "Required model artifacts are missing. Ensure Phase 1-2 outputs are saved before "
+        "starting the server."
+    ) from exc
+if not isinstance(text_features_matrix, csr_matrix):
+    # Convert any compatible sparse matrix to CSR format for efficient row slicing.
+    text_features_matrix = csr_matrix(text_features_matrix)
+try:
+    contents_dataframe = pd.read_csv(CONTENTS_CSV)
+except FileNotFoundError as exc:
+    raise RuntimeError(
+        f"Required contents dataset not found at {CONTENTS_CSV}."
+    ) from exc
+article_content_lookup: Dict[Any, str] = {}
+if "article_id" in contents_dataframe.columns and "content" in contents_dataframe.columns:
+    article_content_lookup = {
+        str(row.article_id): (row.content if isinstance(row.content, str) else "")
+        for row in contents_dataframe.itertuples()
+    }
+else:  # pragma: no cover - dataset schema mismatch
+    raise RuntimeError(
+        "contents.csv must contain 'article_id' and 'content' columns."
+    )
+_encoded_categories: List[str]
+try:
+    categories_arr = getattr(onehot_encoder, "categories_", None)
+    if categories_arr:
+        _encoded_categories = sorted(str(cat) for cat in categories_arr[0])
+    else:
+        _encoded_categories = []
+except AttributeError:  # pragma: no cover
+    _encoded_categories = []
+def _ensure_dataframe(category: str) -> Any:
+    """Create a minimal DataFrame for category encoding.
+    Falls back to a simple dictionary-based DataFrame when pandas is available,
+    otherwise raises a clear ImportError with remediation guidance.
+    """
+    if pd is None:
+        raise ImportError(
+            "pandas is required to prepare categorical inputs. Please install pandas or "
+            "ensure the training environment's dependencies are mirrored here."
+        )
+    return pd.DataFrame({"category": [category]})
+def _lookup_article_metadata(index: int) -> Dict[str, Any]:
+    """Return metadata for a given row index in the text feature matrix.
+    During Phase 2 the pipeline should persist either a pandas DataFrame or a
+    dictionary-like mapping that provides `article_id` and `title` fields. This
+    helper normalises the structure so the API can rely on consistent keys.
+    """
+    if isinstance(article_mapping, dict):
+        entry = article_mapping.get(index)
+        if entry is None:
+            entry = article_mapping.get(str(index))
+        if isinstance(entry, dict):
+            return {
+                "id": entry.get("article_id") or entry.get("id"),
+                "title": entry.get("title") or entry.get("article_title"),
+            }
+        if isinstance(entry, (list, tuple)) and entry:
+            entry_seq = cast(Sequence[Any], entry)
+            article_id = entry_seq[0]
+            article_title = entry_seq[1] if len(entry_seq) > 1 else entry_seq[0]
+            return {"id": article_id, "title": article_title}
+    mapping_obj = cast(Any, article_mapping)
+    if hasattr(mapping_obj, "iloc"):
+        # Supports pandas DataFrame or Series-like objects with iloc.
+        try:
+            row = mapping_obj.iloc[index]
+        except Exception:  # pragma: no cover - defensive guard.
+            row = None
+        if row is not None:
+            if hasattr(row, "to_dict"):
+                row_dict = row.to_dict()
+                return {
+                    "id": row_dict.get("article_id") or row_dict.get("id"),
+                    "title": row_dict.get("title") or row_dict.get("article_title"),
+                }
+    # Fallback: surface the index so downstream consumers can still show something.
+    return {"id": int(index), "title": f"Article #{index}"}
+def _find_similar_articles(query_vector: csr_matrix, top_k: int = 5) -> List[Dict[str, Any]]:
+    """Return the top-k most similar articles for the provided query vector."""
+    similarities = cosine_similarity(query_vector, text_features_matrix).ravel()
+    # Sort indices by descending similarity.
+    ranked_indices = np.argsort(similarities)[::-1]
+    similar_articles: List[Dict[str, Any]] = []
+    for idx in ranked_indices:
+        score = float(similarities[idx])
+        # Skip near-identical matches (useful if querying an existing article).
+        if score >= 0.9999 and not similar_articles:
+            continue
+        metadata = _lookup_article_metadata(int(idx))
+        metadata.update({"similarity": round(score, 4)})
+        similar_articles.append(metadata)
+        if len(similar_articles) >= top_k:
+            break
+    return similar_articles
+def _extract_first_sentence(text: str) -> str:
+    """Return the first sentence-like fragment from the provided text."""
+    if not text:
+        return ""
+    cleaned = re.sub(r"\s+", " ", text).strip()
+    if not cleaned:
+        return ""
+    sentence_endings = re.split(r"(?<=[.!?？!。])\s+", cleaned)
+    for fragment in sentence_endings:
+        fragment = fragment.strip()
+        if fragment:
+            return fragment
+    return cleaned[:80]
+def generate_seo_suggestions(content: str) -> Dict[str, str]:
+    """Generate SEO title and description using Google Gemini."""
+    safe_content = content or ""
+    safe_content = re.sub(r"\s+", " ", safe_content).strip()
+    prompt = (
+        "You are a lead digital editor for a korean prestigious online media company that bridges in-depth analysis with current trends. "
+        "Your mission is to craft an SEO title and description that are both intelligent and highly shareable. The goal is to highlight the article's most timely, newsworthy, and debate-sparking elements to maximize public interest and social engagement.\n\n"
+        "Guidelines:\n"
+        "1. **'title' (under 60 characters):** Frame the core topic as a compelling thesis or a provocative question. Connect it to a current conversation or a surprising trend to make it feel urgent and relevant *today*. It should make people think, 'This is an interesting take.'\n"
+        "2. **'description' (under 150 characters, in Korean):** Go beyond summary. Contextualize the article's importance. Explain *why* this topic matters *now* and what new perspective the article offers on a familiar issue. It should persuade readers that this article will give them a crucial viewpoint for today's conversations.\n"
+        "3. **Format:** Respond strictly with a valid JSON object with 'title' and 'description' keys. Avoid generic phrases, clickbait, and anything that undermines the intellectual integrity of the brand.\n\n"
+        f"Article Content:\n{safe_content}\n\n"
+        "Return exactly: {\"title\": \"<생성된 제목>\", \"description\": \"<생성된 설명>\"}"
+    )
+    try:
+        response = SEO_GENERATIVE_MODEL.generate_content(prompt)
+        raw_text = getattr(response, "text", "") or ""
+        if not raw_text and getattr(response, "candidates", None):
+            collected_parts: List[str] = []
+            for candidate in response.candidates:  # type: ignore[attr-defined]
+                candidate_content: Any = getattr(candidate, "content", None)
+                parts = getattr(candidate_content, "parts", None) if candidate_content else None
+                if parts:
+                    for part in parts:
+                        text_part = getattr(part, "text", None)
+                        if text_part:
+                            collected_parts.append(str(text_part))
+            raw_text = " ".join(collected_parts)
+        cleaned_text = raw_text.strip()
+        if not cleaned_text:
+            raise ValueError("SEO model returned an empty response")
+        if cleaned_text.startswith("```"):
+            cleaned_text = re.sub(r"^```(?:json)?", "", cleaned_text, flags=re.IGNORECASE).strip()
+            cleaned_text = re.sub(r"```$", "", cleaned_text).strip()
+        match = re.search(r"{.*}", cleaned_text, re.DOTALL)
+        json_payload = match.group(0) if match else cleaned_text
+        seo_json = json.loads(json_payload)
+        suggested_title = str(seo_json.get("title", "")).strip()
+        suggested_description = str(seo_json.get("description", "")).strip()
+        if not suggested_title or not suggested_description:
+            raise ValueError("SEO model response missing required fields")
+        return {
+            "suggested_title": suggested_title[:60],
+            "suggested_description": suggested_description[:150],
+        }
+    except Exception as exc:  # pragma: no cover - external API failures
+        logger.error("SEO model (%s) generation failed: %s", SEO_MODEL_NAME, exc)
+        fallback_title = _extract_first_sentence(safe_content) or safe_content[:60]
+        fallback_description = safe_content[:150]
+        return {
+            "suggested_title": fallback_title,
+            "suggested_description": fallback_description,
+        }
+@app.route("/", methods=["GET"])
+def serve_index() -> Any:
+    """Serve the single-page frontend."""
+    template_dir = app.template_folder or "."
+    return send_from_directory(template_dir, "index.html")
+@app.route("/healthz", methods=["GET"])
+def healthcheck() -> Any:
+    """Simple health check endpoint."""
+    return jsonify({"status": "ok"})
+@app.route("/categories", methods=["GET"])
+def list_categories() -> Any:
+    """Expose category options inferred from the fitted OneHotEncoder."""
+    return jsonify({"categories": _encoded_categories})
+@app.route("/predict", methods=["POST"])
+def predict() -> Any:
+    payload = request.get_json(silent=True) or {}
+    required_fields = {"title", "content", "category"}
+    missing = [field for field in required_fields if not payload.get(field)]
+    if missing:
+        return (
+            jsonify({"error": f"Missing required fields: {', '.join(missing)}"}),
+            400,
+        )
+    title: str = str(payload.get("title", "")).strip()
+    content: str = str(payload.get("content", "")).strip()
+    category: str = str(payload.get("category", "")).strip()
+    combined_text = f"{title} {content}".strip()
+    if not combined_text:
+        return jsonify({"error": "Title and content cannot both be empty."}), 400
+    text_vector = tfidf_vectorizer.transform([combined_text])
+    try:
+        category_frame = _ensure_dataframe(category)
+        category_vector = onehot_encoder.transform(category_frame[["category"]])
+    except Exception as exc:
+        return jsonify({"error": f"Failed to encode category: {exc}"}), 400
+    feature_vector = hstack([text_vector, category_vector])
+    view_prediction = view_prediction_model.predict(feature_vector)[0]
+    predicted_views = int(round(float(view_prediction)))
+    age_prediction = age_prediction_model.predict(feature_vector)[0]
+    predicted_age_index = int(age_prediction)
+    similar_articles_raw = _find_similar_articles(text_vector, top_k=5)
+    similar_articles: List[Dict[str, Any]] = []
+    for article in similar_articles_raw:
+        article_id = article.get("id")
+        article_title = article.get("title")
+        lookup_key = str(article_id) if article_id is not None else ""
+        content_text = article_content_lookup.get(lookup_key, "")
+        summary = content_text.strip()[:100]
+        similar_articles.append(
+            {
+                "id": article_id,
+                "title": article_title,
+                "summary": summary,
+            }
+        )
+    try:
+        decoded_age_group = label_encoder.inverse_transform([predicted_age_index])[0]
+    except Exception:
+        decoded_age_group = str(predicted_age_index)
+    seo_recommendation = generate_seo_suggestions(content)
+    seo_simulation = {
+        "current_state": {
+            "issue": "메타 정보에 핵심 키워드가 부족하고 설명이 너무 길어 SERP에서 잘립니다.",
+            "title": title[:70] or "제목이 입력되지 않았습니다.",
+            "description": (content[:150] + ("..." if len(content) > 150 else "")) if content else "본문이 입력되지 않았습니다.",
+        },
+        "recommended_state": {
+            "title": seo_recommendation["suggested_title"],
+            "description": seo_recommendation["suggested_description"],
+        },
+    }
+    response_payload = {
+        "ai_prediction": {
+            "predicted_views": predicted_views,
+            "predicted_age_group": decoded_age_group,
+            "similar_articles": similar_articles,
+        },
+        "seo_simulation": seo_simulation,
+    }
+    return jsonify(response_payload)
+@app.route("/generate-description", methods=["POST"])
+def generate_description() -> Any:
+    payload = request.get_json(silent=True) or {}
+    content_text = str(payload.get("content", ""))
+    if not content_text.strip():
+        return jsonify({"error": "Content is required to generate a description."}), 400
+    suggestions = generate_seo_suggestions(content_text)
+    return jsonify(
+        {
+            "title": suggestions.get("suggested_title", ""),
+            "description": suggestions.get("suggested_description", ""),
+        }
+    )
+if __name__ == "__main__":  # pragma: no cover - manual execution only.
+    app.run(host="0.0.0.0", port=5000, debug=False)

article_mapping.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5edb3af7db4f3664dd57d9a0f38460154d421b27c2e8f034533f04c90008330c
+size 212049

convert_to_csv.py ADDED Viewed

	@@ -0,0 +1,83 @@

+"""
+Excel 파일을 CSV 형식으로 변환하는 스크립트
+"""
+import pandas as pd
+import os
+from datetime import datetime
+# 디렉토리 설정
+data_dir = r'c:\Users\korea\Desktop\dacon_broadcast_paper\data'
+output_dir = r'c:\Users\korea\Desktop\dacon_broadcast_paper\data_csv'
+# 출력 디렉토리 생성
+if not os.path.exists(output_dir):
+    os.makedirs(output_dir)
+    print(f"CSV 출력 디렉토리 생성: {output_dir}")
+# 변환할 파일 목록
+files = [
+    'article_metrics_monthly.xlsx',
+    'contents.xlsx',
+    'demographics_part001.xlsx',
+    'demographics_part002.xlsx',
+    'referrer.xlsx'
+]
+print("=" * 80)
+print("Excel → CSV 변환 시작")
+print("=" * 80)
+for file in files:
+    start_time = datetime.now()
+    file_path = os.path.join(data_dir, file)
+    csv_filename = file.replace('.xlsx', '.csv')
+    csv_path = os.path.join(output_dir, csv_filename)
+    print(f"\n[처리 중] {file}")
+    try:
+        # Excel 파일 읽기
+        df = pd.read_excel(file_path)
+        # CSV로 저장 (UTF-8 with BOM for Excel compatibility)
+        df.to_csv(csv_path, index=False, encoding='utf-8-sig')
+        # 처리 시간 계산
+        elapsed = (datetime.now() - start_time).total_seconds()
+        # 결과 출력
+        print(f"  ✓ 완료: {csv_filename}")
+        print(f"  - 행 개수: {len(df):,}")
+        print(f"  - 열 개수: {len(df.columns)}")
+        print(f"  - 처리 시간: {elapsed:.2f}초")
+        print(f"  - 저장 경로: {csv_path}")
+    except Exception as e:
+        print(f"  ✗ 오류 발생: {str(e)}")
+# demographics 파일 병합 (선택사항)
+print("\n" + "=" * 80)
+print("[추가 작업] demographics 파일 병합")
+print("=" * 80)
+try:
+    demo_part1 = pd.read_csv(os.path.join(output_dir, 'demographics_part001.csv'))
+    demo_part2 = pd.read_csv(os.path.join(output_dir, 'demographics_part002.csv'))
+    # 두 파일 병합
+    demographics_merged = pd.concat([demo_part1, demo_part2], ignore_index=True)
+    # 병합된 파일 저장
+    merged_path = os.path.join(output_dir, 'demographics_merged.csv')
+    demographics_merged.to_csv(merged_path, index=False, encoding='utf-8-sig')
+    print(f"✓ demographics 병합 완료")
+    print(f"  - 총 행 개수: {len(demographics_merged):,}")
+    print(f"  - 저장 경로: {merged_path}")
+except Exception as e:
+    print(f"✗ 병합 중 오류 발생: {str(e)}")
+print("\n" + "=" * 80)
+print("모든 변환 작업 완료!")
+print("=" * 80)

data/article_metrics_monthly.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dc5d86bdddcb0c22eb82b52d167739a2f548d9d47aad6809154bfda1776962a0
+size 857389

data/contents.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:023d9619f1232a138572f19c4411a1e1a226435544710f8b9a68ef2e8f708238
+size 8973538

data/demographics_part001.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ebd47a1657bddb8b43c6f8015f39ee546bf2bd4adceb3287443bee7a54c99bf1
+size 22162808

data/demographics_part002.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d005ab673df2c4e5a5fbea078609276ddf092e520e207aeb5636c3a8857a1ae2
+size 1973477

data/referrer.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f349b7c362f1a5d9b598e2d7bde46b4b0417866e84423f9dcf9d736b3b1e0c32
+size 9025763

data_csv/article_metrics_monthly.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7f526da60ce292fc4fde55d66c5b2314363b6cdb050390902f6ac21d02c288c5
+size 1370521

data_csv/contents.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:00278be4f929cc7f2d4e1afc9f8ee0222f16c5edee8eec908361dccd54c5019d
+size 21806940

data_csv/demographics_merged.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:59d65448bbc7953f3d06f6b0d2a081b442fcc6b6f5955c3f72fbbad85a8d41aa
+size 41352606

data_csv/demographics_part001.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c94a682cee7a5d81ce4be323a385331180da40552595d6cedfb8008f6f81026d
+size 37956547

data_csv/demographics_part002.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aeefbdf084d02890ee6f948b916d0c158f93d2c0c57c43c56528435343f654f1
+size 3396110

data_csv/referrer.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:751a2be72ae6f95cd344b8b63ebcf8a521fff3a14352d05b9c3e8e643c946c6c
+size 43647022

data_structure_analysis.py ADDED Viewed

	@@ -0,0 +1,60 @@

+"""
+신문과방송 독자 데이터 구조 분석 스크립트
+"""
+import pandas as pd
+import os
+# 데이터 폴더 경로
+data_dir = r'c:\Users\korea\Desktop\dacon_broadcast_paper\data'
+# 각 파일 분석
+files = [
+    'article_metrics_monthly.xlsx',
+    'contents.xlsx',
+    'demographics_part001.xlsx',
+    'demographics_part002.xlsx',
+    'referrer.xlsx'
+]
+print("=" * 80)
+print("신문과방송 독자 데이터 구조 분석")
+print("=" * 80)
+for file in files:
+    file_path = os.path.join(data_dir, file)
+    print(f"\n{'='*80}")
+    print(f"파일명: {file}")
+    print(f"{'='*80}")
+    try:
+        # Excel 파일 읽기
+        df = pd.read_excel(file_path)
+        # 기본 정보
+        print(f"\n[기본 정보]")
+        print(f"행 개수: {len(df):,}")
+        print(f"열 개수: {len(df.columns)}")
+        print(f"전체 크기: {df.shape}")
+        # 컬럼 정보
+        print(f"\n[컬럼 목록 및 데이터 타입]")
+        for idx, (col, dtype) in enumerate(zip(df.columns, df.dtypes), 1):
+            non_null = df[col].notna().sum()
+            null_count = df[col].isna().sum()
+            null_pct = (null_count / len(df)) * 100
+            print(f"{idx:2d}. {col:40s} | Type: {str(dtype):15s} | Non-Null: {non_null:,} | Null: {null_count:,} ({null_pct:.1f}%)")
+        # 샘플 데이터 (처음 3행)
+        print(f"\n[샘플 데이터 (처음 3행)]")
+        print(df.head(3).to_string())
+        # 메모리 사용량
+        memory_mb = df.memory_usage(deep=True).sum() / 1024 / 1024
+        print(f"\n[메모리 사용량]: {memory_mb:.2f} MB")
+    except Exception as e:
+        print(f"오류 발생: {str(e)}")
+print("\n" + "=" * 80)
+print("분석 완료!")
+print("=" * 80)

index.html ADDED Viewed

	@@ -0,0 +1,581 @@

+<!DOCTYPE html>
+<html lang="ko">
+  <head>
+    <meta charset="UTF-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <title>기사 성과 예측</title>
+    <style>
+    *, *::before, *::after {box-sizing: border-box;}
+      :root {
+        /* Color Palette */
+        --font-sans: "Pretendard", "Segoe UI", system-ui, -apple-system, sans-serif;
+        /* Light Mode */
+        --bg-light: #f4f6f8; /* 약간 더 부드러운 회색 배경 */
+        --card-light: #ffffff;
+        --text-primary-light: #111827; /* 진한 차콜 */
+        --text-secondary-light: #6b7280; /* 중간 회색 */
+        --border-light: #e5e7eb; /* 옅은 회색 테두리 */
+        --accent-primary: #3b82f6; /* 차분한 블루 */
+        --accent-secondary: #1d4ed8;
+        --accent-glow: rgba(59, 130, 246, 0.2);
+        /* Dark Mode */
+        --bg-dark: #111827; /* 딥 네이비/차콜 */
+        --card-dark: #1f2937; /* 약간 밝은 네이비/차콜 */
+        --text-primary-dark: #f9fafb; /* 거의 흰색 */
+        --text-secondary-dark: #9ca3af; /* 밝은 회색 */
+        --border-dark: #374151;
+      }
+      /* Basic Setup */
+      body {
+        font-family: var(--font-sans);
+        margin: 0;
+        padding: 0;
+        background-color: var(--bg-light);
+        color: var(--text-primary-light);
+        display: flex;
+        justify-content: center;
+        min-height: 100vh;
+        -webkit-font-smoothing: antialiased;
+        -moz-osx-font-smoothing: grayscale;
+        transition: background-color 0.3s ease, color 0.3s ease;
+      }
+      @media (prefers-color-scheme: dark) {
+        body {
+          background-color: var(--bg-dark);
+          color: var(--text-primary-dark);
+        }
+      }
+      .container {
+        width: min(960px, 100% - 40px);
+        padding: 48px 20px 64px;
+      }
+      /* Typography */
+      h1 {
+        font-size: 2.25rem;
+        font-weight: 800;
+        letter-spacing: -0.04em;
+        color: var(--text-primary-light);
+      }
+      p.subtitle {
+        font-size: 1.125rem;
+        margin: 8px 0 40px;
+        color: var(--text-secondary-light);
+      }
+      h2 {
+        font-size: 1.5rem;
+        font-weight: 700;
+        border-bottom: 1px solid var(--border-light);
+        padding-bottom: 12px;
+        margin-bottom: 24px;
+      }
+      h3 {
+        font-size: 1.125rem;
+        font-weight: 600;
+        color: var(--text-primary-light);
+      }
+      @media (prefers-color-scheme: dark) {
+        h1, h3 { color: var(--text-primary-dark); }
+        p.subtitle { color: var(--text-secondary-dark); }
+        h2 { border-color: var(--border-dark); }
+      }
+      /* Main Card */
+      .card {
+        background-color: var(--card-light);
+        border-radius: 24px;
+        padding: 40px;
+        box-shadow: 0 8px 16px rgba(0, 0, 0, 0.02), 0 20px 40px rgba(23, 33, 73, 0.08);
+        transition: background-color 0.3s ease, box-shadow 0.3s ease;
+      }
+      @media (prefers-color-scheme: dark) {
+        .card {
+          background-color: var(--card-dark);
+          box-shadow: 0 20px 40px rgba(0, 0, 0, 0.25);
+        }
+      }
+      /* Form Elements */
+      form {
+        display: grid;
+        gap: 24px;
+      }
+      label {
+        display: block;
+        font-weight: 600;
+        font-size: 0.9rem;
+        margin-bottom: 8px;
+        color: var(--text-primary-light);
+      }
+      @media (prefers-color-scheme: dark) {
+        label { color: var(--text-primary-dark); }
+      }
+      input[type="text"],
+      select,
+      textarea {
+        width: 100%;
+        padding: 12px 16px;
+        border-radius: 10px;
+        border: 1px solid var(--border-light);
+        font-size: 1rem;
+        background-color: var(--bg-light);
+        color: var(--text-primary-light);
+        transition: border-color 0.2s ease, box-shadow 0.2s ease;
+        resize: vertical;
+      }
+      textarea { min-height: 180px; }
+      input:focus,
+      textarea:focus,
+      select:focus {
+        outline: none;
+        border-color: var(--accent-primary);
+        box-shadow: 0 0 0 3px var(--accent-glow);
+      }
+      @media (prefers-color-scheme: dark) {
+        input[type="text"], select, textarea {
+          background-color: #2a3647;
+          border-color: var(--border-dark);
+          color: var(--text-primary-dark);
+        }
+      }
+      button {
+        background: linear-gradient(135deg, var(--accent-primary), var(--accent-secondary));
+        color: #fff;
+        border: none;
+        padding: 14px 24px;
+        border-radius: 10px;
+        font-size: 1rem;
+        font-weight: 600;
+        cursor: pointer;
+        transition: all 0.2s ease;
+        box-shadow: 0 4px 12px rgba(59, 130, 246, 0.2);
+      }
+      button:hover {
+        transform: translateY(-2px);
+        box-shadow: 0 8px 20px rgba(59, 130, 246, 0.3);
+      }
+      button:disabled {
+        opacity: 0.5;
+        cursor: not-allowed;
+        transform: none;
+        box-shadow: none;
+      }
+      #summarize-btn {
+        margin-top: 8px;
+        width: 100%;
+        background: transparent;
+        color: var(--accent-secondary);
+        border: 1px solid var(--accent-secondary);
+        box-shadow: none;
+      }
+      #summarize-btn:hover {
+        background: rgba(59, 130, 246, 0.1);
+      }
+      #summarize-btn:disabled {
+        opacity: 0.6;
+        background: rgba(59, 130, 246, 0.05);
+      }
+      /* Results Section */
+      #results-container {
+        margin-top: 40px;
+        padding-top: 40px;
+        border-top: 1px solid var(--border-light);
+        line-height: 1.65;
+      }
+      @media (prefers-color-scheme: dark) {
+        #results-container { border-color: var(--border-dark); }
+      }
+      #results-container.hidden { display: none; }
+      .spinner {
+        width: 36px;
+        height: 36px;
+        border: 4px solid var(--accent-glow);
+        border-top-color: var(--accent-primary);
+        border-radius: 50%;
+        animation: spin 1s linear infinite;
+        margin: 32px auto;
+      }
+      @keyframes spin { to { transform: rotate(360deg); } }
+      .error-message { color: #f43f5e; font-weight: 600; }
+      .results-section { margin-bottom: 40px; }
+      .ai-card {
+        background-color: #eff6ff;
+        border-radius: 16px;
+        padding: 24px;
+        border: 1px solid #dbeafe;
+      }
+      .ai-card ul { padding-left: 20px; color: var(--text-secondary-light); }
+      .ai-card li strong { color: var(--text-primary-light); }
+      .ai-card li .summary {
+        font-size: 0.9rem;
+        color: var(--text-secondary-light);
+        margin: 6px 0 0;
+      }
+      @media (prefers-color-scheme: dark) {
+        .ai-card {
+          background-color: #1e293b;
+          border-color: #334155;
+        }
+        .ai-card ul { color: var(--text-secondary-dark); }
+        .ai-card li strong { color: var(--text-primary-dark); }
+        .ai-card li .summary { color: var(--text-secondary-dark); }
+      }
+      /* SEO Simulation */
+      .seo-grid {
+        display: grid;
+        grid-template-columns: 1fr 1fr;
+        gap: 24px;
+      }
+      @media (max-width: 768px) {
+        .seo-grid { grid-template-columns: 1fr; }
+      }
+      .seo-column {
+        border: 1px solid var(--border-light);
+        border-radius: 16px;
+        padding: 24px;
+      }
+      @media (prefers-color-scheme: dark) {
+        .seo-column { border-color: var(--border-dark); }
+      }
+      .seo-column h3 { margin-top: 0; }
+      /* SERP Preview */
+      .serp-preview {
+        background-color: var(--card-light);
+        border-radius: 8px;
+        padding: 16px;
+        border: 1px solid var(--border-light);
+        margin-bottom: 24px;
+        font-family: Arial, sans-serif;
+      }
+      .serp-preview .serp-url { color: #1e8e3e; font-size: 0.875rem; }
+      .serp-preview .serp-title { color: #1a0dab; font-size: 1.25rem; font-weight: 400; margin: 4px 0; }
+      .serp-preview .serp-description { color: #4d5156; font-size: 0.875rem; line-height: 1.5; }
+      @media (prefers-color-scheme: dark) {
+        .serp-preview {
+          background-color: #2d3748;
+          border-color: var(--border-dark);
+        }
+        .serp-preview .serp-url { color: #98c379; }
+        .serp-preview .serp-title { color: #8ab4f8; }
+        .serp-preview .serp-description { color: #bdc1c6; }
+      }
+      .seo-input-group { margin-top: 16px; }
+      .counter {
+        font-size: 0.8rem;
+        color: var(--text-secondary-light);
+        text-align: right;
+        margin-top: 4px;
+      }
+      @media (prefers-color-scheme: dark) {
+        .counter { color: var(--text-secondary-dark); }
+      }
+    </style>
+  </head>
+  <body>
+    <main class="container">
+      <section class="card">
+        <h1>기사 성과 예측</h1>
+        <p class="subtitle">기사 내용을 입력하면 예상 조회수, 핵심 독자층, 유사 기사를 추천해 드립니다.</p>
+        <form id="prediction-form" autocomplete="off">
+          <div>
+            <label for="title">기사 제목</label>
+            <input type="text" id="title" name="title" placeholder="예: AI 기술, 언론 제작 방식 대전환" required />
+          </div>
+          <div>
+            <label for="category">카테고리</label>
+            <select id="category" name="category" required>
+              <option value="" disabled selected>카테고리를 선택하세요</option>
+            </select>
+          </div>
+          <div>
+            <label for="content">기사 본문</label>
+            <textarea id="content" name="content" placeholder="핵심 내용과 주요 포인트를 입력하세요." required></textarea>
+          </div>
+          <button type="submit">분석하기</button>
+        </form>
+        <div id="results-container" class="hidden"></div>
+      </section>
+    </main>
+    <script>
+    const form = document.getElementById("prediction-form");
+    const resultsContainer = document.getElementById("results-container");
+    const categorySelect = document.getElementById("category");
+    const articleContentInput = document.getElementById("content");
+      const MAX_TITLE_LENGTH = 60;
+      const MAX_DESCRIPTION_LENGTH = 150;
+      const renderLoading = () => {
+        resultsContainer.classList.remove("hidden");
+        resultsContainer.innerHTML = `
+          <div class="spinner"></div>
+          <p style="text-align:center;">AI가 분석 중입니다...</p>
+        `;
+      };
+      const renderError = (message) => {
+        resultsContainer.classList.remove("hidden");
+        resultsContainer.innerHTML = `<p class="error-message">⚠️ ${message}</p>`;
+      };
+      const createSimilarArticlesList = (similarArticles = []) => {
+        if (!similarArticles.length) {
+          return "<li>유사 기사를 찾지 못했습니다.</li>";
+        }
+        return similarArticles
+          .map((article) => {
+            const title = article.title ?? "제목 미상";
+            const summary = article.summary ? `${article.summary}...` : "요약을 제공할 수 없습니다.";
+            return `
+              <li>
+                <strong>${title}</strong>
+                <p class="summary">${summary}</p>
+              </li>
+            `;
+          })
+          .join("");
+      };
+      const renderResults = (data) => {
+        const { ai_prediction, seo_simulation } = data;
+        const { predicted_views, predicted_age_group, similar_articles } = ai_prediction || {};
+        const { current_state, recommended_state } = seo_simulation || {};
+        resultsContainer.classList.remove("hidden");
+        resultsContainer.innerHTML = `
+          <section class="results-section ai-card">
+            <h2>AI 예측 결과</h2>
+            <h3>예상 조회수: <strong>${(predicted_views ?? 0).toLocaleString()}회</strong></h3>
+            <h3>핵심 독자층: <strong>${predicted_age_group ?? "N/A"}</strong></h3>
+            <h3>유사 기사 추천</h3>
+            <ul>
+              ${createSimilarArticlesList(similar_articles)}
+            </ul>
+          </section>
+          <section class="results-section">
+            <h2>SEO 시뮬레이션</h2>
+            <div class="seo-grid">
+              <div class="seo-column">
+                <h3>현재 노출 상태 (Before)</h3>
+                <div class="serp-preview">
+                  <p class="serp-url">press.example.com/article</p>
+                  <p class="serp-title">${"한국언론진흥재단 전체기사"}</p>
+                  <p class="serp-description">${"설명이 제공되지 않았습니다."}</p>
+                </div>
+                <p>${current_state?.issue ?? "현재 문제를 파악하지 못했습니다."}</p>
+              </div>
+              <div class="seo-column">
+                <h3>추천 개선안 (After)</h3>
+                <div class="serp-preview" id="after-preview">
+                  <p class="serp-url">press.example.com/article</p>
+                  <p class="serp-title" id="after-title-preview"></p>
+                  <p class="serp-description" id="after-description-preview"></p>
+                </div>
+                <div class="seo-input-group">
+                  <div>
+                    <label for="seo-title-input">추천 제목</label>
+                    <input type="text" id="seo-title-input" maxlength="80" />
+                    <div class="counter" id="title-counter"></div>
+                  </div>
+                  <div>
+                    <label for="seo-description-input">추천 설명</label>
+                    <textarea id="seo-description-input" rows="4" maxlength="200"></textarea>
+                    <div class="counter" id="description-counter"></div>
+                    <button type="button" id="summarize-btn">본문 내용으로 설명 다시 생성하기 ↺</button>
+                  </div>
+                </div>
+                <p>추천 제목과 설명을 조정하여 SERP에서의 가독성과 클릭 유도 문구를 최적화할 수 있습니다.</p>
+              </div>
+            </div>
+          </section>
+        `;
+        const seoTitleInput = document.getElementById("seo-title-input");
+        const seoDescriptionInput = document.getElementById("seo-description-input");
+        const afterTitlePreview = document.getElementById("after-title-preview");
+        const afterDescriptionPreview = document.getElementById("after-description-preview");
+        const titleCounter = document.getElementById("title-counter");
+        const descriptionCounter = document.getElementById("description-counter");
+        const summarizeBtn = document.getElementById("summarize-btn");
+        const initialTitle = recommended_state?.title ?? "검색 엔진 친화적인 기사 제목";
+        const initialDescription = recommended_state?.description ?? "핵심 키워드와 요약 내용을 포함하는 설명을 작성해 보세요.";
+        seoTitleInput.maxLength = MAX_TITLE_LENGTH;
+        seoDescriptionInput.maxLength = MAX_DESCRIPTION_LENGTH;
+        seoTitleInput.value = initialTitle;
+        seoDescriptionInput.value = initialDescription;
+        const updateTitlePreview = () => {
+          const value = seoTitleInput.value;
+          afterTitlePreview.textContent = value;
+          titleCounter.textContent = `${value.length} / ${MAX_TITLE_LENGTH}`;
+        };
+        const updateDescriptionPreview = () => {
+          const value = seoDescriptionInput.value;
+          afterDescriptionPreview.textContent = value;
+          descriptionCounter.textContent = `${value.length} / ${MAX_DESCRIPTION_LENGTH}`;
+        };
+        seoTitleInput.addEventListener("input", updateTitlePreview);
+        seoDescriptionInput.addEventListener("input", updateDescriptionPreview);
+        if (summarizeBtn) {
+          summarizeBtn.addEventListener("click", async () => {
+            const contentValue = articleContentInput.value.trim();
+            if (!contentValue) {
+              alert("본문 내용을 입력한 뒤 다시 시도해주세요.");
+              return;
+            }
+            const originalText = summarizeBtn.textContent;
+            summarizeBtn.disabled = true;
+            summarizeBtn.textContent = "생성 중...";
+            try {
+              const response = await fetch("/generate-description", {
+                method: "POST",
+                headers: {
+                  "Content-Type": "application/json",
+                },
+                body: JSON.stringify({ content: contentValue }),
+              });
+              if (!response.ok) {
+                const { error } = await response.json();
+                throw new Error(error || "설명을 생성하지 못했습니다.");
+              }
+              const seoData = await response.json();
+              seoTitleInput.value = seoData.title ?? "";
+              seoDescriptionInput.value = seoData.description ?? "";
+              seoTitleInput.dispatchEvent(new Event("input"));
+              seoDescriptionInput.dispatchEvent(new Event("input"));
+            } catch (error) {
+              console.error(error);
+              alert("AI 설명 생성 중 문제가 발생했습니다. 잠시 후 다시 시도해주세요.");
+            } finally {
+              summarizeBtn.disabled = false;
+              summarizeBtn.textContent = originalText;
+            }
+          });
+        }
+        updateTitlePreview();
+        updateDescriptionPreview();
+      };
+      const fetchCategories = async () => {
+        try {
+          const response = await fetch("/categories");
+          if (!response.ok) throw new Error("카테고리 정보를 불러오지 못했습니다.");
+          const { categories } = await response.json();
+          if (Array.isArray(categories)) {
+            categories.forEach((category) => {
+              const option = document.createElement("option");
+              option.value = category;
+              option.textContent = category;
+              categorySelect.appendChild(option);
+            });
+          }
+        } catch (error) {
+          console.error(error);
+          renderError("카테고리 로딩 중 문제가 발생했습니다. 새로고침 후 다시 시도해주세요.");
+        }
+      };
+      form.addEventListener("submit", async (event) => {
+        event.preventDefault();
+        const title = form.title.value.trim();
+        const content = form.content.value.trim();
+        const category = form.category.value;
+        if (!title || !content || !category) {
+          renderError("제목, 카테고리, 본문을 모두 입력해주세요.");
+          return;
+        }
+        renderLoading();
+        const submitButton = form.querySelector("button[type='submit']");
+        submitButton.disabled = true;
+        const originalButtonText = submitButton.textContent;
+        submitButton.textContent = "AI로 분석 중...";
+        try {
+          const response = await fetch("/predict", {
+            method: "POST",
+            headers: {
+              "Content-Type": "application/json",
+            },
+            body: JSON.stringify({ title, content, category }),
+          });
+          if (!response.ok) {
+            const { error } = await response.json();
+            throw new Error(error || "예측 요청에 실패했습니다.");
+          }
+          const payload = await response.json();
+          renderResults(payload);
+        } catch (error) {
+          console.error(error);
+          renderError(error.message || "서버 요청 중 오류가 발생했습니다.");
+        } finally {
+          submitButton.disabled = false;
+          submitButton.textContent = originalButtonText;
+        }
+      });
+      fetchCategories();
+    </script>
+  </body>
+</html>

label_encoder.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:180c6e96a643e0da6875e8327c9da6e651ba88b216a373f6e9d85679b51e2bb8
+size 558

onehot_encoder.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7eb648fceb784e272c35c0eb24810303b39143c1a5f130e2c50ae7bb23405c25
+size 1452

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+Flask>=3.0.0
+pandas>=2.1.0
+numpy>=1.25.0
+scikit-learn>=1.3.0
+scipy>=1.11.0
+joblib>=1.3.0
+xgboost>=1.7.6
+konlpy>=0.6.0
+mecab-python3>=1.0.6
+python-dotenv>=1.0.0
+google-generativeai>=0.3.1

text_features_matrix.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e7a71eb8a520c1b4dda4b05ba436c653225aa689e9d13b68d7a572013f6d62ba
+size 8837035

tfidf_vectorizer.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:74b3b6ff74dd41b1357013d6326534ee0a3fd0df6c0b9ff57e9bb5563869fc5f
+size 185133

train_and_save_models.py ADDED Viewed

	@@ -0,0 +1,321 @@

+"""Training pipeline for the "신문과방송" article performance prediction project.
+This script prepares the datasets, engineers features using Okt-powered
+TF-IDF and categorical encodings, trains XGBoost models for view-count and
+primary audience prediction, and persists all artifacts required by the Flask
+inference service.
+The script is intended to be executed once the raw CSV files are available in
+`data_csv/`. Running it will generate the following files in the project root:
+- tfidf_vectorizer.pkl
+- onehot_encoder.pkl
+- label_encoder.pkl
+- view_prediction_model.pkl
+- age_prediction_model.pkl
+- text_features_matrix.pkl
+- article_mapping.pkl
+"""
+from __future__ import annotations
+import sys
+from pathlib import Path
+from typing import List, Optional, Tuple, cast
+import joblib
+import numpy as np
+import pandas as pd
+from konlpy.tag import Okt
+from scipy.sparse import csr_matrix, hstack
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.metrics import accuracy_score, mean_absolute_error
+from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import LabelEncoder, OneHotEncoder
+from xgboost import XGBClassifier, XGBRegressor
+DATA_DIR = Path("data_csv")
+CONTENTS_PATH = DATA_DIR / "contents.csv"
+METRICS_PATH = DATA_DIR / "article_metrics_monthly.csv"
+DEMOGRAPHICS_PATH = DATA_DIR / "demographics_merged.csv"
+def ensure_files_exist(paths: List[Path]) -> None:
+    """Raise a helpful error if any expected data file is missing."""
+    missing = [str(path) for path in paths if not path.exists()]
+    if missing:
+        raise FileNotFoundError(
+            "Missing required data files: " + ", ".join(missing)
+        )
+OKT = Okt()
+def okt_tokenizer(text):
+    """Define tokenizer using Okt that extracts nouns and verbs."""
+    if not text.strip():
+        return []
+    # Extract nouns and verbs
+    return [word for word, tag in OKT.pos(text, stem=True) if tag in ['Noun', 'Verb']]
+def load_datasets() -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
+    print("[1/6] Loading datasets...")
+    contents = pd.read_csv(CONTENTS_PATH)
+    metrics = pd.read_csv(METRICS_PATH)
+    demographics = pd.read_csv(DEMOGRAPHICS_PATH)
+    return contents, metrics, demographics
+def aggregate_metrics(metrics: pd.DataFrame) -> pd.DataFrame:
+    print("[2/6] Aggregating article metrics...")
+    agg = (
+        metrics.groupby("article_id", as_index=False)[["views_total", "comments", "likes"]]
+        .sum()
+        .rename(columns={
+            "views_total": "views_total",
+            "comments": "comments_total",
+            "likes": "likes_total",
+        })
+    )
+    return agg
+def identify_primary_audience(demographics: pd.DataFrame) -> pd.DataFrame:
+    print("[3/6] Identifying primary audience age groups...")
+    filtered = demographics[demographics["age_group"] != "전체"].copy()
+    if filtered.empty:
+        raise ValueError(
+            "No demographic records found after excluding '전체'."
+        )
+    filtered.sort_values(["article_id", "views"], ascending=[True, False], inplace=True)
+    idx = filtered.groupby("article_id")["views"].idxmax()
+    primary = (
+        filtered.loc[idx, ["article_id", "age_group"]]
+        .rename(columns={"age_group": "primary_age_group"})
+        .reset_index(drop=True)
+    )
+    return primary
+def build_master_dataframe(
+    contents: pd.DataFrame,
+    metrics_agg: pd.DataFrame,
+    primary_audience: pd.DataFrame,
+) -> pd.DataFrame:
+    print("[4/6] Merging datasets...")
+    df_master = contents.merge(metrics_agg, on="article_id", how="left")
+    df_master = df_master.merge(primary_audience, on="article_id", how="left")
+    # Replace missing numeric metrics with zeros for downstream processing.
+    for column in ["views_total", "comments_total", "likes_total"]:
+        if column in df_master.columns:
+            df_master[column] = df_master[column].fillna(0)
+    return df_master
+def engineer_features(df_master: pd.DataFrame) -> tuple[csr_matrix, csr_matrix, TfidfVectorizer, OneHotEncoder]:
+    print("[5/6] Engineering features (text + category)...")
+    text_series = (
+        df_master["title"].fillna("") + " " + df_master["content"].fillna("")
+    ).str.strip()
+    vectorizer = TfidfVectorizer(
+        tokenizer=okt_tokenizer,
+        max_features=5000,
+        lowercase=False,
+    )
+    X_text = vectorizer.fit_transform(text_series)
+    X_text_csr = csr_matrix(X_text)
+    category_series = df_master["category"].fillna("미분류")
+    onehot_encoder = OneHotEncoder(handle_unknown="ignore")
+    X_cat = onehot_encoder.fit_transform(category_series.to_frame())
+    X_combined = cast(csr_matrix, hstack([X_text_csr, X_cat]).tocsr())
+    return X_combined, X_text_csr, vectorizer, onehot_encoder
+def prepare_targets(
+    df_master: pd.DataFrame,
+    X_combined: csr_matrix,
+    X_text: csr_matrix,
+) -> tuple[csr_matrix, csr_matrix, np.ndarray, np.ndarray, LabelEncoder, pd.DataFrame]:
+    print("[6/6] Preparing targets and filtering valid samples...")
+    y_views = df_master["views_total"].fillna(0).to_numpy(dtype=np.float32)
+    y_age = df_master["primary_age_group"]
+    valid_mask = y_age.notna().to_numpy()
+    if not valid_mask.any():
+        raise ValueError(
+            "No samples contain a primary audience label. Unable to train the classification model."
+        )
+    X_combined_valid = X_combined[valid_mask]
+    X_text_valid = X_text[valid_mask]
+    y_views_valid = y_views[valid_mask]
+    y_age_valid = y_age[valid_mask].astype(str)
+    label_encoder = LabelEncoder()
+    y_age_encoded = np.asarray(label_encoder.fit_transform(y_age_valid), dtype=np.int32)
+    article_mapping = df_master.loc[valid_mask, ["article_id", "title"]].reset_index(drop=True)
+    return (
+        X_combined_valid,
+        X_text_valid,
+        y_views_valid,
+        y_age_encoded,
+        label_encoder,
+        article_mapping,
+    )
+def train_models(
+    X_features: csr_matrix,
+    y_views: np.ndarray,
+    y_age_encoded: np.ndarray,
+    num_classes: int,
+) -> tuple[XGBRegressor, XGBClassifier]:
+    print("Training XGBoost models with validation split...")
+    stratify_target = y_age_encoded if len(np.unique(y_age_encoded)) > 1 else None
+    (
+        X_train,
+        X_valid,
+        y_views_train,
+        y_views_valid,
+        y_age_train,
+        y_age_valid,
+    ) = train_test_split(
+        X_features,
+        y_views,
+        y_age_encoded,
+        test_size=0.2,
+        random_state=42,
+        stratify=stratify_target,
+    )
+    view_model = XGBRegressor(
+        objective="reg:squarederror",
+        n_estimators=200,
+        learning_rate=0.1,
+        max_depth=6,
+        subsample=0.8,
+        colsample_bytree=0.8,
+        random_state=42,
+        tree_method="hist",
+        n_jobs=-1,
+    )
+    view_model.fit(X_train, y_views_train)
+    age_model = XGBClassifier(
+        objective="multi:softprob",
+        num_class=num_classes,
+        n_estimators=300,
+        learning_rate=0.1,
+        max_depth=6,
+        subsample=0.8,
+        colsample_bytree=0.8,
+        random_state=42,
+        tree_method="hist",
+        n_jobs=-1,
+        eval_metric="mlogloss",
+        use_label_encoder=False,
+    )
+    age_model.fit(X_train, y_age_train)
+    if X_valid.shape[0] > 0:
+        view_pred = view_model.predict(X_valid)
+        mae = mean_absolute_error(y_views_valid, view_pred)
+        age_pred = age_model.predict(X_valid)
+        acc = accuracy_score(y_age_valid, age_pred)
+        print(f" - Validation MAE (views): {mae:,.2f}")
+        print(f" - Validation Accuracy (audience): {acc:.4f}")
+    # Refit on the full dataset to maximise performance for saved artifacts.
+    view_model.fit(X_features, y_views)
+    age_model.fit(X_features, y_age_encoded)
+    return view_model, age_model
+def save_artifacts(
+    vectorizer: TfidfVectorizer,
+    onehot_encoder: OneHotEncoder,
+    label_encoder: LabelEncoder,
+    view_model: XGBRegressor,
+    age_model: XGBClassifier,
+    text_features: csr_matrix,
+    article_mapping: pd.DataFrame,
+) -> None:
+    print("Saving artifacts...")
+    joblib.dump(vectorizer, "tfidf_vectorizer.pkl")
+    print("- Saved tfidf_vectorizer.pkl")
+    joblib.dump(onehot_encoder, "onehot_encoder.pkl")
+    print("- Saved onehot_encoder.pkl")
+    joblib.dump(label_encoder, "label_encoder.pkl")
+    print("- Saved label_encoder.pkl")
+    joblib.dump(view_model, "view_prediction_model.pkl")
+    print("- Saved view_prediction_model.pkl")
+    joblib.dump(age_model, "age_prediction_model.pkl")
+    print("- Saved age_prediction_model.pkl")
+    joblib.dump(text_features, "text_features_matrix.pkl")
+    print("- Saved text_features_matrix.pkl")
+    joblib.dump(article_mapping, "article_mapping.pkl")
+    print("- Saved article_mapping.pkl")
+def main() -> None:
+    np.random.seed(42)
+    ensure_files_exist([CONTENTS_PATH, METRICS_PATH, DEMOGRAPHICS_PATH])
+    contents, metrics, demographics = load_datasets()
+    metrics_agg = aggregate_metrics(metrics)
+    primary_audience = identify_primary_audience(demographics)
+    df_master = build_master_dataframe(contents, metrics_agg, primary_audience)
+    X_combined, X_text, vectorizer, onehot_encoder = engineer_features(df_master)
+    (
+        X_features,
+        X_text_filtered,
+        y_views,
+        y_age_encoded,
+        label_encoder,
+        article_mapping,
+    ) = prepare_targets(df_master, X_combined, X_text)
+    view_model, age_model = train_models(
+        X_features,
+        y_views,
+        y_age_encoded,
+        num_classes=len(label_encoder.classes_),
+    )
+    save_artifacts(
+        vectorizer,
+        onehot_encoder,
+        label_encoder,
+        view_model,
+        age_model,
+        X_text_filtered,
+        article_mapping,
+    )
+    print("All artifacts saved successfully.")
+if __name__ == "__main__":
+    try:
+        main()
+    except Exception as exc:  # pragma: no cover - top-level execution guard.
+        print(f"Error: {exc}", file=sys.stderr)
+        raise

view_prediction_model.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e9b2aa61b6a5c09b71cd709a7e1c550511078c1c887ea9d6ac4e7afc11fa9b37
+size 502931

wsgi.py ADDED Viewed

	@@ -0,0 +1,9 @@

+"""WSGI entrypoint for production servers.
+Expose the Flask application as ``application`` so Gunicorn or other WSGI-compatible
+servers can import it via ``wsgi:application``.
+"""
+from app import app as application
+if __name__ == "__main__":  # pragma: no cover
+    application.run()