Upload sentiment analysis model

Browse files

Files changed (8) hide show

README.md +193 -0
config.json +41 -0
model.safetensors +3 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +58 -0
training_args.bin +3 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,193 @@

+---
+language: zh
+license: apache-2.0
+tags:
+- sentiment-analysis
+- chinese
+- finance
+- finbert
+- crypto
+- text-classification
+datasets:
+- custom
+metrics:
+- accuracy
+- f1
+- precision
+- recall
+model-index:
+- name: Chinese Financial Sentiment Analysis (Crypto)
+  results:
+  - task:
+      type: text-classification
+      name: Sentiment Analysis
+    metrics:
+    - type: accuracy
+      value: 0.645
+      name: Accuracy
+    - type: f1
+      value: 0.6365
+      name: F1 Score
+    - type: precision
+      value: 0.6394
+      name: Precision
+    - type: recall
+      value: 0.645
+      name: Recall
+---
+# Chinese Financial Sentiment Analysis Model (Crypto Focus)
+中文金融情感分析模型（加密货币领域）
+## 模型描述 | Model Description
+本模型基于 `yiyanghkust/finbert-tone-chinese` 微调，专门用于分析中文加密货币相关新闻和社交媒体内容的情感倾向。模型可以识别三种情感类别：正面（Positive）、中性（Neutral）和负面（Negative）。
+This model is fine-tuned from `yiyanghkust/finbert-tone-chinese` and specifically designed for sentiment analysis of Chinese cryptocurrency-related news and social media content. It can classify text into three sentiment categories: Positive, Neutral, and Negative.
+## 训练数据 | Training Data
+- **数据量 | Size**: 1000条人工标注的中文金融新闻 | 1000 manually annotated Chinese financial news articles
+- **数据来源 | Source**: 加密货币相关新闻和推文 | Cryptocurrency-related news and tweets
+- **标注方式 | Annotation**: AI辅助 + 人工修正 | AI-assisted + Manual correction
+- **数据分布 | Distribution**:
+  - Positive（正面）: 420条 (42.0%)
+  - Neutral（中性）: 420条 (42.0%)
+  - Negative（负面）: 160条 (16.0%)
+## 性能指标 | Performance Metrics
+在200条测试集上的表现 | Performance on 200 test samples:
+| 指标 Metric | 数值 Value |
+|-------------|-----------|
+| 准确率 Accuracy | 64.50% |
+| F1分数 F1 Score | 63.65% |
+| 精确率 Precision | 63.94% |
+| 召回率 Recall | 64.50% |
+## 使用方法 | Usage
+### 快速开始 | Quick Start
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# 加载模型和分词器 | Load model and tokenizer
+model_name = "YOUR_USERNAME/sentiment-finetuned-1000"  # 替换为你的用户名
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# 分析文本 | Analyze text
+text = "比特币突破10万美元创历史新高"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
+# 预测 | Predict
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    predicted_class = torch.argmax(predictions, dim=-1).item()
+# 结果映射 | Result mapping
+labels = ['positive', 'neutral', 'negative']
+sentiment = labels[predicted_class]
+confidence = predictions[0][predicted_class].item()
+print(f"情感: {sentiment}")
+print(f"置信度: {confidence:.4f}")
+```
+### 批量处理 | Batch Processing
+```python
+texts = [
+    "币安获得阿布扎比监管授权",
+    "以太坊完成Fusaka升级",
+    "某交易所遭攻击损失100万美元"
+]
+inputs = tokenizer(texts, return_tensors="pt", truncation=True,
+                   max_length=128, padding=True)
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    predicted_classes = torch.argmax(predictions, dim=-1)
+labels = ['positive', 'neutral', 'negative']
+for text, pred in zip(texts, predicted_classes):
+    print(f"{text} -> {labels[pred]}")
+```
+## 训练参数 | Training Configuration
+- **基础模型 | Base Model**: yiyanghkust/finbert-tone-chinese
+- **训练轮数 | Epochs**: 5
+- **批次大小 | Batch Size**: 16
+- **学习率 | Learning Rate**: 2e-5
+- **最大序列长度 | Max Length**: 128
+- **训练设备 | Device**: NVIDIA GeForce RTX 3060 Laptop GPU
+- **训练时间 | Training Time**: ~5分钟 | ~5 minutes
+## 适用场景 | Use Cases
+- ✅ 加密货币新闻情感分析
+- ✅ 社交媒体舆情监控
+- ✅ 金融市场情绪指标
+- ✅ 实时新闻情感跟踪
+- ✅ 投资决策辅助参考
+## 局限性 | Limitations
+- ⚠️ 主要针对加密货币领域的金融新闻，其他金融领域可能表现不佳
+- ⚠️ 负面样本相对较少（16%），对负面情感的识别可能不够敏感
+- ⚠️ 短文本（少于10字）的分析准确率可能下降
+- ⚠️ 仅支持简体中文
+- ⚠️ 模型不能替代人工判断，仅供参考
+## 许可证 | License
+Apache-2.0
+## 引用 | Citation
+如果使用本模型，请引用：
+```bibtex
+@misc{watchtower-sentiment-2025,
+  title={Chinese Financial Sentiment Analysis Model (Crypto Focus)},
+  author={WatchTower Team},
+  year={2025},
+  howpublished={\url{https://huggingface.co/YOUR_USERNAME/sentiment-finetuned-1000}},
+  note={Fine-tuned from yiyanghkust/finbert-tone-chinese}
+}
+```
+## 基础模型 | Base Model
+本模型基于以下模型微调：
+- [yiyanghkust/finbert-tone-chinese](https://huggingface.co/yiyanghkust/finbert-tone-chinese)
+感谢原作者的贡献！
+## 更新日志 | Changelog
+### v2.0 (2025-12-09)
+- ✅ 扩充训练数据至1000条
+- ✅ 修正标注错误，提升数据质量
+- ✅ 优化类别分布，提升模型平衡性
+- ✅ F1分数提升2.01%（0.6165 → 0.6365）
+### v1.0 (Initial Release)
+- 基于500条标注数据的初始版本
+## 联系方式 | Contact
+如有问题或建议，欢迎提 issue 或 PR。
+---
+**维护者 | Maintainer**: WatchTower Team
+**最后更新 | Last Updated**: 2025-12-09

config.json ADDED Viewed

	@@ -0,0 +1,41 @@

+{
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "directionality": "bidi",
+  "dtype": "float32",
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "Neutral",
+    "1": "Positive",
+    "2": "Negative"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "Negative": 2,
+    "Neutral": 0,
+    "Positive": 1
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "pooler_fc_size": 768,
+  "pooler_num_attention_heads": 12,
+  "pooler_num_fc_layers": 3,
+  "pooler_size_per_head": 128,
+  "pooler_type": "first_token_transform",
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "transformers_version": "4.57.3",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 21128
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:201a42ab8939395ab1923d8eb7bd5505c645a8331572e4cec417007a7853e761
+size 409103316

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:92dfe961fb52dc1b69afe75a1a36ee5850f2525ca6066e2863f577c3d75cba51
+size 5841

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff