LQ-FSE-base: Korean Financial Sentence Extractor

LangQuant(λž­ν€€νŠΈ)μ—μ„œ κ³΅κ°œν•œ 금육 리포트, 금육 κ΄€λ ¨ λ‰΄μŠ€μ—μ„œ λŒ€ν‘œλ¬Έμž₯을 μΆ”μΆœν•˜κ³  μ—­ν• (outlook, event, financial, risk)을 λΆ„λ₯˜ν•˜λŠ” λͺ¨λΈμž…λ‹ˆλ‹€.

Model Description

  • Base Model: klue/roberta-base
  • Architecture: Sentence Encoder (RoBERTa) + Inter-sentence Transformer (2 layers) + Dual Classifiers
  • Task: Extractive Summarization + Role Classification (Multi-task)
  • Language: Korean
  • Domain: Financial Reports (증ꢌ 리포트), Financial News (금육 λ‰΄μŠ€)

Input Constraints

Parameter Value Description
Max sentence length 128 tokens λ¬Έμž₯λ‹Ή μ΅œλŒ€ 토큰 수 (초과 μ‹œ truncation)
Max sentences per document 30 λ¬Έμ„œλ‹Ή μ΅œλŒ€ λ¬Έμž₯ 수 (초과 μ‹œ μ•ž 30개만 μ‚¬μš©)
Input format Plain text λ¬Έμž₯ λΆ€ν˜Έ(.!?) κΈ°μ€€μœΌλ‘œ μžλ™ 뢄리
  • μž…λ ₯: ν•œκ΅­μ–΄ 금육 ν…μŠ€νŠΈ (증ꢌ 리포트, 금육 λ‰΄μŠ€ λ“±)
  • 좜λ ₯: 각 λ¬Έμž₯별 λŒ€ν‘œλ¬Έμž₯ 점수 (0~1) + μ—­ν•  λΆ„λ₯˜ (outlook/event/financial/risk)

Performance

Metric Score
Extraction F1 0.705
Role Accuracy 0.851

Role Labels

Label Description
outlook 전망/예츑 λ¬Έμž₯
event 이벀트/사건 λ¬Έμž₯
financial 재무/싀적 λ¬Έμž₯
risk 리슀크 μš”μΈ λ¬Έμž₯

Usage

import re
import torch
from transformers import AutoConfig, AutoModel, AutoTokenizer

repo_id = "LangQuant/LQ-FSE-base"

# λͺ¨λΈ λ‘œλ“œ
config = AutoConfig.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model.eval()

# μž…λ ₯ ν…μŠ€νŠΈ
text = (
    "μ‚Όμ„±μ „μžμ˜ 2024λ…„ 4λΆ„κΈ° 싀적이 μ‹œμž₯ μ˜ˆμƒμ„ μƒνšŒν–ˆλ‹€. "
    "λ©”λͺ¨λ¦¬ λ°˜λ„μ²΄ 가격 μƒμŠΉμœΌλ‘œ μ˜μ—…μ΄μ΅μ΄ μ „λΆ„κΈ° λŒ€λΉ„ 30% μ¦κ°€ν–ˆλ‹€. "
    "HBM3E 양산이 λ³Έκ²©ν™”λ˜λ©΄μ„œ AI λ°˜λ„μ²΄ μ‹œμž₯ 점유율이 ν™•λŒ€λ  전망이닀."
)

# λ¬Έμž₯ 뢄리 및 토큰화
sentences = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text.strip()) if s.strip()]
max_len, max_sent = config.max_length, config.max_sentences

padded = sentences[:max_sent]
num_real = len(padded)
while len(padded) < max_sent:
    padded.append("")

ids_list, mask_list = [], []
for s in padded:
    if s:
        enc = tokenizer(s, max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
    else:
        enc = {"input_ids": torch.zeros(1, max_len, dtype=torch.long),
               "attention_mask": torch.zeros(1, max_len, dtype=torch.long)}
    ids_list.append(enc["input_ids"])
    mask_list.append(enc["attention_mask"])

input_ids = torch.cat(ids_list).unsqueeze(0)
attention_mask = torch.cat(mask_list).unsqueeze(0)
doc_mask = torch.zeros(1, max_sent)
doc_mask[0, :num_real] = 1

# μΆ”λ‘ 
with torch.no_grad():
    scores, role_logits = model(input_ids, attention_mask, doc_mask)

role_labels = config.role_labels
for i, sent in enumerate(sentences):
    score = scores[0, i].item()
    role = role_labels[role_logits[0, i].argmax().item()]
    marker = "*" if score >= 0.5 else " "
    print(f"  {marker} [{score:.4f}] [{role:10s}] {sent}")

Model Architecture

Input Sentences
    ↓
[klue/roberta-base] β†’ [CLS] embeddings per sentence
    ↓
[Inter-sentence Transformer] (2 layers, 8 heads)
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Binary Classifierβ”‚  Role Classifier    β”‚
β”‚ (representative?)β”‚  (outlook/event/    β”‚
β”‚                  β”‚   financial/risk)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Training

  • Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
  • Scheduler: Linear warmup (10%)
  • Loss: BCE (extraction) + CrossEntropy (role), role_weight=0.5
  • Max sentence length: 128 tokens
  • Max sentences per document: 30

Files

  • model.py: Model definition (DocumentEncoderConfig, DocumentEncoderForExtractiveSummarization)
  • config.json: Model configuration
  • model.safetensors: Model weights
  • inference_example.py: Inference helper with usage example
  • convert_checkpoint.py: Script to convert original .pt checkpoint

Disclaimer (λ©΄μ±… μ‘°ν•­)

  • λ³Έ λͺ¨λΈμ€ 연ꡬ 및 정보 제곡 λͺ©μ μœΌλ‘œλ§Œ μ œκ³΅λ©λ‹ˆλ‹€.
  • λ³Έ λͺ¨λΈμ˜ 좜λ ₯은 투자 μ‘°μ–Έ, 금육 자문, λ§€λ§€ μΆ”μ²œμ΄ μ•„λ‹™λ‹ˆλ‹€.
  • λͺ¨λΈμ˜ 예츑 κ²°κ³Όλ₯Ό 기반으둜 ν•œ 투자 νŒλ‹¨μ— λŒ€ν•΄ LangQuant 및 κ°œλ°œμžλŠ” μ–΄λ– ν•œ 법적 μ±…μž„λ„ μ§€μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
  • λͺ¨λΈμ˜ μ •ν™•μ„±, μ™„μ „μ„±, μ μ‹œμ„±μ— λŒ€ν•΄ λ³΄μ¦ν•˜μ§€ μ•ŠμœΌλ©°, μ‹€μ œ 투자 μ˜μ‚¬κ²°μ • μ‹œ λ°˜λ“œμ‹œ μ „λ¬Έκ°€μ˜ 쑰언을 κ΅¬ν•˜μ‹œκΈ° λ°”λžλ‹ˆλ‹€.
  • 금육 μ‹œμž₯은 본질적으둜 λΆˆν™•μ‹€ν•˜λ©°, κ³Όκ±° λ°μ΄ν„°λ‘œ ν•™μŠ΅λœ λͺ¨λΈμ΄ 미래 μ„±κ³Όλ₯Ό 보μž₯ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

Usage Restrictions (μ‚¬μš© μ œν•œ)

  • κΈˆμ§€ 사항:
    • λ³Έ λͺ¨λΈμ„ μ΄μš©ν•œ μ‹œμ„Έ μ‘°μ’…, ν—ˆμœ„ 정보 생성 λ“± λΆˆλ²•μ  λͺ©μ μ˜ μ‚¬μš©
    • μžλ™ν™”λœ 투자 λ§€λ§€ μ‹œμŠ€ν…œμ˜ 단독 μ˜μ‚¬κ²°μ • μˆ˜λ‹¨μœΌλ‘œ μ‚¬μš©
    • λͺ¨λΈ 좜λ ₯을 μ „λ¬Έ 금육 자문인 κ²ƒμ²˜λŸΌ 제3μžμ—κ²Œ μ œκ³΅ν•˜λŠ” ν–‰μœ„
  • ν—ˆμš© 사항:
    • ν•™μˆ  연ꡬ 및 ꡐ윑 λͺ©μ μ˜ μ‚¬μš©
    • 금육 ν…μŠ€νŠΈ 뢄석 νŒŒμ΄ν”„λΌμΈμ˜ 보쑰 λ„κ΅¬λ‘œ ν™œμš©
    • 사내 λ¦¬μ„œμΉ˜/뢄석 μ—…λ¬΄μ˜ μ°Έκ³  자료둜 ν™œμš©
  • 상업적 μ‚¬μš© μ‹œ LangQuant에 사전 문의λ₯Ό ꢌμž₯ν•©λ‹ˆλ‹€.

Contributors

Downloads last month
91
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for LangQuant/LQ-FSE-base

Base model

klue/roberta-base
Finetuned
(420)
this model