datause-extraction

Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from development economics and humanitarian research documents.

This is the production release of rafmacalaba/gliner2-datause-large-v1-deval-synth-v2.

Task

Given a passage of text, the model identifies every data source mentioned and classifies it across four dimensions:

Field	Type	Values
`mention_name`	Extractive span	Verbatim text from the passage
`specificity_tag`	Classification	`named` / `descriptive` / `vague`
`typology_tag`	Classification	`survey` / `census` / `administrative` / `database` / `indicator` / `geospatial` / `microdata` / `report` / `other`
`is_used`	Classification	`True` / `False`
`usage_context`	Classification	`primary` / `supporting` / `background`

Inference — Two-Pass Hybrid

This model uses a two-pass architecture. A single-pass structured extract will not produce correct results.

from gliner2 import GLiNER2
from huggingface_hub import snapshot_download

# Install the patched GLiNER2 library:
# pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror

BASE_MODEL = "fastino/gliner2-large-v1"
ADAPTER_ID = "ai4data/datause-extraction"

extractor = GLiNER2.from_pretrained(BASE_MODEL)
extractor.load_adapter(snapshot_download(ADAPTER_ID))
extractor.eval()

CLASSIFICATION_TASKS = {
    "specificity_tag": ["named", "descriptive", "vague"],
    "typology_tag": [
        "survey", "census", "administrative", "database",
        "indicator", "geospatial", "microdata", "report", "other",
    ],
    "is_used": ["True", "False"],
    "usage_context": ["primary", "supporting", "background"],
}

text = "We use the Demographic and Health Survey (DHS) 2020 as our primary data source."

# Pass 1: entity extraction
res_ent = extractor.extract_entities(text, ["data_mention"], threshold=0.3, include_confidence=True)
spans = (
    res_ent.get("entities", {}).get("data_mention", [])
    if isinstance(res_ent, dict)
    else res_ent
)

# Build classification inputs for each valid span
results = []
for span_data in spans:
    span_text = span_data.get("text", "") if isinstance(span_data, dict) else str(span_data)
    span_conf = span_data.get("confidence", 0.0) if isinstance(span_data, dict) else 1.0
    if len(span_text) < 3:
        continue
    start     = text.find(span_text)
    ctx_start = max(0, start - 150) if start != -1 else 0
    ctx_end   = min(len(text), start + len(span_text) + 150) if start != -1 else len(text)
    context_str = f"Mention: {span_text} | Context: {text[ctx_start:ctx_end]}"

    # Pass 2: classify the span's context window
    classes = extractor.classify_text(context_str, CLASSIFICATION_TASKS, threshold=0.3)
    mention = {"mention_name": span_text, "confidence": span_conf}
    for task, out in classes.items():
        mention[task] = out[0] if isinstance(out, tuple) and len(out) == 2 else out
    results.append(mention)

print(results)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ai4data/datause-extraction

Base model

fastino/gliner2-large-v1

Adapter

(6)

this model

ai4data
/

datause-extraction

datause-extraction

Task

Inference — Two-Pass Hybrid

Model tree for ai4data/datause-extraction

Dataset used to train ai4data/datause-extraction