Dataset Use Extraction

A fine-tuned GLiNER2 adapter for extracting structured dataset mentions from research documents and policy papers.

Developed as part of the AI for Data—Data for AI program, a collaboration between the World Bank and UNHCR, to monitor and measure data use across development research.

Overview

This model identifies and extracts structured information about datasets mentioned in text, including formal survey names, descriptive data references, and vague data allusions. It extracts rich metadata for each mention including the dataset name, acronym, producer, geography, data type, and usage context.

Performance

Evaluated on a held-out test set of 199 annotated text passages:

Metric Score
F1 84.8%
Precision 90.0%
Recall 80.2%

Performance by mention type

Tag Total Found Recall
Named 394 317 80.5%
Descriptive 135 108 80.0%
Vague 87 70 80.5%

Extracted Fields

For each dataset mention, the model extracts up to 13 structured fields:

Field Type Description
dataset_name string Name or description of the dataset
acronym string Abbreviation (e.g., "DHS", "LSMS")
author string Individual author(s)
producer string Organization that created the dataset
publication_year string Year published
reference_year string Year data was collected
reference_population string Target population
geography string Geographic coverage
description string Content description
data_type choice survey, census, database, administrative, indicator, geospatial, microdata, report, other
dataset_tag choice named, descriptive, vague
usage_context choice primary, supporting, background
is_used choice True, False

Usage

With ai4data library (recommended)

pip install git+https://github.com/rafmacalaba/monitoring_of_datause.git
from ai4data import extract_from_text, extract_from_document

# Extract from text
text = """We use the Demographic and Health Survey (DHS) from 2020 as our
primary data source to analyze outcomes in Ghana. For robustness checks,
we also reference the Ghana Living Standard Survey (GLSS) from 2012."""

results = extract_from_text(text)
for ds in results["datasets"]:
    print(f"  {ds['dataset_name']} [{ds['dataset_tag']}]")

# Extract from PDF (URL or local file)
url = "https://documents1.worldbank.org/curated/en/.../report.pdf"
results = extract_from_document(url)

With GLiNER2 directly

from gliner2 import GLiNER2
from huggingface_hub import snapshot_download

# Load base model + adapter
model = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
adapter_path = snapshot_download("ai4data/datause-extraction")
model.load_adapter(adapter_path)

# Define extraction schema
schema = (
    model.create_schema()
    .structure("dataset_mention")
        .field("dataset_name", dtype="str")
        .field("acronym", dtype="str")
        .field("producer", dtype="str")
        .field("geography", dtype="str")
        .field("description", dtype="str")
        .field("data_type", dtype="str",
               choices=["survey", "census", "database", "administrative",
                        "indicator", "geospatial", "microdata", "report", "other"])
        .field("dataset_tag", dtype="str",
               choices=["named", "descriptive", "vague"])
        .field("usage_context", dtype="str",
               choices=["primary", "supporting", "background"])
        .field("is_used", dtype="str", choices=["True", "False"])
)

results = model.extract(text, schema)
for mention in results["dataset_mention"]:
    print(mention)

Training Details

  • Base model: fastino/gliner2-large-v1 (DeBERTa-v3-large encoder)
  • Method: LoRA (r=16, alpha=32)
  • Training data: ~3,400 synthetic examples (v8 dataset) generated with GPT-4o and Gemini 2.5 Flash
  • Max context: 512 tokens (aligned with DeBERTa-v3 position embeddings)
  • Data format: Context-aware passages with markdown formatting, footnotes, and structured annotations

Limitations

  • Optimized for English-language research documents and policy papers
  • Best suited for World Bank-style development research documents
  • May not generalize well to non-research text (news articles, social media, etc.)
  • Requires the fastino/gliner2-large-v1 base model

Citation

If you use this model, please cite:

@misc{ai4data-datause-extraction,
  title={Dataset Use Extraction Model},
  author={AI for Data—Data for AI},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/ai4data/datause-extraction}
}

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ai4data/datause-extraction

Adapter
(2)
this model

Evaluation results