fix: correct program name to AI for Data—Data for AI (World Bank & UNHCR)

a622a01 verified 2 days ago

5.85 kB

language:
  - en
license: apache-2.0
tags:
  - gliner2
  - ner
  - dataset-extraction
  - lora
  - world-bank
base_model: fastino/gliner2-large-v1
library_name: gliner2
pipeline_tag: token-classification
datasets:
  - rafmacalaba/datause-v8
model-index:
  - name: datause-extraction
    results:
      - task:
          type: token-classification
          name: Dataset Mention Extraction
        metrics:
          - type: f1
            value: 84.8
            name: F1 (max_tokens=512)
          - type: precision
            value: 90
            name: Precision
          - type: recall
            value: 80.2
            name: Recall

Dataset Use Extraction

A fine-tuned GLiNER2 adapter for extracting structured dataset mentions from research documents and policy papers.

Developed as part of the AI for Data—Data for AI program, a collaboration between the World Bank and UNHCR, to monitor and measure data use across development research.

Overview

This model identifies and extracts structured information about datasets mentioned in text, including formal survey names, descriptive data references, and vague data allusions. It extracts rich metadata for each mention including the dataset name, acronym, producer, geography, data type, and usage context.

Performance

Evaluated on a held-out test set of 199 annotated text passages:

Metric	Score
F1	84.8%
Precision	90.0%
Recall	80.2%

Performance by mention type

Tag	Total	Found	Recall
Named	394	317	80.5%
Descriptive	135	108	80.0%
Vague	87	70	80.5%

Extracted Fields

For each dataset mention, the model extracts up to 13 structured fields:

Field	Type	Description
`dataset_name`	string	Name or description of the dataset
`acronym`	string	Abbreviation (e.g., "DHS", "LSMS")
`author`	string	Individual author(s)
`producer`	string	Organization that created the dataset
`publication_year`	string	Year published
`reference_year`	string	Year data was collected
`reference_population`	string	Target population
`geography`	string	Geographic coverage
`description`	string	Content description
`data_type`	choice	survey, census, database, administrative, indicator, geospatial, microdata, report, other
`dataset_tag`	choice	named, descriptive, vague
`usage_context`	choice	primary, supporting, background
`is_used`	choice	True, False

Usage

With `ai4data` library (recommended)

pip install git+https://github.com/rafmacalaba/monitoring_of_datause.git

from ai4data import extract_from_text, extract_from_document

# Extract from text
text = """We use the Demographic and Health Survey (DHS) from 2020 as our
primary data source to analyze outcomes in Ghana. For robustness checks,
we also reference the Ghana Living Standard Survey (GLSS) from 2012."""

results = extract_from_text(text)
for ds in results["datasets"]:
    print(f"  {ds['dataset_name']} [{ds['dataset_tag']}]")

# Extract from PDF (URL or local file)
url = "https://documents1.worldbank.org/curated/en/.../report.pdf"
results = extract_from_document(url)

With GLiNER2 directly

from gliner2 import GLiNER2
from huggingface_hub import snapshot_download

# Load base model + adapter
model = GLiNER2.from_pretrained("fastino/gliner2-large-v1")
adapter_path = snapshot_download("ai4data/datause-extraction")
model.load_adapter(adapter_path)

# Define extraction schema
schema = (
    model.create_schema()
    .structure("dataset_mention")
        .field("dataset_name", dtype="str")
        .field("acronym", dtype="str")
        .field("producer", dtype="str")
        .field("geography", dtype="str")
        .field("description", dtype="str")
        .field("data_type", dtype="str",
               choices=["survey", "census", "database", "administrative",
                        "indicator", "geospatial", "microdata", "report", "other"])
        .field("dataset_tag", dtype="str",
               choices=["named", "descriptive", "vague"])
        .field("usage_context", dtype="str",
               choices=["primary", "supporting", "background"])
        .field("is_used", dtype="str", choices=["True", "False"])
)

results = model.extract(text, schema)
for mention in results["dataset_mention"]:
    print(mention)

Training Details

Base model: fastino/gliner2-large-v1 (DeBERTa-v3-large encoder)
Method: LoRA (r=16, alpha=32)
Training data: ~3,400 synthetic examples (v8 dataset) generated with GPT-4o and Gemini 2.5 Flash
Max context: 512 tokens (aligned with DeBERTa-v3 position embeddings)
Data format: Context-aware passages with markdown formatting, footnotes, and structured annotations

Limitations

Optimized for English-language research documents and policy papers
Best suited for World Bank-style development research documents
May not generalize well to non-research text (news articles, social media, etc.)
Requires the fastino/gliner2-large-v1 base model

Citation

If you use this model, please cite:

@misc{ai4data-datause-extraction,
  title={Dataset Use Extraction Model},
  author={AI for Data—Data for AI},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/ai4data/datause-extraction}
}

ai4data
/

datause-extraction

Dataset Use Extraction

Overview

Performance

Performance by mention type

Extracted Fields

Usage

With `ai4data` library (recommended)

With GLiNER2 directly

Training Details

Limitations

Citation

Links

Dataset Use Extraction

Overview

Performance

Performance by mention type

Extracted Fields

Usage

With ai4data library (recommended)

With GLiNER2 directly

Training Details

Limitations

Citation

Links

With `ai4data` library (recommended)