Gradio

AI for Data Use: Dataset Extraction

This tool identifies dataset mentions (e.g., Demographic and Health Survey, Living Standards and Measurement Survey, etc.) and extracts contextual metadata such as:

publisher
publication year
reference year
geography
acronym
reference population
data description
data type
usage context

Usage Context Definitions

Primary mention – the dataset is the main source of analysis or results in the study.
Supporting mention – the dataset is used alongside other data to complement or validate findings.
Background mention – the dataset is mentioned for context or comparison but not used in the actual analysis.

How to Use

Paste or type text into the input box (left), or select one of the provided examples.
Click 🚀 Run Extraction to process the text.
The model will highlight all detected dataset mentions and related entities (e.g., publisher, geography, year, usage context) directly in the text.
Below the highlights, a deduplicated relation tree will automatically appear, showing each dataset with its extracted metadata and filtered attributes.
You can click 🧭 Show / Refresh Relation Tree anytime to rebuild or inspect the deduplicated metadata view.

Resources

Model: https://huggingface.co/rafmacalaba/datause-extraction-v3-finetuned
Paper (ArXiv): https://arxiv.org/pdf/2502.10263
GLiNER Repo: https://github.com/urchade/GLiNER
Project Docs: https://worldbank.github.io/ai4data-use/docs/introduction.html

🧠 AI for Data Use: Dataset Extraction

AI for Data Use: Dataset Extraction

How to Use