DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain
Abstract
The DLT-Corpus dataset, containing 2.98 billion tokens from diverse sources, enables analysis of technology emergence patterns and market-innovation correlations in the distributed ledger technology sector.
We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrencies price prediction and smart contracts, leaving domain-specific language under explored despite the sector's ~$3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus' utility by analyzing technology emergence patterns and market-innovation correlations. Findings reveal that technologies originate in scientific literature before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grow independently of market fluctuations, tracking overall market expansion in a virtuous cycle where research precedes and enables economic growth that funds further innovation. We publicly release the full DLT-Corpus; LedgerBERT, a domain-adapted model achieving 23% improvement over BERT-base on a DLT-specific Named Entity Recognition (NER) task; and all associated tools and code.
Community
We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrencies price prediction and smart contracts, leaving domain-specific language under explored despite the sector's ~$3 trillion market capitalization and rapid technological evolution.
In summary, our contributions are:
DLT-Corpus: 2.98 billion tokens from 22.12 million documents (37,440 scientific publications, 49,023 patents, 22M social media posts) with rich metadata enabling cross-disciplinary research.
Innovation diffusion analysis: Evidence that Distributed Ledger Technologies (DLTs) follow traditional technology transfer patterns, with research preceding market expansion and creating a virtuous funding cycle.
Sentiment analysis dataset: 23,301 cryptocurrency news headlines and brief descriptions with crowdsourced annotations from active community members, addressing the need for domain-specific labeled data.
LedgerBERT: A domain-adapted language model achieving 23% improvement over BERT-base on DLT-specific NER task, developed through continued pre-training of SciBERT.
DLT-Corpus, sentiment analysis dataset, models, and code are publicly available to support reproducibility and future research.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Qalb: Largest State-of-the-Art Urdu Large Language Model for 230M Speakers with Systematic Continued Pre-training (2026)
- Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain (2026)
- UrduLM: A Resource-Efficient Monolingual Urdu Language Model (2026)
- Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data (2026)
- Comparative Efficiency Analysis of Lightweight Transformer Models: A Multi-Domain Empirical Benchmark for Enterprise NLP Deployment (2026)
- ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction (2026)
- PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 2
Datasets citing this paper 4
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper