--- license: mit language: - en --- Update README.md ## Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models (EMNLP 2023 findings) Paper: https://aclanthology.org/2023.findings-emnlp.138/ Github: https://github.com/deep-over/FiLM ### **FiLM**(**Fi**nancial **L**anguage **M**odel) Models 🌟 FiLM is a Pre-trained Language Model (PLM) optimized for the Financial domain, built upon a diverse range of Financial domain corpora. Initialized with the RoBERTa-base model, FiLM undergoes further training to achieve performance that surpasses RoBERTa-base in financial domain for the first time. To train FiLM, we have categorized our Financial Corpus into specific groups and gathered a diverse range of corpora to ensure optimal performance. Our model can be called Fin-RoBERTa (Financial RoBERTa). We offer two versions of the FiLM model, each tailored for specific use-cases in the Financial domain: **FiLM (2.4B): Our Base Model** This is our foundational model, trained on the entire range of corpora as outlined in the above Corpus table. Ideal for a wide array of financial applications. 📊 [**FiLM (5.5B): Optimized for SEC Filings**](https://huggingface.co/HYdsl/FiLM-SEC) This model is specialized for handling SEC filings. We expanded the training set by adding 3.1 billion tokens from the SEC filings corpus dataset. The dataset is sourced from EDGAR-CORPUS: Billions of Tokens Make The World Go Round (Loukas et al., ECONLP 2021) and can be downloaded from Zenodo. 📑 The method to load a tokenizer and a model. For the FiLM model, you can call 'roberta-base' from the tokenizer. ```python tokenizer = AutoTokenizer.from_pretrained('roberta-base') model = AutoModel.from_pretrained('HYdsl/FiLM') ``` **Types of Training Corpora 📚** ![image.png](https://cdn-uploads.huggingface.co/production/uploads/65254614785092cd47b1110b/-cT_wOabHugsct1mogOpa.png) #Finance #Financial #RoBERTa