IMDb WordPiece Tokenizer

A custom WordPiece tokenizer trained on the IMDb movie review dataset. This tokenizer is used in a research-style learning project where a Single Neuron RNN is implemented from scratch for sentiment classification.

Dataset

  • IMDb Movie Reviews (stanfordnlp/imdb)
  • 50k labeled reviews for sentiment classification

Tokenizer Details

  • Algorithm: WordPiece

  • Library: tokenizers (Hugging Face)

  • Text normalization:

    • HTML tag removal
    • Lowercasing
    • Whitespace cleanup

Special Tokens

  • [UNK] — unknown token
  • [CLS] — classification token
  • [SEP] — sentence separator
  • [PAD] — padding token
  • [MASK] — masked token

Files

tokenizer.json
tokenizer_config.json

tokenizer.json contains the full tokenizer pipeline including vocabulary, normalization, pre-tokenization and WordPiece model.

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Keyurjotaniya007/bert-imdb-tokenizer")

text = "This MOVie WAS AMazing!"
print(tokenizer.tokenize(text))
print(tokenizer(text))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Keyurjotaniya007/bert-imdb-tokenizer

Finetuned
(6457)
this model

Dataset used to train Keyurjotaniya007/bert-imdb-tokenizer