IMDb WordPiece Tokenizer

A custom WordPiece tokenizer trained on the IMDb movie review dataset. This tokenizer is used in a research-style learning project where a Single Neuron RNN is implemented from scratch for sentiment classification.

Dataset

IMDb Movie Reviews (stanfordnlp/imdb)
50k labeled reviews for sentiment classification

Tokenizer Details

Algorithm: WordPiece
Library: tokenizers (Hugging Face)
Text normalization:
- HTML tag removal
- Lowercasing
- Whitespace cleanup

Special Tokens

[UNK] — unknown token
[CLS] — classification token
[SEP] — sentence separator
[PAD] — padding token
[MASK] — masked token

Files

tokenizer.json
tokenizer_config.json

tokenizer.json contains the full tokenizer pipeline including vocabulary, normalization, pre-tokenization and WordPiece model.

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Keyurjotaniya007/bert-imdb-tokenizer")

text = "This MOVie WAS AMazing!"
print(tokenizer.tokenize(text))
print(tokenizer(text))

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Keyurjotaniya007/bert-imdb-tokenizer

Base model

google-bert/bert-base-uncased

Finetuned

(6457)

this model

Keyurjotaniya007
/

bert-imdb-tokenizer