IMDb WordPiece Tokenizer
A custom WordPiece tokenizer trained on the IMDb movie review dataset. This tokenizer is used in a research-style learning project where a Single Neuron RNN is implemented from scratch for sentiment classification.
Dataset
- IMDb Movie Reviews (
stanfordnlp/imdb) - 50k labeled reviews for sentiment classification
Tokenizer Details
Algorithm: WordPiece
Library:
tokenizers(Hugging Face)Text normalization:
- HTML tag removal
- Lowercasing
- Whitespace cleanup
Special Tokens
[UNK]— unknown token[CLS]— classification token[SEP]— sentence separator[PAD]— padding token[MASK]— masked token
Files
tokenizer.json
tokenizer_config.json
tokenizer.json contains the full tokenizer pipeline including vocabulary, normalization, pre-tokenization and WordPiece model.
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Keyurjotaniya007/bert-imdb-tokenizer")
text = "This MOVie WAS AMazing!"
print(tokenizer.tokenize(text))
print(tokenizer(text))
Model tree for Keyurjotaniya007/bert-imdb-tokenizer
Base model
google-bert/bert-base-uncased