8 Questions
What is the type of the normalizer in the provided JSON?
BertNormalizer
What is the purpose of the '[MASK]' token?
To represent a masked token
What is the prefix used by the decoder for subwords?
What is the maximum number of input characters per word?
100
What is the ID of the '[CLS]' token?
2
What is the type of the post processor?
TemplateProcessing
What is the purpose of the '[SEP]' token?
To separate sequences
What is the ID of the '[UNK]' token?
1
Study Notes
Tokenizer Configuration
- The tokenizer configuration is in JSON format with a version of "1.0".
- The configuration has no truncation or padding specified.
- There are 5 special tokens added to the tokenizer:
- [PAD] with id 0
- [UNK] with id 1
- [CLS] with id 2
- [SEP] with id 3
- [MASK] with id 4
Normalizer
- The normalizer is of type "BertNormalizer".
- It cleans text, handles Chinese characters, and strips accents.
- It also performs lowercase conversion.
Pre-tokenizer
- The pre-tokenizer is of type "BertPreTokenizer".
Post-processor
- The post-processor is of type "TemplateProcessing".
- It has multiple templates for processing, but their details are not specified.
Special Tokens
- [CLS] is a special token with id "[CLS]" and type id 0.
- [SEP] is a special token with id "[SEP]" and type id 0.
- There are multiple sequences defined, including "A" and "B", with type ids 0 and 1 respectively.
Decoder
- The decoder is of type "WordPiece".
- It uses the prefix "##" for subwords.
- It performs cleanup of the output.
Model
- The model is also of type "WordPiece".
- It uses [UNK] as the unknown token.
- It uses the prefix "##" for continuing subwords.
- It has a maximum input character limit of 100 per word.
- The vocabulary includes the special tokens, as well as other tokens, with their corresponding ids.
This quiz is about understanding the configuration of a tokenizer in JSON format. It covers the different parameters and settings used to customize the tokenizer's behavior.
Make Your Own Quizzes and Flashcards
Convert your notes into interactive study material.
Get started for free