JSON Tokenizer Configuration

ThrillingAlbuquerque avatar
ThrillingAlbuquerque
·
·
Download

Start Quiz

Study Flashcards

8 Questions

What is the type of the normalizer in the provided JSON?

BertNormalizer

What is the purpose of the '[MASK]' token?

To represent a masked token

What is the prefix used by the decoder for subwords?

What is the maximum number of input characters per word?

100

What is the ID of the '[CLS]' token?

2

What is the type of the post processor?

TemplateProcessing

What is the purpose of the '[SEP]' token?

To separate sequences

What is the ID of the '[UNK]' token?

1

Study Notes

Tokenizer Configuration

  • The tokenizer configuration is in JSON format with a version of "1.0".
  • The configuration has no truncation or padding specified.
  • There are 5 special tokens added to the tokenizer:
    • [PAD] with id 0
    • [UNK] with id 1
    • [CLS] with id 2
    • [SEP] with id 3
    • [MASK] with id 4

Normalizer

  • The normalizer is of type "BertNormalizer".
  • It cleans text, handles Chinese characters, and strips accents.
  • It also performs lowercase conversion.

Pre-tokenizer

  • The pre-tokenizer is of type "BertPreTokenizer".

Post-processor

  • The post-processor is of type "TemplateProcessing".
  • It has multiple templates for processing, but their details are not specified.

Special Tokens

  • [CLS] is a special token with id "[CLS]" and type id 0.
  • [SEP] is a special token with id "[SEP]" and type id 0.
  • There are multiple sequences defined, including "A" and "B", with type ids 0 and 1 respectively.

Decoder

  • The decoder is of type "WordPiece".
  • It uses the prefix "##" for subwords.
  • It performs cleanup of the output.

Model

  • The model is also of type "WordPiece".
  • It uses [UNK] as the unknown token.
  • It uses the prefix "##" for continuing subwords.
  • It has a maximum input character limit of 100 per word.
  • The vocabulary includes the special tokens, as well as other tokens, with their corresponding ids.

This quiz is about understanding the configuration of a tokenizer in JSON format. It covers the different parameters and settings used to customize the tokenizer's behavior.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free

More Quizzes Like This

Use Quizgecko on...
Browser
Browser