JSON Tokenizer Configuration
8 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the type of the normalizer in the provided JSON?

  • BertNormalizer (correct)
  • TemplateProcessing
  • WordPiece
  • BertPreTokenizer
  • What is the purpose of the '[MASK]' token?

  • To represent a masked token (correct)
  • To represent a pad token
  • To represent a classification token
  • To represent an unknown token
  • What is the prefix used by the decoder for subwords?

  • ## (correct)
  • ^
  • #
  • %
  • What is the maximum number of input characters per word?

    <p>100</p> Signup and view all the answers

    What is the ID of the '[CLS]' token?

    <p>2</p> Signup and view all the answers

    What is the type of the post processor?

    <p>TemplateProcessing</p> Signup and view all the answers

    What is the purpose of the '[SEP]' token?

    <p>To separate sequences</p> Signup and view all the answers

    What is the ID of the '[UNK]' token?

    <p>1</p> Signup and view all the answers

    Study Notes

    Tokenizer Configuration

    • The tokenizer configuration is in JSON format with a version of "1.0".
    • The configuration has no truncation or padding specified.
    • There are 5 special tokens added to the tokenizer:
      • [PAD] with id 0
      • [UNK] with id 1
      • [CLS] with id 2
      • [SEP] with id 3
      • [MASK] with id 4

    Normalizer

    • The normalizer is of type "BertNormalizer".
    • It cleans text, handles Chinese characters, and strips accents.
    • It also performs lowercase conversion.

    Pre-tokenizer

    • The pre-tokenizer is of type "BertPreTokenizer".

    Post-processor

    • The post-processor is of type "TemplateProcessing".
    • It has multiple templates for processing, but their details are not specified.

    Special Tokens

    • [CLS] is a special token with id "[CLS]" and type id 0.
    • [SEP] is a special token with id "[SEP]" and type id 0.
    • There are multiple sequences defined, including "A" and "B", with type ids 0 and 1 respectively.

    Decoder

    • The decoder is of type "WordPiece".
    • It uses the prefix "##" for subwords.
    • It performs cleanup of the output.

    Model

    • The model is also of type "WordPiece".
    • It uses [UNK] as the unknown token.
    • It uses the prefix "##" for continuing subwords.
    • It has a maximum input character limit of 100 per word.
    • The vocabulary includes the special tokens, as well as other tokens, with their corresponding ids.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Description

    This quiz is about understanding the configuration of a tokenizer in JSON format. It covers the different parameters and settings used to customize the tokenizer's behavior.

    More Like This

    Use Quizgecko on...
    Browser
    Browser