ó <±hbãóp•SrSSKJr SSKJrJr SSKJr \R"\ 5r "SS\5rS/rg) z!Tokenization class for Perceiver.é)ÚOptionalé)Ú AddedTokenÚPreTrainedTokenizer)Úloggingc ó6^•\rSrSrSrSS/rSSU4SjjjrS\\\ 44Sjr \S5rSS \ \ S \\ \ S\S\ \ 4U4SjjjrSS \ \ S \\ \ S\ \ 4S jjrS\S\ \4SjrSrSrSrSS\S\\S\\4SjjrSrU=r$)ÚPerceiverTokenizeréaÿ Construct a Perceiver tokenizer. The Perceiver simply uses raw bytes utf-8 encoding. This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. Args: pad_token (`str`, *optional*, defaults to `"[PAD]"`): The token used for padding, for example when batching sequences of different lengths. bos_token (`str`, *optional*, defaults to `"[BOS]"`): The BOS token (reserved in the vocab, but not actually used). eos_token (`str`, *optional*, defaults to `"[EOS]"`): The end of sequence token (reserved in the vocab, but not actually used). When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the `sep_token`. mask_token (`str`, *optional*, defaults to `"[MASK]"`): The MASK token, useful for masked language modeling. cls_token (`str`, *optional*, defaults to `"[CLS]"`): The CLS token (reserved in the vocab, but not actually used). sep_token (`str`, *optional*, defaults to `"[SEP]"`): The separator token, which is used when building a sequence from two sequences. Ú input_idsÚattention_maskÚreturncó,>•[U[5(a[USSS9OUn[U[5(a[USSS9OUn[U[5(a[USSS9OUn[U[5(a[USSS9OUn[U[5(a[USSS9OUn[U[5(a[USSS9OUnSUlUUUUUUS.Ul[UR5Ul[T U] "SUUUUUUUS.UD6 g)NF)ÚlstripÚrstripé)rééréé)Ú pad_tokenÚ bos_tokenÚ eos_tokenÚ mask_tokenÚ cls_tokenÚ sep_tokenÚmodel_max_length©) Ú isinstanceÚstrrÚ_utf_vocab_sizeÚ_added_tokens_decoderÚlenÚ_num_special_tokensÚsuperÚ__init__) ÚselfrrrrrrrÚkwargsÚ __class__s €Úl/var/www/html/shao/venv/lib/python3.13/site-packages/transformers/models/perceiver/tokenization_perceiver.pyr%ÚPerceiverTokenizer.__init__;s-ø€ôJTÐT]Ô_b×IcÑIc”J˜y°¸uÒEÐirˆ ÜISÐT]Ô_b×IcÑIc”J˜y°¸uÒEÐirˆ ÜISÐT]Ô_b×IcÑIc”J˜y°¸uÒEÐirˆ ÜKUÐV`Ôbe×KfÑKf”Z °5ÀÒGÐlvˆ ÜISÐT]Ô_b×IcÑIc”J˜y°¸uÒEÐirˆ ÜISÐT]Ô_b×IcÑIc”J˜y°¸uÒEÐirˆ à#ˆÔðØØØØØñ 6 ˆÔ"ô$' t×'AÑ'AÓ#BˆÔ Ü ‰Òð ØØØØ!ØØØ-ñ ðó ócó®•0n[UR5Hn[U5nX R-X'M UR UR 5 U$©N)Úranger Úchrr#ÚupdateÚadded_tokens_encoder)r&ÚvocabÚiÚtokens r)Ú get_vocabÚPerceiverTokenizer.get_vocabdsN€ØˆÜt×+Ñ+Ö,ˆAÜ˜“FˆEØ×7Ñ7Ñ7ˆE‹Lñ-ð ‰T×.Ñ.Ô/Øˆr+có•UR$r-)r )r&s r)Ú vocab_sizeÚPerceiverTokenizer.vocab_sizels€à×#Ñ#Ð#r+Útoken_ids_0Útoken_ids_1Úalready_has_special_tokenscó¸>•U(a[TU]XSS9$UcS/S/[U5--S/-$S/S/[U5--S/-S/[U5--S/-$)ad Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer `prepare_for_model` method. Args: token_ids_0 (`list[int]`): List of IDs. token_ids_1 (`list[int]`, *optional*): Optional second list of IDs for sequence pairs. already_has_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not the token list is already formatted with special tokens for the model. Returns: `list[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. T)r:r;r<rr)r$Úget_special_tokens_maskr")r&r:r;r<r(s €r)r>Ú*PerceiverTokenizer.get_special_tokens_maskps‡ø€ö$&Ü‘7Ñ2Ø'Ð]að3ðð ð ÑØ3˜!˜œs ;Ó/Ñ/Ñ/°1°#Ñ5Ð5ØˆsqcœC Ó,Ñ,Ñ-°°Ñ3¸°s¼SÀÓ=MÑ7MÑNÐRSÐQTÑTÐTr+có¢•UcUR/U-UR/-$UR/U-UR/-U-UR/-$)a Build model inputs from a sequence or a pair of sequence for sequence classification tasks. A sequence has the following format: - single sequence: `[CLS] X [SEP]` - pair of sequences: `[CLS] A [SEP] B [SEP]` Args: token_ids_0 (`list[int]`): List of IDs to which the special tokens will be added. token_ids_1 (`list[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `list[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens. )Úcls_token_idÚsep_token_id)r&r:r;s r)Ú build_inputs_with_special_tokensÚ3PerceiverTokenizer.build_inputs_with_special_tokensŒsb€ð&ÑØ×%Ñ%Ð&¨Ñ4¸×8IÑ8IÐ7JÑJÐJà×%Ñ%Ð&¨Ñ4¸×8IÑ8IÐ7JÑJÈ[ÑXÐ\`×\mÑ\mÐ[nÑnÐnr+Útextcób•URS5Vs/sHn[U5PM nnU$s snf)zPTake as input a string and return a list of strings (tokens) for words/sub-wordsúutf-8)Úencoder/)r&rEr3Útokenss r)Ú _tokenizeÚPerceiverTokenizer._tokenize¤s/€à"&§+¡+¨gÔ"6Ó7Ñ"6˜Q”#a–&Ñ"6ˆÐ7Øˆ ùò8s”,cóp•[U5S:waURnU$[U5UR-nU$)z0Converts a token (str) in an id using the vocab.r)r"Úunk_token_idÚordr#)r&r4Útoken_ids r)Ú_convert_token_to_idÚ'PerceiverTokenizer._convert_token_to_id©s:€äˆu‹:˜‹?Ø×(Ñ(ˆHðˆô˜5“z D×$<Ñ$<Ñ<ˆHØˆr+có4•[XR- 5nU$)z=Converts an index (integer) in a token (str) using the vocab.)r/r#)r&Úindexr4s r)Ú_convert_id_to_tokenÚ'PerceiverTokenizer._convert_id_to_token±s€äE×4Ñ4Ñ4Ó5ˆØˆr+cóÂ•SnUHFnX0R;a[U5RS5nO[[ U5/5nX$- nMH URSSS9nU$)z:Converts a sequence of tokens (string) in a single string.r+rGÚreplace)Úerrors)r1rrHÚbytesrNÚdecode)r&rIÚbstringr4Ú tok_stringÚstrings r)Úconvert_tokens_to_stringÚ+PerceiverTokenizer.convert_tokens_to_string·sb€àˆÛˆEØ×1Ñ1Ó1Ü ›Z×.Ñ.¨wÓ7‘ ä"¤C¨£J <Ó0 ØÑ!ŠGñð—‘ ° Ð:ˆØˆ r+Úsave_directoryÚfilename_prefixcó•g)Nrr)r&r`ras r)Úsave_vocabularyÚ"PerceiverTokenizer.save_vocabularyÄs€Ør+)r!r#r )z[PAD]z[BOS]z[EOS]z[MASK]z[CLS]z[SEP]i)r N)NFr-)Ú__name__Ú __module__Ú__qualname__Ú__firstlineno__Ú__doc__Úmodel_input_namesr%ÚdictrÚintr5Úpropertyr8ÚlistrÚboolr>rCrJrPrTr^ÚtuplercÚ__static_attributes__Ú __classcell__)r(s@r)r r s;ø†ñð<%Ð&6Ð7ÐðØØØØØØð' ð ÷' ð' ðR˜4 S ™>ôðñ$óð$ðsxñUØ ™9ðUØ3;¸DÀ¹IÑ3FðUØkoðUà ˆc‰÷UðUð:JNñoØ ™9ðoØ3;¸DÀ¹IÑ3Fðoà ˆc‰õoð0˜cð d¨3¡iôò òò ñ¨cðÀHÈSÁMðÐ]bÐcfÑ]g÷ór+r N) riÚtypingrÚtokenization_utilsrrÚutilsrÚ get_loggerreÚloggerr Ú__all__rr+r)Úrys?ðñ(åçAÝð × Ò ˜HÓ %€ôkÐ,ôkð\ Ð r+