ó <±h>'ãóx•SrSSKrSSKJr SSKJrJr SSKJr \R"\ 5r"SS\5rS/r g) z"Tokenization class for model ByT5.éN)ÚOptionalé)Ú AddedTokenÚPreTrainedTokenizer)Úloggingc ór^•\rSrSrSrSS/rSSU4Sjjjr\S5rSr SS \ \S \\ \S\ S\ \4U4SjjjrS \ \S\ \4SjrSS \ \S \\ \S\ \4SjjrSS \ \S \\ \S\ \4SjjrS\S\ \4SjrSrSrSrSS\S\\S\\4SjjrSrU=r$)Ú ByT5Tokenizeréa7 Construct a ByT5 tokenizer. ByT5 simply uses raw bytes utf-8 encoding. This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. Args: eos_token (`str`, *optional*, defaults to `""`): The end of sequence token. When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the `sep_token`. unk_token (`str`, *optional*, defaults to `""`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. pad_token (`str`, *optional*, defaults to `""`): The token used for padding, for example when batching sequences of different lengths. extra_ids (`int`, *optional*, defaults to 125): Add a number of extra ids added to the end of the vocabulary for use as sentinels. These tokens are accessible as "" where "{%d}" is a number between 0 and extra_ids-1. Extra tokens are indexed from the end of the vocabulary up to beginning ("" is the last token in the vocabulary like in ByT5 preprocessing see [here](https://github.com/google-research/text-to-text-transfer-transformer/blob/9fd7b14a769417be33bc6c850f9598764913c833/t5/data/preprocessors.py#L2117)). additional_special_tokens (`list[str]`, *optional*): Additional special tokens used by the tokenizer. Ú input_idsÚattention_maskÚreturnc óJ>•US:”a"Uc[U5Vs/sH nSUS3PM nnONUS:”aHUbE[U5S:”a6[[[SU555nX„:wa[ SUSUS35e[U[5(a[USSS 9OUn[U[5(a[USSS 9OUn[U[5(a[USSS 9OUnX1US .Ul[UR5Ul SUl [T U]0"S UUUSUS.UD6 gs snf)Nrz có0•[S[U5;5$)NÚextra_id)ÚboolÚstr)Úxs Úb/var/www/html/shao/venv/lib/python3.13/site-packages/transformers/models/byt5/tokenization_byt5.pyÚÚ(ByT5Tokenizer.__init__..Ls€´D¸ÄsÈ1ÃvÑ9MÔ4NózBoth extra_ids (z!) and additional_special_tokens (zm) are provided to ByT5Tokenizer. In this case the additional_special_tokens must include the extra_ids tokensT)ÚlstripÚrstrip)rééé)Ú eos_tokenÚ unk_tokenÚ pad_tokenÚ extra_idsÚadditional_special_tokens©) ÚrangeÚlenÚsetÚfilterÚ ValueErrorÚ isinstancerrÚ_added_tokens_decoderÚoffsetÚ_utf_vocab_sizeÚsuperÚ__init__) Úselfrrr r!r"ÚkwargsÚiÚextra_tokensÚ __class__s €rr.ÚByT5Tokenizer.__init__>sTø€ðq‹=Ð6Ñ>ÜDIÈ)ÔDTÓ(UÑDT¸q¨:°a°S¸Ó):ÑDTÐ%Ð(UÐ%Ø ˜‹]Ð8ÑDÌÐMfÓIgÐjkÓIkäœs¤6Ñ*NÐPiÓ#jÓkÓlˆLØÓ(Ü Ø& y kÐ1RÐSlÐRmðn(ð(óðôHRÐR[Ô]`×GaÑGa”J˜y°¸dÒCÐgpˆ äGQÐR[Ô]`×GaÑGa”J˜y°¸dÒCÐgpˆ ÜGQÐR[Ô]`×GaÑGa”J˜y°¸dÒCÐgpˆ à)2ÀYÑ%OˆÔ"Ü˜$×4Ñ4Ó5ˆŒØ#ˆÔÜ ‰Òð ØØØØØ&?ñ ðó ùò')Vs˜D có•UR$©N)r,)r/s rÚ vocab_sizeÚByT5Tokenizer.vocab_sizees€à×#Ñ#Ð#rcóÆ•[URUR-5Vs0sHoRU5U_M nnUR UR 5 U$s snfr6)r$r7r+Úconvert_ids_to_tokensÚupdateÚadded_tokens_encoder)r/r1Úvocabs rÚ get_vocabÚByT5Tokenizer.get_vocabisX€Ü;@ÀÇÁÐSW×S^ÑS^ÑA^Ô;_Ó`Ñ;_°a×+Ñ+¨AÓ.°Ò1Ñ;_ˆÐ`Ø ‰T×.Ñ.Ô/Øˆùòas¥AÚtoken_ids_0Útoken_ids_1Úalready_has_special_tokenscó¨>•U(a[TU]XSS9$UcS/[U5-S/-$S/[U5-S/-S/[U5--S/-$)ad Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer `prepare_for_model` method. Args: token_ids_0 (`list[int]`): List of IDs. token_ids_1 (`list[int]`, *optional*): Optional second list of IDs for sequence pairs. already_has_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not the token list is already formatted with special tokens for the model. Returns: `list[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. T)r@rArBrr)r-Úget_special_tokens_maskr%)r/r@rArBr3s €rrDÚ%ByT5Tokenizer.get_special_tokens_masknswø€ö$&Ü‘7Ñ2Ø'Ð]að3ðð ð ÑØCœ#˜kÓ*Ñ*¨q¨cÑ1Ð1Ø”c˜+Ó&Ñ&¨1¨#Ñ-°!°´s¸;Ó7GÑ1GÑHÈAÈ3ÑNÐNrÚ token_idscó°•[U5S:”a9USUR:Xa&[R"SURS35 U$XR/-$)z.Do not add eos again if user already added it.réÿÿÿÿzThis sequence already has zQ. In future versions this behavior may lead to duplicated eos tokens being added.)r%Úeos_token_idÚwarningsÚwarnr)r/rFs rÚ_add_eos_if_not_presentÚ%ByT5Tokenizer._add_eos_if_not_presentŠs[€äˆy‹>˜AÓ )¨B¡-°4×3DÑ3DÓ"DÜMŠMØ,¨T¯^©^Ð,<ð=+ð+ô ðÐà× 1Ñ 1Ð2Ñ2Ð2rcór•UR/nUc[X-5S/-$[X-U-U-5S/-$)ay Create a mask from the two sequences passed to be used in a sequence-pair classification task. ByT5 does not make use of token type ids, therefore a list of zeros is returned. Args: token_ids_0 (`list[int]`): List of IDs. token_ids_1 (`list[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `list[int]`: List of zeros. r)rIr%)r/r@rAÚeoss rÚ$create_token_type_ids_from_sequencesÚ2ByT5Tokenizer.create_token_type_ids_from_sequences•sL€ð × Ñ Ð!ˆàÑÜ{Ñ(Ó)¨Q¨CÑ/Ð/Ü;Ñ$ {Ñ2°SÑ8Ó9¸Q¸CÑ?Ð?rcóX•URU5nUcU$URU5nX-$)a" Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A sequence has the following format: - single sequence: `X ` - pair of sequences: `A B ` Args: token_ids_0 (`list[int]`): List of IDs to which the special tokens will be added. token_ids_1 (`list[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `list[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens. )rL)r/r@rAs rÚ build_inputs_with_special_tokensÚ.ByT5Tokenizer.build_inputs_with_special_tokens«s9€ð&×2Ñ2°;Ó?ˆØÑØÐà×6Ñ6°{ÓCˆKØÑ,Ð,rÚtextcób•URS5Vs/sHn[U5PM nnU$s snf)zPTake as input a string and return a list of strings (tokens) for words/sub-wordsúutf-8)ÚencodeÚchr)r/rUr1Útokenss rÚ _tokenizeÚByT5Tokenizer._tokenizeÅs/€à"&§+¡+¨gÔ"6Ó7Ñ"6˜Q”#a–&Ñ"6ˆÐ7Øˆ ùò8s”,có\•[U5S:waSnU$[U5UR-nU$)z0Converts a token (str) in an id using the vocab.rN)r%Úordr+)r/ÚtokenÚtoken_ids rÚ_convert_token_to_idÚ"ByT5Tokenizer._convert_token_to_idÊs4€ôˆu‹:˜‹?ØˆHðˆô˜5“z D§K¡KÑ/ˆHàˆrcó4•[XR- 5nU$)z=Converts an index (integer) in a token (str) using the vocab.)rYr+)r/Úindexr_s rÚ_convert_id_to_tokenÚ"ByT5Tokenizer._convert_id_to_tokenÔs€äEŸK™KÑ'Ó(ˆØˆrcó•SnUHknX0R;aURURS5nO6X0R;aURS5nO[[ U5/5nX$- nMm URSSS9nU$)z:Converts a sequence of tokens (string) in a single string.rrWÚignore)Úerrors)Úadded_tokens_decoderrXr<Úbytesr^Údecode)r/rZÚbstringr_Ú tok_stringÚstrings rÚconvert_tokens_to_stringÚ&ByT5Tokenizer.convert_tokens_to_stringÙsƒ€àˆÛˆEØ×1Ñ1Ó1Ø!×6Ñ6°uÑ=×DÑDÀWÓM‘ Ø×3Ñ3Ó3Ø"Ÿ\™\¨'Ó2‘ ä"¤C¨£J <Ó0 ØÑ!ŠGñð—‘ °Ð9ˆØˆ rÚsave_directoryÚfilename_prefixcó•g)Nr#r#)r/rrrss rÚsave_vocabularyÚByT5Tokenizer.save_vocabularyès€Ør)r*r,r+)zzzé}N)r N)NFr6)Ú__name__Ú __module__Ú__qualname__Ú__firstlineno__Ú__doc__Úmodel_input_namesr.Úpropertyr7r>ÚlistÚintrrrDrLrPrSrr[rarerpÚtupleruÚ__static_attributes__Ú __classcell__)r3s@rr r swø†ñð@%Ð&6Ð7ÐðØØØØ"&ð % ð ÷% ð% ðNñ$óð$òðsxñOØ ™9ðOØ3;¸DÀ¹IÑ3FðOØkoðOà ˆc‰÷OðOð8 3°°c±ð 3¸tÀC¹yô 3ðJNñ@Ø ™9ð@Ø3;¸DÀ¹IÑ3Fð@à ˆc‰õ@ð.JNñ-Ø ™9ð-Ø3;¸DÀ¹IÑ3Fð-à ˆc‰õ-ð4˜cð d¨3¡iôò òò ñ¨cðÀHÈSÁMðÐ]bÐcfÑ]g÷órr )r|rJÚtypingrÚtokenization_utilsrrÚutilsrÚ get_loggerrxÚloggerr Ú__all__r#rrÚrŠsBðñ)ãÝçAÝð × Ò ˜HÓ %€ôNÐ'ôNðbÐ r