ó <±h•7ãóÔ•SSKrSSKJr SSKJrJrJr SSKrSSK J r Jr SSKJ r SSKJr \(aSSKJr \ R$"\5rS S 0rSr\"SS 9"SS\55rS/rg)éN)Úcopyfile)Ú TYPE_CHECKINGÚAnyÚOptionalé)Ú AddedTokenÚPreTrainedTokenizer)Úlogging)Úrequires)Ú TextInputÚ vocab_fileztokenizer.modeluâ–)Ú sentencepiece)Úbackendsc ó†^•\rSrSrSr\rSS/rS S\\ \ \44U4SjjjrSr Sr\S 5rS rSSS \\ 4U4SjjrSrSrSrSrS!S\\ S \\ 4SjjrS!SjrS"S\\S\\\S\S \\4U4SjjjrS!S\\S\\\S \\4SjjrS#S\\S\S\S \ 4SjjrSr U=r!$)$ÚGemmaTokenizeré+aa Construct a Gemma tokenizer. Based on byte-level Byte-Pair-Encoding. The default padding token is unset as there is no padding token in the original model. Args: vocab_file (`str`): Path to the vocabulary file. unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `""`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `""`): The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `""`): The end of sequence token. pad_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `""`): A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms or loss computation. sp_model_kwargs (`dict[str, Any]`, `Optional`, *optional*): Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set: - `enable_sampling`: Enable subword regularization. - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout. - `nbest_size = {0,1}`: No sampling is performed. - `nbest_size > 1`: samples from the nbest_size results. - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm. - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout. add_bos_token (`bool`, *optional*, defaults to `True`): Whether or not to add an `bos_token` at the start of sequences. add_eos_token (`bool`, *optional*, defaults to `False`): Whether or not to add an `eos_token` at the end of sequences. clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`): Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra spaces. use_default_system_prompt (`bool`, *optional*, defaults to `False`): Whether or not the default system prompt for Gemma should be used. spaces_between_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not to add spaces between special tokens. Ú input_idsÚattention_maskÚsp_model_kwargscó>•Uc0OUUl[U[5(a[USSS9OUn[U[5(a[USSS9OUn[U[5(a[USSS9OUn[U[5(a[USSS9OUnXlXplX€lX l[R"S0URD6Ul URRU5 [T U]4"SUUUUUUUU U US. UD6 g)NFT)Ú normalizedÚspecial) Ú bos_tokenÚ eos_tokenÚ unk_tokenÚ pad_tokenÚ add_bos_tokenÚ add_eos_tokenrÚclean_up_tokenization_spacesÚuse_default_system_promptÚspaces_between_special_tokens©)rÚ isinstanceÚstrrr rrr ÚspmÚSentencePieceProcessorÚsp_modelÚLoadÚsuperÚ__init__)Úselfr rrrrrrrrr r!ÚkwargsÚ __class__s €Úd/var/www/html/shao/venv/lib/python3.13/site-packages/transformers/models/gemma/tokenization_gemma.pyr*ÚGemmaTokenizer.__init__^sø€ð&5Ñ%<™rÀ/ˆÔÜMWÐXaÔcf×MgÑMg”J˜y°UÀDÒIÐmvˆ ÜMWÐXaÔcf×MgÑMg”J˜y°UÀDÒIÐmvˆ ÜMWÐXaÔcf×MgÑMg”J˜y°UÀDÒIÐmvˆ ÜMWÐXaÔcf×MgÑMg”J˜y°UÀDÒIÐmvˆ à$ŒØ*ÔØ*ÔØ)BÔ&Ü×2Ò2ÑJ°T×5IÑ5IÑJˆŒ Ø ‰ ×Ñ˜:Ô&ä ‰Òð ØØØØØ'Ø'Ø+Ø)EØ&?Ø*Gñ ðó ócó~•URR5nSUS'URR5US'U$)Nr'Úsp_model_proto)Ú__dict__Úcopyr'Úserialized_model_proto)r+Ústates r.Ú__getstate__ÚGemmaTokenizer.__getstate__ˆs;€Ø— ‘ ×"Ñ"Ó$ˆØ ˆˆjÑØ"&§-¡-×"FÑ"FÓ"HˆÐÑØˆr0cóÎ•URRU5 [R"S0URD6UlUR R UR5 g)Nr")r3Úupdater%r&rr'ÚLoadFromSerializedProtor2)r+Úds r.Ú__setstate__ÚGemmaTokenizer.__setstate__ŽsG€Ø ‰ ×Ñ˜QÔÜ×2Ò2ÑJ°T×5IÑ5IÑJˆŒ Ø ‰ ×-Ñ-¨d×.AÑ.AÕBr0có6•URR5$)zReturns vocab size)r'Úget_piece_size)r+s r.Ú vocab_sizeÚGemmaTokenizer.vocab_size“s€ð}‰}×+Ñ+Ó-Ð-r0có¬•[UR5Vs0sHoRU5U_M nnURUR5 U$s snf)zReturns vocab as a dict)ÚrangerAÚconvert_ids_to_tokensr:Úadded_tokens_encoder)r+ÚiÚvocabs r.Ú get_vocabÚGemmaTokenizer.get_vocab˜sL€ä;@ÀÇÁÔ;QÓRÑ;Q°a×+Ñ+¨AÓ.°Ò1Ñ;QˆÐRØ ‰T×.Ñ.Ô/ØˆùòSs˜AÚtextrÚreturncó&>•[TU]"U40UD6$)zE Args: text: TextInput Simply calls PreTrainedTokenizer's method )r)Útokenize)r+rKr,r-s €r.rNÚGemmaTokenizer.tokenizežsø€ô‰wÒ Ñ/¨Ñ/Ð/r0có>•URRU[S9$)zf Args: text: TextInput Returns a tokenized string. The Gemma tokenizer never adds a prefix space. )Úout_type)r'Úencoder$)r+rKr,s r.Ú _tokenizeÚGemmaTokenizer._tokenize¦s€ð}‰}×#Ñ# D´3Ð#Ð7Ð7r0có8•URRU5$)z0Converts a token (str) in an id using the vocab.)r'Úpiece_to_id)r+Útokens r.Ú_convert_token_to_idÚ#GemmaTokenizer._convert_token_to_id®s€à}‰}×(Ñ(¨Ó/Ð/r0có<•URRU5nU$)z=Converts an index (integer) in a token (str) using the vocab.)r'Ú IdToPiece)r+ÚindexrWs r.Ú_convert_id_to_tokenÚ#GemmaTokenizer._convert_id_to_token²s€à— ‘ ×'Ñ'¨Ó.ˆØˆr0cóâ•/nSnUHGnX@R;a$X0RRU5U-- n/nM6URU5 MI X0RRU5- nU$)z:Converts a sequence of tokens (string) in a single string.Ú)Ú_added_tokens_encoderr'ÚdecodeÚappend)r+ÚtokensÚcurrent_sub_tokensÚ out_stringrWs r.Úconvert_tokens_to_stringÚ'GemmaTokenizer.convert_tokens_to_string·st€àÐØˆ ÛˆEà×2Ñ2Ó2ØŸm™m×2Ñ2Ð3EÓFÈÑNÑN Ø%'Ò"à"×)Ñ)¨%Ö0ñ ð —m‘m×*Ñ*Ð+=Ó>Ñ>ˆ ØÐr0Úfilename_prefixcó•[RRU5(d[R SUS35 g[RRX(aUS-OS[S-5n[RRUR5[RRU5:waG[RRUR5(a[URU5 U4$[RRUR5(dC[US5nURR5nURU5 SSS5 U4$U4$!,(df U4$=f)zÍ Save the vocabulary and special tokens file to a directory. Args: save_directory (`str`): The directory in which to save the vocabulary. Returns: `Tuple(str)`: Paths to the files saved. zVocabulary path (z) should be a directoryNÚ-r`r Úwb)ÚosÚpathÚisdirÚloggerÚerrorÚjoinÚVOCAB_FILES_NAMESÚabspathr ÚisfilerÚopenr'r5Úwrite)r+Úsave_directoryriÚout_vocab_fileÚfiÚcontent_spiece_models r.Úsave_vocabularyÚGemmaTokenizer.save_vocabularyÅs.€ôw‰w}‰}˜^×,Ñ,ÜL‰LÐ,¨^Ð,<Ð•U(a[TU]XSS9$UR(aS/O/nUR(aS/O/nUcUS/[ U5--U-$US/[ U5--U-U-S/[ U5--U-$)ad Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer `prepare_for_model` method. Args: token_ids_0 (`list[int]`): List of IDs. token_ids_1 (`list[int]`, *optional*): Optional second list of IDs for sequence pairs. already_has_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not the token list is already formatted with special tokens for the model. Returns: `list[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. T)rƒr„rˆér)r)Úget_special_tokens_maskrrÚlen)r+rƒr„rˆr€rr-s €r.r‹Ú&GemmaTokenizer.get_special_tokens_maskës¹ø€ö$&Ü‘7Ñ2Ø'Ð]að3ðð ð#×0×0˜‘s°bˆØ"×0×0˜‘s°bˆàÑØ A 3¬¨[Ó)9Ñ#9Ñ:¸\ÑIÐIàØˆs”S˜Ó%Ñ%ñ 'àñ ðñ ðˆs”S˜Ó%Ñ%ñ 'ð ñ ð r0cóâ•UR(a UR/O/nUR(a UR/O/nS/[ X1-U-5-nUbUS/[ X2-U-5-- nU$)aM Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT sequence pair mask has the following format: ``` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | ``` if token_ids_1 is None, only returns the first portion of the mask (0s). Args: token_ids_0 (`list[int]`): List of ids. token_ids_1 (`list[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `list[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s). rrŠ)rr€rrrŒr‚s r.Ú$create_token_type_ids_from_sequencesÚ3GemmaTokenizer.create_token_type_ids_from_sequencessv€ð./3×.@×.@˜×)Ñ)Ñ*ÀbˆØ.2×.@×.@˜×)Ñ)Ñ*Àbˆà”s˜<Ñ5¸ÑDÓEÑEˆàÑ"ØqcœC Ñ :¸\Ñ IÓJÑJÑJˆFàˆ r0Ú token_idsÚskip_special_tokensr!có(•/n/nUH˜nU(aXpR;aMXpR;a]U(a*URURR U55 URURUR 5 /nM‡URU5 Mš U(a*URURR U55 U(aSR U5nOSR U5nUR[S5$)NÚ r`) Úall_special_idsÚ_added_tokens_decoderrcr'rbÚcontentrrÚreplaceÚSPIECE_UNDERLINE)r+r‘r’r!r,Ú sub_textsÚcurrent_sub_textÚidss r.Ú_decodeÚGemmaTokenizer._decode1sà€ðˆ ØÐÛˆCÞ" s×.BÑ.BÓ'BÙØ×0Ñ0Ó0Þ#Ø×$Ñ$ T§]¡]×%9Ñ%9Ð:JÓ%KÔLØ× Ñ ×!;Ñ!;¸CÑ!@×!HÑ!HÔIØ#%Ò à ×'Ñ'¨Ö,ñöØ×Ñ˜TŸ]™]×1Ñ1Ð2BÓCÔDæ(ØŸ™ Ó+‰IàŸ™ Ó*ˆIà× Ñ Ô!1°3Ó7Ð7r0)rrr'rr r ) zzzzNTFFFFr)NF)FF)"Ú__name__Ú __module__Ú__qualname__Ú__firstlineno__Ú__doc__rsÚvocab_files_namesÚmodel_input_namesrÚdictr$rr*r7r=ÚpropertyrArIÚlistrNrSrXr]rgÚtupler|r†ÚintÚboolr‹rrÚ__static_attributes__Ú __classcell__)r-s@r.rr+s†ø†ñ,ð\*ÐØ$Ð&6Ð7Ðð ØØØØ48ØØØ%*Ø"'Ø&+ñ( ð" $ s¨C x¡.Ñ1÷( ð( òTòCð ñ.óð.òð0˜[ð0°t¸C±y÷0ò8ò0òò ñ!¸xÈ¹}ð!ÐX]Ð^aÑXbõ!ô6 ðsxñ# Ø ™9ð# Ø3;¸DÀ¹IÑ3Fð# Økoð# à ˆc‰÷# ð# ðLJNñØ ™9ðØ3;¸DÀ¹IÑ3Fðà ˆc‰õðH%*Ø.3ñ 8à˜‘9ð8ð"ð8ð(,ð 8ð ÷ 8ó8r0r)rmÚshutilrÚtypingrrrrr%Útokenization_utilsrr Úutilsr Úutils.import_utilsrÚtokenization_utils_baserÚ get_loggerrŸrprsr™rÚ__all__r"r0r.Úr¶syðó, Ýß/Ñ/ãçAÝÝ*öÝ4à × Ò ˜HÓ %€à!Ð#4Ð5ÐàÐñ Ð%Ñ&ô`8Ð(ó`8ó'ð`8ðF Ð r0