ó <±h1#ãó¾•SrSSKrSSKJr SSKJrJr SSKrSSK J r SSKJr SSK Jr S S KJr \R""\5rSS0r\"S S9"SS\ 55rS/rg)z Tokenization class for SpeechT5.éN)Úcopyfile)ÚAnyÚOptionalé)ÚPreTrainedTokenizer)Úlogging)Úrequiresé)ÚEnglishNumberNormalizerÚ vocab_filezspm_char.model)Ú sentencepiece)Úbackendsc ój^•\rSrSrSr\rSS/rSS\\ \ \4SS4U4SjjjrSS jr \S 5r\S5r\R"S5rS rSrSrS\ S\\ 4SjrSrSrSrS S\\4SjjrS!S\\S\\\S\S\\4U4SjjjrS S\ S\\ S\\ 4SjjrSr U=r!$)"ÚSpeechT5Tokenizeré"aŽ Construct a SpeechT5 tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece). This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. Args: vocab_file (`str`): [SentencePiece](https://github.com/google/sentencepiece) file (generally has a *.spm* extension) that contains the vocabulary necessary to instantiate a tokenizer. bos_token (`str`, *optional*, defaults to `"~~"`): The begin of sequence token. eos_token (`str`, *optional*, defaults to `"~~"`): The end of sequence token. unk_token (`str`, *optional*, defaults to `""`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. pad_token (`str`, *optional*, defaults to `""`): The token used for padding, for example when batching sequences of different lengths. normalize (`bool`, *optional*, defaults to `False`): Whether to convert numeric quantities in the text to their spelt-out english counterparts. sp_model_kwargs (`dict`, *optional*): Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set: - `enable_sampling`: Enable subword regularization. - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout. - `nbest_size = {0,1}`: No sampling is performed. - `nbest_size > 1`: samples from the nbest_size results. - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm. - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout. Attributes: sp_model (`SentencePieceProcessor`): The *SentencePiece* processor that is used for every conversion (string, tokens and IDs). Ú input_idsÚattention_maskNÚsp_model_kwargsÚreturnc ó>•Uc0OUUlXlX`lSUl[R "S0URD6UlURRU5 [T U]$"SUUUUUURS.UD6 g)N)Ú bos_tokenÚ eos_tokenÚ unk_tokenÚ pad_tokenÚ normalizer©) rrrÚ_normalizerÚspmÚSentencePieceProcessorÚsp_modelÚLoadÚsuperÚ__init__) ÚselfrrrrrrrÚkwargsÚ __class__s €Új/var/www/html/shao/venv/lib/python3.13/site-packages/transformers/models/speecht5/tokenization_speecht5.pyr#ÚSpeechT5Tokenizer.__init__Qs‡ø€ð&5Ñ%<™rÀ/ˆÔØ$ŒØ"ŒØˆÔä×2Ò2ÑJ°T×5IÑ5IÑJˆŒ Ø ‰ ×Ñ˜:Ô&ä ‰Òð ØØØØØØ ×0Ñ0ñ ðó ócóˆ•URSUR5nU(aSU-nU(aURU5nX4$)NrÚ )ÚpoprÚ normalizer)r$ÚtextÚis_split_into_wordsr%rs r'Úprepare_for_tokenizationÚ*SpeechT5Tokenizer.prepare_for_tokenizationns;€Ø—J‘J˜{¨D¯N©NÓ;ˆ ÞØ˜‘:ˆDÞØ—?‘? 4Ó(ˆDØˆ~Ðr)có6•URR5$©N)r Úget_piece_size©r$s r'Ú vocab_sizeÚSpeechT5Tokenizer.vocab_sizevs€à}‰}×+Ñ+Ó-Ð-r)cóR•URc[5UlUR$r3)rrr5s r'r-ÚSpeechT5Tokenizer.normalizerzs%€à×ÑÑ#Ü6Ó8ˆDÔØ×ÑÐr)có•Xlgr3)r)r$Úvalues r'r-r9€s€à Õr)có¬•[UR5Vs0sHoRU5U_M nnURUR5 U$s snfr3)Úranger6Úconvert_ids_to_tokensÚupdateÚadded_tokens_encoder)r$ÚiÚvocabs r'Ú get_vocabÚSpeechT5Tokenizer.get_vocab„sL€Ü;@ÀÇÁÔ;QÓRÑ;Q°a×+Ñ+¨AÓ.°Ò1Ñ;QˆÐRØ ‰T×.Ñ.Ô/ØˆùòSs˜AcóD•URR5nSUS'U$)Nr )Ú__dict__Úcopy)r$Ústates r'Ú__getstate__ÚSpeechT5Tokenizer.__getstate__‰s#€Ø— ‘ ×"Ñ"Ó$ˆØ ˆˆjÑØˆr)cóÔ•Xl[US5(d0Ul[R"S0URD6UlUR R UR5 g)Nrr)rFÚhasattrrrrr r!r)r$Úds r'Ú__setstate__ÚSpeechT5Tokenizer.__setstate__ŽsP€ØŒ ôtÐ.×/Ñ/Ø#%ˆDÔ ä×2Ò2ÑJ°T×5IÑ5IÑJˆŒ Ø ‰ ×Ñ˜4Ÿ?™?Õ+r)r.có>•URRU[S9$)zPTake as input a string and return a list of strings (tokens) for words/sub-words)Úout_type)r ÚencodeÚstr)r$r.s r'Ú _tokenizeÚSpeechT5Tokenizer._tokenize˜s€à}‰}×#Ñ# D´3Ð#Ð7Ð7r)có8•URRU5$)z0Converts a token (str) in an id using the vocab.)r Úpiece_to_id)r$Útokens r'Ú_convert_token_to_idÚ&SpeechT5Tokenizer._convert_token_to_idœs€à}‰}×(Ñ(¨Ó/Ð/r)có<•URRU5nU$)z=Converts an index (integer) in a token (str) using the vocab.)r Ú IdToPiece)r$ÚindexrXs r'Ú_convert_id_to_tokenÚ&SpeechT5Tokenizer._convert_id_to_token s€à— ‘ ×'Ñ'¨Ó.ˆØˆr)có"•/nSnSnUHWnXPR;a2U(dUS- nX0RRU5U-- nSn/nMDURU5 SnMY X0RRU5- nUR 5$)z:Converts a sequence of tokens (string) in a single string.ÚFr+T)Úall_special_tokensr ÚdecodeÚappendÚstrip)r$ÚtokensÚcurrent_sub_tokensÚ out_stringÚprev_is_specialrXs r'Úconvert_tokens_to_stringÚ*SpeechT5Tokenizer.convert_tokens_to_string¦s™€àÐØˆ ØˆÛˆEà×/Ñ/Ó/Þ&Ø #Ñ%JØŸm™m×2Ñ2Ð3EÓFÈÑNÑN Ø"&Ø%'Ò"à"×)Ñ)¨%Ô0Ø"'’ñð —m‘m×*Ñ*Ð+=Ó>Ñ>ˆ Ø×ÑÓ!Ð!r)cóJ•UcXR/-$X-UR/-$)z=Build model inputs from a sequence by appending eos_token_id.)Úeos_token_id)r$Útoken_ids_0Útoken_ids_1s r'Ú build_inputs_with_special_tokensÚ2SpeechT5Tokenizer.build_inputs_with_special_tokens¹s1€àÑØ×"3Ñ"3Ð!4Ñ4Ð4àÑ(¨D×,=Ñ,=Ð+>Ñ>Ð>r)rnroÚalready_has_special_tokenscó¢>•U(a[TU]XSS9$S/nUcS/[U5-U-$S/[U5-S/[U5--U-$)NT)rnrorrr r)r"Úget_special_tokens_maskÚlen)r$rnrorrÚsuffix_onesr&s €r'rtÚ)SpeechT5Tokenizer.get_special_tokens_maskÀssø€ö&Ü‘7Ñ2Ø'Ð]að3ðð ðcˆØÑØCœ#˜kÓ*Ñ*¨kÑ9Ð9Ø”c˜+Ó&Ñ&¨A¨3´°[Ó1AÑ+AÑBÀ[ÑPÐPr)Úsave_directoryÚfilename_prefixcó•[RRU5(d[R SUS35 g[RRX(aUS-OS[S-5n[RRUR5[RRU5:waG[RRUR5(a[URU5 U4$[RRUR5(dC[US5nURR5nURU5 SSS5 U4$U4$!,(df U4$=f)NzVocabulary path (z) should be a directoryÚ-rarÚwb)ÚosÚpathÚisdirÚloggerÚerrorÚjoinÚVOCAB_FILES_NAMESÚabspathrÚisfilerÚopenr Úserialized_model_protoÚwrite)r$rxryÚout_vocab_fileÚfiÚcontent_spiece_models r'Úsave_vocabularyÚ!SpeechT5Tokenizer.save_vocabularyÍs,€Üw‰w}‰}˜^×,Ñ,ÜL‰LÐ,¨^Ð,<ÐzzzFN)Fr3)NF)"Ú__name__Ú __module__Ú__qualname__Ú__firstlineno__Ú__doc__rƒÚvocab_files_namesÚmodel_input_namesrÚdictrSrr#r0Úpropertyr6r-ÚsetterrCrIrNÚlistrTrYr^rjÚintrpÚboolrtÚtuplerŒÚ__static_attributes__Ú __classcell__)r&s@r'rr"s]ø†ñ(ðT*ÐØ$Ð&6Ð7Ðð ØØØØØ48ñ ð" $ s¨C x¡.Ñ1ð ð ÷ ð ô:ðñ.óð.ðñ óð ð ×Ññ!óð!òò ò ,ð8˜cð8 d¨3¡iô8ò0òò"ñ&?ÐQUÐVYÑQZõ?ðsxñQØ ™9ðQØ3;¸DÀ¹IÑ3FðQØkoðQà ˆc‰÷QðQñ!¨cð!ÀHÈSÁMð!Ð]bÐcfÑ]g÷!ó!r)r)r’r}ÚshutilrÚtypingrrr rÚtokenization_utilsrÚutilsrÚutils.import_utilsr Únumber_normalizerrÚ get_loggerrŽr€rƒrÚ__all__rr)r'Úr¦spðñ'ã Ýß ãå5ÝÝ*Ý6ð × Ò ˜HÓ %€à!Ð#3Ð4Ðñ Ð%Ñ&ôy!Ð+óy!ó'ðy!ðxÐ r)