ó <±häAãó<•SSKrSSKrSSKrSSKJr SSKJr SSKJrJ r J r SSKrSSKJ r SSKJr SSKJr \R$"\5rS S SSS S.rSr\"SS9"SS\ 55rS\S\\\4S\R44SjrS\SS4SjrS\S\ \\44SjrS/rg)éN)ÚPath)Úcopyfile)ÚAnyÚOptionalÚUnioné)ÚPreTrainedTokenizer)Úlogging)Úrequiresz source.spmz target.spmz vocab.jsonztarget_vocab.jsonztokenizer_config.json)Ú source_spmÚ target_spmÚvocabÚtarget_vocab_fileÚtokenizer_config_fileuâ–)Ú sentencepiece)Úbackendsc óÄ^•\rSrSrSr\rSS/rS*S\\ \ \4SS4U4SjjjrS r S \ S\ 4SjrSrS \ 4SjrS \ S\\ 4SjrS\S\ 4SjrU4SjrU4SjrS\\ S\ 4SjrS+S\\4SjjrSrSr\S\4Sj5rS+S\ S\\ S\\ 4SjjrS\ 4SjrSr Sr!S\ 4S jr"S!\ SS4S"jr#S#r$S$r%S,S%\S&\\S'\&S\\4S(jjr'S)r(U=r)$)-ÚMarianTokenizeré,a† Construct a Marian tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece). This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. Args: source_spm (`str`): [SentencePiece](https://github.com/google/sentencepiece) file (generally has a .spm extension) that contains the vocabulary for the source language. target_spm (`str`): [SentencePiece](https://github.com/google/sentencepiece) file (generally has a .spm extension) that contains the vocabulary for the target language. source_lang (`str`, *optional*): A string representing the source language. target_lang (`str`, *optional*): A string representing the target language. unk_token (`str`, *optional*, defaults to `""`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. eos_token (`str`, *optional*, defaults to `""`): The end of sequence token. pad_token (`str`, *optional*, defaults to `""`): The token used for padding, for example when batching sequences of different lengths. model_max_length (`int`, *optional*, defaults to 512): The maximum sentence length the model accepts. additional_special_tokens (`list[str]`, *optional*, defaults to `["", ""]`): Additional special tokens used by the tokenizer. sp_model_kwargs (`dict`, *optional*): Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set: - `enable_sampling`: Enable subword regularization. - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout. - `nbest_size = {0,1}`: No sampling is performed. - `nbest_size > 1`: samples from the nbest_size results. - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm. - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout. Examples: ```python >>> from transformers import MarianForCausalLM, MarianTokenizer >>> model = MarianForCausalLM.from_pretrained("Helsinki-NLP/opus-mt-en-de") >>> tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de") >>> src_texts = ["I am a small frog.", "Tom asked his teacher for advice."] >>> tgt_texts = ["Ich bin ein kleiner Frosch.", "Tom bat seinen Lehrer um Rat."] # optional >>> inputs = tokenizer(src_texts, text_target=tgt_texts, return_tensors="pt", padding=True) >>> outputs = model(**inputs) # should work ```Ú input_idsÚattention_maskNÚsp_model_kwargsÚreturnc ó >•Uc0OUUl[U5R5(d SU35eXÀl[ U5Ul[ U5UR ;a[S5e[ U 5UR ;deU(aK[ U5UlURR5VVs0sHupïXþ_M snnUl /UlOƒUR R5VVs0sHupïXþ_M snnUl UR Vs/sH4oîRS5(dMURS5(dM2UPM6 snUlXPlX`lX/Ul[#XR5Ul[#X R5UlUR$UlUR UlUR-5 [.TU]`"SUUUUU U URUUS. U D6 gs snnfs snnfs snf)Nzcannot find spm source z token must be in the vocabú>>ú<<) Úsource_langÚtarget_langÚ unk_tokenÚ eos_tokenÚ pad_tokenÚmodel_max_lengthrrÚseparate_vocabs©)rrÚexistsr#Ú load_jsonÚencoderÚstrÚKeyErrorÚtarget_encoderÚitemsÚdecoderÚsupported_language_codesÚ startswithÚendswithrrÚ spm_filesÚload_spmÚ spm_sourceÚ spm_targetÚcurrent_spmÚcurrent_encoderÚ_setup_normalizerÚsuperÚ__init__)Úselfrr rrrrrr r!r"rr#ÚkwargsÚkÚvÚ __class__s €Úf/var/www/html/shao/venv/lib/python3.13/site-packages/transformers/models/marian/tokenization_marian.pyr8ÚMarianTokenizer.__init__ksÐø€ð &5Ñ%<™rÀ/ˆÔäJÓ×&Ñ&×(Ñ(ÐPÐ,CÀJÀ<Ð*PÓPÐ(à.ÔÜ Ó'ˆŒÜˆy‹> §¡Ó-ÜÐ=Ó>Ð>Ü9‹~ §¡Ó-Ð-Ð-æÜ"+Ð,=Ó">ˆDÔØ-1×-@Ñ-@×-FÑ-FÔ-HÔIÑ-H¡T Q˜AšDÑ-HÒIˆDŒLØ,.ˆDÕ)à-1¯\©\×-?Ñ-?Ô-AÔBÑ-A¡T Q˜AšDÑ-AÒBˆDŒLØ>B¿lºlÓ2v¹l¸ÏlÉlÐ[_×N`³1Ðef×eoÑeoÐpt×eu·1¹lÑ2vˆDÔ)à&ÔØ&ÔØ$Ð1ˆŒô# :×/CÑ/CÓDˆŒÜ" :×/CÑ/CÓDˆŒØŸ?™?ˆÔØ#Ÿ|™|ˆÔð ×ÑÔ ä ‰Òð à#Ø#ØØØØ-Ø ×0Ñ0Ø/Ø+ñ ðó ùó)JùóCùÚ2vsÂ? G?Ã: HÄHÄ:HÅHcó¼•SSKJn U"UR5RUlg![ [4a! [R"S5 SUlgf=f)Nr)ÚMosesPunctNormalizerz$Recommended: pip install sacremoses.có•U$©Nr$)Úxs r>ÚÚ3MarianTokenizer._setup_normalizer..°s€©Qó) Ú sacremosesrArÚ normalizeÚpunc_normalizerÚImportErrorÚFileNotFoundErrorÚwarningsÚwarn)r9rAs r>r6Ú!MarianTokenizer._setup_normalizer©sM€ð /Ý7á#7¸×8HÑ8HÓ#I×#SÑ#SˆDÕ øÜÔ.Ð/ó /ÜMŠMÐ@ÔAÙ#.ˆDÖ ð /ús‚'*ª.AÁArDcó6•U(aURU5$S$)zHCover moses empty string edge case. They return empty list for '' input!Ú)rJ)r9rDs r>rIÚMarianTokenizer.normalize²s€æ*+ˆt×#Ñ# AÓ&Ð3°Ð3rGcóf•URRXRUR5$rC)r5Úgetr)r9Útokens r>Ú_convert_token_to_idÚ$MarianTokenizer._convert_token_to_id¶s(€Ø×#Ñ#×'Ñ'¨×/CÑ/CÀDÇNÁNÑ/SÓTÐTrGÚtextcó¢•/nURS5(a5URS5=nS:waURUSUS-5 XS-SnX!4$)z6Remove language codes like >>fr<< before sentencepiecerréÿÿÿÿNé)r.ÚfindÚappend)r9rXÚcodeÚend_locs r>Úremove_language_codeÚ$MarianTokenizer.remove_language_code¹sW€àˆØ?‰?˜4× Ñ °·±¸4³Ð&@ gÀRÓ%GØK‰K˜˜]˜w¨™{Ð+Ô,Ø !™˜ Ð&ˆDØˆzÐrGcól•URU5up!URRU[S9nX#-$)N)Úout_type)r`r4Úencoder()r9rXr^Úpiecess r>Ú _tokenizeÚMarianTokenizer._tokenizeÁs7€Ø×.Ñ.¨tÓ4‰ ˆØ×!Ñ!×(Ñ(¨¼Ð(Ð<ˆØ‰}ÐrGÚindexcóL•URRXR5$)z?Converts an index (integer) in a token (str) using the decoder.)r,rTr)r9rhs r>Ú_convert_id_to_tokenÚ$MarianTokenizer._convert_id_to_tokenÆs€à|‰|×Ñ §~¡~Ó6Ð6rGcó&>•[TU]"U40UD6$)aç Convert a list of lists of token ids into a list of strings by calling decode. Args: sequences (`Union[list[int], list[list[int]], np.ndarray, torch.Tensor, tf.Tensor]`): List of tokenized input ids. Can be obtained using the `__call__` method. skip_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not to remove special tokens in the decoding. clean_up_tokenization_spaces (`bool`, *optional*): Whether or not to clean up the tokenization spaces. If `None`, will default to `self.clean_up_tokenization_spaces` (available in the `tokenizer_config`). use_source_tokenizer (`bool`, *optional*, defaults to `False`): Whether or not to use the source tokenizer to decode sequences (only applicable in sequence-to-sequence problems). kwargs (additional keyword arguments, *optional*): Will be passed to the underlying model specific decode method. Returns: `list[str]`: The list of decoded sentences. )r7Úbatch_decode)r9Ú sequencesr:r=s €r>rmÚMarianTokenizer.batch_decodeÊsø€ô*‰wÒ# IÑ8°Ñ8Ð8rGcó&>•[TU]"U40UD6$)aj Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces. Similar to doing `self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))`. Args: token_ids (`Union[int, list[int], np.ndarray, torch.Tensor, tf.Tensor]`): List of tokenized input ids. Can be obtained using the `__call__` method. skip_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not to remove special tokens in the decoding. clean_up_tokenization_spaces (`bool`, *optional*): Whether or not to clean up the tokenization spaces. If `None`, will default to `self.clean_up_tokenization_spaces` (available in the `tokenizer_config`). use_source_tokenizer (`bool`, *optional*, defaults to `False`): Whether or not to use the source tokenizer to decode sequences (only applicable in sequence-to-sequence problems). kwargs (additional keyword arguments, *optional*): Will be passed to the underlying model specific decode method. Returns: `str`: The decoded sentence. )r7Údecode)r9Ú token_idsr:r=s €r>rqÚMarianTokenizer.decodeásø€ô0‰wŠ~˜iÑ2¨6Ñ2Ð2rGÚtokenscóZ•UR(aUROURn/nSnUH@nXPR;aXBR U5U-S-- n/nM/URU5 MB XBR U5- nUR [S5nUR5$)zQUses source spm if _decode_use_source_tokenizer is True, and target spm otherwiserQÚ ) Ú_decode_use_source_tokenizerr2r3Úall_special_tokensÚ decode_piecesr]ÚreplaceÚSPIECE_UNDERLINEÚstrip)r9rtÚsp_modelÚcurrent_sub_tokensÚ out_stringrUs r>Úconvert_tokens_to_stringÚ(MarianTokenizer.convert_tokens_to_stringûs¥€à&*×&G×&G4—?’?ÈTÏ_É_ˆØÐØˆ ÛˆEà×/Ñ/Ó/Ø×4Ñ4Ð5GÓHÈ5ÑPÐSVÑVÑV Ø%'Ò"à"×)Ñ)¨%Ö0ñ ð ×,Ñ,Ð-?Ó@Ñ@ˆ Ø×'Ñ'Ô(8¸#Ó>ˆ Ø×ÑÓ!Ð!rGcóJ•UcXR/-$X-UR/-$)z=Build model inputs from a sequence by appending eos_token_id.)Úeos_token_id)r9Útoken_ids_0Útoken_ids_1s r>Ú build_inputs_with_special_tokensÚ0MarianTokenizer.build_inputs_with_special_tokenss1€àÑØ×"3Ñ"3Ð!4Ñ4Ð4àÑ(¨D×,=Ñ,=Ð+>Ñ>Ð>rGcóH•URUlURUlgrC)r2r4r'r5©r9s r>Ú_switch_to_input_modeÚ%MarianTokenizer._switch_to_input_modes€ØŸ?™?ˆÔØ#Ÿ|™|ˆÕrGcól•URUlUR(aURUlggrC)r3r4r#r*r5r‰s r>Ú_switch_to_target_modeÚ&MarianTokenizer._switch_to_target_modes*€ØŸ?™?ˆÔØ××Ø#'×#6Ñ#6ˆDÕ ð rGcó,•[UR5$rC)Úlenr'r‰s r>Ú vocab_sizeÚMarianTokenizer.vocab_sizes€ä4—<‘<Ó Ð rGÚsave_directoryÚfilename_prefixcóÖ•[RRU5(d[R SUS35 g/nUR (a»[RR UU(aUS-OS[S-5n[RR UU(aUS-OS[S-5n[URU5 [URU5 URU5 URU5 O\[RR X(aUS-OS[S-5n[URU5 URU5 [[S[S/URURUR/5GH$upxn [RR X(aUS-OSU-5n [RR!U5[RR!U 5:waB[RR#U5(a[%XŠ5 URU 5 M·[RR#U5(aMÝ['U S 5nU R)5nUR+U5 SSS5 URU 5 GM' [-U5$!,(df N/=f) NzVocabulary path (z) should be a directoryÚ-rQrrrr Úwb)ÚosÚpathÚisdirÚloggerÚerrorr#ÚjoinÚVOCAB_FILES_NAMESÚ save_jsonr'r*r]Úzipr0r2r3ÚabspathÚisfilerÚopenÚserialized_model_protoÚwriteÚtuple) r9r“r”Úsaved_filesÚout_src_vocab_fileÚout_tgt_vocab_fileÚout_vocab_fileÚspm_save_filenameÚ spm_orig_pathÚ spm_modelÚ spm_save_pathÚfiÚcontent_spiece_models r>Úsave_vocabularyÚMarianTokenizer.save_vocabularysE€Üw‰w}‰}˜^×,Ñ,ÜL‰LÐ,¨^Ð,<ÐØ×ÑÐ1Ô2Ø×ÑÐ1Õ2äŸW™WŸ\™\Ø½/ °3Ò!6ÈrÔUfÐgnÑUoÑ oóˆNô d—l‘l NÔ3Ø×Ñ˜~Ô.ä;>Ü ˜|Ñ ,Ô.?ÀÑ.MÐNØN‰NØ _‰_˜dŸo™oÐ.÷< Ñ7Ð¨iô ŸG™GŸL™LØ½/ °3Ò!6ÈrÐUfÑ fóˆMôw‰w‰˜}Ó-´·±·±ÀÓ1OÓOÔTV×T[ÑT[×TbÑTbÐcp×TqÑTqÜ˜Ô6Ø×"Ñ" =Ö1Ü—W‘W—^‘^ M×2Ó2Ü˜-¨Ô.°"Ø+4×+KÑ+KÓ+MÐ(Ø—H‘HÐ1Ô2÷/ð×"Ñ" =×1ñ< ô"[Ó!Ð!÷/Õ.úsÊ"KË K( có"•UR5$rC)Ú get_src_vocabr‰s r>Ú get_vocabÚMarianTokenizer.get_vocabLs€Ø×!Ñ!Ó#Ð#rGcóB•[UR40URD6$rC)Údictr'Úadded_tokens_encoderr‰s r>r´ÚMarianTokenizer.get_src_vocabOs€ÜD—L‘LÑ> D×$=Ñ$=Ñ>Ð>rGcóB•[UR40URD6$rC)r¸r*Úadded_tokens_decoderr‰s r>Ú get_tgt_vocabÚMarianTokenizer.get_tgt_vocabRs€ÜD×'Ñ'ÑE¨4×+DÑ+DÑEÐErGcó†•URR5nUR[R /SQ55 U$)N)r2r3r4rJr)Ú__dict__ÚcopyÚupdater¸Úfromkeys)r9Ústates r>Ú__getstate__ÚMarianTokenizer.__getstate__Us4€Ø— ‘ ×"Ñ"Ó$ˆØ ‰ÜM‰MÒmÓnô ðˆrGÚdcóÌ^•UTl[TS5(d0TlU4SjTR5uTlTlTRTlTR5 g)Nrc3óN># •UHn[UTR5v• M g7frC)r1r)Ú.0Úfr9s €r>Ú Ú/MarianTokenizer.__setstate__..cs$øé€Ð+fÑWeÐRS¬H°Q¸×8LÑ8L×,MÐ,MÒWeùsƒ"%)rÀÚhasattrrr0r2r3r4r6)r9rÇs` r>Ú__setstate__ÚMarianTokenizer.__setstate__\sTø€ØˆŒ ôtÐ.×/Ñ/Ø#%ˆDÔ ä+fÐW[×WeÒWeÓ+fÑ(ˆŒ˜œØŸ?™?ˆÔØ×ÑÕ rGcó•g)zJust EOSér$)r9Úargsr:s r>Únum_special_tokens_to_addÚ)MarianTokenizer.num_special_tokens_to_addgs€àrGcóž•[UR5nURUR5 UVs/sHo3U;aSOSPM sn$s snf)NrÒr)ÚsetÚall_special_idsÚremoveÚunk_token_id)r9ÚseqrØrDs r>Ú_special_token_maskÚ#MarianTokenizer._special_token_maskksH€Ü˜d×2Ñ2Ó3ˆØ×Ñ˜t×0Ñ0Ô1Ù:=Ó>¹#°Q˜/Ó)‘¨qÒ0¹#Ñ>Ð>ùÒ>sµA r„r…Úalready_has_special_tokenscó•U(aURU5$UcURU5S/-$URX-5S/-$)zCGet list where entries are [1] if a token is [eos] or [pad] else 0.rÒ)rÜ)r9r„r…rÞs r>Úget_special_tokens_maskÚ'MarianTokenizer.get_special_tokens_maskpsQ€ö&Ø×+Ñ+¨KÓ8Ð8Ø Ñ Ø×+Ñ+¨KÓ8¸A¸3Ñ>Ð>à×+Ñ+¨KÑ,EÓFÈ!ÈÑLÐLrG)rÀr5r4r,r'rJr#rrr0r2r3r-r*r) NNNzzziNFrC)NF)*Ú__name__Ú __module__Ú__qualname__Ú__firstlineno__Ú__doc__ržÚvocab_files_namesÚmodel_input_namesrr¸r(rr8r6rIrVr`ÚlistrfÚintrjrmrqr€r†rŠrÚpropertyr‘r¦r±rµr´r½rÅrÏrÔrÜÚboolràÚ__static_attributes__Ú __classcell__)r=s@r>rr,s¸ø†ñ8ðt*ÐØ$Ð&6Ð7ÐðØØØØØØØ48Øñ< ð" $ s¨C x¡.Ñ1ð< ð ÷< ð< ò|/ð4˜3ð4 3ô4òUð¨ôð˜cð d¨3¡iôð 7¨#ð7°#ô7õ9õ.3ð4"¨t°C©yð"¸Sô"ñ ?ÐQUÐVYÑQZõ?ò,ò7ð ð!˜Có!óð!ñ+"¨cð+"ÀHÈSÁMð+"Ð]bÐcfÑ]gõ+"ðZ$˜4ô$ò?òFð˜dôð !˜dð ! tô !òò?ðinñ MØð MØ.6°t©nð MØaeð Mà ˆc‰÷ Mó MrGrr™rrcóT•[R"S0UD6nURU5 U$)Nr$)rÚSentencePieceProcessorÚLoad)r™rÚspms r>r1r1|s%€Ü × .Ò .Ñ A°Ñ A€CØ‡HHˆT„NØ€JrGcóz•[US5n[R"XSS9 SSS5 g!,(df g=f)NÚwr[)Úindent)r£ÚjsonÚdump)Údatar™rËs r>rŸrŸ‚s%€Ü ˆdCŒ˜AÜ Š $ !Ò$÷ Žús,¬ :có|•[US5n[R"U5sSSS5 $!,(df g=f)NÚr)r£röÚload)r™rËs r>r&r&‡s"€Ü ˆdCŒ˜AÜyŠy˜‹|÷ ús- ;) rör˜rMÚpathlibrÚshutilrÚtypingrrrrÚtokenization_utilsr Úutilsr Úutils.import_utilsrÚ get_loggerrâr›ržr{rr(r¸rðr1rŸrér&Ú__all__r$rGr>ÚrsåðóÛ ÛÝÝß'Ñ'ãå5ÝÝ*ð × Ò ˜HÓ %€ðØØ Ø,Ø4ñÐðÐñ Ð%Ñ&ôLMÐ)óLMó'ðLMð^ 3ð¨¨c°3¨h©ð¸M×<`Ñ<`ôð%˜#ð% $ô%ð Cð˜E $¨ *Ñ-ôð Ð rG