ó <±hÐãó(•SSKrSSKJr SSKJr \"5(aSSKJr SS\RRS\RS\RS \RS \RS\S\R4S jjrg!\a Nlf=f)éNé)ÚPagedAttentionCache)Úis_flash_attn_2_available)Úflash_attn_varlen_funcÚmoduleÚqÚkÚvÚattention_maskÚcacheÚreturnc óâ•UR"X#UR4SU0UD6up#[USS5(dSO URS4n UbURnSURS50nW"UR SS 5RS5R5UR SS 5RS5R5UR SS 5RS5R5UR[R5UR[R5R5UU 4URS U S.UD6n[U[5(aUSnUS4$)a¨Perform the forward pass of attention with paged key-value cache. This function handles the cache updates and performs the attention computation using the flash_attn_varlen_func for efficient processing. Args: q: (total_q, nheads, headdim), where total_q = total number of query tokens in the batch. k: (total_k, nheads_k, headdim), where total_k = total number of key tokens in the batch. but if there is a block table it can be the full k v: (total_k, nheads_k, headdim), where total_k = total number of key tokens in the batch. but if there is a block table it can be the full v cumulative_seqlens_q: (batch_size + 1,), dtype torch.int32. The cumulative sequence lengths of the sequences in the batch, used to index into q. cumulative_seqlens_k: (batch_size + 1,), dtype torch.int32. The cumulative sequence lengths of the sequences in the batch, used to index into kv. max_seqlen_q: int. Maximum query sequence length in the batch. max_seqlen_k: int. Maximum key sequence length in the batch. dropout_p: float. Dropout probability. softmax_scale: float. The scaling of QK^T before applying softmax. Default to 1 / sqrt(headdim). causal: bool. Whether to apply causal attention mask (e.g., for auto-regressive modeling). window_size: (left, right). If not (-1, -1), implements sliding window local attention. softcap: float. Anything > 0 activates softcapping attention. Úcumulative_seqlens_kÚsliding_windowF)éÿÿÿÿrrNÚs_auxérT)Ú softmax_scaleÚcausalÚwindow_size)ÚupdateÚ layer_idxÚgetattrrrÚgetÚ transposeÚsqueezeÚ contiguousÚtoÚtorchÚint32ÚcloneÚscalingÚ isinstanceÚtuple)rrr r rrÚcumulative_seqlens_qrÚmax_seqlen_qÚmax_seqlen_kÚblock_tablesÚimplementationÚkwargsrrÚ custom_kwargsÚattn_outputs Ú]/var/www/html/shao/venv/lib/python3.13/site-packages/transformers/integrations/flash_paged.pyÚpaged_attention_forwardr.sU€ðJ<Š<˜˜f×.Ñ.ÑdÐEYÐdÐ]cÑdD€Aä%,¨VÐ5EÀu×%MÑ%M‘XÐTZ×TiÑTiÐklÐSm€NØÑ!Ø!/×!FÑ!FÐØ˜fŸj™j¨Ó1Ð2€MÙ(Ø ‰AqÓ×!Ñ! !Ó$×/Ñ/Ó1Ø ‰AqÓ×!Ñ! !Ó$×/Ñ/Ó1Ø ‰AqÓ×!Ñ! !Ó$×/Ñ/Ó1Ø×Ñ¤§¡Ó,Ø×Ñ¤§¡Ó,×2Ñ2Ó4ØØð ð—n‘nØØ"ñ ðñ €Kô+œu×%Ñ%Ø! !‘nˆØ˜ÐÐó)NNNNNNNN)rÚgeneration.continuous_batchingrÚutilsrÚ flash_attnrÚ ExceptionÚnnÚModuleÚTensorr.©r/r-Úr8s°ðÛå@Ý-ð Ù ×"Ñ"Ý5ð$(Ø!%ØØØØØØñ;ØH‰HO‰Oð;à‡||ð;ð ‡||ð;ð ‡||ð ;ð —L‘Lð;ðð ;ð‡\\ö;øð ó Ùð ús’BÂBÂB