Skip to content

Official implementation of "DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking". SOTA on Multi-Session Chat with negligible alignment tax.

Notifications You must be signed in to change notification settings

lyj20071013/DZ-TDPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

67 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking in Long-Context Dialogue

Paper PDF Twitter License

Yijun Liao
Independent Researcher

Read the Paper (PDF) | Twitter Thread


πŸ“’ News

  • [2025/12/5] πŸ”₯ Baseline Repair Evaluation Pipeline Released! Added eval_judge.py (LLM-as-a-Judge) and freeze_data.py.
  • [2025/12/2] πŸ’» Code Released! The full training framework and evaluation benchmarks are now available.
  • [2025/12/2] πŸš€ Paper Released! We have uploaded the paper PDF directly to this repo.

⚑️ Abstract

In long-context dialogue systems, models suffer from State Inertia, where static constraints prevent resolving conflicts between evolving user intents (e.g., "I'm now Vegan") and established historical context. Standard alignment methods like DPO incur a massive "Alignment Tax" (perplexity explosion >100) when trying to force these updates.

We propose DZ-TDPO, a non-destructive alignment framework that synergizes:

  1. Conflict-Aware Dynamic KL Constraints (TDPO-DKL): Optimization level adjustment.
  2. Learnable Temporal Attention Bias (Dual-Zone Temporal Attention): Representation level filtering powered by semantic conflict detection.

Result: DZ-TDPO achieves State-of-the-Art win rates (50.8% on Qwen2.5-7B) on the Multi-Session Chat (MSC) dataset while maintaining robust zero-shot generalization and negligible perplexity overhead.


🌟 Key Results

1. SOTA Performance on Mutable State Tracking

DZ-TDPO significantly outperforms Standard DPO and SimPO on the MSC dataset, solving the "State Inertia" problem without destroying the model's general capabilities.

Method Win Rate (MSC) PPL (Validation) Alignment Tax
Standard DPO 45.8% 102.3 πŸ’₯ High
SimPO 46.4% 101.2 High
DZ-TDPO (Ours) 55.4% 26.0 βœ… Negligible

Note: On Qwen2.5-7B, DZ-TDPO achieves a near-perfect 50.8% Win Rate.

2. Robustness (Needle-in-a-Haystack)

Does the temporal decay make the model "forgetful"? No. Our "Non-Conflicting Needle-in-a-Haystack" test confirms that DZ-TDPO retains 100% retrieval accuracy for non-conflicting facts, demonstrating precise attention regulation.


πŸ› οΈ Methodology

DZ-TDPO introduces a Conflict-Aware Adaptive Decay mechanism.

  • Dual-Zone Temporal Attention: We map dialogue turns into a latent semantic space using SBERT. A learnable scalar $\lambda$ applies a temporal bias only when a semantic conflict is detected between the current instruction and history.
  • System Prompt Shielding: We enforce a hard constraint to ensure the "System Prompt" (Safety Constitution) is never decayed, ensuring safety amidst aggressive state updates.

πŸ“‚ Project Structure

DZ-TDPO/
β”œβ”€β”€ dz_tdpo/               # Core implementation package
β”‚   β”œβ”€β”€ config.py          # Unified Configuration (TDPO & SimPO)
β”‚   β”œβ”€β”€ loss.py            # Loss functions (TDPO-DKL, SimPO)
β”‚   β”œβ”€β”€ model.py           # TemporalCausalLM & Attention Bias mechanism
β”‚   β”œβ”€β”€ trainer.py         # Custom Trainer implementation
β”‚   β”œβ”€β”€ utils.py           # Metrics & Semantic helpers
β”‚   └── data/              # Data processing pipelines
β”‚       β”œβ”€β”€ dataset.py     # Torch Dataset wrappers
β”‚       β”œβ”€β”€ msc.py         # Multi-Session Chat (MSC) loader
β”‚       └── ultrachat.py   # UltraChat loader
β”œβ”€β”€ benchmarks/            # Comprehensive Evaluation Suite
β”‚   β”œβ”€β”€ eval_tab60.py      # The "TAB-60" Adversarial Benchmark
β”‚   β”œβ”€β”€ eval_needle.py     # Needle-in-a-Haystack Robustness Test
β”‚   β”œβ”€β”€ eval_ppl.py        # Perplexity & Alignment Tax Evaluation
β”‚   β”œβ”€β”€ eval_safety.py     # Context Flooding & Jailbreak Defense Test
β”‚   β”œβ”€β”€ eval_pingpong.py   # Rapid Intent Switching Test (State Flip-Flop)
β”‚   β”œβ”€β”€ eval_RULER.py      # Long-context Retrieval Stress Test
β”‚   β”œβ”€β”€ merge_adapter.py   # Utility: Merge Custom Weights into HF Base Model
β”‚   β”œβ”€β”€ eval_gen.py        # Quantitative: Generation Quality (BLEU, ROUGE, BERTScore)
β”‚   β”œβ”€β”€ eval_judge.py      # LLM-as-a-Judge (DeepSeek v3.2)
β”‚   β”œβ”€β”€ freeze_data.py     # Generate fixed test sets for reproducibility
β”‚   β”œβ”€β”€ eval_generation_universal.py # Batch generation script
β”œβ”€β”€ baselines/
β”‚   β”œβ”€β”€ train_baselines.py # Training DPO and SimPO
β”œβ”€β”€ train.py               # Main unified training entry point
β”œβ”€β”€ test_cpu_dryrun.py     # Architecture integrity verification script
└── requirements.txt       # Project dependencies

πŸ’» Installation & Usage

  1. Clone the repository:
git clone https://github.com/YourUsername/DZ-TDPO.git
cd DZ-TDPO
  1. Install dependencies:
pip install -r requirements.txt

Note: For GPU acceleration, we recommend installing flash-attn separately.

πŸš€ Training

We provide a unified training script train.py that supports both DZ-TDPO and SimPO.

  1. Prepare Data Download the Multi-Session Chat (MSC) dataset and place it in your data directory (e.g., ./data/msc).

  2. Run Training (DZ-TDPO) To train with Temporal Bias and Adaptive Decay (requires sentence-transformers):

python train.py \
    --model_name_or_path microsoft/Phi-3.5-mini-instruct \
    --data_dir ./data/msc \
    --output_dir ./checkpoints/dz_tdpo \
    --use_temporal_bias \
    --use_adaptive_tau \
    --batch_size 2 \
    --epochs 4
  1. Run Training (DPO/SimPO Baseline)
python baselines/train_baselines.py \
    --mode dpo(/simpo) \
    --model_name_or_path microsoft/Phi-3.5-mini-instruct \
    --data_dir ./data/msc \
    --output_dir ./checkpoints/dpo(/simpo)

πŸ”¬ Reproducible Evaluation Pipeline

To ensure the reproducibility of the Win Rate results reported in Table 1, we provide a complete evaluation pipeline consisting of data freezing, batch generation, and LLM-as-a-Judge automation.

1. Prepare Fixed Test Set

To guarantee that all models are evaluated on the exact same prompts, we first freeze a subset of the validation data.

# Generates 'fixed_msc_test.json' and 'fixed_ood_test.json' in ./data/
python benchmarks/freeze_data.py \
    --msc_data_dir ./data/msc \
    --output_dir ./data \
    --model_name microsoft/Phi-3.5-mini-instruct

2. Generate Responses

Use the universal generation script to generate model responses. For DZ-TDPO (Ours):

python benchmarks/eval_generation_universal.py \
    --ckpt_path ./checkpoints/dz_tdpo/final_model.pt \
    --base_model_path microsoft/Phi-3.5-mini-instruct \
    --data_path ./data/fixed_msc_test.json \
    --output_file results_dz_tdpo.json \
    --use_tab \
    --batch_size 1

Note: The --use_tab flag is critical to enable the Dual-Zone Temporal Attention mechanism during inference.

For Baselines (e.g., Standard DPO):

python benchmarks/eval_generation_universal.py \
    --ckpt_path ./checkpoints/dpo/final_model.pt \
    --base_model_path microsoft/Phi-3.5-mini-instruct \
    --data_path ./data/fixed_msc_test.json \
    --output_file results_baseline.json \
    --batch_size 8

3. Automated Judgment (LLM-as-a-Judge)

We employ DeepSeek-V3.2 (or GPT-5) as an impartial judge to compute the Win Rate. The script handles position bias mitigation (swapping) and concurrent requests automatically. Setup API Key:

export DEEPSEEK_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxx"
# Or for OpenAI: export OPENAI_API_KEY="sk-..."

Run Evaluation:

python benchmarks/eval_judge.py \
    --ours results_dz_tdpo.json \
    --baseline results_baseline.json \
    --output judge_verdict.json \
    --model deepseek-chat \
    --workers 10

πŸ“Š Evaluation

We provide a comprehensive suite of benchmarks to evaluate State Tracking, Robustness, and Safety.

0. Model Merging (Optional)

If you want to merge the learned temporal bias and weights into the base model for easier deployment (HuggingFace format):

python benchmarks/merge_adapter.py \
    --base_model_path microsoft/Phi-3.5-mini-instruct \
    --ckpt_path ./checkpoints/dz_tdpo/final_model.pt \
    --save_path ./merged_model

1. Qualitative Analysis (TAB-60 & PingPong)

  • TAB-60: Evaluates the model on 60 adversarial scenarios (e.g., rapid preference toggling, role reversal).
    python benchmarks/eval_tab60.py --ckpt_path ./checkpoints/dz_tdpo/final_model.pt
  • PingPong Test: Tests the model's stability under high-frequency state updates (e.g., Vegan <-> Meat eater every turn).
    python benchmarks/eval_pingpong.py --ckpt_path ./checkpoints/dz_tdpo/final_model.pt

2. Robustness (Needle & RULER)

  • Needle-in-a-Haystack: Checks if the model can retrieve non-conflicting facts from long contexts (2k-16k tokens).
    python benchmarks/eval_needle.py --ckpt_path ./checkpoints/dz_tdpo/final_model.pt
  • RULER: Stress tests long-context retrieval capabilities.
    python benchmarks/eval_RULER.py --ckpt_path ./checkpoints/dz_tdpo/final_model.pt

3. Quantitative Metrics (PPL & Generation)

  • Perplexity (Alignment Tax):
    python benchmarks/eval_ppl.py --data_dir ./data/msc --ckpt_path ./checkpoints/dz_tdpo/final_model.pt
  • Generation Quality (BLEU/ROUGE/BERTScore):
    python benchmarks/eval_gen.py --data_dir ./data/msc --ckpt_path ./checkpoints/dz_tdpo/final_model.pt

πŸ“œ Citation

If you find this work helpful, please consider citing:

@misc{liao2025dztdponondestructivetemporalalignment,
      title={DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking in Long-Context Dialogue}, 
      author={Yijun Liao},
      year={2025},
      eprint={2512.03704},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.03704}, 
}

About

Official implementation of "DZ-TDPO: Non-Destructive Temporal Alignment for Mutable State Tracking". SOTA on Multi-Session Chat with negligible alignment tax.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages