The challenge in LLM post-training is the lack of large, diverse, public synthetic datasets to analyze data generating model (DGM) quality.
This paper addresses this by introducing WildChat-50m, a large dataset of chat transcripts from 50 diverse open-weight models, and investigates how different DGMs impact downstream supervised fine-tuning (SFT).
This paper proposes WildChat-50m, a large dataset, to enable analysis of synthetic data quality and its impact on SFT performance.
-----
https://arxiv.org/abs/2501.18511
📌 WildChat-50m dataset offers a valuable public resource. It allows direct comparative analysis of diverse Data Generating Models, enabling researchers to empirically optimize synthetic data selection for Supervised Fine-Tuning.
📌 The paper highlights the crucial role of Data Generating Model selection. High Synthetic Data Quality, driven by factors like clarity and comprehensiveness, is more impactful for Supervised Fine-Tuning than dataset size or blending.
📌 Counter to common intuition, blending diverse Data Generating Model responses provides no Supervised Fine-Tuning advantage over using a single, high-quality Data Generating Model. This suggests optimizing Data Generating Model quality is key.
----------
Methods Explored in this Paper 🔧:
→ The paper introduces WildChat-50m, the largest public dataset of chat transcripts.
→ It extends the original WildChat dataset by including responses from 50 open-weight LLMs, ranging from 0.5B to 104B parameters.
→ Data was collected using VLLM for efficient LLM inference on a 12x8 H100 cluster.
→ Each model participated in over 1 million multi-turn conversations, totaling over 125 million transcripts.
→ The dataset facilitates comparative analysis of runtime, VRAM efficiency, and response similarity across diverse LLMs.
→ Supervised Fine-Tuning (SFT) experiments were conducted using a new data mix called Re-Wild (RWD).
→ RWD combines high-quality DGM data from WildChat-50m with datasets enhancing world knowledge and math skills.
→ The SFT experiments used Llama-3.1 8B Base model fine-tuned on RWD and evaluated against strong baselines like Tulu-3.
-----
Key Insights 💡:
→ The choice of Data Generating Model (DGM) significantly impacts the synthetic data quality (SDQ) and downstream SFT performance on generalist chat benchmarks.
→ Selecting a good DGM can compensate for smaller dataset size and outperform more complex methods in SFT.
→ Comprehensiveness, clarity, tone, and prompt responsiveness of a DGM are highly heritable during the SFT process.
→ World knowledge and mathematics skills are only heritable when the data is specifically curated for those skills.
→ LLM responses from diverse models exhibit high similarity, suggesting predictable output generation.
→ Larger models tend to generate more similar responses, indicating convergence towards a consensus response.
→ Blending different DGMs does not offer significant benefits over using a single high-quality DGM for SFT.
-----
Results 📊:
→ Re-Wild (RWD) outperforms Tulu 3 SFT mix, despite using only 40% of the data.
→ RWD achieves strong performance on generalist chat and instruction following benchmarks as shown in spider plot.
→ Qwen2.5-72B-Instruct is the slowest model at 3,163 tokens/second, while Llama-2-7b-chat-hf is the fastest at 37,357 tokens/second.
→ Input token processing is significantly faster than output, with a mean ratio of 4.68 to 1 across pre-trained models.
Share this post