0:00
/
0:00
Transcript

"WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training"

Below podcast on this paper is generated with Google's Illuminate.

The challenge in LLM post-training is the lack of large, diverse, public synthetic datasets to analyze data generating model (DGM) quality.

This paper addresses this by introducing WildChat-50m, a large dataset of chat transcripts from 50 diverse open-weight models, and investigates how different DGMs impact downstream supervised fine-tuning (SFT).

This paper proposes WildChat-50m, a large dataset, to enable analysis of synthetic data quality and its impact on SFT performance.

-----

https://arxiv.org/abs/2501.18511

📌 WildChat-50m dataset offers a valuable public resource. It allows direct comparative analysis of diverse Data Generating Models, enabling researchers to empirically optimize synthetic data selection for Supervised Fine-Tuning.

📌 The paper highlights the crucial role of Data Generating Model selection. High Synthetic Data Quality, driven by factors like clarity and comprehensiveness, is more impactful for Supervised Fine-Tuning than dataset size or blending.

📌 Counter to common intuition, blending diverse Data Generating Model responses provides no Supervised Fine-Tuning advantage over using a single, high-quality Data Generating Model. This suggests optimizing Data Generating Model quality is key.

----------

Methods Explored in this Paper 🔧:

→ The paper introduces WildChat-50m, the largest public dataset of chat transcripts.

→ It extends the original WildChat dataset by including responses from 50 open-weight LLMs, ranging from 0.5B to 104B parameters.

→ Data was collected using VLLM for efficient LLM inference on a 12x8 H100 cluster.

→ Each model participated in over 1 million multi-turn conversations, totaling over 125 million transcripts.

→ The dataset facilitates comparative analysis of runtime, VRAM efficiency, and response similarity across diverse LLMs.

→ Supervised Fine-Tuning (SFT) experiments were conducted using a new data mix called Re-Wild (RWD).

→ RWD combines high-quality DGM data from WildChat-50m with datasets enhancing world knowledge and math skills.

→ The SFT experiments used Llama-3.1 8B Base model fine-tuned on RWD and evaluated against strong baselines like Tulu-3.

-----

Key Insights 💡:

→ The choice of Data Generating Model (DGM) significantly impacts the synthetic data quality (SDQ) and downstream SFT performance on generalist chat benchmarks.

→ Selecting a good DGM can compensate for smaller dataset size and outperform more complex methods in SFT.

→ Comprehensiveness, clarity, tone, and prompt responsiveness of a DGM are highly heritable during the SFT process.

→ World knowledge and mathematics skills are only heritable when the data is specifically curated for those skills.

→ LLM responses from diverse models exhibit high similarity, suggesting predictable output generation.

→ Larger models tend to generate more similar responses, indicating convergence towards a consensus response.

→ Blending different DGMs does not offer significant benefits over using a single high-quality DGM for SFT.

-----

Results 📊:

→ Re-Wild (RWD) outperforms Tulu 3 SFT mix, despite using only 40% of the data.

→ RWD achieves strong performance on generalist chat and instruction following benchmarks as shown in spider plot.

→ Qwen2.5-72B-Instruct is the slowest model at 3,163 tokens/second, while Llama-2-7b-chat-hf is the fastest at 37,357 tokens/second.

→ Input token processing is significantly faster than output, with a mean ratio of 4.68 to 1 across pre-trained models.