Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone (part-1)

Abstract

phi-2.0에서의 기조를 그대로 이어가돼,phi-3-mini (3.8B), phi-3-small (7B), phi-3-medium (14B)를 통해서 parameter-scaling 결과를 공개함.
- phi-3-mini (3.8B)는 3.3T tokens로 학습
- phi-3-small (7B)와 phi-3-medium (14B)는 4.8T tokens로 학습
- 이전 model series과 달리 alignment를 진행한 version 같이 공개함.
- phi-3-mini → phi-3-small → phi-3-medium에 따라서, (MMLU, MT-bench) 성능이 (69, 8,38) → (75, 8.7) → (78, 8.9)

또한 phi-3-mini (3.8B)를 기초로한, phi-3-vision (4.2B, image-text input, text output)을 공개함.

Introduction

이전 model series (phi-1, phi-1.5, phi-2)의 기조를 더욱 확장하여 진행
In our previous works on the phi models [GZA+23, LBE+23, JBA+23] it was shown that a combination of LLM-based filtering of publicly available web data, and LLM-created synthetic data, enable performance in smaller language models that were typically seen only in much larger models

phi-2 대비, phi-3-mini는 model size는 2.7B → 3.8B로 train tokens는 1.4T → 3.3T로 확장함.

특히 phi-3-mini가 mobile환경에서 동작하며 그 성능은 GPT-3.5, Mixtral 8x7B와 비교할만한 수준임을 강조함.

Technical Specifications

Model

common
- model에 상관없이 context length를 확장할 때는 LongRoPE라는 방법론을 사용함.

phi-3-mini

default context length 4k

tokenizer는 llama-2-tokenizer를 modification한 버전을 사용함. (32000 → 32064)
- bos token를 제거하고, chat template에 필요한 some additional tokens을 추가함.
```
# chat template
<|user|>/n Question <|end|>/n <|assistant|>
```
  python

구조는 llama-2 계열과 동일함.
The model uses 3072 hidden dimension, 32 heads and 32 layers.

phi-3-small

default context length 8k

tokenizer는 multilingual을 더 잘 지원하기위해서, tiktoken tokenizer를 modification (사용하지않는 token들을 제거함.)해서 사용함. ( → 100352)
- https://huggingface.co/microsoft/Phi-3-small-8k-instruct/blob/main/tokenization_phi3_small.py
- multilinual data를 10%정도 사용함.

GEGLU와 grouped query attention (4 queries sharing 1 key)을 사용함.

blocksparse attention과 dense attention을 layer마다 교대로 사용함.
- blocksparse attention은 아래의 효과를 가짐.
  For each attention head, the blocksparse attention enforces different sparsity patterns over KV cache. This ensures that all tokens are attended to on different heads for the given choice of sparsity. As illustrated in Figure 1, the context is then efficiently divided and conquered among attention heads, with significant KV cache reduction.
- blocksparse attention의 효율을 최대화하기위해 kernel을 직접 작성함.
  - training: flash-attention을 기반으로 triton을 이용하여 kernel을 새로 작성함.
  - inference: vLLM에 있는 paged attention kernel을 기반으로 triton을 이용하여 kernel을 새로 작성함. (prefilling phase?)

phi-3-medium

Highly capable language model running locally on a cell-phone

Thanks to its small size, phi3-mini can be quantized to 4-bits so that it only occupies ≈ 1.8GB of memory. We tested the quantized model by deploying phi-3-mini on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than 12 tokens per second.

Training Methodology

pre-training

이전 연구와 마찬가지로 high quality training data가 기존에 알려진 standard scaling-laws (e.g. chinchilla scaling law)를 훨씬 개선할 수 있다고 믿고, 이를 실행함.
In this work we show that such method allows to reach the level of highly capable models such as GPT-3.5 or Mixtral with only 3.8B total parameters (while Mixtral has 45B total parameters for example).

기존 phi series와 달리 staged pre-training을 도입함.
Pre-training is performed in two disjoint and sequential phases; phase-1 comprises mostly of web sources aimed at teaching the model general knowledge and language understanding. Phase-2 merges even more heavily filtered webdata (a subset used in Phase-1) with some synthetic data that teach the model logical reasoning and various niche skills.

post-training

For AI assistant
- sft, dpo를 하는 두 개의 stage로 구성됨.
  - paper 상에서 phi-3-mini 기준으로만 적혀있어, phi-3-mini 또는 phi-3-medium에만 해당되는 이야기일 수 있음.
- sft
  - 다양한 domain의 high-quality data를 사용함. english examples만을 사용함.
    - e.g. math, coding, reasoning, conversation, model identity, and safety
- dpo
  - dpo를 하기위한 dataset의 경우, chat-format data, reasoning, responsible AI domain의 data로 구성됨.
  - dpo를 아래의 방식으로 활용함.
    We use DPO to steer the model away from unwanted behavior, by using those outputs as “rejected” responses.

For long context
Long context extension has been done in two stages, including long context mid-training and long-short mixed post-training with both SFT and DPO.

Data Optimal Regime

small scale의 LLM에서는 reasoning ability를 학습시켜 줄 수 있는 data가 훨씬 중요함.
We try to calibrate the training data to be closer to the “data optimal” regime for small models. In particular, we filter the publicly available web data to contain the correct level of “knowledge” and keep more web pages that could potentially improve the “reasoning ability” for the model. As an example, the result of a game in premier league in a particular day might be good training data for frontier models, but we need to remove such information to leave more model capacity for “reasoning” for the mini size models.

larger scale의 LLM에서는 factual knowledge도 중요한 data가 됨.
We observe that some benchmarks improve much less from 7B to 14B than they do from 3.8B to 7B, perhaps indicating that our data mixture needs further work to be in the “data optimal regime” for 14B parameters model.

Academic benchmarks

Safety & Weakness

phi-3-mini가 factual knowledge or hallucination관련해서 phi-3-small, phi-3-medium 대비 문제가 조금 있음을 알 수 있음.
💡
data-optimal-regime과 관련한 문제라고 생각함. larger scale의 LLM이 factual knowledge를 학습하는 것이, small scale의 LLM보다 훨씬 쉬움. → reasoning과 관련된 ability를 적당한 scale에서 획득하고, 그 이후의 size 증대는 factual knowledge 획득으로 hallucination을 감소시킬 수 있다..?

phi3-mini의 경우, search engine과 잘 섞으면 이를 개선할 수 있음.

seopbo.log

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone (part-1)

Abstract

Introduction

Technical Specifications

Model

Highly capable language model running locally on a cell-phone

Training Methodology

Data Optimal Regime

Academic benchmarks

Safety & Weakness

Tags