Compact Latent and Reward-Based World Models as the Next Systems Substrate

부제

Why pixel-level reconstruction is giving way to compressed latent simulation, and how this will connect to memory and self-evolving agents

URL

https://arxiv.org/pdf/2511.08544

작성일

2026/03/26

세종양의 짧은 생각

World model의 미래는 reconstruction이 아니라 abstraction이고, abstraction의 미래는 compression + reward + memory + continual editing의 결합이다.

ChatGPT Generated Abstract

최근 world model 연구의 무게중심은 픽셀을 얼마나 잘 복원하느냐에서 행동 가능한 latent state를 얼마나 작고 안정적으로 유지하느냐로 이동하고 있다. 이 변화는 단순한 취향의 변화가 아니라, 계산량·메모리·planning latency·online adaptation이라는 시스템 제약이 만든 필연에 가깝다. QJL, PolarQuant, TurboQuant는 직접적인 world model 논문은 아니지만, AI 시스템 전반에서 **“고정밀 원신호를 끝까지 들고 가기보다, geometry를 보존하는 compact representation으로 옮겨라”**는 흐름을 강하게 보여준다. QJL은 KV cache quantization에서 quantization constants 저장 오버헤드를 없애며 3-bit 수준에서 5배 이상의 메모리 절감을 달성했고, PolarQuant와 TurboQuant는 무작위 회전·polar transform·residual correction을 통해 inner product와 geometry를 더 잘 보존하는 압축을 제시했다. 이는 곧 “모델이 세상을 기억하고 추론하는 데 필요한 것은 원본 픽셀 그 자체가 아니라, 의사결정에 충분한 구조적 상태”라는 쪽으로 해석될 수 있다. (arXiv)

이 관점은 latent world model 쪽에서 더욱 직접적으로 나타난다. Planning in 8 Tokens는 기존 tokenizer가 관측 하나를 수백 개 토큰으로 표현해 planning을 느리고 비싸게 만든다고 지적하며, 관측을 8개 토큰 수준으로 압축하는 discrete tokenizer CompACT를 제안했다. 이 모델은 planning 성능을 유지하면서도 decision-time planning을 대폭 가속한다고 주장한다. 즉, world model의 병목은 더 이상 “세계를 그릴 수 있느냐”가 아니라 “계획에 필요한 state를 얼마나 작고 조작 가능하게 만들 수 있느냐”가 된다. (arXiv)

또한 LeJEPA는 world model 자체보다 더 밑단에서 중요한 메시지를 준다. 이 논문은 세계와 동역학을 다룰 수 있는 representation 학습의 중심 문제를 다루며, JEPA류 representation이 안정적이고 선형 시간/메모리 복잡도로 학습되어야 한다고 본다. stop-gradient, teacher-student, 복잡한 scheduler 같은 heuristic 없이도 안정적으로 학습되는 표현 공간을 만들자는 주장인데, 이는 future world model이 필요로 하는 manipulable latent substrate와 매우 잘 맞닿아 있다. 다시 말해 앞으로의 world model은 “잘 복원하는 생성기”보다 “잘 압축되고, 잘 예측되고, 잘 조작되는 상태공간” 위에 세워질 가능성이 크다. (arXiv)

이 흐름에서 reward-based world model이 부상하는 이유는 명확하다. 픽셀 복원은 정보량이 너무 크고, 환경의 모든 세부를 동등하게 학습하게 만들며, 실제 decision-making에서 중요한 causal variable보다 시각적 fidelity를 과대평가하게 만든다. 반면 reward-centric latent world model은 “무엇이 미래 보상과 제어 가능성에 중요한가”를 기준으로 state abstraction을 밀어붙인다. 이렇게 되면 representation은 더 작아지고, rollout은 더 길어지며, online update도 쉬워진다. 특히 장기적으로는 메모리 시스템과 결합해, 모든 경험을 원본으로 저장하는 대신 reward-relevant event, novel transition, uncertainty spike, policy failure case 위주로 episodic memory를 축적하는 방향이 자연스럽다. 이 문단의 해석은 위 논문들의 직접 주장이라기보다, 해당 결과들 위에서 가능한 합리적 종합이다. (arXiv)

결국 앞으로의 self-evolving agent는 “world model + memory + compression + reward”의 결합체가 될 가능성이 높다. compact latent는 저장과 rollout의 비용을 낮추고, reward model은 어떤 경험을 남기고 어떤 경험을 버릴지 결정하며, memory는 장기 누적 구조를 제공하고, self-evolving loop는 이 메모리로부터 tokenizer·predictor·value estimator를 지속적으로 업데이트한다. 이때 중요한 것은 full replay가 아니라 selective replay, full reconstruction이 아니라 counterfactual sufficiency, 그리고 static model이 아니라 continually editable latent simulator일 것이다. 이는 아직 완성된 단일 패러다임이라기보다, 최근 압축·latent planning·JEPA 계열이 함께 가리키는 다음 방향으로 보는 편이 정확하다. (구글 리서치)

ChatGPT Generated Content

1. 왜 pixel-level reconstruction world model이 한계를 보이나

pixel-level reconstruction은 직관적으로는 “세계를 잘 이해한다”처럼 보이지만, 실제로는 너무 많은 비용을 치르게 만든다.

첫째, 학습 목표가 과도하게 넓다.

에이전트가 의사결정에 필요로 하지 않는 텍스처, 배경 노이즈, 시점 변화, 조명 변화까지도 똑같이 맞히려 든다. 이건 representation learning 관점에서 불필요한 entropy를 끌어안는 셈이다.

둘째, planning-time cost가 너무 크다.

world model을 진짜로 쓰려면 미래를 여러 갈래로 rollout해야 하는데, 픽셀 공간에서 이걸 하면 rollout 하나하나가 너무 무겁다. Planning in 8 Tokens가 지적하듯, 기존 latent tokenizer조차 observation 하나를 수백 토큰으로 들고 가면 real-time planning이 어렵다. (arXiv)

셋째, memory와 continual update에 불리하다.

self-evolving agent가 경험을 계속 축적하려면, 저장 비용·검색 비용·replay 비용이 전부 중요해진다. 픽셀 기반 표현은 이 세 가지를 모두 악화시킨다.

2. 왜 compact latent가 대안이 되는가

compact latent의 핵심은 “세계를 다 저장하지 말고, 행동과 예측에 필요한 sufficient state만 저장하자”는 데 있다.

이 관점은 최근 압축 연구들과도 맞물린다. QJL은 KV cache에서 quantization constant 저장 오버헤드를 제거했고, PolarQuant는 explicit normalization 없이 polar coordinates에서 각도를 양자화하며, TurboQuant는 두 단계를 조합해 distortion를 작게 유지한다. 이들은 모두 다른 문제를 다루지만 공통적으로 원신호 전체를 유지하기보다, geometry와 inner product를 잘 보존하는 compact form이 시스템적으로 더 중요하다는 걸 보여준다. (arXiv)

world model에 이걸 대입하면 논리는 더 선명해진다.

•

좋은 world model은 “예쁜 미래 프레임 생성기”가 아니라

•

정책이 필요한 변수를 보존하는 compact simulator여야 한다.

Planning in 8 Tokens는 이 흐름을 아주 직접적으로 보여준다. latent token 수를 극단적으로 줄이면 planning latency가 내려가고, 그제야 world model이 실제 control stack 안으로 들어올 수 있다. (arXiv)

3. reward-based world model이 뜨는 이유

compact latent만으로는 아직 부족하다.

무엇을 남기고 무엇을 버릴지를 정해야 하기 때문이다.

여기서 reward가 등장한다.

reward-based world model은 representation 학습의 우선순위를 다음처럼 바꾼다.

•

“무엇이 시각적으로 중요한가?”가 아니라

•

“무엇이 미래 가치, 성공/실패, 제어 가능성에 중요한가?”

이렇게 되면 world model은 점점 task-aware abstraction engine이 된다.

즉, 상태를 압축하되 아무렇게나 압축하는 게 아니라, reward-relevant direction을 중심으로 압축한다.

그 결과 얻는 장점은 크다.

•

rollout이 가벼워진다

•

planning horizon을 늘리기 쉬워진다

•

memory selection 기준이 생긴다

•

online adaptation 때 무엇을 업데이트할지 명확해진다

4. JEPA/representation theory가 왜 중요한가

이 변화의 밑바닥에는 representation stability 문제가 있다.

LeJEPA는 world model 논문 자체는 아니지만, 앞으로의 latent world model이 기대는 기반을 보여준다. 이 논문은 JEPA형 representation이 manipulable해야 하고, 학습이 heuristic 없이도 안정적이어야 하며, 선형 시간/메모리 복잡도로 확장 가능해야 한다고 주장한다. 이는 곧 “미래의 world model은 생성 품질보다 조작 가능성과 안정성을 우선하는 latent space 위에서 돌아갈 것”이라는 방향성과 연결된다. (arXiv)

내 해석으로는, JEPA 계열은 world model의 encoder를 이렇게 바꿀 가능성이 있다.

•

reconstructive encoder → predictive encoder

•

dense sensory code → task-sufficient latent

•

static pretraining feature → continually regularized state manifold

즉, world model의 핵심이 decoder에서 encoder 쪽으로 다시 이동하는 셈이다.

5. 앞으로 memory와는 어떻게 결합될까

이 부분이 제일 중요하다.

앞으로의 agent는 모든 경험을 equally valuable하게 저장하지 않을 가능성이 크다.

대신 memory는 다음 같은 경험 위주로 축적될 것이다.

•

reward spike / penalty spike

•

novelty가 큰 transition

•

model uncertainty가 큰 구간

•

policy failure / recovery episode

•

long-horizon credit assignment에 기여한 사건

즉, memory는 “과거 데이터 창고”가 아니라

world model 업데이트를 위한 curriculum buffer가 된다.

그리고 compact latent world model은 이 memory를 raw pixel이 아니라 작은 state trajectory로 저장하게 만들 것이다.

이러면 검색도 빠르고, 요약도 쉽고, counterfactual replay도 쉬워진다.

내가 보기엔 구조가 점점 이렇게 갈 가능성이 높다.

Perception → Compact Latent State → Reward/Value Attribution → Episodic Memory Write → Selective Replay → World Model Update

이 루프가 굴러가면 self-evolving이 가능해진다.

6. self-evolving과 결합되면 어떤 그림이 나오나

self-evolving agent에서 중요한 건 단순 online finetuning이 아니다.

중요한 건 자기 경험으로 자기 세계모형의 state abstraction을 고쳐나가는 것이다.

이걸 조금 더 쪼개면:

agent가 환경과 상호작용한다

compact latent trajectory를 만든다

reward model / verifier가 어떤 구간이 중요한지 표시한다

중요한 trajectory만 memory에 남긴다

world model은 이 메모리로부터 transition prior, uncertainty head, tokenizer를 다시 조정한다

정책은 새 latent simulator 위에서 planning한다

이때 self-evolving의 핵심 업데이트 대상은 단순 policy만이 아니라 아래 셋이 될 수 있다.

•

tokenizer: 어떤 정보를 더 압축하고 어떤 정보를 더 세밀하게 남길지

•

dynamics model: 어떤 전이가 더 예측 어려운지

•

memory policy: 어떤 경험을 저장/삭제/요약할지

즉, 미래의 self-evolving은 “모델 파라미터 전체를 갈아엎는 지속학습”보다

representation granularity와 memory policy를 동적으로 재편하는 방향에 더 가까워 보인다.

7. 한 줄 결론

앞으로 각광받는 world model은

pixel을 잘 그리는 모델이 아니라

reward-relevant structure를 compact latent로 유지하고, memory를 통해 계속 자기 state space를 고쳐나가는 모델일 가능성이 높다.

참고자료

[1] TurboQuant, Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni, arXiv:2504.19874. Near-optimal vector quantization with random rotation plus residual QJL correction; reports quality-neutral KV-cache quantization at 3.5 bits/channel. (arXiv)

[2] QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead, Amir Zandieh, Majid Daliri, Insu Han, arXiv:2406.03482. Removes quantization-constant overhead and reports over 5× KV-cache memory reduction at 3-bit quantization without accuracy loss. (arXiv)

[3] PolarQuant: Quantizing KV Caches with Polar Transformation, Insu Han, Praneeth Kacham, Amin Karbasi, Vahab Mirrokni, Amir Zandieh, arXiv:2502.02617. Uses random preconditioning plus polar transformation; reports over 4.2× KV-cache compression with strong long-context quality. (arXiv)

[4] TurboQuant blog, Google Research. Explains TurboQuant as a two-stage method combining PolarQuant-like high-quality compression and QJL residual correction, framing the work around AI memory bottlenecks and vector-search efficiency. (구글 리서치)

[5] Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model, Dongwon Kim et al., arXiv:2603.05438. Argues conventional tokenizers make planning too expensive and proposes an 8-token observation representation for much faster decision-time planning. (arXiv)

[6] LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics, Randall Balestriero, Yann LeCun, arXiv:2511.08544. Not a world-model paper per se, but highly relevant as a scalable, stable representation-learning substrate for manipulable latent dynamics. (arXiv)