Quickstart
Get started with World Engine in minutes.
Installation
First, set up a virtual environment (recommended):
python3 -m venv .env
source .env/bin/activate
Install World Engine with PyTorch CUDA support:
pip install \
--index-url https://download.pytorch.org/whl/test/cu128 \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--extra-index-url https://pypi.org/simple \
--upgrade --ignore-installed \
"world_engine @ git+https://github.com/Overworldai/world_engine.git"
Set your HuggingFace token to access Waypoint models:
export HF_TOKEN=<your access token>
Get your token at https://huggingface.co/settings/tokens
Basic Usage
Understanding CtrlInput
The CtrlInput class represents controller input for a single frame. It encapsulates button presses and mouse/pointer movement:
from world_engine import CtrlInput
# Create control input with button presses and mouse movement
ctrl = CtrlInput(
button={48, 42}, # Set of pressed button IDs
mouse=[0.4, 0.3] # (x, y) mouse velocity vector
)
# Empty control input (no buttons, no mouse movement)
ctrl = CtrlInput()
Button Keycodes: Button IDs are defined by Owl-Control
Mouse Input: The mouse parameter is a raw velocity vector (x, y) representing pointer movement.
Using WorldEngine
The WorldEngine class is the main interface for generating frames. It maintains state across frames, including the visual history and current prompt.
Loading a Model
Create an engine instance by specifying a Waypoint model from HuggingFace:
from world_engine import WorldEngine
engine = WorldEngine(
"Overworld/Waypoint-1-Small",
device="cuda"
)
Browse all available Waypoint models at Overworld on HuggingFace.
Setting the Prompt
Use set_prompt() to specify a text prompt that conditions frame generation. The prompt persists until you call set_prompt() again:
engine.set_prompt("A fun game")
# All subsequent frames will be conditioned on this prompt
Appending Frames
Use append_frame() to manually add a frame to the sequence without generating. This is useful for:
Setting an initial frame
Forcing a specific image at any point in the sequence
Seeding generation with reference images
Single Frame:
import torch
# Append a single frame (H, W, 3) uint8 tensor
img = torch.randint(0, 256, (512, 512, 3), dtype=torch.uint8)
returned_img = engine.append_frame(img)
Multiple Frames:
# Append multiple frames at once
frames = torch.randint(0, 256, (4, 512, 512, 3), dtype=torch.uint8)
engine.append_frame(frames)
Frames with Controls:
# Append frames with corresponding control inputs
frames = torch.randint(0, 256, (2, 512, 512, 3), dtype=torch.uint8)
ctrls = [
CtrlInput(button={48}),
CtrlInput(button={42})
]
engine.append_frame(frames, ctrls=ctrls)
Note
append_frame() returns the appended image(s) on the same device as engine.device
Generating Frames
Use gen_frame() to generate a new frame conditioned on:
The current prompt (set via
set_prompt())Visual history (all previously generated/appended frames)
Controller input (optional)
from world_engine import CtrlInput
# Generate with control input
ctrl = CtrlInput(button={48, 42}, mouse=[0.4, 0.3])
img = engine.gen_frame(ctrl=ctrl)
# Generate without control input
img = engine.gen_frame()
# Generate multiple frames in sequence
for ctrl_input in [
CtrlInput(button={48, 42}, mouse=[0.4, 0.3]),
CtrlInput(mouse=[0.1, 0.2]),
CtrlInput(button={95, 32, 105}),
]:
img = engine.gen_frame(ctrl=ctrl_input)
Resetting State
Use reset() to clear the visual history and start fresh. The prompt is preserved:
engine.reset()
# Visual history cleared, but prompt remains set
Complete Example: Interactive Pygame Application
Here’s a complete example using OpenCV to create an interactive world model experience:
from typing import AsyncIterable, AsyncIterator
import asyncio
import contextlib
import cv2
import sys
import torch
from world_engine import WorldEngine, CtrlInput
async def render(frames: AsyncIterable[torch.Tensor],
win_name="World Engine Demo (ESC to exit)") -> None:
"""Render stream of RGB tensor images."""
cv2.namedWindow(win_name, cv2.WINDOW_AUTOSIZE | cv2.WINDOW_GUI_NORMAL)
async for t in frames:
cv2.imshow(win_name, t.cpu().numpy())
await asyncio.sleep(0)
cv2.destroyAllWindows()
async def frame_stream(engine: WorldEngine,
ctrls: AsyncIterable[CtrlInput]) -> AsyncIterator[torch.Tensor]:
"""Generate frame by calling Engine for each ctrl."""
yield await asyncio.to_thread(engine.gen_frame)
async for ctrl in ctrls:
yield await asyncio.to_thread(engine.gen_frame, ctrl=ctrl)
async def ctrl_stream(delay: int = 1) -> AsyncIterator[CtrlInput]:
"""Accumulate key presses asynchronously."""
q: asyncio.Queue[int] = asyncio.Queue()
async def producer() -> None:
while True:
k = cv2.waitKey(delay)
if k != -1:
await q.put(k)
await asyncio.sleep(0)
prod_task = asyncio.create_task(producer())
while True:
buttons: set[int] = set()
# Drain everything currently in the queue into this batch
with contextlib.suppress(asyncio.QueueEmpty):
while True:
k = q.get_nowait()
if k == 27: # ESC key
prod_task.cancel()
return
buttons.add(k)
yield CtrlInput(button=buttons)
async def main() -> None:
uri = sys.argv[1] if len(sys.argv) > 1 else \
"Overworld/Waypoint-1-Small"
# Create engine
engine = WorldEngine(uri, device="cuda")
# Set initial prompt
engine.set_prompt("A fun platformer game")
# Stream controls and frames
ctrls = ctrl_stream()
frames = frame_stream(engine, ctrls)
await render(frames)
if __name__ == "__main__":
asyncio.run(main())
This example creates a real-time interactive experience where:
OpenCV captures keyboard input asynchronously
Inputs are batched into
CtrlInputobjectsEach frame is generated based on input and visual history
Frames are rendered to an OpenCV window
Press ESC to exit
Next Steps
Explore the Main API Reference for detailed API documentation
Check out Quantization for model compression options
Review Modeling Specifics if you want to understand internals