stable_worldmodel.envs package

Submodules

stable_worldmodel.envs.image_positioning module

class ImagePositioning(resolution: int, images: list[<module 'PIL.Image' from '/opt/hostedtoolcache/Python/3.11.13/x64/lib/python3.11/site-packages/PIL/Image.py'>], render_mode: str | None = None, background_power_decay: float | None = 1.0)[source]

Bases: Env

close()[source]

After the user has finished using the environment, close contains the code necessary to “clean up” the environment.

This is critical for closing rendering windows, database or HTTP connections. Calling close on an already closed environment has no effect and won’t raise an error.

metadata: dict[str, Any] = {'render_fps': 4, 'render_modes': ['human', 'rgb_array']}

render(mode='current')[source]

Compute the render frames as specified by render_mode during the initialization of the environment.

The environment’s metadata render modes (env.metadata[“render_modes”]) should contain the possible ways to implement the render modes. In addition, list versions for most render modes is achieved through gymnasium.make which automatically applies a wrapper to collect rendered frames.

Note

As the render_mode is known during __init__, the objects used to render the environment state should be initialised in __init__.

By convention, if the render_mode is:

None (default): no render is computed.
“human”: The environment is continuously rendered in the current display or terminal, usually for human consumption. This rendering should occur during step() and render() doesn’t need to be called. Returns None.
“rgb_array”: Return a single frame representing the current state of the environment. A frame is a np.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image.
“ansi”: Return a strings (str) or StringIO.StringIO containing a terminal-style text representation for each time step. The text can include newlines and ANSI escape sequences (e.g. for colors).
“rgb_array_list” and “ansi_list”: List based version of render modes are possible (except Human) through the wrapper, gymnasium.wrappers.RenderCollection that is automatically applied during gymnasium.make(..., render_mode="rgb_array_list"). The frames collected are popped after render() is called or reset().

Note

Make sure that your class’s metadata "render_modes" key includes the list of supported modes.

Changed in version 0.25.0: The render function was changed to no longer accept parameters, rather these parameters should be specified in the environment initialised, i.e., gymnasium.make("CartPole-v1", render_mode="human")

reset(seed: int | None = None, options: dict | None = None)[source]

Start a new episode.

Parameters:

seed – Random seed for reproducible episodes
options – Additional configuration (unused in this example)

Returns:

(observation, info) for the initial state

Return type:

tuple

step(action)[source]

Execute one timestep within the environment.

Parameters:: action – The action to take (0-3 for directions)
Returns:: (observation, reward, terminated, truncated, info)
Return type:: tuple

stable_worldmodel.envs.ogbench_cube module

OGBench Cube manipulation environment with multiple task variants.

This module implements a robotic manipulation environment using cubes with various task configurations ranging from single cube pick-and-place to complex multi-cube stacking and rearrangement tasks. The environment supports visual variations including object colors, sizes, lighting, and camera angles for robust policy learning.

The environment is built on top of the ManipSpaceEnv from OGBench and uses MuJoCo for physics simulation. It provides both pixel-based and state-based observations, with support for goal-conditioned learning and data collection modes.

Example

Basic usage of the cube environment:

from stable_worldmodel.envs.ogbench_cube import CubeEnv

# Create a double cube environment with pixel observations
env = CubeEnv(env_type='double', ob_type='pixels', multiview=True)

# Reset with variation sampling
obs, info = env.reset(options={'variation': ['all']})

# Run an episode
for _ in range(100):
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    if info['success']:
        break

class CubeEnv(env_type='single', ob_type='pixels', permute_blocks=True, multiview=False, height=224, width=224, *args, **kwargs)[source]

Bases: ManipSpaceEnv

Robotic manipulation environment with cube objects and multiple task variants.

This environment provides a suite of manipulation tasks involving 1-8 colored cubes that must be moved to target positions. It supports various task types including pick-and-place, stacking, swapping, and cyclic rearrangement. The environment includes comprehensive variation spaces for visual domain randomization.

The environment operates in two modes:

‘task’: Goal-conditioned mode where the robot must achieve specific configurations
‘data_collection’: Mode for collecting demonstrations with random targets

Variables:

_env_type (str) – Type of environment determining number of cubes. One of: ‘single’, ‘double’, ‘triple’, ‘quadruple’, ‘octuple’.
_num_cubes (int) – Number of cubes in the environment (1, 2, 3, 4, or 8).
_permute_blocks (bool) – Whether to randomly permute cube order at task init.
_multiview (bool) – Whether to render from both front and side cameras.
_ob_type (str) – Observation type, either ‘pixels’ or state-based.
_cube_colors (ndarray) – Array of RGB colors for cubes.
_target_task (str) – Task type identifier, set to ‘cube’.
_target_block (int) – Index of the target cube in data collection mode.
variation_space (Dict) – Hierarchical space defining variation ranges for: - cube: color (RGB), size (half-extents) - agent: color (RGB) - floor: color (two RGB values for checkerboard) - camera: angle_delta (yaw/pitch offsets) - light: intensity (diffuse lighting strength)
task_infos (list) – List of dictionaries defining task configurations, each containing ‘task_name’, ‘init_xyzs’, and ‘goal_xyzs’.
cameras (dict) – Dictionary of camera configurations with position and orientation.
_cube_geoms_list (list) – MuJoCo geom objects for each cube.
_cube_target_geoms_list (list) – MuJoCo geom objects for target visualizations.
_cube_geom_ids_list (list) – MuJoCo geom IDs for each cube.
_cube_target_mocap_ids (list) – MuJoCo mocap body IDs for target positions.
_cube_target_geom_ids_list (list) – MuJoCo geom IDs for target visualizations.
_success (bool) – Whether the current task has been completed successfully.
_cur_goal_ob (ndarray) – Goal observation for goal-conditioned learning.
_cur_goal_rendered (ndarray) – Rendered image of goal state, if enabled.

Note

Inherits from ManipSpaceEnv which provides the underlying robotic arm control, physics simulation, and base functionality.

__init__(env_type='single', ob_type='pixels', permute_blocks=True, multiview=False, height=224, width=224, *args, **kwargs)[source]

Initialize the CubeEnv with specified configuration.

Sets up the manipulation environment with the specified number of cubes and configures observation type, block permutation, and camera views. Initializes the variation space for visual domain randomization.

Parameters:

env_type (str) – Environment type corresponding to number of cubes. Must be one of: ‘single’ (1 cube), ‘double’ (2 cubes), ‘triple’ (3 cubes), ‘quadruple’ (4 cubes), ‘octuple’ (8 cubes).
ob_type (str, optional) – Type of observation to return. Either ‘pixels’ for image observations or ‘state’ for proprioceptive/object states. Defaults to ‘pixels’.
permute_blocks (bool, optional) – Whether to randomly shuffle the order of cubes at the start of each episode. Helps with generalization. Defaults to True.
multiview (bool, optional) – Whether to render the scene from both front and side camera views simultaneously. Returns stacked images when True. Defaults to False.
*args – Variable length argument list passed to parent ManipSpaceEnv.
**kwargs – Arbitrary keyword arguments passed to parent ManipSpaceEnv.

Raises:

ValueError – If env_type is not one of the supported values.

Note

The variation_space is automatically configured with appropriate ranges for all visual variations including colors, sizes, lighting, and cameras.

add_object_info(ob_info)[source]

Add cube-specific information to the observation info dictionary.

Augments the info dictionary with privileged state information about all cubes including positions, orientations, and target information (in data collection mode).

Parameters:: ob_info (dict) – Observation info dictionary to augment. Modified in-place.

Adds to ob_info:

‘privileged/block_{i}_pos’: 3D position (x, y, z) of cube i
‘privileged/block_{i}_quat’: Quaternion (w, x, y, z) of cube i
‘privileged/block_{i}_yaw’: Yaw angle in radians of cube i
‘privileged/target_task’: Task type string (data collection mode only)
‘privileged/target_block’: Index of target cube (data collection mode only)
‘privileged/target_block_pos’: Target position (data collection mode only)
‘privileged/target_block_yaw’: Target yaw angle (data collection mode only)

Note

All positions are in world coordinates. Quaternions use (w, x, y, z) format. Privileged information is typically not available to policies during deployment.

add_objects(arena_mjcf)[source]

Add cube objects and cameras to the MuJoCo scene.

Constructs the manipulation scene by loading cube XML descriptions and positioning them appropriately. Sets up multiple camera viewpoints for rendering observations.

Parameters:: arena_mjcf (mjcf.RootElement) – The MuJoCo XML root element representing the arena where objects and cameras will be added.

Note

Cubes are positioned with 0.05m spacing along the y-axis
Each cube has both a physical object and a semi-transparent target marker
Three cameras are added: ‘front’, ‘front_pixels’, and ‘side_pixels’
All cube geoms are stored for later color and property modifications

compute_observation()[source]

Compute the current observation based on observation type.

Generates either pixel-based or state-based observations depending on the ob_type configuration. State observations include scaled proprioceptive and object state information.

Returns:

Observation array. If ob_type is ‘pixels’, returns image array: with shape (H, W, C) or (N, H, W, C) for multiview. If ob_type is not ‘pixels’, returns flattened state vector containing: - Arm joint positions (6D) and velocities (6D) - End-effector position (3D, scaled), yaw angle (2D: cos/sin) - Gripper opening (1D, scaled) and contact (binary) - For each cube: position (3D, scaled), quaternion (4D), yaw (2D: cos/sin)

Return type:

ndarray

Note

State observations use a centering offset (0.425, 0.0, 0.0) and scaling factors (10.0 for positions, 3.0 for gripper) to normalize values. Yaw angles are encoded as (cos, sin) pairs for continuity.

compute_oracle_observation()[source]

Compute oracle goal representation containing only cube positions.

Returns a compact state representation containing only the positions of all cubes, useful for goal-conditioned learning where the goal is defined by object configurations rather than the full state.

Returns:

Concatenated cube positions with shape (num_cubes * 3,).: Each cube contributes its (x, y, z) position, centered and scaled by the same factors used in compute_observation().

Return type:

ndarray

Note

This representation excludes robot state and object orientations, focusing only on cube positions. Used primarily in task mode for goal specification.

compute_reward(ob, action)[source]

Compute the reward for the current step.

Calculates reward based on task success. If a specific reward_task_id is set, uses a custom reward function that counts successful cube placements minus the total number of cubes (range: -num_cubes to 0). Otherwise defers to parent class reward computation.

Parameters:

ob (ndarray) – Current observation (not used in custom reward computation).
action (ndarray) – Action taken in this step (not used in custom reward).

Returns:

Scalar reward value. Custom reward ranges from -num_cubes (all: cubes far from targets) to 0 (all cubes at targets). Parent class reward depends on its implementation.

Return type:

float

Note

The custom reward provides dense feedback about task progress by counting how many cubes are successfully positioned. Each successful cube adds 1 to the base value of -num_cubes.

get_reset_info()[source]

Get information dictionary at environment reset.

Compiles observation info along with goal observations (in task mode) and success status to provide comprehensive reset information.

Returns:

Dictionary containing:

All keys from compute_ob_info() (proprioception and object states)
’goal’: Goal observation (task mode only)
’success’: Boolean indicating current success status

Return type:

dict

Note

Called after initialize_episode() to provide initial state information.

get_step_info()[source]

Get information dictionary after each environment step.

Compiles current observation info along with goal observations (in task mode) and success status to provide comprehensive step information.

Returns:

Dictionary containing:

All keys from compute_ob_info() (proprioception and object states)
’goal’: Goal observation (task mode only)
’success’: Boolean indicating whether task is completed

Return type:

dict

Note

Called after each step to provide feedback about current state and progress.

initialize_episode()[source]

Initialize the environment state at the start of an episode.

Sets up cube colors, arm position, and object placements based on the current mode (task or data_collection). In task mode, creates goal observations and places cubes according to the current task definition. In data collection mode, randomizes cube placements and sets a random target.

The initialization process:

Apply cube colors from variation space
Reset arm to home position
In task mode: Set cubes to task-specific positions, generate goal observation
In data_collection mode: Randomize cube positions and orientations
Run forward kinematics to stabilize the scene

Note

In task mode, goal observation is computed by first placing cubes at goal positions, rendering/observing, then resetting to initial positions
Small random perturbations (±0.01m) are added to initial positions
Random yaw rotations (0-2π) are applied to all cubes

modify_mjcf_model(mjcf_model)[source]

Apply visual variations to the MuJoCo model based on variation space.

Modifies the MJCF model XML to apply sampled variations including floor colors, robot arm colors, cube sizes, camera angles, and lighting. Only variations that are enabled in the variation_options are applied.

Parameters:: mjcf_model (mjcf.RootElement) – The MuJoCo XML model to modify.
Returns:: The modified model with variations applied.
Return type:: mjcf.RootElement

Note

Variations are only applied if specified in variation_options during reset
Some variations (size, light) call self.mark_dirty() to trigger recompilation
Camera angle perturbations use the perturb_camera_angle helper function

post_compilation_objects()[source]

Extract MuJoCo object IDs after model compilation.

Retrieves and stores the integer IDs for all cube geoms, target mocap bodies, and target geoms. These IDs are used for efficient access during simulation and rendering.

Note

Must be called after the MuJoCo model has been compiled. IDs are needed for direct manipulation of model properties like colors and positions.

post_step()[source]

Update environment state after each simulation step.

Computes success status and adjusts target marker visibility based on the current mode. In task mode, all cube targets are evaluated. In data collection mode, only the designated target cube is evaluated.

Updates:

self._success: Set to True if success conditions are met
Target geom alpha: Made visible (0.2) for relevant targets when visualization is enabled, invisible (0.0) otherwise

Note

Success in task mode requires ALL cubes to reach their targets. Success in data collection mode requires only the target cube to succeed.

render(camera='front_pixels', *args, **kwargs)[source]

Render the current scene from a specified camera view.

Generates an RGB image of the current environment state from a single camera viewpoint. This method renders from one camera at a time.

Parameters:

camera (str, optional) – Camera name to render from. Defaults to ‘front_pixels’. Supports any camera defined in self.cameras (e.g., ‘front_pixels’, ‘side_pixels’).
*args – Additional positional arguments passed to parent render method.
**kwargs – Additional keyword arguments passed to parent render method.

Returns:

Rendered image with shape (H, W, C) where H is height,: W is width, and C is the number of color channels (typically 3 for RGB).

Return type:

ndarray

Note

For rendering from multiple cameras simultaneously, use the render_multiview() method instead.

render_multiview(camera='front_pixels', *args, **kwargs)[source]

Render the current scene from multiple camera views or a fallback single view.

When multiview mode is enabled (_multiview=True), renders the scene from both ‘front_pixels’ and ‘side_pixels’ cameras and returns them as a dictionary. When multiview is disabled, falls back to rendering from a single camera.

Parameters:

camera (str, optional) – Camera name to use for fallback rendering when multiview is disabled. Defaults to ‘front_pixels’. Ignored when multiview is enabled.
*args – Additional positional arguments passed to the render method.
**kwargs – Additional keyword arguments passed to the render method.

Returns:

If multiview is enabled, returns a dictionary with camera: names as keys (‘front_pixels’, ‘side_pixels’) and rendered images as values, where each image has shape (H, W, C). If multiview is disabled, returns a single rendered image array with shape (H, W, C).

Return type:

dict or ndarray

Note

The multiview dictionary format is useful for policies that process multiple viewpoints separately. The ‘front_pixels’ camera provides an oblique view while ‘side_pixels’ shows a perpendicular side view.

reset(seed=None, options=None, *args, **kwargs)[source]

Reset the environment to an initial state.

Resets the environment and optionally samples from the variation space to create visual diversity. Handles both task mode (with predefined goals) and data collection mode (with random targets).

Parameters:

options (dict, optional) –
Dictionary of reset options. Supported keys: - ‘variation’: List/tuple of variation names to sample. Use [‘all’]

to sample all variations, or specify individual ones like [‘cube.color’, ‘light.intensity’]. Defaults to None (no variation).
*args – Variable length argument list passed to parent reset.
**kwargs – Arbitrary keyword arguments passed to parent reset.

Returns:

(observation, info) where:

observation: Current observation based on ob_type configuration
info: Dictionary containing reset information and task state

Return type:

tuple

Raises:

AssertionError – If variation option is not a list/tuple, or if variation values are outside their defined spaces.

Example

Reset with all variations enabled:

obs, info = env.reset(options={'variation': ['all']})

Reset with specific variations:

obs, info = env.reset(options={'variation': ['cube.color', 'camera.angle_delta']})

set_new_target(return_info=True, p_stack=0.5)[source]

Set a new random target for data collection mode.

Randomly selects one of the “top” cubes (not stacked under another) as the target and assigns it a random goal position. The goal can be either a flat surface position or stacked on top of another cube.

Parameters:

return_info (bool, optional) – Whether to return the observation and reset info after setting the new target. Defaults to True.
p_stack (float, optional) – Probability of setting the target to stack on top of another block (when multiple blocks are available). Must be in range [0, 1]. Defaults to 0.5.

Returns:

If return_info is True, returns (observation, reset_info).: Otherwise returns None.

Return type:

tuple or None

Raises:

AssertionError – If called when mode is not ‘data_collection’.

Note

Only cubes that are not underneath other cubes can be selected as targets
Target markers are made visible for the selected cube, invisible for others
Stacking targets are positioned 0.04m above the base cube’s z-position
Non-stacking targets are randomly sampled from the target sampling bounds

set_tasks()[source]

Define all task configurations for the environment.

Initializes the task_infos list with predefined manipulation tasks appropriate for the current env_type. Each task specifies initial and goal positions for all cubes. Tasks increase in complexity from simple pick-and-place to multi-object stacking and cyclic rearrangements.

Task types by environment:

single: 5 tasks (horizontal, vertical movements, diagonals)
double: 5 tasks (single/double pick-place, swap, stack)
triple: 5 tasks (single/triple pick-place, unstack, cycle, stack)
quadruple: 5 tasks (double/quad pick-place, unstack, cycle, stack)
octuple: 5 tasks (quad/octuple pick-place, unstacking, stacking)

Note

Also sets the default reward_task_id to 2 if not already configured. All positions are in MuJoCo world coordinates (x, y, z) in meters.

stable_worldmodel.envs.ogbench_scene module

class SceneEnv(env_type, ob_type='pixels', permute_blocks=True, multiview=False, *args, **kwargs)[source]

Bases: ManipSpaceEnv

Scene environment.

This environment consists of a cube, two buttons, a drawer, and a window. The goal is to manipulate the objects to a target configuration. The buttons toggle the lock state of the drawer and window.

In addition to qpos and qvel, it maintains the following state variables. - button_states: A binary array of size num_buttons representing the state of each button. Stored in

_cur_button_states.

__init__(env_type, ob_type='pixels', permute_blocks=True, multiview=False, *args, **kwargs)[source]

Initialize the Scene environment.

Parameters:

env_type – Unused; defined for compatibility with the other environments.
permute_blocks – Whether to randomly permute the order of the blocks at task initialization.
*args – Additional arguments to pass to the parent class.
**kwargs – Additional keyword arguments to pass to the parent class.

add_object_info(ob_info)[source]

add_objects(arena_mjcf)[source]

compute_observation()[source]

Compute the observation at each timestep.

Returns:: A dictionary of observation arrays.

compute_oracle_observation()[source]: Return the oracle goal representation of the current state.

compute_reward(ob, action)[source]: Compute the reward at each timestep.

get_reset_info()[source]: Return a dictionary of information to be included in the reset return.

get_step_info()[source]: Return a dictionary of information to be included in the step return.

initialize_episode()[source]: Initialize the environment at the beginning of each episode.

modify_mjcf_model(mjcf_model)[source]

Modify the MJCF model at the beginning of each episode.

This is useful for domain randomization or other forms of model modifications that may require recompilation of the MjModel and MjData objects. If the operation performed requires recompilation, call mark_dirty to force recompilation.

Parameters:: mjcf_model – Root element of the MJCF model.
Returns:: The root element of the modified MJCF model.

post_compilation_objects()[source]

post_step()[source]

Perform any post-step operations.

This can be useful for updating the environment state after the simulation has been stepped. By default, this method does nothing.

pre_step()[source]

Perform any pre-step operations.

This can be useful for saving information. By default, this method does nothing.

render(camera='front_pixels', *args, **kwargs)[source]: Render the current state of the environment.

reset(options=None, *args, **kwargs)[source]

Reset the environment to the initial state.

If this is the first call to reset, build the MJCF model with build_mjcf_model.
Modify the MJCF model by calling modify_mjcf.
If the environment is dirty, MjModel and MjData objects will be recompiled. Otherwise, compilation will be
skipped unless this is the first call to reset.
Reset the simulation with mujoco.mj_resetData.
Initialize the episode with initialize_episode.
Compute the first observation with compute_observation.

set_new_target(return_info=True, p_stack=0.5)[source]

Set a new random target for data collection.

Parameters:

return_info – Whether to return the observation and reset info.
p_stack – Probability of stacking the target block on top of another block when there are multiple blocks and the target task is ‘cube’.

set_state(qpos, qvel, button_states)[source]: Reset the environment to a specific state.

set_tasks()[source]

stable_worldmodel.envs.pusht module

class PushT(block_cog=None, damping=None, render_action=False, resolution=224, with_target=True, render_mode='rgb_array', fix_action_sample=True, relative=True)[source]

Bases: Env

add_I(position, angle, scale=30, color='LightSlateGray', mask=4294967295)[source]

add_L(position, angle, scale=30, color='LightSlateGray', mask=4294967295)[source]

add_Z(position, angle, scale=30, color='LightSlateGray', mask=4294967295)[source]

add_box(position, height, width, color='LightSlateGray', scale=1, angle=0)[source]

add_circle(position, angle=0, scale=1, color='RoyalBlue')[source]

add_plus(position, angle, scale=30, color='LightSlateGray', mask=4294967295)[source]

add_shape(shape, *args, **kwargs)[source]

add_small_tee(position, angle, scale=30, color='LightSlateGray', mask=4294967295)[source]

add_square(position, angle, scale=30, color='LightSlateGray', mask=4294967295)[source]

add_tee(position, angle, scale=30, color='LightSlateGray', mask=4294967295)[source]

close()[source]

After the user has finished using the environment, close contains the code necessary to “clean up” the environment.

This is critical for closing rendering windows, database or HTTP connections. Calling close on an already closed environment has no effect and won’t raise an error.

eval_state(goal_state, cur_state)[source]

fix_action_sample()[source]

metadata: dict[str, Any] = {'render_fps': 10, 'render_modes': ['human', 'rgb_array'], 'video.frames_per_second': 10}

render()[source]

Compute the render frames as specified by render_mode during the initialization of the environment.

The environment’s metadata render modes (env.metadata[“render_modes”]) should contain the possible ways to implement the render modes. In addition, list versions for most render modes is achieved through gymnasium.make which automatically applies a wrapper to collect rendered frames.

Note

As the render_mode is known during __init__, the objects used to render the environment state should be initialised in __init__.

By convention, if the render_mode is:

None (default): no render is computed.
“human”: The environment is continuously rendered in the current display or terminal, usually for human consumption. This rendering should occur during step() and render() doesn’t need to be called. Returns None.
“rgb_array”: Return a single frame representing the current state of the environment. A frame is a np.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image.
“ansi”: Return a strings (str) or StringIO.StringIO containing a terminal-style text representation for each time step. The text can include newlines and ANSI escape sequences (e.g. for colors).
“rgb_array_list” and “ansi_list”: List based version of render modes are possible (except Human) through the wrapper, gymnasium.wrappers.RenderCollection that is automatically applied during gymnasium.make(..., render_mode="rgb_array_list"). The frames collected are popped after render() is called or reset().

Note

Make sure that your class’s metadata "render_modes" key includes the list of supported modes.

Changed in version 0.25.0: The render function was changed to no longer accept parameters, rather these parameters should be specified in the environment initialised, i.e., gymnasium.make("CartPole-v1", render_mode="human")

reset(seed=None, options=None)[source]

Resets the environment to an initial internal state, returning an initial observation and info.

This method generates a new starting state often with some randomness to ensure that the agent explores the state space and learns a generalised policy about the environment. This randomness can be controlled with the seed parameter otherwise if the environment already has a random number generator and reset() is called with seed=None, the RNG is not reset.

Therefore, reset() should (in the typical use case) be called with a seed right after initialization and then never again.

For Custom environments, the first line of reset() should be super().reset(seed=seed) which implements the seeding correctly.

Changed in version v0.25: The return_info parameter was removed and now info is expected to be returned.

Parameters:

seed (optional int) – The seed that is used to initialize the environment’s PRNG (np_random) and the read-only attribute np_random_seed. If the environment does not already have a PRNG and seed=None (the default option) is passed, a seed will be chosen from some source of entropy (e.g. timestamp or /dev/urandom). However, if the environment already has a PRNG and seed=None is passed, the PRNG will not be reset and the env’s np_random_seed will not be altered. If you pass an integer, the PRNG will be reset even if it already exists. Usually, you want to pass an integer right after the environment has been initialized and then never again. Please refer to the minimal example above to see this paradigm in action.
options (optional dict) – Additional information to specify how the environment is reset (optional, depending on the specific environment)

Returns:

Observation of the initial state. This will be an element of observation_space: (typically a numpy array) and is analogous to the observation returned by step().
info (dictionary): This dictionary contains auxiliary information complementing observation. It should be analogous to: the info returned by step().

Return type:

observation (ObsType)

reward_range = (0.0, 1.0)

step(action)[source]

Run one timestep of the environment’s dynamics using the agent actions.

When the end of an episode is reached (terminated or truncated), it is necessary to call reset() to reset this environment’s state for the next episode.

Changed in version 0.26: The Step API was changed removing done in favor of terminated and truncated to make it clearer to users when the environment had terminated or truncated which is critical for reinforcement learning bootstrapping algorithms.

Parameters:

action (ActType) – an action provided by the agent to update the environment state.

Returns:

An element of the environment’s observation_space as the next observation due to the agent actions.: An example is a numpy array containing the positions and velocities of the pole in CartPole.

reward (SupportsFloat): The reward as a result of taking the action. terminated (bool): Whether the agent reaches the terminal state (as defined under the MDP of the task)

which can be positive or negative. An example is reaching the goal state or moving into the lava from the Sutton and Barto Gridworld. If true, the user needs to call reset().

truncated (bool): Whether the truncation condition outside the scope of the MDP is satisfied.: Typically, this is a timelimit, but could also be used to indicate an agent physically going out of bounds. Can be used to end the episode prematurely before a terminal state is reached. If true, the user needs to call reset().
info (dict): Contains auxiliary diagnostic information (helpful for debugging, learning, and logging).: This might, for instance, contain: metrics that describe the agent’s performance state, variables that are hidden from observations, or individual reward terms that are combined to produce the total reward. In OpenAI Gym <v26, it contains “TimeLimit.truncated” to distinguish truncation and termination, however this is deprecated in favour of returning terminated and truncated variables.
done (bool): (Deprecated) A boolean value for if the episode has ended, in which case further step() calls will: return undefined results. This was removed in OpenAI Gym v26 in favor of terminated and truncated attributes. A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully, a certain timelimit was exceeded, or the physics simulation has entered an invalid state.

Return type:

observation (ObsType)

stable_worldmodel.envs.simple_point_maze module

class SimplePointMazeEnv(max_walls=6, min_walls=4, wall_min_size=0.5, wall_max_size=1.5, render_mode=None, show_goal: bool = True)[source]

Bases: Env

close()[source]

After the user has finished using the environment, close contains the code necessary to “clean up” the environment.

This is critical for closing rendering windows, database or HTTP connections. Calling close on an already closed environment has no effect and won’t raise an error.

metadata: dict[str, Any] = {'render_modes': ['human', 'rgb_array']}

render(mode=None)[source]

Compute the render frames as specified by render_mode during the initialization of the environment.

The environment’s metadata render modes (env.metadata[“render_modes”]) should contain the possible ways to implement the render modes. In addition, list versions for most render modes is achieved through gymnasium.make which automatically applies a wrapper to collect rendered frames.

Note

As the render_mode is known during __init__, the objects used to render the environment state should be initialised in __init__.

By convention, if the render_mode is:

None (default): no render is computed.
“human”: The environment is continuously rendered in the current display or terminal, usually for human consumption. This rendering should occur during step() and render() doesn’t need to be called. Returns None.
“rgb_array”: Return a single frame representing the current state of the environment. A frame is a np.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image.
“ansi”: Return a strings (str) or StringIO.StringIO containing a terminal-style text representation for each time step. The text can include newlines and ANSI escape sequences (e.g. for colors).
“rgb_array_list” and “ansi_list”: List based version of render modes are possible (except Human) through the wrapper, gymnasium.wrappers.RenderCollection that is automatically applied during gymnasium.make(..., render_mode="rgb_array_list"). The frames collected are popped after render() is called or reset().

Note

Make sure that your class’s metadata "render_modes" key includes the list of supported modes.

Changed in version 0.25.0: The render function was changed to no longer accept parameters, rather these parameters should be specified in the environment initialised, i.e., gymnasium.make("CartPole-v1", render_mode="human")

reset(seed=None, options=None)[source]

Resets the environment to an initial internal state, returning an initial observation and info.

This method generates a new starting state often with some randomness to ensure that the agent explores the state space and learns a generalised policy about the environment. This randomness can be controlled with the seed parameter otherwise if the environment already has a random number generator and reset() is called with seed=None, the RNG is not reset.

Therefore, reset() should (in the typical use case) be called with a seed right after initialization and then never again.

For Custom environments, the first line of reset() should be super().reset(seed=seed) which implements the seeding correctly.

Changed in version v0.25: The return_info parameter was removed and now info is expected to be returned.

Parameters:

seed (optional int) – The seed that is used to initialize the environment’s PRNG (np_random) and the read-only attribute np_random_seed. If the environment does not already have a PRNG and seed=None (the default option) is passed, a seed will be chosen from some source of entropy (e.g. timestamp or /dev/urandom). However, if the environment already has a PRNG and seed=None is passed, the PRNG will not be reset and the env’s np_random_seed will not be altered. If you pass an integer, the PRNG will be reset even if it already exists. Usually, you want to pass an integer right after the environment has been initialized and then never again. Please refer to the minimal example above to see this paradigm in action.
options (optional dict) – Additional information to specify how the environment is reset (optional, depending on the specific environment)

Returns:

Observation of the initial state. This will be an element of observation_space: (typically a numpy array) and is analogous to the observation returned by step().
info (dictionary): This dictionary contains auxiliary information complementing observation. It should be analogous to: the info returned by step().

Return type:

observation (ObsType)

step(action)[source]

Run one timestep of the environment’s dynamics using the agent actions.

When the end of an episode is reached (terminated or truncated), it is necessary to call reset() to reset this environment’s state for the next episode.

Changed in version 0.26: The Step API was changed removing done in favor of terminated and truncated to make it clearer to users when the environment had terminated or truncated which is critical for reinforcement learning bootstrapping algorithms.

Parameters:

action (ActType) – an action provided by the agent to update the environment state.

Returns:

An element of the environment’s observation_space as the next observation due to the agent actions.: An example is a numpy array containing the positions and velocities of the pole in CartPole.

reward (SupportsFloat): The reward as a result of taking the action. terminated (bool): Whether the agent reaches the terminal state (as defined under the MDP of the task)

which can be positive or negative. An example is reaching the goal state or moving into the lava from the Sutton and Barto Gridworld. If true, the user needs to call reset().

truncated (bool): Whether the truncation condition outside the scope of the MDP is satisfied.: Typically, this is a timelimit, but could also be used to indicate an agent physically going out of bounds. Can be used to end the episode prematurely before a terminal state is reached. If true, the user needs to call reset().
info (dict): Contains auxiliary diagnostic information (helpful for debugging, learning, and logging).: This might, for instance, contain: metrics that describe the agent’s performance state, variables that are hidden from observations, or individual reward terms that are combined to produce the total reward. In OpenAI Gym <v26, it contains “TimeLimit.truncated” to distinguish truncation and termination, however this is deprecated in favour of returning terminated and truncated variables.
done (bool): (Deprecated) A boolean value for if the episode has ended, in which case further step() calls will: return undefined results. This was removed in OpenAI Gym v26 in favor of terminated and truncated attributes. A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully, a certain timelimit was exceeded, or the physics simulation has entered an invalid state.

Return type:

observation (ObsType)

stable_worldmodel.envs.two_room module

class TwoRoomEnv(render_size=224, render_mode='rgb_array')[source]

Bases: Env

A simple navigation two-room environment.

add_circle(position, radius, color, *, is_goal=False)[source]

check_collide(x, entity='agent')[source]

check_one_door_fit(x)[source]

check_other_room(x)[source]

close()[source]

After the user has finished using the environment, close contains the code necessary to “clean up” the environment.

This is critical for closing rendering windows, database or HTTP connections. Calling close on an already closed environment has no effect and won’t raise an error.

metadata: dict[str, Any] = {'render_fps': 10, 'render_modes': ['human', 'rgb_array']}

render()[source]

Compute the render frames as specified by render_mode during the initialization of the environment.

The environment’s metadata render modes (env.metadata[“render_modes”]) should contain the possible ways to implement the render modes. In addition, list versions for most render modes is achieved through gymnasium.make which automatically applies a wrapper to collect rendered frames.

Note

As the render_mode is known during __init__, the objects used to render the environment state should be initialised in __init__.

By convention, if the render_mode is:

None (default): no render is computed.
“human”: The environment is continuously rendered in the current display or terminal, usually for human consumption. This rendering should occur during step() and render() doesn’t need to be called. Returns None.
“rgb_array”: Return a single frame representing the current state of the environment. A frame is a np.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image.
“ansi”: Return a strings (str) or StringIO.StringIO containing a terminal-style text representation for each time step. The text can include newlines and ANSI escape sequences (e.g. for colors).
“rgb_array_list” and “ansi_list”: List based version of render modes are possible (except Human) through the wrapper, gymnasium.wrappers.RenderCollection that is automatically applied during gymnasium.make(..., render_mode="rgb_array_list"). The frames collected are popped after render() is called or reset().

Note

Make sure that your class’s metadata "render_modes" key includes the list of supported modes.

Changed in version 0.25.0: The render function was changed to no longer accept parameters, rather these parameters should be specified in the environment initialised, i.e., gymnasium.make("CartPole-v1", render_mode="human")

reset(seed=None, options=None)[source]

Resets the environment to an initial internal state, returning an initial observation and info.

This method generates a new starting state often with some randomness to ensure that the agent explores the state space and learns a generalised policy about the environment. This randomness can be controlled with the seed parameter otherwise if the environment already has a random number generator and reset() is called with seed=None, the RNG is not reset.

Therefore, reset() should (in the typical use case) be called with a seed right after initialization and then never again.

For Custom environments, the first line of reset() should be super().reset(seed=seed) which implements the seeding correctly.

Changed in version v0.25: The return_info parameter was removed and now info is expected to be returned.

Parameters:

seed (optional int) – The seed that is used to initialize the environment’s PRNG (np_random) and the read-only attribute np_random_seed. If the environment does not already have a PRNG and seed=None (the default option) is passed, a seed will be chosen from some source of entropy (e.g. timestamp or /dev/urandom). However, if the environment already has a PRNG and seed=None is passed, the PRNG will not be reset and the env’s np_random_seed will not be altered. If you pass an integer, the PRNG will be reset even if it already exists. Usually, you want to pass an integer right after the environment has been initialized and then never again. Please refer to the minimal example above to see this paradigm in action.
options (optional dict) – Additional information to specify how the environment is reset (optional, depending on the specific environment)

Returns:

Observation of the initial state. This will be an element of observation_space: (typically a numpy array) and is analogous to the observation returned by step().
info (dictionary): This dictionary contains auxiliary information complementing observation. It should be analogous to: the info returned by step().

Return type:

observation (ObsType)

seed(seed=None)[source]

step(action)[source]

Run one timestep of the environment’s dynamics using the agent actions.

When the end of an episode is reached (terminated or truncated), it is necessary to call reset() to reset this environment’s state for the next episode.

Changed in version 0.26: The Step API was changed removing done in favor of terminated and truncated to make it clearer to users when the environment had terminated or truncated which is critical for reinforcement learning bootstrapping algorithms.

Parameters:

action (ActType) – an action provided by the agent to update the environment state.

Returns:

An element of the environment’s observation_space as the next observation due to the agent actions.: An example is a numpy array containing the positions and velocities of the pole in CartPole.

reward (SupportsFloat): The reward as a result of taking the action. terminated (bool): Whether the agent reaches the terminal state (as defined under the MDP of the task)

which can be positive or negative. An example is reaching the goal state or moving into the lava from the Sutton and Barto Gridworld. If true, the user needs to call reset().

truncated (bool): Whether the truncation condition outside the scope of the MDP is satisfied.: Typically, this is a timelimit, but could also be used to indicate an agent physically going out of bounds. Can be used to end the episode prematurely before a terminal state is reached. If true, the user needs to call reset().
info (dict): Contains auxiliary diagnostic information (helpful for debugging, learning, and logging).: This might, for instance, contain: metrics that describe the agent’s performance state, variables that are hidden from observations, or individual reward terms that are combined to produce the total reward. In OpenAI Gym <v26, it contains “TimeLimit.truncated” to distinguish truncation and termination, however this is deprecated in favour of returning terminated and truncated variables.
done (bool): (Deprecated) A boolean value for if the episode has ended, in which case further step() calls will: return undefined results. This was removed in OpenAI Gym v26 in favor of terminated and truncated attributes. A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully, a certain timelimit was exceeded, or the physics simulation has entered an invalid state.

Return type:

observation (ObsType)

stable_worldmodel.envs.utils module

class DrawOptions(surface: Surface)[source]

Bases: SpaceDebugDrawOptions

__init__(surface: Surface) → None[source]

Draw a pymunk.Space on a pygame.Surface object.

Typical usage:

>>> import pymunk
>>> surface = pygame.Surface((10, 10))
>>> space = pymunk.Space()
>>> options = pymunk.pygame_util.DrawOptions(surface)
>>> space.debug_draw(options)

You can control the color of a shape by setting shape.color to the color you want it drawn in:

>>> c = pymunk.Circle(None, 10)
>>> c.color = pygame.Color("pink")

See pygame_util.demo.py for a full example

Since pygame uses a coordinate system where y points down (in contrast to many other cases), you either have to make the physics simulation with Pymunk also behave in that way, or flip everything when you draw.

The easiest is probably to just make the simulation behave the same way as Pygame does. In that way all coordinates used are in the same orientation and easy to reason about:

>>> space = pymunk.Space()
>>> space.gravity = (0, -1000)
>>> body = pymunk.Body()
>>> body.position = (0, 0)  # will be positioned in the top left corner
>>> space.debug_draw(options)

To flip the drawing its possible to set the module property positive_y_is_up to True. Then the pygame drawing will flip the simulation upside down before drawing:

>>> positive_y_is_up = True
>>> body = pymunk.Body()
>>> body.position = (0, 0)
>>> # Body will be position in bottom left corner

Parameters:

surfacepygame.Surface: Surface that the objects will be drawn on

draw_circle(pos: Vec2d, angle: float, radius: float, outline_color: SpaceDebugColor, fill_color: SpaceDebugColor) → None[source]

draw_dot(size: float, pos: tuple[float, float], color: SpaceDebugColor) → None[source]

draw_fat_segment(a: tuple[float, float], b: tuple[float, float], radius: float, outline_color: SpaceDebugColor, fill_color: SpaceDebugColor) → None[source]

draw_polygon(verts: Sequence[tuple[float, float]], radius: float, outline_color: SpaceDebugColor, fill_color: SpaceDebugColor) → None[source]

draw_segment(a: Vec2d, b: Vec2d, color: SpaceDebugColor) → None[source]

from_pygame(p: tuple[float, float], surface: Surface) → tuple[int, int][source]: Convenience method to convert pygame surface local coordinates to pymunk coordinates

get_mouse_pos(surface: Surface) → tuple[int, int][source]: Get position of the mouse pointer in pymunk coordinates.

light_color(color: SpaceDebugColor)[source]

perturb_camera_angle(xyaxis, deg_dif=[3, 3])[source]: For OGBench Environments: Perturb the camera angle by a small random rotation.

pymunk_to_shapely(body, shapes)[source]

to_pygame(p: tuple[float, float], surface: Surface) → tuple[int, int][source]

Convenience method to convert pymunk coordinates to pygame surface local coordinates.

Note that in case positive_y_is_up is False, this function won’t actually do anything except converting the point to integers.

stable_worldmodel.envs.voidrun module

class Action(LEFT: 'int' = 0, RIGHT: 'int' = 1, DOWN: 'int' = 2, UP: 'int' = 3)[source]

Bases: object

DOWN: int = 2

LEFT: int = 0

RIGHT: int = 1

UP: int = 3

class VoidRunEnv(seed: int | None = None, render_mode: str = 'human')[source]

Bases: Env

Discrete grid environment with a 1x1 agent cell.

check_location(x)[source]

check_termination() → bool[source]: Success = all blocks are void except under the agent, AND the agent is at the goal position. For 1x1 agent, ‘under the agent’ is just its current cell.

close() → None[source]

After the user has finished using the environment, close contains the code necessary to “clean up” the environment.

This is critical for closing rendering windows, database or HTTP connections. Calling close on an already closed environment has no effect and won’t raise an error.

generate_board() → ndarray[source]

generate_goal(*, cell_value: int = 3) → None[source]

metadata: dict[str, Any] = {'render_fps': 10, 'render_modes': ['human', 'rgb_array']}

render(mode: str | None = None)[source]

Compute the render frames as specified by render_mode during the initialization of the environment.

The environment’s metadata render modes (env.metadata[“render_modes”]) should contain the possible ways to implement the render modes. In addition, list versions for most render modes is achieved through gymnasium.make which automatically applies a wrapper to collect rendered frames.

Note

As the render_mode is known during __init__, the objects used to render the environment state should be initialised in __init__.

By convention, if the render_mode is:

None (default): no render is computed.
“human”: The environment is continuously rendered in the current display or terminal, usually for human consumption. This rendering should occur during step() and render() doesn’t need to be called. Returns None.
“rgb_array”: Return a single frame representing the current state of the environment. A frame is a np.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image.
“ansi”: Return a strings (str) or StringIO.StringIO containing a terminal-style text representation for each time step. The text can include newlines and ANSI escape sequences (e.g. for colors).
“rgb_array_list” and “ansi_list”: List based version of render modes are possible (except Human) through the wrapper, gymnasium.wrappers.RenderCollection that is automatically applied during gymnasium.make(..., render_mode="rgb_array_list"). The frames collected are popped after render() is called or reset().

Note

Make sure that your class’s metadata "render_modes" key includes the list of supported modes.

Changed in version 0.25.0: The render function was changed to no longer accept parameters, rather these parameters should be specified in the environment initialised, i.e., gymnasium.make("CartPole-v1", render_mode="human")

render_board(ax: Axes | None = None) → None[source]

reset(*, seed: int | None = None, options: dict[str, Any] | None = None)[source]

Resets the environment to an initial internal state, returning an initial observation and info.

This method generates a new starting state often with some randomness to ensure that the agent explores the state space and learns a generalised policy about the environment. This randomness can be controlled with the seed parameter otherwise if the environment already has a random number generator and reset() is called with seed=None, the RNG is not reset.

Therefore, reset() should (in the typical use case) be called with a seed right after initialization and then never again.

For Custom environments, the first line of reset() should be super().reset(seed=seed) which implements the seeding correctly.

Changed in version v0.25: The return_info parameter was removed and now info is expected to be returned.

Parameters:

seed (optional int) – The seed that is used to initialize the environment’s PRNG (np_random) and the read-only attribute np_random_seed. If the environment does not already have a PRNG and seed=None (the default option) is passed, a seed will be chosen from some source of entropy (e.g. timestamp or /dev/urandom). However, if the environment already has a PRNG and seed=None is passed, the PRNG will not be reset and the env’s np_random_seed will not be altered. If you pass an integer, the PRNG will be reset even if it already exists. Usually, you want to pass an integer right after the environment has been initialized and then never again. Please refer to the minimal example above to see this paradigm in action.
options (optional dict) – Additional information to specify how the environment is reset (optional, depending on the specific environment)

Returns:

Observation of the initial state. This will be an element of observation_space: (typically a numpy array) and is analogous to the observation returned by step().
info (dictionary): This dictionary contains auxiliary information complementing observation. It should be analogous to: the info returned by step().

Return type:

observation (ObsType)

set_state(board: ndarray, player_pos: tuple[int, int], *, validate: bool = True, render: bool = False) → dict[str, Any][source]

step(action: int)[source]

Run one timestep of the environment’s dynamics using the agent actions.

When the end of an episode is reached (terminated or truncated), it is necessary to call reset() to reset this environment’s state for the next episode.

Changed in version 0.26: The Step API was changed removing done in favor of terminated and truncated to make it clearer to users when the environment had terminated or truncated which is critical for reinforcement learning bootstrapping algorithms.

Parameters:

action (ActType) – an action provided by the agent to update the environment state.

Returns:

An element of the environment’s observation_space as the next observation due to the agent actions.: An example is a numpy array containing the positions and velocities of the pole in CartPole.

reward (SupportsFloat): The reward as a result of taking the action. terminated (bool): Whether the agent reaches the terminal state (as defined under the MDP of the task)

which can be positive or negative. An example is reaching the goal state or moving into the lava from the Sutton and Barto Gridworld. If true, the user needs to call reset().

truncated (bool): Whether the truncation condition outside the scope of the MDP is satisfied.: Typically, this is a timelimit, but could also be used to indicate an agent physically going out of bounds. Can be used to end the episode prematurely before a terminal state is reached. If true, the user needs to call reset().
info (dict): Contains auxiliary diagnostic information (helpful for debugging, learning, and logging).: This might, for instance, contain: metrics that describe the agent’s performance state, variables that are hidden from observations, or individual reward terms that are combined to produce the total reward. In OpenAI Gym <v26, it contains “TimeLimit.truncated” to distinguish truncation and termination, however this is deprecated in favour of returning terminated and truncated variables.
done (bool): (Deprecated) A boolean value for if the episode has ended, in which case further step() calls will: return undefined results. This was removed in OpenAI Gym v26 in favor of terminated and truncated attributes. A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully, a certain timelimit was exceeded, or the physics simulation has entered an invalid state.

Return type:

observation (ObsType)

Module contents

register(id, entry_point)[source]