AI-Generated Gymnasium Environments

RL Environment Generation
powered by AI.

Describe what you need. Train agents.
Track experiments. Export results.

Gymnasium v0.29+ 8 Automated Tests PPO / SAC / DQN Continue / Fine-Tune / Curriculum AI Smart Suggestions GitHub Export

What is an RL Environment?

A reinforcement learning environment defines the world an agent interacts with. The agent observes a state, takes an action, and receives a reward — learning to maximize cumulative reward over time.

Agent

Policy π(a|s)

action
env.step(action)
obs, reward
Environment

s', r = env(s, a)

ObservationActionRewardDone?

Observation

What the agent sees — sensor data, positions, velocities, or any state representation the environment exposes.

Action

What the agent does — discrete choices or continuous control signals that affect the environment state.

Reward

The feedback signal — a scalar value that tells the agent how good its action was. The agent learns to maximize total reward.

How It Works

From natural language to a trained agent in five steps. No boilerplate. No setup headaches.

01

Describe

Tell the AI what environment you need in plain English. Specify observations, actions, and goals.

02

Generate

The Architect Agent writes Gymnasium-compatible Python code, validated by 8 automated tests.

03

Test

Syntax, imports, reset, step, observation space, action space, reward sanity, and determinism — all checked.

04

Iterate

Chat with the AI or use smart suggestions to refine rewards, dynamics, and observations. Every change versioned.

05

Train

Continue, Fine-Tune, or Curriculum modes with configurable hyperparameters. Live metrics and experiment reports.

drone-navigation-v2v3robotics
ZIPTrain Agent
Create a drone navigation environment with obstacle avoidance

Created DroneNavEnv with 18-dim observation (position, velocity, 8 distance sensors). Continuous 4-dim action for thrust.

syntaximportresetstepobsactrewarddeterm.
Add wind turbulence as random perturbation

Added Gaussian wind N(0, 0.3) each step. New obs: wind_vector.

syntaximportresetstepobsactrewarddeterm.
Describe changes...
DashboardCodeAgentHistory
7/8 Tests Passing
syntax
import
reset
step
obs_space
action_space
reward
determinism
Observation Space
TypeBox
Shape[18]
Range[-1, 1]
Action Space
TypeBox
Shape[4]
Range[-1, 1]

Builder Features

Every tool you need to design production-quality RL environments, built into one seamless workflow.

Full Version Control

Every iteration creates a new version. Browse history, compare diffs, and roll back to any previous state. Your environment has a complete audit trail.

Version History
v3Added wind turbulence
7/82m ago
v2Obstacle avoidance penalty
8/88m ago
v1Initial generation
8/812m ago

8 Automated Tests

Each generation runs through syntax, import, reset, step, observation space, action space, reward sanity, and determinism checks. Catch errors before training.

8/8 Tests Passing0.42s
syntax
import
reset
step
obs_space
action_space
reward_sanity
determinism

Chat & AI Smart Suggestions

Refine your environment through conversation or use AI-driven smart suggestions that analyze your env and training results to recommend next steps — like increasing difficulty, tuning rewards, or trying different algorithms.

Make the reward sparse — only +1 at goal, 0 otherwise

Replaced distance-based reward with sparse signal. +1.0 within 0.1m of goal, 0.0 otherwise. Added small time penalty (-0.001/step).

passpasspasspasspasspasspasspass

v4 — reward function rewritten

Increase difficultyMake rewards denserTry SAC algorithmAdd noise to obs
Describe changes...

Export Options

Download your environment as a ZIP file with all dependencies, or push directly to a GitHub repository. Ready for local development or CI/CD pipelines.

Download ZIPenv.py + requirements.txt + README.md
Push to GitHubAuto-creates repo with CI workflow
Copy CodeRaw Python source to clipboard

Training & Experiments

Train agents directly in the platform. Compare runs across environment versions. Export detailed reports.

PPOSACDQN

Advanced Training Modes

Choose an algorithm, configure hyperparameters (learning rate, batch size, gamma, network architecture), and select a training mode. Continue resumes from checkpoints, Fine-Tune uses low LR for refinement, and Curriculum auto-increases environment difficulty.

Live reward curves
ETA & progress bar
Curriculum learning
Configurable hyperparams
Training in progress73.5%
073.5K / 100K100K
Episode Reward78.2
Episode Length55
Success Rate97%
Policy Loss0.63
Continue
Resume training
Fine-Tune
Low LR, short
Curriculum
Auto-difficulty
Step: 73,500Episodes: 1,247FPS: 4,280ETA: 6sPPO

Experiment Tracking & Reports

Each training run is an experiment linked to a specific environment version. Expandable experiment history with full metrics, compare runs side-by-side, and export to GitHub or download reports.

Side-by-side run comparison with metric diffs
Configurable hyperparameters: LR, batch size, gamma, net arch
Environment version tracking across training modes
GitHub export and PDF reports for documentation
Experiments
Compare Runs
Comparing Run #5 vs #4
Metric#5 PPO v3#4 PPO v2
Reward78.245.6
Success97%68%
Length55120
#5completedPPOv3100K
R: 78.297%24s
#4completedPPOv2100K
R: 45.668%22s
#3completedSACv250K
R: 38.152%18s
#2completedPPOv150K
R: 12.415%11s

Template Environments

Explore published environments built with Kualia. Use them as starting points or study their design for inspiration.

financemedium

5-Stock Trading

5-stock portfolio trading with synthetic GBM prices, transaction costs, and Sharpe-based reward.

View details
optimizationmedium

Inventory Management

Warehouse inventory optimization for 3 products with stochastic demand.

View details
gameeasy

Gridworld Maze

10x10 grid maze navigation with random walls. Agent must find path to goal.

View details
roboticshard

Drone Grid Navigation

3D drone obstacle avoidance. Navigate to goal while avoiding random obstacles.

View details
controleasy

CartPole with Wind

Classic CartPole with random wind disturbance. Keep pole balanced despite wind forces.

View details
custommedium

TurkishEditorEnv

A text editing environment where an RL agent learns to proofread Turkish documents containing translation errors from English and spelling mistakes violating Turkish "imla kuralları" (spelling rules). The agent navigates through documents using a sliding window, detecting and correcting character-level errors including Turkish-specific distinctions (ı/i, ğ/g, ş/s) and translation false friends, while managing a limited editing budget.

Obs: Box(0, 1, shape=(window_size * alphabet_size + 3,)): One-hot encoded text window (20 chars × 35 Turkish alphabet chars), normalized cursor position (1), remaining budget ratio (1), and error density indicator (1).Act: Discrete(5 + alphabet_size): 0=move_next, 1=mark_spelling_error, 2=mark_translation_error, 3=accept_text, 4=delete_char, 5-39=insert_specific_char (Turkish alphabet including ç, ğ, ı, ö, ş, ü).
View details
navigationmedium

dynamic-goal-navigation-v0

A 2D point-mass navigation task where the target goal location evolves continuously to test adaptation to changing reward structures while maintaining fixed transition dynamics. The agent controls a point mass with double-integrator dynamics (mass=1.0, friction=0.1) in a 10x10 bounded arena. The goal follows configurable dynamics: static, linear drift with reflection, Brownian random walk, or periodic teleportation.

Obs: Box(shape=[6])Act: Box(shape=[2])
View details
classic_controlmedium

non-stationary-cartpole

A CartPole environment with continuously drifting physical parameters (pole length and mass) to test adaptation to non-stationary dynamics. Parameters evolve via configurable schedules (sinusoidal, random walk, or abrupt steps). Observation space optionally includes temporal awareness features (sin/cos of phase) to help the agent anticipate parameter changes.

Obs: Box(shape=[6])Act: Discrete(shape=[1])
View details
navigationmedium

adaptive-goal-nav

A 2D point-mass navigation environment where a holonomic robot tracks a smoothly moving goal. The reward structure continuously morphs between dense (distance-based) and sparse (proximity-based) according to a time-varying alpha parameter, testing continual adaptation to non-stationary reward functions.

Obs: Box(shape=[6])Act: Box(shape=[2])
View details
navigationmedium

custom-env

Gymnasium-compatible continuous 2D navigation environment (10m x 10m arena). Observation space: Box(low=-10, high=10, shape=(14,), dtype=float32) containing [agent_x, agent_y, agent_vx, agent_vy, goal_relative_x, goal_relative_y, 8_lidar_readings (ray cast distances)]. Action space: Box(low=-1, high=1, shape=(2,), dtype=float32) representing [force_x, force_y]. Physics uses Euler integration with velocity damping 0.9. The environment maintains a competency buffer tracking success/failure of last 20 episodes to compute score c ∈ [0,1]. Adaptation mechanism: Obstacle count N = 5 + floor(15*c). Placement policy evolves continuously: (1) Random uniform when c < 0.33; (2) Corridor-blocking when 0.33 ≤ c < 0.66 using k-means clustering on recent agent trajectories to identify high-traffic zones, placing obstacles to minimize passage width; (3) Adversarial placement when c ≥ 0.66 using trajectory distribution analysis to maximize expected path length to goal. Obstacles are static circles with radii 0.3-0.6m. Reward: r_t = -0.1*||pos - goal||_2 - 0.01*||action||^2 + 10*success_flag - 5*collision_flag. Episode terminates on goal reach (distance < 0.5m), collision, or 500 steps. Reset() randomizes start/goal positions (min separation 8m) and regenerates obstacles via current placement policy based on c.

Obs: Box(shape=?)Act: Discrete(shape=?)
View details
resource_managementmedium

custom-env

Gymnasium-compatible continuous resource management with 3 interdependent resources (A, B, C). Observation space: Box(low=0, high=100, shape=(15,), dtype=float32): [storage_A, storage_B, storage_C, demand_A, demand_B, demand_C, demand_derivative_A, demand_derivative_B, demand_derivative_C, coupling_AB, coupling_BC, coupling_CA, time_since_shock, rolling_efficiency_score, normalized_step]. Action space: Box(low=0, high=10, shape=(6,), dtype=float32): [produce_A, produce_B, produce_C, convert_A_to_B, convert_B_to_C, convert_C_to_A]. Dynamics: storage_t+1 = storage_t + production + conversion_in - conversion_out - demand_t - waste. Demand follows non-stationary process d_t = d_base + α*sin(ω*t) where ω = ω_base*(1+e) scales with efficiency e ∈ [0,1] (rolling satisfied_demand/total_demand over 100 steps). Shock events occur with probability p = 0.01 + 0.2*max(0, e-0.7). Coupling coefficients C_ij (resource i requires resource j) evolve as C_ij = C_base * e, creating progressive interdependencies. Higher e increases production complexity and demand non-stationarity. Reward: r_t = -sum(|demand_t - satisfied_t|) - 0.5*sum(waste) - 0.01*||action||^2. Episode length: 1000 steps. Reset() initializes storage at 50 units, sets coupling matrix based on performance history (persistence across episodes), and samples new demand phase parameters.

Obs: Box(shape=?)Act: Discrete(shape=?)
View details
navigationhard

Morphing Grid Navigation

A 10x10 partially observable grid world where the agent navigates from bottom-left to a dynamically relocating goal while collecting resources. After every action, the environment stochastically morphs: walls toggle with 30% probability per cell, the goal teleports, and resources shift positions. The agent receives a 5x5 local view of walls and relative vectors to the goal and nearest resources. Features anti-oscillation and stagnation penalties to prevent reward hacking.

Obs: Box(shape=[41])Act: Discrete(shape=[1])
View details
financemedium

Dynamic Resource Market

A medium-complexity economic simulation where an agent manages a portfolio of 5 resources. The agent observes resource quantities, market prices, and demand levels to make buy/sell/hold decisions. Market conditions exhibit volatility cycles (20-50% fluctuation ranges) and random scarcity/abundance events. The agent must optimize portfolio value while minimizing transaction costs and drawdowns.

Obs: Box(shape=[17])Act: Discrete(shape=[15])
View details
navigationhard

DynamicMazeWithSelfObservation

RESEARCH HYPOTHESIS: In dynamically changing environments where the environment state transitions after each agent action, agents that incorporate self-observation data (including recent action sequences, reward histories, environment change patterns, and strategy effectiveness metrics) into their observation space will demonstrate superior adaptation performance compared to agents that only observe external environment state SUB-HYPOTHESES: H1: Agents with access to their own recent action history and corresponding rewards will adapt faster to dynamic environment changes than agents without this self-observation capability; H2: Agents that track environment change patterns (how the environment responds to their actions) will develop more robust strategies in dynamic settings; H3: Agents that monitor their own strategy effectiveness (progress toward goals over recent timesteps) will avoid repeating ineffective action sequences and converge to better policies USER'S ORIGINAL IDEA: dynamic reinforcement learning environment and adaptive agent: environment that dynamically change in training, so the agent inside should adapt the new situation every time, everything in the environment changes after each action. the hypothesis: when the environment is dynamic and changes after agent's each action, thus the agent must store the experience and must do self observation while taking actions and also must store the self observation experience and learn from these experiences in order to adapt the dynamically changing environment. Senin hipotezim 3 katmanlı: Environment dinamik değişmeli → ✅ B Agent deneyimi depolamalı (experience storage) → agent'ın "bu durumda şu oldu, ortam böyle değişti" bilgisini açıkça saklaması. Agent self-observation yapmalı → Agent sadece dış ortamı gözlemliyor (duvarlar, hedef, kaynaklar). Ama kendi iç durumunu — "son 5 adımda ne yaptım, ne oldu, ortam nasıl değişti, stratejim işe yaradı mı" — gözlemlesin tezin env kodunda olması gereken şey: Observation space'e agent'ın kendi deneyim özeti eklenmeli: Son N adımdaki action dizisi (ne yaptım) Son N adımdaki reward dizisi (ne oldu) Ortam değişim vektörü (ortam nasıl değişti — duvar toggle oranı, hedef ne kadar kaydı) Strateji etkinliği (son K adımda hedefe yaklaştım mı uzaklaştım mı) Deneyim kalıpları (bu tür ortam değişiminde hangi stratejim işe yaradı) Yani agent'ın observation'ı sadece "dış dünya şu an nasıl" değil, "ben ne yaptım, ne oldu, ortam nasıl tepki verdi" bilgisini de içermeli CRITICAL: The environment MUST implement ALL aspects of the hypothesis, including any agent-side mechanisms (self-observation, experience storage, adaptive behavior tracking) as part of the OBSERVATION SPACE and REWARD FUNCTION. Do not just build the environment dynamics — also embed the agent-side requirements into the env's observation/reward design. ENVIRONMENT SPECIFICATION: OBSERVATION SPACE: 147-dimensional vector containing: (1) Current position (x,y) [2 dims], (2) Goal position (x,y) [2 dims], (3) 10x10 grid maze layout (walls=1, empty=0) [100 dims], (4) Last 5 actions taken [5 dims: 0=up,1=down,2=left,3=right,4=stay], (5) Last 5 rewards received [5 dims], (6) Environment change vector: wall toggle frequency in 3x3 local area over last 10 steps [9 dims], (7) Goal displacement distance over last 10 steps [10 dims], (8) Strategy effectiveness: distance-to-goal change over last 5 steps [5 dims], (9) Action pattern effectiveness: reward per action type over last 10 steps [4 dims], (10) Local exploration coverage: unique cells visited in last 15 steps [5 dims]. ACTION SPACE: Discrete(5) - up, down, left, right, stay. TRANSITION DYNAMICS: After each action: (a) 30% chance each wall in 5x5 area around agent toggles state, (b) Goal position shifts by 1-2 cells in random direction with 40% probability, (c) New walls may block current path to goal. REWARD FUNCTION: +10 for reaching goal, -0.1 per timestep, -1 for hitting wall, +0.5 for moving closer to goal, -0.5 for moving away, +1.0 bonus if agent reaches goal using fewer actions than previous 3 attempts (strategy improvement reward), -0.2 penalty for repeating same failed action sequence (last 3 actions) that previously led to wall collision. EPISODE TERMINATION: Goal reached, 200 timesteps elapsed, or agent stuck (same position for 10 consecutive steps). AGENT-SIDE REQUIREMENTS: Agent must maintain internal buffers for action history, reward history, and compute strategy effectiveness metrics that feed into observation space.

Obs: Box(shape=?)Act: Discrete(shape=?)
View details
customhard

AdaptiveResourceGatheringWithExperienceTracking

RESEARCH HYPOTHESIS: In dynamically changing environments where the environment state transitions after each agent action, agents that incorporate self-observation data (including recent action sequences, reward histories, environment change patterns, and strategy effectiveness metrics) into their observation space will demonstrate superior adaptation performance compared to agents that only observe external environment state SUB-HYPOTHESES: H1: Agents with access to their own recent action history and corresponding rewards will adapt faster to dynamic environment changes than agents without this self-observation capability; H2: Agents that track environment change patterns (how the environment responds to their actions) will develop more robust strategies in dynamic settings; H3: Agents that monitor their own strategy effectiveness (progress toward goals over recent timesteps) will avoid repeating ineffective action sequences and converge to better policies USER'S ORIGINAL IDEA: dynamic reinforcement learning environment and adaptive agent: environment that dynamically change in training, so the agent inside should adapt the new situation every time, everything in the environment changes after each action. the hypothesis: when the environment is dynamic and changes after agent's each action, thus the agent must store the experience and must do self observation while taking actions and also must store the self observation experience and learn from these experiences in order to adapt the dynamically changing environment. Senin hipotezim 3 katmanlı: Environment dinamik değişmeli → ✅ B Agent deneyimi depolamalı (experience storage) → agent'ın "bu durumda şu oldu, ortam böyle değişti" bilgisini açıkça saklaması. Agent self-observation yapmalı → Agent sadece dış ortamı gözlemliyor (duvarlar, hedef, kaynaklar). Ama kendi iç durumunu — "son 5 adımda ne yaptım, ne oldu, ortam nasıl değişti, stratejim işe yaradı mı" — gözlemlesin tezin env kodunda olması gereken şey: Observation space'e agent'ın kendi deneyim özeti eklenmeli: Son N adımdaki action dizisi (ne yaptım) Son N adımdaki reward dizisi (ne oldu) Ortam değişim vektörü (ortam nasıl değişti — duvar toggle oranı, hedef ne kadar kaydı) Strateji etkinliği (son K adımda hedefe yaklaştım mı uzaklaştım mı) Deneyim kalıpları (bu tür ortam değişiminde hangi stratejim işe yaradı) Yani agent'ın observation'ı sadece "dış dünya şu an nasıl" değil, "ben ne yaptım, ne oldu, ortam nasıl tepki verdi" bilgisini de içermeli CRITICAL: The environment MUST implement ALL aspects of the hypothesis, including any agent-side mechanisms (self-observation, experience storage, adaptive behavior tracking) as part of the OBSERVATION SPACE and REWARD FUNCTION. Do not just build the environment dynamics — also embed the agent-side requirements into the env's observation/reward design. ENVIRONMENT SPECIFICATION: OBSERVATION SPACE: 89-dimensional vector containing: (1) Agent position (x,y) [2 dims], (2) Resource locations (5 resources, each with x,y,type,quantity) [20 dims], (3) Agent inventory (4 resource types) [4 dims], (4) Market prices for each resource type [4 dims], (5) Last 8 actions (0=move_up,1=move_down,2=move_left,3=move_right,4=gather,5=sell) [8 dims], (6) Last 8 rewards [8 dims], (7) Resource regeneration pattern: quantity change per resource over last 5 timesteps [25 dims], (8) Price volatility: price change per resource over last 4 timesteps [16 dims], (9) Gathering efficiency: resources gathered per gathering action over last 6 attempts [6 dims], (10) Market timing effectiveness: profit per sell action over last 4 sales [4 dims], (11) Exploration diversity: number of different resource types interacted with in last 10 actions [1 dim], (12) Strategy consistency score: correlation between action sequences and positive rewards over last 15 actions [1 dim]. ACTION SPACE: Discrete(6) - move in 4 directions, gather resource at current location, sell inventory at market. TRANSITION DYNAMICS: After each action: (a) Resource quantities change by ±20-50% with 60% probability, (b) Resource types may change (wood→stone, etc.) with 25% probability, (c) Market prices fluctuate ±10-30% based on agent's recent selling behavior, (d) New resources spawn randomly while others deplete. REWARD FUNCTION: +value for selling resources (based on quantity×price), +2 for gathering rare resources, -0.05 per timestep, +3.0 bonus for selling when prices are in top 25% of recent history (market timing), +1.5 bonus for maintaining diverse resource portfolio, -1.0 penalty for attempting same failed gathering sequence (last 4 actions) that previously yielded zero resources. EPISODE TERMINATION: 300 timesteps elapsed, total profit exceeds 100 units, or agent profit below -20 (bankruptcy). AGENT-SIDE REQUIREMENTS: Agent must track market timing patterns, resource availability changes, and maintain experience buffer linking action sequences to profitability outcomes.

Obs: Box(shape=?)Act: Discrete(shape=?)
View details
navigationhard

MorphingMazeAdaptiveIntelligence

A 10x10 grid maze with continuous morphing dynamics where walls toggle probabilistically after every action, the goal relocates periodically, and resources respawn. The 63-dimensional observation space includes external state (position, local wall view), experience storage (action/reward history), self-observation metrics (distance trends, collision rates, exploration efficiency), and meta-awareness signals (environment change magnitude, pattern familiarity). Designed to test whether self-observational capabilities improve adaptation speed in non-stationary environments.

Obs: Box(shape=[63])Act: Discrete(shape=[4])
View details
navigationhard

VariableMorphingComplexityEnvironment

A 12x12 morphing maze with parametric volatility (10-30% wall toggle), variable goal relocation (50-200 steps), and dynamic resource counts (2-5). Features a 75-dimensional observation space combining external state, experience storage, and enhanced self-observation metrics for studying adaptation to continuous environmental changes. Tests whether self-observing agents outperform external-only agents under varying morphing intensities.

Obs: Box(shape=[75])Act: Discrete(shape=[4])
View details

Ready to build your environment?

Describe your RL problem, generate the environment, train an agent, and export your results. All in one place.

Start Building