This project implements two actor variants that extend the standard Gaussian actor used in Soft Actor-Critic (SAC) style agents. The goal of both designs is to transform and enrich the action-noise distribution in a non-linear, learned fashion to improve stability and exploration.
- Standard stochastic policies in actor-critic methods sample actions from simple parametric distributions (e.g. diagonal Gaussian). While effective, these distributions can limit the kinds of exploration the agent can perform.
- By learning a richer, invertible or decodable transformation of the base noise distribution, we get more expressive action distributions while retaining tractable density evaluation via change-of-variables (important for off-policy algorithms that need log probabilities).
- This repository contains two approaches: a LatentActor (decoded latent action) and a FlowActor (normalizing-flow based transformation).
-
LatentActor(inproject/models/actor.py)- Samples a latent action from the base diagonal Gaussian produced by the underlying
Actor. - Optionally decodes the latent action through a small MLP decoder (conditional on features if
conditional_decoder=True) to produce the final environment action. - Useful when you want a low-dimensional latent policy but richer decoding to the action space, or when you want to constrain the latent space (e.g., with
tanhto keep values within [-1,1]). - Provides
get_latent_action,get_decoded_action, andaction_log_probthat return both a decoded action and the latent log-probability.
- Samples a latent action from the base diagonal Gaussian produced by the underlying
-
FlowActor(inproject/models/actor.py)- Samples a latent action from the base diagonal Gaussian.
- Transforms that latent sample through a conditional or unconditional normalizing flow (RealNVP implementation provided in
project/models/flow.py) to obtain a flexible final action distribution. - When using flows, the log-probability of the final action is obtained via change-of-variables: log p(x) = log p(z) - log|det df/dz| (or equivalently with the inverse Jacobian sign convention used by the flow implementation).
- Provides
get_flow_actionand a customaction_log_probthat adjusts the latent log-prob with the flow log-determinant.
- Latent decoding keeps the policy in a compact latent space while allowing a non-linear mapping to the action — this can regularize learning and reduce variance of the policy network outputs.
- Normalizing flows give an exact (tractable) density under flexible transformations. They can represent multimodal and skewed distributions that diagonal Gaussians cannot.
- Both methods try to improve exploration by changing how stochasticity is injected into actions, and they can improve numerical stability when the decoder/flow is trained alongside the actor.
- The actors extend the base
stable_baselines3Actorclass. They can be plugged into existing SAC training loops by replacing the policy actor withLatentActororFlowActorvariants where the rest of the agent expects the same interface. - Key constructor args in
LatentActor/FlowActor:latent_dim: dimensionality of the latent action space (smaller => stronger bottleneck).latent_arch/flow_arch: MLP sizes for decoder or flow hidden layers.conditional_decoder/conditional_flow: whether to condition the decoder/flow on observation features.constrain_latent_space: if True, appliestanhto mean actions to keep them in [-1, 1].
- Careful: the sign used when applying the flow log-determinant depends on the convention used in the
flowimplementation:- If your flow returns
logdet = log|det dx/dz|(forward Jacobian), then uselog p(x) = log p(z) - logdet. - If it returns
logdet = log|det dz/dx|(inverse Jacobian), add the logdet instead.
- If your flow returns
- See the
FlowActor.action_log_probimplementation for how the repository currently applies this term. If you want me to checkproject/models/flow.pyand confirm the flow convention, I can do that and update the comment in code.
project/models/actor.py:LatentActor,FlowActorimplementations.project/models/flow.py: Flow layers and RealNVP / ConditionalRealNVP implementations.project/models/policy.py: Higher-level policy wiring that consumes actor outputs.scripts/train.py: Example training entrypoint (if present) showing how the model is instantiated for experiments.
- No experimental results are included yet. Recommended first experiments:
- Compare SAC baseline (diagonal Gaussian actor) vs LatentActor vs FlowActor on a simple continuous control task (e.g., Pendulum / LunarLander).
- Ablate latent dimensionality and conditional vs unconditional decoders/flows.
- Track training stability (variance over seeds) and sample efficiency (reward vs wall-clock).
- Add clear unit-tests for the flow log-determinant sign and the
action_log_probcorrectness. A small density check (sample z -> x and compare log-prob via change-of-variables) is a cheap sanity test. - Add CLI example or tutorial notebook showing how to instantiate and train with each actor.
- Run experiments and populate a
results/subfolder with plots and metrics.
This project uses a dedicated environment implementation for the 2D inverse kinematics benchmark. The environment is provided as the ik_rl subpackage inside project/environment/ik_rl and has its own packaging and documentation in that folder.
Install the environment
There are two ways to make the IK environment available to experiments:
- Install the local package (recommended for development):
- From the repository root run (uses Poetry-managed Python environment):
poetry install --no-root
poetry run pip install -e project/environment/ik_rlThis installs the ik_rl package in editable mode so you can modify the environment code and test changes immediately.
- Use PYTHONPATH during development (quick, no install):
export PYTHONPATH="$PYTHONPATH:$(pwd)/project/environment/ik_rl"
python -m project ...The project package provides a small CLI wrapper that delegates to the Entrypoint class in project/entrypoint.py. Two main commands are provided:
train-sac— start a SAC training run using the configured actor/policy.render-sac— render a trained checkpoint.
Examples (from repository root):
# Train (uses Hydra config at configs/train_sac.yaml)
python -m project train-sac --help
python -m project train-sac
# Render a checkpoint
python -m project render-sac --checkpoint path/to/checkpoint --device cpuInternals: the CLI uses pyargwriter's hydra wrapper to pass the Hydra-config to Entrypoint.train_sac. The Entrypoint class in project/entrypoint.py calls project.scripts.train.train_sac(config, force, device) under the hood.
- Make sure the
ik_rlenvironment is importable (either installed or PYTHONPATH set). Missing environment imports will raise at runtime when constructing envs inscripts/train.py. - Use the
--helpflags to reveal available config overrides supported bypyargwriterand the entrypoint. - The project relies on Poetry and specific pinned dependencies (see
pyproject.toml). If you prefer pip/venv, install the packages listed under[project].dependenciesinpyproject.tomlinto your virtualenv.
- To iterate quickly on environments, use the editable install (
pip install -e project/environment/ik_rl) and run the entrypoint from the repo root. - Unit tests for the environment are available in
project/environment/ik_rl/tests/and can be run withpytestonce dependencies are installed.
This project uses the MIT license. For questions, open an issue or contact the maintainer.
If you use this code in your research, please cite the repository:
@misc{uhrich2025spark,
title={SPARK: Stochastic Policies Augmented for Robust Knowledge/Exploration},
author={Robin Uhrich},
year={2025},
publisher={GitHub},
journal={GitHub repository},
url={https://github.com/RobinU434/SPARK},
}