Assessing Dataset Quality

In this tutorial, we will show how to use our pre-implemented metrics for assessing dataset quality. One can use these metrics to determine the type of offline RL algorithm to use [Schweighofer et al., 2022] or to choose between different available datasets [Swazinna et al., 2021]. We will generate two datasets for the custom MountainCar environment introduced in the “Introduction.ipynb”:

Dataset A: generated by a random policy
Dataset B: generated by a policy trained online using PPO.

We will then compute two coverage metrics

State-action coverage (SACo) [Schweighofer et al., 2022]
behavioral entropy (BE) [Suttle et al., 2025] as well as three reward-based metrics
trajectory quality (TQ) [Schweighofer et al., 2022],
estimated return improvement (ERI) [Swazinna et al., 2021], and
average Q-value [Asadulaev et al., 2025] for these datasets. Finally, we will use offline RL (CQL) to train one agent per dataset and compare their performance. Can the quality metrics serve as an indicator for offline RL performance?

Note: This is merely an example to introduce the usage of our quality metrics. We do not claim our offline RL performance evaluation to be comprehensive, as we only use a single algorithm and a single seed. For more background on how to assess the predictive quality of performance metrics, please see the references below.

References

Asadulaev, Arip, Fakhri Karray, and Martin Takac. “Expert or not? assessing data quality in offline reinforcement learning.” arXiv preprint arXiv:2510.12638, 2025.

Schweighofer, Kajetan, et al. “A dataset perspective on offline reinforcement learning.” Conference on Lifelong Learning Agents. PMLR, 2022.

Suttle, Wesley A., Aamodh Suresh, and Carlos Nieto-Granda. “Behavioral Entropy-Guided Dataset Generation for Offline Reinforcement Learning.” arXiv preprint arXiv:2502.04141 (2025).

Swazinna, Phillip, Steffen Udluft, and Thomas Runkler. “Measuring data quality for dataset selection in offline reinforcement learning.” 2021 IEEE Symposium Series on Computational Intelligence (SSCI).

[ ]:

import torch
import sys
from pathlib import Path

import numpy as np
import pandas as pd

from gymnasium.wrappers import TimeLimit

from functools import partial

from stable_baselines3 import PPO

import d3rlpy
from d3rlpy.algos import DiscreteCQLConfig
from d3rlpy.dataset import MDPDataset

from pytupli.storage import TupliAPIClient, FileStorage
from pytupli.schema import FilterEQ
from pytupli.dataset import TupliDataset, NumpyTupleParser
from pytupli.quality_metrics import (
    QFunctionMetric,
    MLP,
    SACoMetric,
    GeneralizedBehavioralEntropyMetric,
    EstimatedReturnImprovementMetric,
    AverageReturnMetric,
)

from custom_env import CustomMountainCarEnv, MyTupliEnvWrapper, MyCallback, discretize_observation


# locate repository / tutorials folder robustly
def find_repo_root():
    p = Path.cwd()
    for parent in [p] + list(p.parents):
        if (parent / '.git').exists():
            return parent
    try:
        import pytupli

        return Path(pytupli.__file__).resolve().parent.parent
    except Exception:
        return Path.cwd()


REPO_ROOT = find_repo_root()
TUTORIALS_ROOT = REPO_ROOT / 'docs' / 'source' / 'tutorials'
if str(TUTORIALS_ROOT) not in sys.path:
    sys.path.insert(0, str(TUTORIALS_ROOT))
# canonical data / model locations (relative to tutorials folder)
DATA_DIR = TUTORIALS_ROOT / 'data'
DATA_PATH = DATA_DIR / 'wind_data.csv'
MODEL_PATH = DATA_DIR / 'mountain_car_ppo_model'
# fall back to strings when passing into APIs that expect str
DATA_PATH = str(DATA_PATH)
MODEL_PATH = str(MODEL_PATH)

PyTupli has two storage options: A local FileStorage and using MongoDB as a backend in the TupliAPIClient. You can run this notebook with both storage types by adjusting the flag below. If you want to use the TupliAPIClient, follow the instructions in the Readme to start the application.

[ ]:

STORAGE_FLAG = 'file'  # "api"

Benchmark Creation

We first have to prepare a benchmark for which to record the datasets. The following steps are explained in more detail in “Introduction.ipynb”, which is why we will not go into detail here.

[ ]:

# which storage to use
if STORAGE_FLAG == 'api':
    storage = TupliAPIClient()
elif STORAGE_FLAG == 'file':
    storage = FileStorage()
else:
    raise ValueError(f"Unknown storage flag: {STORAGE_FLAG}. Has to be 'api' or 'file'.")

[ ]:

# instantiate the environment
max_eps_length = 999
# data path relative to repository/tutorials
env = TimeLimit(
    CustomMountainCarEnv(render_mode=None, data_path=DATA_PATH), max_episode_steps=max_eps_length
)
# Now we can create the benchmark
tupli_env = MyTupliEnvWrapper(env, storage=storage)

[ ]:

# store the benchmark in the storage
tupli_env.store(name='mountain-car-v0', description='Mountain Car v0 benchmark')

Recording Datasets

We will now record two datasets. Dataset A contains tuples generated by a random policy. To generate Dataset B, we first train an agent online using PPO. Then, we generated trajectories using this trained policy.

[ ]:

n_tuples = 10_000

[ ]:

# load the environment and define a callback for recording metadata
is_expert = False  # for random policy
loaded_tupli_env = MyTupliEnvWrapper.load(
    storage=storage, benchmark_id=tupli_env.id, metadata_callback=MyCallback(is_expert=is_expert)
)

[ ]:

# For reproducibility when generating episodes
np.random.seed(42)
obs, info = loaded_tupli_env.reset(seed=42)

# generate n_tuples of data with a random policy
for step in range(n_tuples):
    action = np.int64(np.random.randint(low=0, high=3))
    obs, reward, done, truncated, info = loaded_tupli_env.step(action)
    if done or truncated:
        print(f'Episode finished after {step + 1} timesteps')
        obs, info = loaded_tupli_env.reset()

We pre-trained a model with PPO using the code below. Since this takes a considerable amount of time, we will simply load the pre-trained model.

[ ]:

# loaded_tupli_env.deactivate_recording()
# torch.manual_seed(42)
# model = PPO("MlpPolicy", loaded_tupli_env, verbose=1)
# model.learn(total_timesteps=500_000)

[ ]:

model = PPO.load(MODEL_PATH)

Now, we generate a dataset with the trained policy. We have to replace the callback to convey that we are now using an expert policy.

[ ]:

loaded_tupli_env.set_wrapper_attr('metadata_callback', MyCallback(is_expert=True))
loaded_tupli_env.activate_recording()

# For reproducibility when generating episodes
np.random.seed(42)
obs, info = loaded_tupli_env.reset(seed=42)

# generate n_tuples of data with a random policy
for step in range(n_tuples):
    action = model.predict(obs, deterministic=True)[0]
    obs, reward, done, truncated, info = loaded_tupli_env.step(action)
    if done or truncated:
        print(f'Episode finished after {step + 1} timesteps')
        obs, info = loaded_tupli_env.reset()

Let us load the two different datasets by filtering for the “is_expert” flag.

[ ]:

# Create dataset A
dataset_A = (
    TupliDataset(storage=storage)
    .with_benchmark_filter(FilterEQ(key='id', value=loaded_tupli_env.id))
    .with_episode_filter(FilterEQ(key='metadata.is_expert', value=False))
)
dataset_A.load()

[ ]:

# Create dataset B
dataset_B = (
    TupliDataset(storage=storage)
    .with_benchmark_filter(FilterEQ(key='id', value=loaded_tupli_env.id))
    .with_episode_filter(FilterEQ(key='metadata.is_expert', value=True))
)
dataset_B.load()

[ ]:

print(f'Dataset A - number of trajectories: {len(dataset_A.episodes)}')
print(f'Dataset B - number of trajectories: {len(dataset_B.episodes)}')

Compute Quality Metrics

We will now compute different quality metrics for the two datasets.

[ ]:

# For the average return metric, we could specify a normalization range if needed
avg_return_metric = AverageReturnMetric(gamma=0.99)

[ ]:

# For the estimated return improvement metric, we could provide a minimum return.
# Otherwise, it is inferred from the datasets.
eri_metric = EstimatedReturnImprovementMetric(gamma=0.99)

[ ]:

# For the generalized behavioral entropy, we need to specify the representation dimensionality.
# If your observations are images, you have to provide an encoder to extract latent space representations.
gbe_metric = GeneralizedBehavioralEntropyMetric(
    rep_dim=2, alpha=0.3, num_knn=5, use_mean_normalization=True
)

[ ]:

# The SACo metric assumes discrete state and action spaces.
# Therefore, we have to provide a preprocessor that discretizes the observations.
saco_metric = SACoMetric(
    environment=loaded_tupli_env,
    observation_preprocessor=partial(
        discretize_observation,
        20,
        (loaded_tupli_env.observation_space.low, loaded_tupli_env.observation_space.high),
    ),
)

[ ]:

# Finally, we can compute the average Q value over the dataset.
# We need to define a Q-function approximator for this.
# Furthermore, we could provide an evaluation policy. For continuous actions, this would be strictly necessary.
# Without an evaluation policy, the greedy policy w.r.t. the Q-function is used.
torch.manual_seed(42)
net_arch = MLP
state_dim = loaded_tupli_env.observation_space.shape[0]
action_dim = loaded_tupli_env.action_space.n
net_kwargs = dict(in_dim=state_dim, out_dim=action_dim, hidden=(128, 128))
q_function_metric = QFunctionMetric(
    env=loaded_tupli_env,
    network_arch=net_arch,
    network_kwargs=net_kwargs,
    batch_size=256,
    iterations=40000,
)

[ ]:

# Dataset A metrics
avg_return_A = avg_return_metric.evaluate(dataset_A)
print(f'Dataset A - Average Return: {avg_return_A}')
eri_A = eri_metric.evaluate(dataset_A)
print(f'Dataset A - Estimated Return Improvement: {eri_A}')
gbe_A = gbe_metric.evaluate(dataset_A)
print(f'Dataset A - Generalized Behavioral Entropy: {gbe_A}')
saco_A = saco_metric.evaluate(dataset_A)
print(f'Dataset A - SACo: {saco_A}')

[ ]:

q_value_A = q_function_metric.evaluate(dataset_A)
print(f'Dataset A - Average Q-Value: {q_value_A}')

You can see that we often get high loss values when training the Q-function on this dataset. That is most likely because we have sparse rewards, and the random policy never reaches the goal, making it difficult to estimate Q-values.

[ ]:

# Dataset B metrics
avg_return_B = avg_return_metric.evaluate(dataset_B)
print(f'Dataset B - Average Return: {avg_return_B}')
eri_B = eri_metric.evaluate(dataset_B)
print(f'Dataset B - Estimated Return Improvement: {eri_B}')
gbe_B = gbe_metric.evaluate(dataset_B)
print(f'Dataset B - Generalized Behavioral Entropy: {gbe_B}')
saco_B = saco_metric.evaluate(dataset_B)
print(f'Dataset B - SACo: {saco_B}')

[ ]:

q_value_B = q_function_metric.evaluate(dataset_B)
print(f'Dataset B - Average Q-Value: {q_value_B}')

Train Offline RL Agents

We train one RL agent per dataset and evaluate their final performance.

[ ]:

# Dataset A
obs, act, rew, terminal, truncated = dataset_A.convert_to_tensors(parser=NumpyTupleParser)
# create d3rlpy dataset
d3rlpy_dataset_A = MDPDataset(
    observations=obs, actions=act, rewards=rew, terminals=terminal, timeouts=truncated
)
# algorithm for offline training: CQL from d3rlpy
d3rlpy.seed(1)  # for reproducibility
algo_A = DiscreteCQLConfig(batch_size=64, alpha=2.0, target_update_interval=1000).create(
    device='cpu'
)
# train
algo_A.fit(dataset=d3rlpy_dataset_A, n_steps=10000, n_steps_per_epoch=100)

[ ]:

# deactivate recording of episodes
loaded_tupli_env.deactivate_recording()
# run the environment
np.random.seed(seed=42)
obs, info = loaded_tupli_env.reset(seed=42)
returns = []
undiscounted_return = 0.0
for step in range(n_tuples):
    action = np.int64(algo_A.predict(np.expand_dims(obs, axis=0))[0])
    obs, reward, done, truncated, info = loaded_tupli_env.step(action)
    undiscounted_return += reward
    if done or truncated:
        print(f'Episode finished after {step + 1} timesteps')
        returns.append(undiscounted_return)
        undiscounted_return = 0.0
        obs, info = loaded_tupli_env.reset()
# store the average return obtained by the trained agent
average_return_A = np.mean(returns)
print(f'Average undiscounted return of trained agent on dataset A: {average_return_A}')

[ ]:

# Dataset B
obs, act, rew, terminal, truncated = dataset_B.convert_to_tensors(parser=NumpyTupleParser)
# create d3rlpy dataset
d3rlpy_dataset_B = MDPDataset(
    observations=obs, actions=act, rewards=rew, terminals=terminal, timeouts=truncated
)
# algorithm for offline training: CQL from d3rlpy
d3rlpy.seed(1)  # for reproducibility
algo_B = DiscreteCQLConfig(batch_size=64, alpha=2.0, target_update_interval=1000).create(
    device='cpu'
)
# train
algo_B.fit(dataset=d3rlpy_dataset_B, n_steps=10000, n_steps_per_epoch=100)

[ ]:

# run the environment
np.random.seed(seed=42)
obs, info = loaded_tupli_env.reset(seed=42)
returns = []
undiscounted_return = 0.0
for step in range(n_tuples):
    action = np.int64(algo_B.predict(np.expand_dims(obs, axis=0))[0])
    obs, reward, done, truncated, info = loaded_tupli_env.step(action)
    undiscounted_return += reward
    if done or truncated:
        print(f'Episode finished after {step + 1} timesteps')
        returns.append(undiscounted_return)
        undiscounted_return = 0.0
        obs, info = loaded_tupli_env.reset()
# store the average return obtained by the trained agent
average_return_B = np.mean(returns)
print(f'Average undiscounted return of trained agent on dataset B: {average_return_B}')

Let us summarize the results in a table:

[ ]:

df = pd.DataFrame(
    {
        'Dataset': ['A', 'B'],
        'Metric: SACo': [saco_A, saco_B],
        'Metric: GBE': [gbe_A, gbe_B],
        'Metric: Avg Return': [avg_return_A, avg_return_B],
        'Metric: ERI': [eri_A, eri_B],
        'Metric: Avg Q-Value': [q_value_A, q_value_B],
        'Final Performance CQL': [average_return_A, average_return_B],
    }
)
# format only numeric columns
numeric_cols = [
    'Metric: Avg Return',
    'Metric: ERI',
    'Metric: GBE',
    'Metric: SACo',
    'Metric: Avg Q-Value',
    'Final Performance CQL',
]
display(df.style.format({c: '{:.2f}' for c in numeric_cols}))

This is not a comprehensive evaluation, so we cannot draw any conclusions about the predictive quality of the different metrics for offline RL performance. Nevertheless, we encourage you to experiment with this in a more complete evaluation setting where you train multiple algorithms and use more and larger datasets for the same benchmark problem!

Deleting Benchmarks

To clean up our storage, we now delete the benchmark and all related artifacts. Episodes will automatically be deleted, too.

[ ]:

# Delete the benchmark and all remaining related artifacts
# Episodes will automatically be deleted too
print(f'Deleting benchmark: {loaded_tupli_env.id}')
loaded_tupli_env.delete(delete_artifacts=True)

print('Cleanup completed!')