Configuration and Troubleshooting

GPU Usage

The device parameter in train() can be used to run Aerial on GPU. Note that Aerial only uses a shallow Autoencoder and therefore can also run on CPU without a major performance hindrance.

Furthermore, Aerial will also use the device specified in train() function for rule extraction, e.g., when performing forward runs on the trained Autoencoder with the test vectors.

from aerial import model, rule_extraction
from ucimlrepo import fetch_ucirepo

# a categorical tabular dataset
breast_cancer = fetch_ucirepo(id=14).data.features

# run Aerial on GPU
trained_autoencoder = model.train(breast_cancer, device="cuda")

# during the rule extraction stage, Aerial will continue to use the device specified above
result = rule_extraction.generate_rules(trained_autoencoder)
print(f"Mined {result['statistics']['rule_count']} rules on GPU")

Logging Configuration

Aerial source code prints extra debug statements notifying the beginning and ending of major functions such as the training process or rule extraction. The log levels can be changed as follows:

import logging
import aerial

# setting the log levels to DEBUG level
aerial.setup_logging(logging.DEBUG)
...

Training Parameters

The train() function allows you to customize various training parameters:

  • autoencoder: You can implement your own Autoencoder and use it for ARM as part of Aerial, as long as the last layer matches the original version (see our paper or the source code)

  • noise_factor (default=0.5): amount of random noise (+-) added to each neuron of the denoising Autoencoder before the training process

  • lr (default=5e-3): learning rate

  • epochs (default=2): number of training epochs. Shorter training produces fewer, higher-quality rules

  • batch_size (default=auto): automatically determined based on dataset size

  • loss_function (default=torch.nn.BCELoss()): loss function

  • num_workers (default=1): number of workers for parallel execution

Example:

from aerial import model, rule_extraction
from ucimlrepo import fetch_ucirepo

breast_cancer = fetch_ucirepo(id=14).data.features

# Customize training parameters
trained_autoencoder = model.train(
    breast_cancer,
    epochs=5,
    lr=1e-3,
    batch_size=4
)

result = rule_extraction.generate_rules(trained_autoencoder)
print(f"Found {result['statistics']['rule_count']} rules")

Note: Longer training may lead to overfitting, which results in rules with low association strength (Zhang’s metric). See Advanced: Training and Architecture Tuning for more details.

Debugging

The following is a step by step debugging guide for Aerial.

What to do when Aerial does not learn any rules?

Following are some recommendations when Aerial can not find rules, assuming that the data preparation is done correctly (e.g., the data is discretized).

  • Longer training. Increasing the number of epochs can make Aerial capture associations better. However, training for too long may lead to overfitting, which means non-informative rules with low association strength.

  • Adding more parameters. Increasing the number of layers and/or dimension of the layers can again allow Aerial to discover associations that was not possible with lower number of parameters. This may require training longer as well.

  • Reducing antecedent similarity threshold. Antecedent similarity threshold in Aerial is synonymous to minimum support threshold in exhaustive ARM methods. Reducing antecedent similarity threshold will result in more rules with potentially lower support.

  • Reducing consequent similarity threshold. Consequent similarity threshold of Aerial is synonymous to minimum confidence threshold in exhaustive ARM methods. Reducing this threshold will result in more rules with potentially lower confidence.

What to do when Aerial takes too much time and learns too many rules?

Similar to any other ARM algorithm, when performing knowledge discovery by learning rules, it could be the case that the input parameters of the algorithm results in a huge search space and that the underlying hardware does not allow terminating in a reasonable time.

To overcome this, we suggest starting with smaller search spaces and gradually increasing. In the scope of Aerial, this can be done as follows:

  1. Start with max_antecedents=2, observe the execution time and usefulness of the rules you learned. Then gradually increase this number if necessary for the task you want to achieve.

  2. Start with min_rule_frequency=0.5, or even higher if necessary. A high rule frequency means you start discovering the most prominent patterns in the data first, that are usually easier to discover. This parameter is analogous to the minimum support threshold of exhaustive ARM methods such as Apriori or FP-Growth.

  3. Do not set low min_rule_strength. The rule strength is analogous to a combination of minimum confidence and zhang’s metric thresholds. There is no reason to set this parameter low, e.g., lower than 0.5. Similar to min_rule_frequency, start with a high number such as 0.9 and then gradually decrease if necessary.

  4. Train less or use less parameters. If Aerial does not terminate for an unreasonable duration, it could also mean that the model over-fitted the data and is finding many non-informative rules which increase the execution time. To prevent that, start with smaller number of epochs and parameters. For datasets where the number of rows n is much bigger than the number columns d, such that n >> d, usually training for 2 epochs with 2 layers of decreasing dimensions per encoder and decoder is enough.

  5. Another alternative is to apply ideas from the ARM rule explosion literature. One of the ideas is to learn rules for items of interest rather than all items (columns). This can be done with Aerial as it is exemplified in Specifying Item Constraints section.

  6. If the dataset is big and you needed to create a deeper neural network with many parameters, use GPU rather than a CPU. Please see the GPU Usage section for details.

Note that it is also always possible that there are no prominent patterns in the data to discover.

What to do if Aerial produces error messages?

Please create an issue in this repository with the error message and/or send an email to e.karabulut@uva.nl.

Advanced: Training and Architecture Tuning

This section is for advanced users who want fine-grained control over Aerial’s behavior. For most use cases, the default settings and the Parameter Tuning Guide are sufficient.

Understanding Overfitting in Knowledge Discovery

Overfitting in knowledge discovery is fundamentally different from overfitting in traditional machine learning:

Traditional Machine Learning:

  • High training accuracy but low test accuracy

  • Model memorizes training data instead of learning generalizable patterns

  • Solution: Early stopping, regularization, more training data

Knowledge Discovery (Association Rule Mining):

  • More rules with lower average quality

  • Model captures spurious correlations instead of meaningful associations

  • Rules may have high support and confidence but low association strength (Zhang’s metric) or too many low support and confidence rules.

  • Solution: Shorter training, stronger compression, higher quality thresholds (in addition to early stopping, regularization, more training data). PyAerial defaults to epochs=2 for this reason.

Key Insight: In knowledge discovery, overfitting doesn’t mean poor generalization to new data—it means discovering non-informative patterns that lack genuine associations.

Signs of Overfitting in Aerial:

  • Many rules with low Zhang’s metric (association strength near 0)

  • High support and confidence but weak correlations (association strength)

  • or higher number of rules with low support and confidence

Impact of Training Duration (Epochs)

Training duration has the following effects on rule quality in knowledge discovery.

Shorter Training (1-3 epochs):

  • ✅ Fewer, higher-quality rules

  • ✅ Captures strong, meaningful associations

  • ✅ Higher average Zhang’s metric (association strength)

  • ✅ Faster execution

  • ⚠️ May miss some patterns if data is complex

Longer Training (5+ epochs):

  • ⚠️ More rules but lower average quality

  • ⚠️ Captures spurious correlations and noise

  • ⚠️ Lower average Zhang’s metric

  • ❌ Overfitting to data peculiarities

Recommendation:

  • Default (1-2 epochs) works well for most datasets

  • For datasets where n >> d (many rows, few columns): 2 epochs is usually sufficient

  • Only increase epochs if you’re getting no rules and suspect underfitting

  • If rules have low Zhang’s metric: reduce epochs, don’t increase

Example:

from aerial import model, rule_extraction
from ucimlrepo import fetch_ucirepo

breast_cancer = fetch_ucirepo(id=14).data.features

# Shorter training for higher quality rules
trained_autoencoder = model.train(breast_cancer, epochs=2)
result = rule_extraction.generate_rules(trained_autoencoder)

print(f"Rule count: {result['statistics']['rule_count']}")
print(f"Avg Zhang's metric: {result['statistics']['average_zhangs_metric']}")

Impact of Architecture (layer_dims and Compression)

The autoencoder’s architecture controls how aggressively it compresses information, which directly affects rule quality.

Compression Ratio: How much the autoencoder reduces dimensionality in the bottleneck layer. This is controlled by setting the last layer’s dimension in layer_dims.

layer_dims = [4, 2] means 2 hidden layers of dimensions 4 and 2.

More Aggressive Compression (smaller layer_dims, e.g., [4, 2]):

  • Fewer, higher-quality rules

  • ✅ Forces model to preserve only essential feature relationships

  • ✅ Filters out weak or spurious associations

  • ✅ Higher average rule quality metrics

  • ⚠️ May miss some nuanced patterns

Less Aggressive Compression (larger layer_dims, e.g., [50, 25]):

  • ⚠️ More rules but lower average quality

  • ⚠️ Preserves weaker associations and noise

  • ⚠️ May capture spurious correlations

  • ✅ Can discover more nuanced patterns

Number of Layers:

  • Deeper networks (more layers) allow for more gradual compression

  • Shallower networks (fewer layers) force more aggressive compression

  • For most tabular datasets: 1-2 hidden layers per encoder/decoder is sufficient

Recommendation:

  • Let Aerial decide automatically (don’t specify layer_dims) for most use cases

  • If you get too many low-quality rules: Use smaller layer_dims for stronger compression

  • If you get no rules: Use larger layer_dims to preserve more associations

  • Rule of thumb: Bottleneck dimension should be much smaller than input dimension

Example:

from aerial import model, rule_extraction
from ucimlrepo import fetch_ucirepo

breast_cancer = fetch_ucirepo(id=14).data.features

# Aggressive compression for higher quality rules
# layer_dims=[4] means: encoder has a single hidden layer of size 4, decoder mirrors this
trained_autoencoder = model.train(breast_cancer, layer_dims=[4], epochs=2)

result = rule_extraction.generate_rules(trained_autoencoder)
print(f"Rule count: {result['statistics']['rule_count']}")
print(f"Avg Zhang's metric: {result['statistics']['average_zhangs_metric']}")

Balancing Training and Architecture

Best Practices:

  1. Start conservative: Short training (2 epochs) + moderate compression (auto or [8, 4])

  2. Evaluate quality: Check Zhang’s metric and rule count

  3. Adjust based on results:

    • Too many low-quality rules → Reduce epochs OR increase compression

    • Too few rules → Reduce compression OR slightly increase epochs

    • Low Zhang’s metric → Reduce epochs (likely overfitting)

Anti-patterns:

  • ❌ Long training (10+ epochs) with weak compression → Maximum overfitting

  • ❌ Increasing epochs when Zhang’s metric is already low

  • ❌ Using very large layer_dims on small datasets

For more details on these experiments, see this blog post on scalable knowledge discovery.

Advanced: Boosting Rule Quality with Tabular Foundation Models

For domains with limited data—such as gene expression datasets with thousands of features but only dozens of samples—traditional rule mining algorithms struggle to discover meaningful patterns. Aerial addresses this challenge by enabling transfer learning in knowledge discovery, a paradigm shift from conventional algorithmic methods like Apriori, FP-Growth, and ECLAT that inherently lack this capability.

The Challenge: High-Dimensional Small Tabular Data

In specialized domains (biomedical research, rare disease analysis, materials science), practitioners often face extreme dimensional imbalance where the number of features far exceeds available samples. Classical ARM algorithms fail in these settings because they cannot leverage knowledge from related domains. Aerial overcomes this limitation by incorporating tabular foundation models—neural networks pre-trained on diverse tabular datasets—that transfer learned representations to discover high-quality rules even from scarce data.

Transfer Learning Strategies in Aerial

Aerial supports two fine-tuning strategies to adapt foundation models for rule mining:

1. Weight Initialization (WI)

Weight Initialization Strategy

The foundation model’s pre-trained weights initialize Aerial’s autoencoder, preserving learned feature relationships while adapting to the specific rule mining task. This strategy enables the model to leverage patterns from large-scale pre-training while specializing for the target domain.

2. Projection-Guided Fine-Tuning via Double Loss (DL)

Double Loss Strategy

This strategy uses a projection encoder to align Aerial’s autoencoder reconstructions with embeddings from a tabular foundation model (e.g., TabPFN), jointly optimizing two complementary objectives:

  • Reconstruction loss (L_recon): Binary cross-entropy loss ensuring the autoencoder accurately reconstructs the original tabular data

  • Projection loss (L_proj): Cosine distance loss aligning the autoencoder’s representations with the foundation model’s meta-learned embedding space

The combined double loss function is: L(θ) = L_recon + L_proj

By optimizing both objectives simultaneously, this strategy encourages the autoencoder to not only reconstruct the input data but also produce representations that are semantically consistent with the foundation model’s learned knowledge, leading to higher-quality rules with better generalization.

Why This Matters

Traditional algorithmic methods operate without learned representations—they mine rules directly from raw data statistics. Aerial fundamentally changes this by:

  • Leveraging pre-trained models: Enables rule discovery from small specialized datasets by transferring knowledge from foundation models

  • Enabling cross-domain transfer: Knowledge learned from diverse tabular data transfers to new domains, even with minimal samples

  • Improving rule quality: Foundation models capture semantic relationships that pure algorithmic methods miss in low-data regimes

Implementation Note

Important: PyAerial does not yet provide out-of-the-box support for tabular foundation model integration. To use these transfer learning strategies, you will need to implement them yourself by:

  1. Following the methodology described in Karabulut et al. (2025)

  2. Referring to the implementation in the paper’s companion repository

  3. Adapting Aerial’s autoencoder architecture to incorporate the weight initialization or double loss strategies

The paper provides comprehensive implementation details and the repository contains reference code for both fine-tuning approaches. Future versions of PyAerial may include built-in support for these advanced capabilities.