# Configuration and Troubleshooting ## GPU Usage The `device` parameter in `train()` can be used to run Aerial on GPU. Note that Aerial only uses a shallow Autoencoder and therefore can also run on CPU without a major performance hindrance. Furthermore, Aerial will also use the device specified in `train()` function for rule extraction, e.g., when performing forward runs on the trained Autoencoder with the test vectors. ```python from aerial import model, rule_extraction from ucimlrepo import fetch_ucirepo # a categorical tabular dataset breast_cancer = fetch_ucirepo(id=14).data.features # run Aerial on GPU trained_autoencoder = model.train(breast_cancer, device="cuda") # during the rule extraction stage, Aerial will continue to use the device specified above result = rule_extraction.generate_rules(trained_autoencoder) print(f"Mined {result['statistics']['rule_count']} rules on GPU") ``` ## Logging Configuration Aerial source code prints extra debug statements notifying the beginning and ending of major functions such as the training process or rule extraction. The log levels can be changed as follows: ```python import logging import aerial # setting the log levels to DEBUG level aerial.setup_logging(logging.DEBUG) ... ``` ## Training Parameters The `train()` function allows you to customize various training parameters: - `autoencoder`: You can implement your own Autoencoder and use it for ARM as part of Aerial, as long as the last layer matches the original version (see our paper or the source code) - `noise_factor` (default=0.5): amount of random noise (`+-`) added to each neuron of the denoising Autoencoder before the training process - `lr` (default=5e-3): learning rate - `epochs` (default=2): number of training epochs. Shorter training produces fewer, higher-quality rules - `batch_size` (default=auto): automatically determined based on dataset size - `loss_function` (default=torch.nn.BCELoss()): loss function - `num_workers` (default=1): number of workers for parallel execution **Example:** ```python from aerial import model, rule_extraction from ucimlrepo import fetch_ucirepo breast_cancer = fetch_ucirepo(id=14).data.features # Customize training parameters trained_autoencoder = model.train( breast_cancer, epochs=5, lr=1e-3, batch_size=4 ) result = rule_extraction.generate_rules(trained_autoencoder) print(f"Found {result['statistics']['rule_count']} rules") ``` **Note:** Longer training may lead to overfitting, which results in rules with low association strength (Zhang's metric). See [Advanced: Training and Architecture Tuning](#advanced-training-and-architecture-tuning) for more details. ## Debugging The following is a step by step debugging guide for Aerial. ### What to do when Aerial does not learn any rules? Following are some recommendations when Aerial can not find rules, assuming that the data preparation is done correctly (e.g., the data is discretized). - **Longer training.** Increasing the number of epochs can make Aerial capture associations better. However, training for too long may lead to overfitting, which means non-informative rules with low association strength. - **Adding more parameters.** Increasing the number of layers and/or dimension of the layers can again allow Aerial to discover associations that was not possible with lower number of parameters. This may require training longer as well. - **Reducing antecedent similarity threshold.** Antecedent similarity threshold in Aerial is synonymous to minimum support threshold in exhaustive ARM methods. Reducing antecedent similarity threshold will result in more rules with potentially lower support. - **Reducing consequent similarity threshold.** Consequent similarity threshold of Aerial is synonymous to minimum confidence threshold in exhaustive ARM methods. Reducing this threshold will result in more rules with potentially lower confidence. ### What to do when Aerial takes too much time and learns too many rules? Similar to any other ARM algorithm, when performing knowledge discovery by learning rules, it could be the case that the input parameters of the algorithm results in a huge search space and that the underlying hardware does not allow terminating in a reasonable time. To overcome this, we suggest starting with smaller search spaces and gradually increasing. In the scope of Aerial, this can be done as follows: 1. Start with `max_antecedents=2`, observe the execution time and usefulness of the rules you learned. Then gradually increase this number if necessary for the task you want to achieve. 2. Start with `min_rule_frequency=0.5`, or even higher if necessary. A high rule frequency means you start discovering the most prominent patterns in the data first, that are usually easier to discover. This parameter is analogous to the minimum support threshold of exhaustive ARM methods such as Apriori or FP-Growth. 3. Do not set low `min_rule_strength`. The rule strength is analogous to a combination of minimum confidence and zhang's metric thresholds. There is no reason to set this parameter low, e.g., lower than 0.5. Similar to `min_rule_frequency`, start with a high number such as `0.9` and then gradually decrease if necessary. 4. Train less or use less parameters. If Aerial does not terminate for an unreasonable duration, it could also mean that the model over-fitted the data and is finding many non-informative rules which increase the execution time. To prevent that, start with smaller number of epochs and parameters. For datasets where the number of rows `n` is much bigger than the number columns `d`, such that `n >> d`, usually training for 2 epochs with 2 layers of decreasing dimensions per encoder and decoder is enough. 5. Another alternative is to apply ideas from the ARM rule explosion literature. One of the ideas is to learn rules for items of interest rather than all items (columns). This can be done with Aerial as it is exemplified in [Specifying Item Constraints](user_guide.md#2-specifying-item-constraints) section. 6. If the dataset is big and you needed to create a deeper neural network with many parameters, use GPU rather than a CPU. Please see the [GPU Usage](#gpu-usage) section for details. Note that it is also always possible that there are no prominent patterns in the data to discover. ### What to do if Aerial produces error messages? Please create an issue in this repository with the error message and/or send an email to e.karabulut@uva.nl. ## Advanced: Training and Architecture Tuning This section is for advanced users who want fine-grained control over Aerial's behavior. For most use cases, the default settings and the [Parameter Tuning Guide](parameter_guide.md) are sufficient. ### Understanding Overfitting in Knowledge Discovery Overfitting in knowledge discovery is fundamentally different from overfitting in traditional machine learning: **Traditional Machine Learning:** - High training accuracy but low test accuracy - Model memorizes training data instead of learning generalizable patterns - Solution: Early stopping, regularization, more training data **Knowledge Discovery (Association Rule Mining):** - **More rules with lower average quality** - Model captures spurious correlations instead of meaningful associations - Rules may have high support and confidence but **low association strength (Zhang's metric)** or **too many low support and confidence rules**. - **Solution**: Shorter training, stronger compression, higher quality thresholds (in addition to early stopping, regularization, more training data). PyAerial defaults to epochs=2 for this reason. **Key Insight:** In knowledge discovery, overfitting doesn't mean poor generalization to new data—it means discovering non-informative patterns that lack genuine associations. **Signs of Overfitting in Aerial:** - Many rules with low Zhang's metric (association strength near 0) - High support and confidence but weak correlations (association strength) - or higher number of rules with low support and confidence ### Impact of Training Duration (Epochs) Training duration has the following effects on rule quality in knowledge discovery. **Shorter Training (1-3 epochs):** - ✅ Fewer, higher-quality rules - ✅ Captures strong, meaningful associations - ✅ Higher average Zhang's metric (association strength) - ✅ Faster execution - ⚠️ May miss some patterns if data is complex **Longer Training (5+ epochs):** - ⚠️ More rules but lower average quality - ⚠️ Captures spurious correlations and noise - ⚠️ Lower average Zhang's metric - ❌ Overfitting to data peculiarities **Recommendation:** - **Default (1-2 epochs)** works well for most datasets - For datasets where `n >> d` (many rows, few columns): 2 epochs is usually sufficient - Only increase epochs if you're getting no rules and suspect underfitting - If rules have low Zhang's metric: **reduce epochs**, don't increase **Example:** ```python from aerial import model, rule_extraction from ucimlrepo import fetch_ucirepo breast_cancer = fetch_ucirepo(id=14).data.features # Shorter training for higher quality rules trained_autoencoder = model.train(breast_cancer, epochs=2) result = rule_extraction.generate_rules(trained_autoencoder) print(f"Rule count: {result['statistics']['rule_count']}") print(f"Avg Zhang's metric: {result['statistics']['average_zhangs_metric']}") ``` ### Impact of Architecture (layer_dims and Compression) The autoencoder's architecture controls how aggressively it compresses information, which directly affects rule quality. **Compression Ratio:** How much the autoencoder reduces dimensionality in the bottleneck layer. This is controlled by setting the last layer's dimension in `layer_dims`. `layer_dims = [4, 2]` means 2 hidden layers of dimensions 4 and 2. **More Aggressive Compression** (smaller `layer_dims`, e.g., `[4, 2]`): - ✅ **Fewer, higher-quality rules** - ✅ Forces model to preserve only essential feature relationships - ✅ Filters out weak or spurious associations - ✅ Higher average rule quality metrics - ⚠️ May miss some nuanced patterns **Less Aggressive Compression** (larger `layer_dims`, e.g., `[50, 25]`): - ⚠️ More rules but lower average quality - ⚠️ Preserves weaker associations and noise - ⚠️ May capture spurious correlations - ✅ Can discover more nuanced patterns **Number of Layers:** - Deeper networks (more layers) allow for more gradual compression - Shallower networks (fewer layers) force more aggressive compression - For most tabular datasets: 1-2 hidden layers per encoder/decoder is sufficient **Recommendation:** - **Let Aerial decide automatically** (don't specify `layer_dims`) for most use cases - If you get too many low-quality rules: Use **smaller layer_dims** for stronger compression - If you get no rules: Use **larger layer_dims** to preserve more associations - Rule of thumb: Bottleneck dimension should be much smaller than input dimension **Example:** ```python from aerial import model, rule_extraction from ucimlrepo import fetch_ucirepo breast_cancer = fetch_ucirepo(id=14).data.features # Aggressive compression for higher quality rules # layer_dims=[4] means: encoder has a single hidden layer of size 4, decoder mirrors this trained_autoencoder = model.train(breast_cancer, layer_dims=[4], epochs=2) result = rule_extraction.generate_rules(trained_autoencoder) print(f"Rule count: {result['statistics']['rule_count']}") print(f"Avg Zhang's metric: {result['statistics']['average_zhangs_metric']}") ``` ### Balancing Training and Architecture **Best Practices:** 1. **Start conservative**: Short training (2 epochs) + moderate compression (auto or `[8, 4]`) 2. **Evaluate quality**: Check Zhang's metric and rule count 3. **Adjust based on results**: - Too many low-quality rules → Reduce epochs OR increase compression - Too few rules → Reduce compression OR slightly increase epochs - Low Zhang's metric → Reduce epochs (likely overfitting) **Anti-patterns:** - ❌ Long training (10+ epochs) with weak compression → Maximum overfitting - ❌ Increasing epochs when Zhang's metric is already low - ❌ Using very large layer_dims on small datasets For more details on these experiments, see this [blog post on scalable knowledge discovery](https://erkankarabulut.github.io/blog/uva-dsc-seminar-scalable-knowledge-discovery/). ## Advanced: Boosting Rule Quality with Tabular Foundation Models For domains with limited data—such as gene expression datasets with thousands of features but only dozens of samples—traditional rule mining algorithms struggle to discover meaningful patterns. Aerial addresses this challenge by enabling **transfer learning in knowledge discovery**, a paradigm shift from conventional algorithmic methods like Apriori, FP-Growth, and ECLAT that inherently lack this capability. ### The Challenge: High-Dimensional Small Tabular Data In specialized domains (biomedical research, rare disease analysis, materials science), practitioners often face extreme dimensional imbalance where the number of features far exceeds available samples. Classical ARM algorithms fail in these settings because they cannot leverage knowledge from related domains. Aerial overcomes this limitation by incorporating tabular foundation models—neural networks pre-trained on diverse tabular datasets—that transfer learned representations to discover high-quality rules even from scarce data. ### Transfer Learning Strategies in Aerial Aerial supports two fine-tuning strategies to adapt foundation models for rule mining: #### 1. Weight Initialization (WI) Weight Initialization Strategy The foundation model's pre-trained weights initialize Aerial's autoencoder, preserving learned feature relationships while adapting to the specific rule mining task. This strategy enables the model to leverage patterns from large-scale pre-training while specializing for the target domain. #### 2. Projection-Guided Fine-Tuning via Double Loss (DL) Double Loss Strategy This strategy uses a projection encoder to align Aerial's autoencoder reconstructions with embeddings from a tabular foundation model (e.g., TabPFN), jointly optimizing two complementary objectives: - **Reconstruction loss (L_recon)**: Binary cross-entropy loss ensuring the autoencoder accurately reconstructs the original tabular data - **Projection loss (L_proj)**: Cosine distance loss aligning the autoencoder's representations with the foundation model's meta-learned embedding space The combined double loss function is: **L(θ) = L_recon + L_proj** By optimizing both objectives simultaneously, this strategy encourages the autoencoder to not only reconstruct the input data but also produce representations that are semantically consistent with the foundation model's learned knowledge, leading to higher-quality rules with better generalization. ### Why This Matters Traditional algorithmic methods operate without learned representations—they mine rules directly from raw data statistics. Aerial fundamentally changes this by: - **Leveraging pre-trained models**: Enables rule discovery from small specialized datasets by transferring knowledge from foundation models - **Enabling cross-domain transfer**: Knowledge learned from diverse tabular data transfers to new domains, even with minimal samples - **Improving rule quality**: Foundation models capture semantic relationships that pure algorithmic methods miss in low-data regimes ### Implementation Note **Important:** PyAerial does not yet provide out-of-the-box support for tabular foundation model integration. To use these transfer learning strategies, you will need to implement them yourself by: 1. Following the methodology described in [Karabulut et al. (2025)](https://arxiv.org/pdf/2509.20113) 2. Referring to the implementation in the [paper's companion repository](https://github.com/DiTEC-project/rule_learning_high_dimensional_small_tabular_data) 3. Adapting Aerial's autoencoder architecture to incorporate the weight initialization or double loss strategies The paper provides comprehensive implementation details and the repository contains reference code for both fine-tuning approaches. Future versions of PyAerial may include built-in support for these advanced capabilities.