Core Concepts

This chapter provides a comprehensive overview of the theoretical foundations and key concepts underlying GeneExpressionProgramming.jl. Understanding these concepts will help you make informed decisions about parameter settings, interpret results effectively, and apply the package to complex real-world problems.

Gene Expression Programming Fundamentals

Historical Context and Evolution

Gene Expression Programming (GEP) was introduced by Cândida Ferreira in 2001 as an evolutionary algorithm that combines the advantages of both genetic algorithms and genetic programming [1]. GEP addresses several limitations of traditional genetic programming, particularly the issues of closure and structural complexity that can hinder the evolution of mathematical expressions.

The key innovation of GEP lies in its unique representation scheme that separates the genotype (linear chromosome) from the phenotype (expression tree). This separation allows for more efficient genetic operations while maintaining the expressiveness needed for symbolic regression tasks. Unlike genetic programming, which directly manipulates tree structures, GEP operates on fixed-length linear chromosomes that are then translated into expression trees of varying sizes and shapes.

Genotype-Phenotype Mapping

The fundamental concept in GEP is the translation from linear chromosomes (genotype) to expression trees (phenotype). This process involves several key components:

Chromosomes: In GEP, a chromosome consists of one or more genes, each representing a sub-expression. Each gene has a fixed length and is divided into a head and a tail. The head can contain both functions and terminals, while the tail contains only terminals.

Head and Tail Structure: The head length (h) determines the maximum complexity of the expression, while the tail length is calculated as t = h × (n-1) + 1, where n is the maximum arity of the functions in the function set. This ensures that any expression can be properly formed regardless of the arrangement of functions and terminals.

Translation Process: The linear chromosome is read from left to right, and the expression tree is constructed using a breadth-first approach. Functions require operands according to their arity, and the translation continues until all function nodes have been satisfied.

Expression Trees and Evaluation

Once translated from the linear chromosome, the expression tree represents the mathematical formula that will be evaluated. The tree structure follows standard mathematical conventions:

Internal nodes represent functions (operators)
Leaf nodes represent terminals (variables or constants)
Evaluation proceeds from the leaves to the root using standard mathematical precedence

The expression tree evaluation in GeneExpressionProgramming.jl is optimized using DynamicExpressions.jl, which provides fast compilation and evaluation of symbolic expressions. This optimization is crucial for the performance of the evolutionary algorithm, as fitness evaluation typically dominates the computational cost.

Evolutionary Operators

Selection Mechanisms

GeneExpressionProgramming.jl implements several selection mechanisms to choose parents for reproduction:

Tournament Selection: The default selection method where k individuals are randomly chosen from the population, and the best among them is selected as a parent. Tournament size affects selection pressure - larger tournaments increase pressure toward fitter individuals.

NSGA-II Selection: For multi-objective optimization, the package implements the Non-dominated Sorting Genetic Algorithm II (NSGA-II), which maintains diversity while progressing toward the Pareto front. This is particularly useful when balancing accuracy against expression complexity.

Roulette Wheel Selection: Probabilistic selection based on fitness proportions, though this is less commonly used due to potential issues with fitness scaling.

Genetic Operators

Mutation: GEP uses several mutation operators that work directly on the linear chromosome:

Point Mutation: Randomly changes individual symbols in the chromosome
Inversion: Reverses a sequence of symbols within a gene
IS Transposition: Inserts a sequence from one location to another
RIS Transposition: Similar to IS but with root insertion
Gene Transposition: Moves entire genes within the chromosome

Crossover: Recombination operators that combine genetic material from two parents:

One-Point Crossover: Exchanges genetic material at a single crossover point
Two-Point Crossover: Uses two crossover points to exchange a middle segment
Gene Crossover: Exchanges entire genes between parents
Uniform Crossover: Exchanges individual symbols with a certain probability

Recombination: Higher-level operators that combine sub-expressions:

One-Point Recombination: Exchanges sub-trees at randomly chosen points
Two-Point Recombination: Exchanges sub-trees between two points

Population Dynamics

The evolutionary process in GEP follows a generational model where the entire population is replaced in each generation, except for elite individuals that are preserved through elitism. The population dynamics are controlled by several parameters:

Population Size: Determines the number of individuals in each generation. Larger populations provide better exploration but require more computational resources.

Generation Gap: The proportion of the population replaced in each generation. A generation gap of 1.0 means the entire population (except elites) is replaced.

Elitism: The number of best individuals preserved from one generation to the next, ensuring that good solutions are not lost during evolution.

Multi-Objective Optimization

Pareto Optimality

In many real-world applications, there are multiple conflicting objectives that need to be optimized simultaneously. For symbolic regression, common objectives include:

Accuracy: Minimizing prediction error (MSE, MAE, etc.)
Complexity: Minimizing expression size or depth
Interpretability: Favoring simpler, more understandable expressions

GeneExpressionProgramming.jl implements multi-objective optimization using NSGA-II, which maintains a diverse set of solutions along the Pareto front. A solution is Pareto optimal if no other solution exists that is better in all objectives simultaneously.

Crowding Distance

To maintain diversity in the population, NSGA-II uses crowding distance as a secondary selection criterion. Individuals in less crowded regions of the objective space are preferred, encouraging the algorithm to explore the entire Pareto front rather than converging to a single region.

Objective Function Design

When designing multi-objective problems, consider:

Objective Scaling: Ensure objectives are on comparable scales
Objective Correlation: Highly correlated objectives may not provide additional information
Objective Conflict: Objectives should represent genuine trade-offs

Physical Dimensionality and Semantic Backpropagation

Dimensional Analysis

One of the unique features of GeneExpressionProgramming.jl is its support for physical dimensionality constraints through semantic backpropagation [2]. This feature ensures that evolved expressions respect physical units and dimensional homogeneity.

Dimensional Representation: Physical dimensions are represented as vectors where each component corresponds to a fundamental unit (mass, length, time, electric current, temperature, amount of substance, luminous intensity).

Forward Propagation: During expression evaluation, dimensions are propagated forward through the expression tree according to the rules of dimensional analysis:

Addition/Subtraction: Operands must have the same dimensions
Multiplication: Dimensions are added
Division: Dimensions are subtracted
Power: Dimensions are multiplied by the exponent

Backward Propagation: When dimensional inconsistencies are detected, the algorithm can backpropagate corrections to ensure dimensional validity.

Semantic Constraints

Semantic backpropagation goes beyond simple dimensional analysis to enforce more complex semantic constraints:

Type Consistency: Ensuring that operations are semantically meaningful (e.g., not taking the logarithm of a dimensioned quantity)

Domain Constraints: Restricting function domains (e.g., ensuring arguments to sqrt are non-negative)

Physical Plausibility: Enforcing constraints based on physical laws and principles

Tensor Operations and Advanced Data Types

Tensor Regression

For problems involving vector or matrix data, GeneExpressionProgramming.jl provides tensor regression capabilities through the GepTensorRegressor. This functionality is built on top of Flux.jl and supports:

Vector Operations: Element-wise operations, dot products, cross products, and norms

Matrix Operations: Matrix multiplication, transposition, determinants, and eigenvalue operations

Tensor Contractions: General tensor operations for higher-dimensional data

Performance Considerations

Tensor operations are computationally more expensive than scalar operations, so several optimizations are implemented:

Lazy Evaluation: Operations are only computed when needed

Batch Processing: Multiple evaluations are batched together for efficiency

GPU Acceleration: not yet implemented

Loss Functions and Fitness Evaluation

Standard Loss Functions

GeneExpressionProgramming.jl provides various regression losses:

Mean Squared Error (MSE): Standard for continuous regression
Mean Absolute Error (MAE): Robust to outliers
Root Mean Squared Error (RMSE): Scale-dependent alternative to MSE
Huber Loss: Combines MSE and MAE properties

Custom Loss Functions

For specialized applications, you can define custom loss functions:

function custom_loss(y_true, y_pred)
    # Your custom loss calculation
    return loss_value
end

Moreover there is a more general wrapper function, where the user can define the evalulation strategy complete independent.

Fitness Evaluation Strategies

Single-Objective: Direct optimization of a single loss function

Multi-Objective: Simultaneous optimization of multiple objectives using Pareto dominance

Lexicographic: Hierarchical optimization where objectives are prioritized

Weighted Sum: Linear combination of multiple objectives with user-defined weights

Algorithm Parameters and Tuning

Population Parameters

Population Size: Affects exploration vs. exploitation trade-off

Small populations (50-200): Fast convergence, limited exploration
Medium populations (200-1000): Balanced performance
Large populations (1000+): Extensive exploration, slower convergence

Number of Generations: Determines total computational budget

Early stopping can be used to prevent overfitting
Monitor convergence to determine appropriate stopping criteria

Genetic Parameters

Mutation Rate: Controls exploration intensity

Low rates (0.01-0.05): Conservative exploration
Medium rates (0.05-0.15): Balanced exploration
High rates (0.15+): Aggressive exploration, risk of disruption

Crossover Rate: Controls exploitation of existing solutions

Typically set between 0.6-0.9 for good performance
Higher rates emphasize recombination over mutation

Expression Parameters

Head Length: Determines maximum expression complexity

Shorter heads: Simpler expressions, faster evaluation
Longer heads: More complex expressions, higher computational cost

Gene Count: Number of sub-expressions per chromosome

Single genes: Simple expressions
Multiple genes: Complex, modular expressions (e.g different terms within the problem considered)

Function Set: Available operations for expression construction

Basic arithmetic: +, -, *, /
Transcendental: sin, cos, exp, log
Specialized: domain-specific functions

Performance Optimization

Computational Efficiency

Expression Compilation: DynamicExpressions.jl compiles expressions for fast evaluation

Parallel Evaluation: Population evaluation can be parallelized across multiple cores

Memory Management: Efficient memory usage for large populations and long chromosomes

Caching: Intermediate results are cached to avoid redundant computations (cache blow up might remain an issue)

Scalability Considerations

Problem Size: Algorithm complexity scales with:

Population size
Expression complexity
Number of generations
Dataset size

Memory Requirements: Dominated by population storage and expression evaluation

Parallel Processing: Multiple levels of parallelization available:

Population-level: Evaluate individuals in parallel
Expression-level: Parallel evaluation of sub-expressions
Data-level: Parallel evaluation across data samples

Theoretical Foundations

Convergence Properties

GEP belongs to the class of evolutionary algorithms with the following theoretical properties:

Schema Theorem: Building blocks (schemas) with above-average fitness tend to receive exponentially increasing numbers of trials in subsequent generations [3]

No Free Lunch Theorem: No single algorithm performs best on all possible problems, emphasizing the importance of problem-specific tuning [4]

Exploration vs. Exploitation: The algorithm must balance exploring new regions of the search space with exploiting known good solutions

Search Space Characteristics

Representation Space: The space of all possible linear chromosomes

Phenotype Space: The space of all possible expression trees

Fitness Landscape: The mapping from expressions to fitness values, which can be:

Smooth: Gradual fitness changes with small modifications
Rugged: Many local optima and fitness discontinuities
Neutral: Large regions with similar fitness values

Complexity Analysis

Time Complexity: O(G × P × L × N) where:

G: Number of generations
P: Population size (the one evaluated wihtin each generation)
L: Average expression complexity (traversing the expression tree)
N: Dataset size

Space Complexity: O(P × L) for population storage plus O(N) for dataset storage (assuming the )

Advanced Topics

Hybrid Approaches

GeneExpressionProgramming.jl can be combined with other optimization techniques:

Local Search: Gradient-based optimization of numerical constants

Ensemble Methods: Combining multiple evolved expressions

Transfer Learning: Using knowledge from related problems

Domain-Specific Extensions

Time Series: Specialized operators for temporal data

Image Processing: Operators for spatial data and convolution

Signal Processing: Frequency domain operations and filtering

Scientific Computing: Domain-specific function sets for physics, chemistry, biology

References

[1] Ferreira, C. (2001). Gene Expression Programming: a New Adaptive Algorithm for Solving Problems. Complex Systems, 13.

[2] Reissmann, M., Fang, Y., Ooi, A. S. H., & Sandberg, R. D. (2025). Constraining genetic symbolic regression via semantic backpropagation. Genetic Programming and Evolvable Machines, 26(1), 12.

[3] Holland, J. H. (1975). Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. University of Michigan Press.

[4] Wolpert, D. H., & Macready, W. G. (1997). No Free Lunch Theorems for Optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82.

This chapter provides the theoretical foundation for understanding GeneExpressionProgramming.jl. For practical applications, continue to the API Reference.