Skip to main content

Troubleshooting

Common issues and their solutions when using Ajna Analytical Engine.

Installation Issues

GitHub Authentication Failed

Problem: Can’t install from private repository
ERROR: Repository not found or access denied
Solutions:
# Generate SSH key
ssh-keygen -t ed25519 -C "[email protected]"

# Add to SSH agent
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519

# Copy public key and add to GitHub
cat ~/.ssh/id_ed25519.pub

# Test connection
ssh -T [email protected]

# Install package
pip install git+ssh://[email protected]/ajnacloud-ksj/ajna-analytical-engine-py-lib.git

Dependency Conflicts

Problem: Package version conflicts during installation
# Check for conflicts
pip check

# Create clean environment
python -m venv ajna_env
source ajna_env/bin/activate  # On Windows: ajna_env\Scripts\activate

# Install with specific versions
pip install polars==0.20.31
pip install git+ssh://git@github.com/ajnacloud-ksj/ajna-analytical-engine-py-lib.git

Query Execution Issues

Memory Errors

Problem: Out of memory errors with large datasets
# Error message
MemoryError: Unable to allocate array with shape (1000000, 100)
Solutions:
from ajna_analytical_engine.config import EngineConfig

# Configure memory limits
config = EngineConfig(
    memory_limit_mb=2048,  # Set memory limit
    streaming_chunk_size=10000,  # Process in smaller chunks
    polars_thread_pool_size=2  # Reduce thread count
)

engine = AnalyticalEngine(config=config)

# Use streaming for large datasets
request = QueryRequest(
    sources=["large_file.parquet"],
    select=["id", "amount"],  # Select only needed columns
    filters={"large_file.parquet": [
        {"column": "date", "op": ">=", "value": "2024-01-01"}  # Filter early
    ]},
    limit=10000  # Limit results
)

Slow Query Performance

Problem: Queries taking too long to execute Diagnostic Steps:
# Enable debug mode
engine = AnalyticalEngine(debug=True)

# Check query execution plan
plan = engine.explain_query(request)
print(f"Estimated cost: {plan.estimated_cost}")
print(f"Execution steps: {plan.execution_steps}")

# Monitor performance
result = engine.execute_query(request)
print(f"Execution time: {result.metadata.execution_time_ms}ms")
print(f"Rows processed: {result.metadata.rows_processed}")
print(f"Memory used: {result.metadata.memory_usage_mb}MB")
Performance Optimization:
# ✅ Good practices
request = QueryRequest(
    sources=["sales.parquet"],  # Use Parquet for better performance
    select=["id", "amount", "date"],  # Select only needed columns
    filters={"sales.parquet": [
        {"column": "date", "op": ">=", "value": "2024-01-01"}  # Filter early
    ]},
    limit=1000  # Reasonable limit
)

# ❌ Avoid these
request = QueryRequest(
    sources=["sales.csv"],  # CSV is slower than Parquet
    select=["*"],  # Selecting all columns
    limit=1000000  # Very large limit
    # No filters - processing entire dataset
)

File Not Found Errors

Problem: Cannot find data source files
# Error message
DataLoadingError: File not found: 'data/sales.csv'
Solutions:
import os
from pathlib import Path

# Check current working directory
print(f"Current directory: {os.getcwd()}")

# Check if file exists
file_path = Path("data/sales.csv")
print(f"File exists: {file_path.exists()}")

# Use absolute paths
request = QueryRequest(
    sources=[str(Path.cwd() / "data" / "sales.csv")],
    select=["*"],
    limit=10
)

# Or relative to project root
request = QueryRequest(
    sources=["./data/sales.csv"],
    select=["*"],
    limit=10
)

Data Type Issues

Schema Mismatch Errors

Problem: Data type conflicts between sources
# Error message
QueryExecutionError: Cannot join on columns with different types: int64 vs string
Solution:
# Check data types first
request = QueryRequest(
    sources=["table1.parquet"],
    select=["column_name", "typeof(column_name) as data_type"],
    limit=1
)

result = engine.execute_query(request)
print(f"Data types: {result.data}")

# Cast columns to compatible types
request = QueryRequest(
    sources=["orders.parquet", "customers.csv"],
    select=["orders.id", "customers.name"],
    joins=[{
        "left": "CAST(orders.customer_id AS STRING)",
        "right": "customers.id",
        "type": "inner"
    }]
)

Date Parsing Issues

Problem: Date columns not recognized correctly
# Parse dates explicitly
request = QueryRequest(
    sources=["sales.csv"],
    select=[
        "id",
        "CAST(date_column AS DATE) as sale_date",
        "amount"
    ],
    filters={"sales.csv": [
        {"column": "CAST(date_column AS DATE)", "op": ">=", "value": "2024-01-01"}
    ]}
)

Database Connection Issues

Connection String Problems

Problem: Cannot connect to database
# Error message
DataLoadingError: Connection failed: could not connect to server
Solutions:
from ajna_analytical_engine.config import ConfigManager

# Test connection string format
config = ConfigManager()

# PostgreSQL
config.add_database_connection(
    name="postgres_db",
    connection_string="postgresql://username:password@localhost:5432/database_name",
    pool_size=5,
    timeout_seconds=30
)

# MySQL
config.add_database_connection(
    name="mysql_db",
    connection_string="mysql://username:password@localhost:3306/database_name",
    pool_size=5,
    timeout_seconds=30
)

# Test connection
engine = AnalyticalEngine(config_manager=config)
status = engine.get_health_status()
print(f"Database connections: {status.database_connections}")

Permission Denied

Problem: Database access denied
# Error message
DataLoadingError: Access denied for user 'username'@'localhost'
Check database permissions:
-- For PostgreSQL
GRANT SELECT ON ALL TABLES IN SCHEMA public TO username;

-- For MySQL
GRANT SELECT ON database_name.* TO 'username'@'localhost';

Query Validation Errors

Invalid Filter Operators

Problem: Using unsupported filter operators
# Error message
QueryValidationError: Invalid operator 'contains' for column filter
Supported operators:
# Comparison operators
{"column": "age", "op": ">=", "value": 18}
{"column": "price", "op": "between", "value": [10, 100]}

# Set operations
{"column": "status", "op": "in", "value": ["active", "pending"]}
{"column": "category", "op": "not in", "value": ["test", "demo"]}

# Pattern matching
{"column": "name", "op": "like", "value": "John%"}
{"column": "email", "op": "ilike", "value": "%@gmail.com"}

# Null checks
{"column": "deleted_at", "op": "is null"}
{"column": "updated_at", "op": "is not null"}

Invalid Aggregation Functions

Problem: Using unsupported aggregation functions
# ✅ Supported functions
aggregations = [
    {"function": "sum", "column": "revenue"},
    {"function": "avg", "column": "price"},
    {"function": "count", "column": "*"},
    {"function": "min", "column": "date"},
    {"function": "max", "column": "date"},
    {"function": "stddev", "column": "amount"},
    {"function": "variance", "column": "amount"},
    {"function": "string_agg", "column": "name", "separator": ", "}
]

# ❌ Unsupported - will cause validation error
aggregations = [
    {"function": "median", "column": "price"},  # Use percentile instead
    {"function": "mode", "column": "category"}   # Not supported
]

Cache Issues

Cache Not Working

Problem: Queries not using cache
# Check cache configuration
from ajna_analytical_engine.config import EngineConfig

config = EngineConfig(
    cache_enabled=True,
    cache_size_mb=512,
    cache_ttl_seconds=3600
)

engine = AnalyticalEngine(config=config)

# Check if query is cacheable
result = engine.execute_query(request)
print(f"Cache hit: {result.metadata.cache_hit}")
print(f"Query hash: {result.metadata.query_hash}")

Cache Memory Issues

Problem: Cache using too much memory
# Reduce cache size
config = EngineConfig(
    cache_enabled=True,
    cache_size_mb=256,  # Reduced from default
    cache_ttl_seconds=1800  # Shorter TTL
)

# Or disable cache for memory-constrained environments
config = EngineConfig(cache_enabled=False)

Common Error Messages

”Table not found"

# Check available tables
config = ConfigManager()
config.add_database_connection(
    name="db",
    connection_string="postgresql://user:pass@localhost:5432/mydb"
)

engine = AnalyticalEngine(config_manager=config)

# List available tables
status = engine.get_health_status()
print(f"Available tables: {status.available_tables}")

"Column does not exist"

# Check table schema
request = QueryRequest(
    sources=["table_name"],
    select=["*"],
    limit=1
)

result = engine.execute_query(request)
print(f"Available columns: {list(result.schema.keys())}")

"Cannot resolve column reference”

# Use proper table prefixes in joins
request = QueryRequest(
    sources=["orders.parquet", "customers.csv"],
    select=[
        "orders.id",
        "orders.amount",
        "customers.name"  # Prefix with table name
    ],
    joins=[{
        "left": "orders.customer_id",
        "right": "customers.id",
        "type": "inner"
    }]
)

Performance Debugging

Enable Detailed Logging

import logging

# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("ajna_analytical_engine")
logger.setLevel(logging.DEBUG)

# Run query with debug output
engine = AnalyticalEngine(debug=True)
result = engine.execute_query(request)

Memory Profiling

import psutil
import os

def monitor_memory():
    process = psutil.Process(os.getpid())
    memory_mb = process.memory_info().rss / 1024 / 1024
    print(f"Memory usage: {memory_mb:.1f} MB")

# Monitor memory during query execution
monitor_memory()
result = engine.execute_query(request)
monitor_memory()

Query Optimization Checklist

  • ✅ Use Parquet files instead of CSV when possible
  • ✅ Select only columns you need
  • ✅ Apply filters early in the query
  • ✅ Use appropriate data types
  • ✅ Set reasonable limits
  • ✅ Enable caching for repeated queries
  • ✅ Use database indexes for join columns
  • ✅ Consider partitioning large datasets

Getting Help

If you’re still having issues:
  1. Check the error message carefully - it usually contains the specific problem
  2. Enable debug mode to get more detailed information
  3. Check your data types - many issues stem from type mismatches
  4. Start with a simple query and add complexity gradually
  5. Verify your data sources exist and are accessible
For complex issues, create a minimal reproducible example to help diagnose the problem.