Troubleshooting

Common issues and their solutions when using Ajna Analytical Engine.

Installation Issues

GitHub Authentication Failed

Problem: Can’t install from private repository

ERROR: Repository not found or access denied

Solutions:

SSH Key Setup
Personal Access Token

# Generate SSH key
ssh-keygen -t ed25519 -C "[email protected]"

# Add to SSH agent
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519

# Copy public key and add to GitHub
cat ~/.ssh/id_ed25519.pub

# Test connection
ssh -T [email protected]

# Install package
pip install git+ssh://[email protected]/ajnacloud-ksj/ajna-analytical-engine-py-lib.git

# Create token at: https://github.com/settings/tokens
# Give it 'repo' scope

# Install with token
pip install git+https://[email protected]/ajnacloud-ksj/ajna-analytical-engine-py-lib.git

# Or set environment variable
export GITHUB_TOKEN=your_token_here
pip install git+https://${GITHUB_TOKEN}@github.com/ajnacloud-ksj/ajna-analytical-engine-py-lib.git

Dependency Conflicts

Problem: Package version conflicts during installation

# Check for conflicts
pip check

# Create clean environment
python -m venv ajna_env
source ajna_env/bin/activate  # On Windows: ajna_env\Scripts\activate

# Install with specific versions
pip install polars==0.20.31
pip install git+ssh://git@github.com/ajnacloud-ksj/ajna-analytical-engine-py-lib.git

Query Execution Issues

Memory Errors

Problem: Out of memory errors with large datasets

# Error message
MemoryError: Unable to allocate array with shape (1000000, 100)

Solutions:

from ajna_analytical_engine.config import EngineConfig

# Configure memory limits
config = EngineConfig(
    memory_limit_mb=2048,  # Set memory limit
    streaming_chunk_size=10000,  # Process in smaller chunks
    polars_thread_pool_size=2  # Reduce thread count
)

engine = AnalyticalEngine(config=config)

# Use streaming for large datasets
request = QueryRequest(
    sources=["large_file.parquet"],
    select=["id", "amount"],  # Select only needed columns
    filters={"large_file.parquet": [
        {"column": "date", "op": ">=", "value": "2024-01-01"}  # Filter early
    ]},
    limit=10000  # Limit results
)

Slow Query Performance

Problem: Queries taking too long to execute Diagnostic Steps:

# Enable debug mode
engine = AnalyticalEngine(debug=True)

# Check query execution plan
plan = engine.explain_query(request)
print(f"Estimated cost: {plan.estimated_cost}")
print(f"Execution steps: {plan.execution_steps}")

# Monitor performance
result = engine.execute_query(request)
print(f"Execution time: {result.metadata.execution_time_ms}ms")
print(f"Rows processed: {result.metadata.rows_processed}")
print(f"Memory used: {result.metadata.memory_usage_mb}MB")

Performance Optimization:

# ✅ Good practices
request = QueryRequest(
    sources=["sales.parquet"],  # Use Parquet for better performance
    select=["id", "amount", "date"],  # Select only needed columns
    filters={"sales.parquet": [
        {"column": "date", "op": ">=", "value": "2024-01-01"}  # Filter early
    ]},
    limit=1000  # Reasonable limit
)

# ❌ Avoid these
request = QueryRequest(
    sources=["sales.csv"],  # CSV is slower than Parquet
    select=["*"],  # Selecting all columns
    limit=1000000  # Very large limit
    # No filters - processing entire dataset
)

File Not Found Errors

Problem: Cannot find data source files

# Error message
DataLoadingError: File not found: 'data/sales.csv'

Solutions:

import os
from pathlib import Path

# Check current working directory
print(f"Current directory: {os.getcwd()}")

# Check if file exists
file_path = Path("data/sales.csv")
print(f"File exists: {file_path.exists()}")

# Use absolute paths
request = QueryRequest(
    sources=[str(Path.cwd() / "data" / "sales.csv")],
    select=["*"],
    limit=10
)

# Or relative to project root
request = QueryRequest(
    sources=["./data/sales.csv"],
    select=["*"],
    limit=10
)

Data Type Issues

Schema Mismatch Errors

Problem: Data type conflicts between sources

# Error message
QueryExecutionError: Cannot join on columns with different types: int64 vs string

Solution:

# Check data types first
request = QueryRequest(
    sources=["table1.parquet"],
    select=["column_name", "typeof(column_name) as data_type"],
    limit=1
)

result = engine.execute_query(request)
print(f"Data types: {result.data}")

# Cast columns to compatible types
request = QueryRequest(
    sources=["orders.parquet", "customers.csv"],
    select=["orders.id", "customers.name"],
    joins=[{
        "left": "CAST(orders.customer_id AS STRING)",
        "right": "customers.id",
        "type": "inner"
    }]
)

Date Parsing Issues

Problem: Date columns not recognized correctly

# Parse dates explicitly
request = QueryRequest(
    sources=["sales.csv"],
    select=[
        "id",
        "CAST(date_column AS DATE) as sale_date",
        "amount"
    ],
    filters={"sales.csv": [
        {"column": "CAST(date_column AS DATE)", "op": ">=", "value": "2024-01-01"}
    ]}
)

Database Connection Issues

Connection String Problems

Problem: Cannot connect to database

# Error message
DataLoadingError: Connection failed: could not connect to server

Solutions:

from ajna_analytical_engine.config import ConfigManager

# Test connection string format
config = ConfigManager()

# PostgreSQL
config.add_database_connection(
    name="postgres_db",
    connection_string="postgresql://username:password@localhost:5432/database_name",
    pool_size=5,
    timeout_seconds=30
)

# MySQL
config.add_database_connection(
    name="mysql_db",
    connection_string="mysql://username:password@localhost:3306/database_name",
    pool_size=5,
    timeout_seconds=30
)

# Test connection
engine = AnalyticalEngine(config_manager=config)
status = engine.get_health_status()
print(f"Database connections: {status.database_connections}")

Permission Denied

Problem: Database access denied

# Error message
DataLoadingError: Access denied for user 'username'@'localhost'

Check database permissions:

-- For PostgreSQL
GRANT SELECT ON ALL TABLES IN SCHEMA public TO username;

-- For MySQL
GRANT SELECT ON database_name.* TO 'username'@'localhost';

Query Validation Errors

Invalid Filter Operators

Problem: Using unsupported filter operators

# Error message
QueryValidationError: Invalid operator 'contains' for column filter

Supported operators:

# Comparison operators
{"column": "age", "op": ">=", "value": 18}
{"column": "price", "op": "between", "value": [10, 100]}

# Set operations
{"column": "status", "op": "in", "value": ["active", "pending"]}
{"column": "category", "op": "not in", "value": ["test", "demo"]}

# Pattern matching
{"column": "name", "op": "like", "value": "John%"}
{"column": "email", "op": "ilike", "value": "%@gmail.com"}

# Null checks
{"column": "deleted_at", "op": "is null"}
{"column": "updated_at", "op": "is not null"}

Invalid Aggregation Functions

Problem: Using unsupported aggregation functions

# ✅ Supported functions
aggregations = [
    {"function": "sum", "column": "revenue"},
    {"function": "avg", "column": "price"},
    {"function": "count", "column": "*"},
    {"function": "min", "column": "date"},
    {"function": "max", "column": "date"},
    {"function": "stddev", "column": "amount"},
    {"function": "variance", "column": "amount"},
    {"function": "string_agg", "column": "name", "separator": ", "}
]

# ❌ Unsupported - will cause validation error
aggregations = [
    {"function": "median", "column": "price"},  # Use percentile instead
    {"function": "mode", "column": "category"}   # Not supported
]

Cache Issues

Cache Not Working

Problem: Queries not using cache

# Check cache configuration
from ajna_analytical_engine.config import EngineConfig

config = EngineConfig(
    cache_enabled=True,
    cache_size_mb=512,
    cache_ttl_seconds=3600
)

engine = AnalyticalEngine(config=config)

# Check if query is cacheable
result = engine.execute_query(request)
print(f"Cache hit: {result.metadata.cache_hit}")
print(f"Query hash: {result.metadata.query_hash}")

Cache Memory Issues

Problem: Cache using too much memory

# Reduce cache size
config = EngineConfig(
    cache_enabled=True,
    cache_size_mb=256,  # Reduced from default
    cache_ttl_seconds=1800  # Shorter TTL
)

# Or disable cache for memory-constrained environments
config = EngineConfig(cache_enabled=False)

Common Error Messages

”Table not found"

# Check available tables
config = ConfigManager()
config.add_database_connection(
    name="db",
    connection_string="postgresql://user:pass@localhost:5432/mydb"
)

engine = AnalyticalEngine(config_manager=config)

# List available tables
status = engine.get_health_status()
print(f"Available tables: {status.available_tables}")

"Column does not exist"

# Check table schema
request = QueryRequest(
    sources=["table_name"],
    select=["*"],
    limit=1
)

result = engine.execute_query(request)
print(f"Available columns: {list(result.schema.keys())}")

"Cannot resolve column reference”

# Use proper table prefixes in joins
request = QueryRequest(
    sources=["orders.parquet", "customers.csv"],
    select=[
        "orders.id",
        "orders.amount",
        "customers.name"  # Prefix with table name
    ],
    joins=[{
        "left": "orders.customer_id",
        "right": "customers.id",
        "type": "inner"
    }]
)

Performance Debugging

Enable Detailed Logging

import logging

# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("ajna_analytical_engine")
logger.setLevel(logging.DEBUG)

# Run query with debug output
engine = AnalyticalEngine(debug=True)
result = engine.execute_query(request)

Memory Profiling

import psutil
import os

def monitor_memory():
    process = psutil.Process(os.getpid())
    memory_mb = process.memory_info().rss / 1024 / 1024
    print(f"Memory usage: {memory_mb:.1f} MB")

# Monitor memory during query execution
monitor_memory()
result = engine.execute_query(request)
monitor_memory()

Query Optimization Checklist

✅ Use Parquet files instead of CSV when possible
✅ Select only columns you need
✅ Apply filters early in the query
✅ Use appropriate data types
✅ Set reasonable limits
✅ Enable caching for repeated queries
✅ Use database indexes for join columns
✅ Consider partitioning large datasets

Getting Help

If you’re still having issues:

Check the error message carefully - it usually contains the specific problem
Enable debug mode to get more detailed information
Check your data types - many issues stem from type mismatches
Start with a simple query and add complexity gradually
Verify your data sources exist and are accessible

For complex issues, create a minimal reproducible example to help diagnose the problem.

Documentation

​Troubleshooting

​Installation Issues

​GitHub Authentication Failed

​Dependency Conflicts

​Query Execution Issues

​Memory Errors

​Slow Query Performance

​File Not Found Errors

​Data Type Issues

​Schema Mismatch Errors

​Date Parsing Issues

​Database Connection Issues

​Connection String Problems

​Permission Denied

​Query Validation Errors

​Invalid Filter Operators

​Invalid Aggregation Functions

​Cache Issues

​Cache Not Working

​Cache Memory Issues

​Common Error Messages

​”Table not found"

​"Column does not exist"

​"Cannot resolve column reference”

​Performance Debugging

​Enable Detailed Logging

​Memory Profiling

​Query Optimization Checklist

​Getting Help

Troubleshooting

Installation Issues

GitHub Authentication Failed

Dependency Conflicts

Query Execution Issues

Memory Errors

Slow Query Performance

File Not Found Errors

Data Type Issues

Schema Mismatch Errors

Date Parsing Issues

Database Connection Issues

Connection String Problems

Permission Denied

Query Validation Errors

Invalid Filter Operators

Invalid Aggregation Functions

Cache Issues

Cache Not Working

Cache Memory Issues

Common Error Messages

”Table not found"

"Column does not exist"

"Cannot resolve column reference”

Performance Debugging

Enable Detailed Logging

Memory Profiling

Query Optimization Checklist

Getting Help