Skip to content

bacalhau-project/aerolake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Environmental Sensor Data Pipeline with Databricks and Expanso

A production-ready data pipeline for processing environmental sensor data (temperature, humidity, pressure, vibration, voltage) with real-time anomaly detection, multi-stage transformation, and Databricks Unity Catalog integration.

πŸš€ Quick Start

For complete setup and demo instructions, see: MASTER_SETUP_AND_DEMO.md

This is the single, authoritative guide that will take you from zero to a fully working demo.

πŸ“ Project Structure

.
β”œβ”€β”€ MASTER_SETUP_AND_DEMO.md    # ⭐ START HERE - Complete setup guide
β”œβ”€β”€ .env.example                 # Environment configuration template
β”œβ”€β”€ databricks-notebooks/       # Databricks AutoLoader notebooks
β”‚   └── setup-and-run-autoloader.py  # Main AutoLoader notebook
β”œβ”€β”€ scripts/                    # Automation and utility scripts
β”‚   β”œβ”€β”€ validate-env.sh         # Validate environment configuration
β”‚   β”œβ”€β”€ create-all-buckets.sh   # Create S3 buckets
β”‚   β”œβ”€β”€ seed-buckets-for-autoloader.py  # Seed buckets with sample data
β”‚   β”œβ”€β”€ fix-external-locations-individual.py  # Fix external location URLs
β”‚   β”œβ”€β”€ turbo-delete-buckets.py # Fast bucket deletion
β”‚   β”œβ”€β”€ clean-all-data.sh       # Clean bucket contents
β”‚   β”œβ”€β”€ start-environmental-sensor.sh  # Start environmental sensor
β”‚   └── run-anomaly-demo.sh     # Run complete demo
β”œβ”€β”€ docs/                       # Additional documentation
β”‚   β”œβ”€β”€ DEVELOPMENT_RULES.md
β”‚   β”œβ”€β”€ ENVIRONMENT_SETUP.md
β”‚   └── QUICK_START_CHECKLIST.md
β”œβ”€β”€ jobs/                       # Expanso job specifications
β”‚   └── edge-processing-job.yaml
└── spot/                       # Edge node deployment files
    └── instance-files/         # Files deployed to edge nodes
        β”œβ”€β”€ etc/                # Configuration files
        β”œβ”€β”€ opt/                # Service files and scripts
        └── setup.sh            # Node setup script

🎯 Key Features

  • Edge-First Architecture: Distributed data processing using Expanso Edge nodes
  • Anomaly Detection: Physics-based validation for wind turbine data at the edge
  • Real-Time Processing: Streaming ingestion with Databricks AutoLoader
  • Schema Validation: Automatic JSON schema validation at the edge
  • Unity Catalog: Enterprise governance and data management
  • Containerized Edge Services: Docker-based sensors running on edge nodes

πŸ”§ Architecture

graph LR
    A[Edge Node] --> B[Sensor]
    B --> C[SQLite]
    C --> D[Expanso Edge]
    D --> E{Validator}
    E -->|Valid| F[Validated S3]
    E -->|Invalid| G[Anomalies S3]
    F --> H[AutoLoader]
    G --> H
    H --> I[Databricks]
    I --> J[Unity Catalog]
Loading

πŸ“Š Pipeline Stages

  1. Raw Data: All sensor readings as received at edge nodes
  2. Validated Data: Readings that pass physics validation at the edge
  3. Anomalies: Readings that violate physics rules detected at the edge
  4. Schematized Data: Structured with enforced schema in Databricks
  5. Aggregated Data: Analytics-ready summaries in Databricks

🚦 Anomaly Detection Rules

The system detects anomalies in environmental sensor data:

  • Temperature anomalies: Values outside -20Β°C to 60Β°C range
  • Humidity anomalies: Values outside 5% to 95% range
  • Pressure anomalies: Values outside 950-1050 hPa range
  • Vibration anomalies: Values exceeding 10 mm/sΒ²
  • Voltage anomalies: Values outside 20-25V range
  • Sensor-flagged anomalies: Records with anomaly_flag = 1 from sensor

Prerequisites

  • Python 3.11+
  • Docker
  • AWS Account with S3 access
  • Databricks Workspace with Unity Catalog enabled
  • uv package manager (pip install uv)
  • Expanso CLI (latest version)

πŸƒ Complete Setup Process

Phase 1: Environment Setup

  1. Clone and Configure:
# Clone the repository
cd aerolake
  1. Validate Configuration:
./scripts/validate-env.sh

Phase 2: AWS Infrastructure

  1. Create S3 Buckets:
./scripts/create-all-buckets.sh

This creates all required buckets:

  • expanso-raw-data-{region}
  • expanso-validated-data-{region}
  • expanso-anomalies-data-{region}
  • expanso-schematized-data-{region}
  • expanso-aggregated-data-{region}
  • expanso-checkpoints-{region}
  • expanso-metadata-{region}
  1. Setup IAM Role:
./scripts/create-databricks-iam-role.sh
./scripts/update-iam-role-for-new-buckets.sh

Phase 3: Databricks Setup

  1. Setup Unity Catalog:
cd scripts
uv run -s setup-unity-catalog-storage.py
  1. Fix External Locations (Critical!):
# External locations may have wrong URLs - fix them
uv run -s fix-external-locations-individual.py
  1. Seed Buckets with Sample Data:
# AutoLoader needs sample files to infer schemas
uv run -s seed-buckets-for-autoloader.py
  1. Upload and Run AutoLoader Notebook:
uv run -s upload-and-run-notebook.py \
  --notebook ../databricks-notebooks/setup-and-run-autoloader.py

Phase 4: Run the Demo

  1. Start Environmental Sensor:
# Normal sensor (no anomalies)
./scripts/start-environmental-sensor.sh 300

# With anomalies (25% probability)
./scripts/start-environmental-sensor.sh 300 --with-anomalies
  1. Monitor Processing:

    • Open Databricks workspace
    • Navigate to the uploaded notebook
    • Watch as data flows through all 5 pipeline stages
    • View anomalies being detected and routed
  2. Query Results:

-- View ingested data
SELECT * FROM expanso_catalog.sensor_data.sensor_readings_ingestion;

-- View detected anomalies
SELECT * FROM expanso_catalog.sensor_data.sensor_readings_anomalies;

-- View aggregated metrics
SELECT * FROM expanso_catalog.sensor_data.sensor_readings_aggregated;

πŸ” Troubleshooting

Common Issues and Solutions

  1. AutoLoader Schema Inference Error:

    • Cause: Empty buckets
    • Solution: Run scripts/seed-buckets-for-autoloader.py
  2. UNAUTHORIZED_ACCESS Error:

    • Cause: External locations pointing to wrong bucket URLs
    • Solution: Run scripts/fix-external-locations-individual.py
  3. Permission Denied on External Locations:

    • Cause: IAM role not updated for new buckets
    • Solution: Run scripts/update-iam-role-for-new-buckets.sh
  4. No Data Flowing:

    • Cause: Checkpoints preventing reprocessing
    • Solution: Clean checkpoints with scripts/clean-all-data.sh

Verification Scripts

# Check bucket structure
./scripts/check-bucket-structure.sh

# Verify Databricks setup
cd scripts && uv run -s verify-databricks-setup.py

# List external locations
uv run --with databricks-sdk --with python-dotenv python3 -c "
from databricks.sdk import WorkspaceClient
from dotenv import load_dotenv
import os
from pathlib import Path

load_dotenv(Path('.').parent / '.env')
w = WorkspaceClient(host=os.getenv('DATABRICKS_HOST'), token=os.getenv('DATABRICKS_TOKEN'))

print('External Locations:')
for loc in w.external_locations.list():
    if 'expanso' in loc.name.lower():
        print(f'  {loc.name}: {loc.url}')
"

πŸ“ˆ Monitoring

Key Metrics to Track

  • Ingestion Rate: Files/second being processed
  • Anomaly Rate: Percentage of readings flagged
  • Processing Latency: Time from sensor to Unity Catalog
  • Schema Evolution: New columns being added

Dashboard Queries

-- Anomaly detection rate
SELECT 
    DATE(processing_timestamp) as date,
    COUNT(*) as anomaly_count,
    AVG(wind_speed) as avg_wind_speed,
    AVG(power_output) as avg_power
FROM expanso_catalog.sensor_data.sensor_readings_anomalies
GROUP BY DATE(processing_timestamp)
ORDER BY date DESC;

-- Pipeline throughput
SELECT 
    stage,
    COUNT(*) as record_count,
    MAX(processing_timestamp) as last_update
FROM (
    SELECT 'ingestion' as stage, processing_timestamp 
    FROM expanso_catalog.sensor_data.sensor_readings_ingestion
    UNION ALL
    SELECT 'validated' as stage, processing_timestamp 
    FROM expanso_catalog.sensor_data.sensor_readings_validated
    UNION ALL
    SELECT 'anomalies' as stage, processing_timestamp 
    FROM expanso_catalog.sensor_data.sensor_readings_anomalies
)
GROUP BY stage;

🧹 Cleanup

To completely clean up all resources:

# Delete all S3 buckets and contents
./scripts/turbo-delete-buckets.py

# Clean Unity Catalog objects
cd scripts && uv run -s cleanup-unity-catalog.py

# Remove local artifacts
rm -rf .flox/run .env *.db *.log

πŸ“š Additional Resources

🀝 Contributing

  1. Follow the coding standards in docs/DEVELOPMENT_RULES.md
  2. Always use uv for Python scripts
  3. Test with both normal and anomaly data
  4. Update documentation for any new features

πŸ“ License

MIT License - See LICENSE file for details

πŸ†˜ Support

For issues or questions:

  1. Check the Master Setup Guide
  2. Review troubleshooting section above
  3. Open an issue with:
    • Environment details (.env without secrets)
    • Error messages and logs
    • Steps to reproduce

πŸŽ‰ Success Indicators

You know the pipeline is working when:

  • βœ… All 5 S3 buckets have data flowing
  • βœ… Unity Catalog tables are populated
  • βœ… Anomalies are being detected and routed
  • βœ… AutoLoader is processing files continuously
  • βœ… No errors in Databricks notebook execution

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published