Arus — Data Pipeline Platform

Data flows without the cluster.

Arus is a lightweight, self-hosted CDC & ETL framework purpose-built for teams running on VPS-class infrastructure (no Kubernetes). It ingests data from MySQL, MariaDB, PostgreSQL, and MongoDB sources, applies transformations, and lands them into a PostgreSQL, MySQL, or ClickHouse data warehouse — with a visual DAG interface for monitoring and troubleshooting.

Features

Connector Framework

Pluggable source and destination connectors based on abstract base classes.

Feature	Description	Documentation
Source Connectors	MySQL, MariaDB, PostgreSQL, MongoDB — watermark-based batch extraction	Connectors Guide →
Destination Connectors	PostgreSQL, MySQL, ClickHouse — raw + normalized load modes	Connectors Guide →
Auto-discover Tables	Scan source databases, detect tables, columns, and sync modes automatically	Console Guide →
Column Type Mapping	Auto-map source types to destination types (e.g., MySQL `TINYINT` → PostgreSQL `BOOLEAN`)	Connectors Guide →
Custom Connectors	Implement `BaseSource` / `BaseDestination` for any database	Development Guide →

Pipeline Orchestration

Core engine for scheduling, executing, and monitoring data syncs.

Feature	Description	Documentation
Incremental Sync	Watermark-based batch CDC using timestamp columns (`updated_at`, `created_at`)	Pipelines →
Full Refresh	Truncate and reload entire tables on demand or on schedule	Pipelines →
Backfill	Re-sync historical data from a specific date	Pipelines →
Scheduling	APScheduler cron-based with configurable intervals (default: every 5 minutes)	Pipelines →
Pipeline Dependencies	Chain pipelines — B waits for A's successful run	Pipelines →
Load Modes	Direct (source → analytics) or Raw → Normalize (staging JSONB → analytics)	Pipelines →

Transform Engine

Process data between extraction and loading.

Feature	Description	Documentation
Built-in Steps	Rename, remove, compute, filter, map values, type cast, concat fields	Pipelines →
Python Scripts	Custom `transform(row)` functions per pipeline	Pipelines →
Re-orderable Steps	Drag-and-drop step ordering in the Console UI	Console Guide →

Reliability & Quality

Production-grade error handling and data validation.

Feature	Description	Documentation
Retry with Backoff	Exponential backoff via `tenacity` (default: 3 attempts, 2s → 16s max)	Pipelines →
Dead Letter Queue	Failed rows stored in `staging._dead_letters` for review and reprocessing	Pipelines →
Data Quality Checks	Row count validation + null checks on NOT NULL columns (threshold: 5%)	Pipelines →
Schema Drift Detection	Detect new columns in source, optionally auto-ALTER warehouse tables	Pipelines →
Soft-Delete Reconciliation	Track `deleted_at` columns and propagate deletions to warehouse	Pipelines →

Alerting & Notifications

Stay informed about pipeline health.

Feature	Description	Documentation
Notification Targets	Telegram, Discord, Slack — configurable per pipeline	Pipelines →
Alert Events	Failure, success, dead letter, schema drift, quality breach	Pipelines →
Pipeline Linking	Link multiple targets to a pipeline with specific event types	Console Guide →

Web Console

Browser-based management UI.

Feature	Description	Documentation
Dashboard	Stats cards, sync performance chart, recent runs feed, sources overview	Console Guide →
Source Management	Add, test, rescale, edit, delete source connections	Console Guide →
Pipeline Management	Create, configure, pause, resume, trigger pipelines	Console Guide →
Pipeline Detail	Run history, logs, transforms, dead letters, notifications per pipeline	Console Guide →
DAG View	Interactive SVG asset graph with zoom/pan and color-coded status	Console Guide →
Run History	Global view of all pipeline runs with filters and actions	Console Guide →
User Management	CRUD users with Admin/Editor/Viewer roles (admin only)	Console Guide →
Settings	Global runtime settings — schedule, retry, quality, notifications (admin only)	Console Guide →

Authentication & Security

Feature	Description	Documentation
JWT Authentication	Access token (15 min) + refresh token (7 days)	Architecture →
Role-Based Access	Admin, Editor, Viewer roles with granular permissions	Architecture →
Password Hashing	bcrypt via `passlib`	Architecture →
Credential Encryption	Fernet AES-128-CBC for stored source/destination passwords	Security →
Rate Limiting	Login: 10 attempts per 60 seconds per IP	Security →

Quick Comparison

Feature	Arus	Airbyte OSS	Debezium	Custom Scripts
Infrastructure	Docker Compose (3 containers)	Kubernetes or Docker + workers	Kafka + Zookeeper + Connect	Anything
Setup time	~2 minutes	30-60 minutes	1-2 hours	Varies
RAM idle	~200MB	~2GB	~3GB+	~100MB
CDC method	Watermark-based batch	Watermark + log-based	Log-based (WAL/binlog)	Custom
Schema drift	✅ Auto	✅ Auto	✅	❌
Dead letter queue	✅	✅	❌	❌
DAG / Pipeline UI	✅ Built-in	✅ Basic	❌	❌
Transform engine	✅ Built-in steps + Python	✅ Basic	❌	❌

Getting Started

bash

# Deploy Arus in under 2 minutes
docker compose up -d

# Access the console
open http://localhost:8082

New to Arus? Start with the Quickstart Guide →

Architecture at a Glance

                    Docker Host
                    ┌─────────────────────────────────────────┐
                    │  arus-console    arus-api               │
                    │  :8082 (nginx)   :8081 (FastAPI)        │
                    │       │               │                  │
                    │       └───────┬───────┘                  │
                    │               ▼                          │
                    │  arus-db (PostgreSQL)                    │
                    │  ├─ arus_config.*    (auth, sources,     │
                    │  │                   pipelines, settings)│
                    │  ├─ arus_state.*     (watermarks)        │
                    │  ├─ arus_run_logs.*  (run history)       │
                    │  ├─ staging.*        (raw landing zone)  │
                    │  └─ analytics.*      (normalized tables) │
                    └─────────────────────────────────────────┘

See the Architecture Guide → for a deep dive.

Project Status

Phase	Focus	Status
Phase 1	Foundation — connectors, auth, console, core pipelines	✅ Complete
Phase 2	Reliability — retry, DLQ, quality checks, schema drift, transforms, notifications	✅ Complete
Phase 3	Production hardening — CLI, backfill UI, secrets, multi-env	🔄 In Progress
Phase 4	Advanced — log-based CDC, cloud warehouses, BI integration	📋 Planned

See the Roadmap → for details.

Arus — Data Pipeline Platform ​

Features ​

Connector Framework ​

Pipeline Orchestration ​

Transform Engine ​

Reliability & Quality ​

Alerting & Notifications ​

Web Console ​

Authentication & Security ​

Quick Comparison ​

Getting Started ​

Architecture at a Glance ​

Project Status ​