Skip to content

Arus — Data Pipeline Platform

Data flows without the cluster.

Arus is a lightweight, self-hosted CDC & ETL framework purpose-built for teams running on VPS-class infrastructure (no Kubernetes). It ingests data from MySQL, MariaDB, PostgreSQL, and MongoDB sources, applies transformations, and lands them into a PostgreSQL, MySQL, or ClickHouse data warehouse — with a visual DAG interface for monitoring and troubleshooting.


Features

Connector Framework

Pluggable source and destination connectors based on abstract base classes.

FeatureDescriptionDocumentation
Source ConnectorsMySQL, MariaDB, PostgreSQL, MongoDB — watermark-based batch extractionConnectors Guide →
Destination ConnectorsPostgreSQL, MySQL, ClickHouse — raw + normalized load modesConnectors Guide →
Auto-discover TablesScan source databases, detect tables, columns, and sync modes automaticallyConsole Guide →
Column Type MappingAuto-map source types to destination types (e.g., MySQL TINYINT → PostgreSQL BOOLEAN)Connectors Guide →
Custom ConnectorsImplement BaseSource / BaseDestination for any databaseDevelopment Guide →

Pipeline Orchestration

Core engine for scheduling, executing, and monitoring data syncs.

FeatureDescriptionDocumentation
Incremental SyncWatermark-based batch CDC using timestamp columns (updated_at, created_at)Pipelines →
Full RefreshTruncate and reload entire tables on demand or on schedulePipelines →
BackfillRe-sync historical data from a specific datePipelines →
SchedulingAPScheduler cron-based with configurable intervals (default: every 5 minutes)Pipelines →
Pipeline DependenciesChain pipelines — B waits for A's successful runPipelines →
Load ModesDirect (source → analytics) or Raw → Normalize (staging JSONB → analytics)Pipelines →

Transform Engine

Process data between extraction and loading.

FeatureDescriptionDocumentation
Built-in StepsRename, remove, compute, filter, map values, type cast, concat fieldsPipelines →
Python ScriptsCustom transform(row) functions per pipelinePipelines →
Re-orderable StepsDrag-and-drop step ordering in the Console UIConsole Guide →

Reliability & Quality

Production-grade error handling and data validation.

FeatureDescriptionDocumentation
Retry with BackoffExponential backoff via tenacity (default: 3 attempts, 2s → 16s max)Pipelines →
Dead Letter QueueFailed rows stored in staging._dead_letters for review and reprocessingPipelines →
Data Quality ChecksRow count validation + null checks on NOT NULL columns (threshold: 5%)Pipelines →
Schema Drift DetectionDetect new columns in source, optionally auto-ALTER warehouse tablesPipelines →
Soft-Delete ReconciliationTrack deleted_at columns and propagate deletions to warehousePipelines →

Alerting & Notifications

Stay informed about pipeline health.

FeatureDescriptionDocumentation
Notification TargetsTelegram, Discord, Slack — configurable per pipelinePipelines →
Alert EventsFailure, success, dead letter, schema drift, quality breachPipelines →
Pipeline LinkingLink multiple targets to a pipeline with specific event typesConsole Guide →

Web Console

Browser-based management UI.

FeatureDescriptionDocumentation
DashboardStats cards, sync performance chart, recent runs feed, sources overviewConsole Guide →
Source ManagementAdd, test, rescale, edit, delete source connectionsConsole Guide →
Pipeline ManagementCreate, configure, pause, resume, trigger pipelinesConsole Guide →
Pipeline DetailRun history, logs, transforms, dead letters, notifications per pipelineConsole Guide →
DAG ViewInteractive SVG asset graph with zoom/pan and color-coded statusConsole Guide →
Run HistoryGlobal view of all pipeline runs with filters and actionsConsole Guide →
User ManagementCRUD users with Admin/Editor/Viewer roles (admin only)Console Guide →
SettingsGlobal runtime settings — schedule, retry, quality, notifications (admin only)Console Guide →

Authentication & Security

FeatureDescriptionDocumentation
JWT AuthenticationAccess token (15 min) + refresh token (7 days)Architecture →
Role-Based AccessAdmin, Editor, Viewer roles with granular permissionsArchitecture →
Password Hashingbcrypt via passlibArchitecture →
Credential EncryptionFernet AES-128-CBC for stored source/destination passwordsSecurity →
Rate LimitingLogin: 10 attempts per 60 seconds per IPSecurity →

Quick Comparison

FeatureArusAirbyte OSSDebeziumCustom Scripts
InfrastructureDocker Compose (3 containers)Kubernetes or Docker + workersKafka + Zookeeper + ConnectAnything
Setup time~2 minutes30-60 minutes1-2 hoursVaries
RAM idle~200MB~2GB~3GB+~100MB
CDC methodWatermark-based batchWatermark + log-basedLog-based (WAL/binlog)Custom
Schema drift✅ Auto✅ Auto
Dead letter queue
DAG / Pipeline UI✅ Built-in✅ Basic
Transform engine✅ Built-in steps + Python✅ Basic

Getting Started

bash
# Deploy Arus in under 2 minutes
docker compose up -d

# Access the console
open http://localhost:8082

New to Arus? Start with the Quickstart Guide →


Architecture at a Glance

                    Docker Host
                    ┌─────────────────────────────────────────┐
                    │  arus-console    arus-api               │
                    │  :8082 (nginx)   :8081 (FastAPI)        │
                    │       │               │                  │
                    │       └───────┬───────┘                  │
                    │               ▼                          │
                    │  arus-db (PostgreSQL)                    │
                    │  ├─ arus_config.*    (auth, sources,     │
                    │  │                   pipelines, settings)│
                    │  ├─ arus_state.*     (watermarks)        │
                    │  ├─ arus_run_logs.*  (run history)       │
                    │  ├─ staging.*        (raw landing zone)  │
                    │  └─ analytics.*      (normalized tables) │
                    └─────────────────────────────────────────┘

See the Architecture Guide → for a deep dive.


Project Status

PhaseFocusStatus
Phase 1Foundation — connectors, auth, console, core pipelines✅ Complete
Phase 2Reliability — retry, DLQ, quality checks, schema drift, transforms, notifications✅ Complete
Phase 3Production hardening — CLI, backfill UI, secrets, multi-env🔄 In Progress
Phase 4Advanced — log-based CDC, cloud warehouses, BI integration📋 Planned

See the Roadmap → for details.