Skip to content

Tooling Recommendations

EDP should prefer open-source, PostgreSQL-centered, container-friendly, Git-managed tools where practical. Add complexity only when a category earns it through scale, governance needs, operational risk, or user demand.

Suggested Tooling by Category

CategoryRecommended Starting ToolingLater or Optional
Connectors and Data MovementCustom Python connectors, Airbyte, MeltanoDebezium for CDC, Kafka or Redpanda for event streams
Orchestration and WorkflowApache AirflowDagster for asset-centric orchestration, Temporal for durable business workflows
Storage and PersistencePostgreSQLObject storage plus Parquet, ClickHouse, TimescaleDB
Transformation and ModelingdbtSQLMesh, custom Python transforms
Data Quality and Validationdbt tests, Great Expectations, Soda CoreDeequ for Spark-heavy environments
Catalog, Metadata, and LineageOpenMetadataDataHub for larger metadata and governance programs
Semantic Layer and Metricsdbt semantic layer concepts, Superset semantic layerCube, Lightdash, commercial semantic layers
Consumer ExperienceApache Superset, Grafana, FastAPI plus VueMetabase, Evidence.dev, Power BI
Custom Applications and APIsFastAPI, Vue, NGINXReact, .NET APIs, internal developer portals
Identity and AccessExisting identity provider, Keycloak if self-hostingOpen Policy Agent for policy-as-code
Secrets and ConfigurationKubernetes Secrets for a basic start, SOPS plus ageHashiCorp Vault or OpenBao
Observability and MonitoringPrometheus, Grafana, LokiOpenTelemetry, Tempo, Alertmanager
DevOps, GitOps, and CI/CDGitHub Actions, Docker, Kubernetes manifestsArgo CD, Flux, Helm, Renovate
Search and IndexingPostgreSQL full-text searchOpenSearch when search becomes central
Documentation and KnowledgeVitePress, Markdown, ADRsOpenMetadata docs, Evidence.dev reports
Privacy and Governance ControlsPostgreSQL roles and views, dbt tests, documented classificationsData masking, policy engines, catalog-driven classification

Start with:

  • PostgreSQL for persistence
  • PostgreSQL schemas for raw, ODS, Data Vault, Data Marts, platform metadata, and application state
  • Airflow for orchestration
  • dbt for transformations, tests, lineage, and model documentation
  • Superset for the main BI portal
  • Grafana for platform and operational dashboards
  • FastAPI plus Vue for workflow-oriented applications
  • VitePress for documentation
  • Docker, Kubernetes-ready manifests, and GitHub Actions for delivery
  • GitOps-friendly deployment definitions that can be repeatedly applied to a target environment

This stack is broad enough to prove the EDP concept while staying understandable for a small platform team.

See Runtime Infrastructure for recommended hosting, operating system, container, and GitOps patterns.

Phase Guidance

Phase 1 should establish the platform core:

  • PostgreSQL
  • Airflow
  • dbt
  • Superset
  • Grafana
  • VitePress
  • GitHub Actions

Phase 2 should improve trust, operations, and governance:

  • Great Expectations or Soda Core
  • OpenMetadata
  • SOPS plus age
  • Prometheus
  • Loki
  • Alertmanager

Phase 3 should scale integration and user-facing value:

  • Airbyte or Meltano for connector scale
  • Debezium for change data capture
  • Evidence.dev for curated report pages
  • FastAPI and Vue applications for operational workflows

Phase 4 should add specialized scale and mature operations:

  • Object storage plus Parquet for large raw history
  • ClickHouse for high-volume analytical workloads
  • TimescaleDB for serious time-series needs
  • Temporal for durable workflow applications
  • OpenSearch for cross-entity search
  • Vault or OpenBao for mature secrets management

Selection Principles

Start with the fewest tools that can operate the full data lifecycle safely.

Prefer tools that integrate cleanly with Git, containers, PostgreSQL, SQL, Python, and the platform documentation workflow.

Avoid duplicating responsibilities. For example, dashboards should not own transformation logic, custom applications should not maintain private reporting models, and notebooks should not become production pipelines.

Introduce specialized systems when there is a specific pain they solve: scale, durability, governance, lineage, search, real-time events, or consumer usability.