Research Data Integration & Analytics Platform

Aug 1, 2025

Go to Project Site

Introduction

A comprehensive data integration and analytics solution was developed within the Research & Innovation department. The project brought together multiple self-hosted applications eLabFTW, Apache Airflow, and Apache Superset into a centralized, secure platform. Data workflows were automated to streamline research operations, improve reproducibility, and support advanced analytics. The system was deployed on cloud infrastructure, leveraging containerization, reverse proxy configurations, and single sign-on authentication for seamless and secure access across tools. The result was a robust environment that empowered researchers to capture, manage, and analyze data with greater efficiency and reliability.

Repository Description

Within the Research & Innovation department, a fully integrated research data management and analytics environment was designed, deployed, and optimized. The work was carried out using containerized services orchestrated via Docker, ensuring portability and reproducibility across environments.

The implementation included:

eLabFTW configured as the primary electronic lab notebook (ELN), customized for departmental research workflows, metadata standards, and timestamped data certification.
Apache Airflow deployed for ETL pipeline automation, enabling scheduled ingestion, transformation, and synchronization of data between systems.
Apache Superset integrated for interactive dashboards and visual analytics, connected to both operational and analytical databases for real-time reporting.
Security and accessibility were addressed through the deployment of Keycloak as the identity provider, using OpenID Connect and SAML protocols to enable single sign-on (SSO) across all services. Reverse proxy rules and SSL/TLS encryption were implemented via NGINX to ensure secure, domain-based access.
The cloud deployment was hosted on Azure Virtual Machines, with performance tuning applied to handle concurrent workloads from multiple research teams. Data persistence and backups were managed through mounted volumes and automated snapshot policies. The system design also allowed for role-based access control, ensuring that data governance policies were upheld without limiting research agility.

This project not only consolidated tools into a single platform but also reduced administrative overhead, improved collaboration between research staff, and laid the foundation for future machine learning applications by standardizing the data lifecycle from capture to visualization.

Not further information can be share due to confidencial information

Tools: SQL Management Studio, Anaconda, Jupyter Notebook, GIT, Azure Devops
Languajes: SQL,Python
Python Libraries: Pandas, Numpy, Seaborn, Sci-learn, Prophet, Arima,traceback

Research Data Integration & Analytics Platform

Introduction

Repository Description

Andres Camilo Viloria Garcia

Data Scientist | Data Analyst