Data processing pipeline | Sofia Escobedo

Technologies Used

Serverless Project Management IoT AWS CDK AWS CloudFormation Kinesis Architecture

This article is also available in Spanish

• I built the entire infrastructure as code using AWS CDK, which allowed everything to be scalable, flexible, and easy to maintain, especially considering the wide variety of sensors and hardware we planned to integrate in the future.

‍

• The IoT sensors in the industrial factories send electrical measurements every 5 minutes. I set up a local SQLite database within each device, to temporarily store the records when there was no connection, and later automatically upload them to the cloud to ensure zero data loss.

‍

• At AWS, I set up a serverless-based infrastructure, as well as using services like Lambda, Step Functions, and Kinesis Data Streams and Firehose, to process the information as it came in. While we didn’t need exact real-time analytics, it was important to automatically calculate the electrical cost for each measurement using our data pipeline.

‍

• In Python and Pandas i developed ETLs that run on AWS Step Functions to integrate external information sources that clients were giving us, and they also built a library that greatly simplifies queries to InfluxDB using the Flux query language.

‍

As part of the project, I developed a Python script to easily check the execution status of our AWS Step Functions. This script allows you to know in real time how your process is going, the estimated time to complete, automatically get the results, and best of all: if something fails, you receive a clear and friendly description of the error, so you can solve it quickly without having to dive into confusing logs.

‍

Finally, something that stood out a lot was how we made use of IaaS and visualization with Grafana and Power BI with AWS Athena while maintaining full visibility with tools like CloudWatch and Sentry 🔎.

This combination allowed us to maintain an efficient, economical and easily scalable operation in the future, especially for a startup with an AGILE methodology.‍ 💡 Our goal was to deliver key indicators KPI’s based on the information we collected from all our Data Lake sources with a business intelligence BI objective, for this with a combination of transactional OLTP databases and OLAP Analytics that we intended to achieve ‍ Why AWS Kinesis instead of Apache Kafka ?? We needed a simple, scalable, and quick solution to deploy from scratch. I chose AWS Kinesis because compared to Apache Kafka, it is serverless, which means zero additional infrastructure management, lower upfront costs, and a much faster startup.

‍

Note on Event-Driven Architectures: These architectures are based on generating, detecting, and reacting to events in real time, allowing systems to respond asynchronously and flexibly to specific changes or actions. This facilitates application scalability and responsiveness.‍

Lambda automatically scales on demand and follows a pay-as-you-go model, resulting in a more efficient and cost-effective solution for applications with variable workloads.

‍ Check out the AWS infrastructure diagram 👇

Architecture Diagram

Explore more articles on similar topics and technologies.

Scalable ETL pipelines

Significantly reduced infrastructure expenses in AWS

March 8, 2022

Big Data ETL migrations

Executed multiple ETL migration projects from Oracle to AWS Glue, with strong emphasis on detailed requirement collection and comprehensive planning

November 1, 2024

Black Belt Architecture

demonstration of knowledge

December 6, 2022

Technologies Used

Architecture Diagram

Related Posts

Scalable ETL pipelines

Big Data ETL migrations

Black Belt Architecture