Adore Me

4.69

16 evaluări

Adaugă evaluare

Medium blog: https://adoreme.tech

Careers site: https://bucharest.adoreme.com

(P)TL, a new data engineering architecture

31.03.2022

Dragos C

Dec 2, 202

Introduction

In the classic data engineering architecture, the data is extracted from the source system and gets in the staging database or in a data lake (usually in files on storage) and after that, it will be transformed and delivered to the business. The problem with this architecture is that it is not sustainable in the long run, having tens/hundreds of systems, each with its own data and specific business logic, and one data engineering team will not be able to handle all the data.

Why “E” from ETL is bad

Extracting the data from sources system has a lot of problems, but here is a list of some reasons why you shouldn’t do this:

You will need to handle various types of authentication (because you can extract data from api/database/sftp/virtual machines and every system will have its own auth method),
You will need to spend a lot of time analyzing the sources system table and understanding their own business/tech logic.
You will become a blocker for any integration because your team will be the only team that extracts the data using an ETL tool.
Extracting the data also can become a brainwash activity and you will be more. bored with every new source.
You will never have streaming/real-time data because the extractions are made on cron jobs ( at a specific time).

What (P) means in (P)TL

We see that extraction is now a good way from an infrastructure perspective, but what other options do we have? Well, (P) comes from pushing the data in the data lake/ staging database from the source, using a message broker like RabbitMq, Kafka, Pubsub, and automating the transfer of data from the queue to the database.

Basically, every source teams will have the ownership of transferring the data to the message broker, and the data engineering team will automate the insertion in the database and will create tools and standards for the source team's programmers. In this way, the data engineering team will become a platform team, like the DevOps team.

(P)TL architecture using google cloud services

Advantages of (P)TL:

Real-Time/Streaming data.
Trigger event base instead of cron jobs.
The source teams will have ownership of the data.
No more SQL(because it will be automated based on JSON schema or another configurable file).
Data will be part of the products.
Make data engineering a platform time.

Conclusion

The (P)TL architecture is more sustainable and reliable than the classic ETL, and the data engineering teams will have a transformation from an integration team to a platform team

Vezi mai multe