As a big data enthusiast and pipelines developer, I have been using Google Cloud Dataflow for a while now, deployed many batch and streaming pipelines. Most of them by using Spotify’s Scio, a Scala API for Apache Beam and Google Cloud Dataflow. I found it very suitable to our use case of physiological data analysis into emotional data being sent from wearable smart devices. I’m happy to share the result of our extensive work which is our architecture and integration of Google Cloud Dataflow with other excellent Google Cloud services.
I think Scala and Spotify’s Scio as a language and a framework was the right choice for creating Google Cloud Dataflow pipelines because of the following:
1. Data acquisition
The processing starts when the wearable smart devices send physiological data of the user. The physiological data is sent to the Data Gateway which publishes a Google Pub/Sub message to a topic so other services may also fetch the raw data as well.Our Dataflow pipeline is a streaming job which subscribes to the raw data Pub/Sub topic, the pipeline is up 24/7 and start analyzing data as soon as it arrives from Pub/Sub.
2. Data analysis
The first step is windowing the data by timestamp using sliding window (required by the algorithm). Then grouping the data by user id for a per user emotional analysis. After the grouping, a data validation step occurs to filter invalid data. Lastly, an emotional load algorithm, provided as a package from the data science team, has to be executed on the data. This separation between the data science and big data code enables the two teams to proceed independently as long as they do not break the integration point. Now the applications can consume the emotional load analysis and provide an emotionally-aware and much richer user experience.
3. Data Consumers
To provide results in realtime, we use Firebase as a realtime database. The pipeline sends the result via Pub/Sub message and a Firebase Cloud Function is triggered to save the result to Firestore. The separation between the pipeline and saving to Firestore with Pub/Sub keeps the writing to Firestore to be only from Firebase Cloud Functions instead of mixing it straight to the Dataflow steps. Also this way other consumers may fetch the realtime data and act as they wish. Another use case is for BI applications and offline analysis, results are saved in PostgreSQL on Google Cloud SQL. Aggregated results are saved every few minutes using Scio API for JDBC database.
Things were not always that bright and shiny.. During the development we found a performance issue that caused a major delay in the pipeline. To debug which steps causing the delay, we used system lag and walltime metrics to track the validation step as the cause for the delay. In the investigation we found out that the validation code was implemented ineffectively so we had to reimplement it with extra care about performance, for example using direct access to arrays by using index instead of searching the value in the whole array or using tail recursion instead of built-in functions like foldLeft or reduceLeft.Our conclusion from that experience is simple, when you write code to Dataflow pipeline that needs to process millions of records, you need to think about the complexity in all parts of your code, even trivial things, especially on arrays loop.
This is our Google Cloud Dataflow streaming pipeline to analyze physical data to emotional data.