What a Developer Can Do to Get the Whole Team Involved with Data

Soo Min Jeong
Analytics Vidhya
Published in
6 min readMar 10, 2020

--

Teaching SQL and Setting up Zeppelin

Just like any other startup does, my team had to make tens of decisions every day. Quite often they were as trivial as where a button should go, sometimes as enormous as setting up a yearly KPI. When the talks get lukewarm, we could feel that we need a convincing argument based on numbers.

The hidden gems of insights

I collected and analyzed the data for decision-making since I was the only backend developer on the team. While I was sharing a statistical report every week, I witnessed how operators and UI designers had accurate predictions on the result without even looking at it. That led me to believe that those close to the end-users have better instincts on the user data. The following is how I took the belief into implementation for the last couple of months.

What to overcome to share the access to the database

I had to get over three obstacles; to teach non-developers the SQL syntax, to set up a data analytics tool for them, and to find a safe way to open the database not affecting our service.

1) SQL seminar for non-developers

I taught SQL to the Operations team and UI Development team for two hours a week. It took me 2 sessions to cover the basic SQL syntax focusing on SELECT queries and another 3 sessions to practice with real data.

I restored the snapshot of the master database, and we accessed the data with Sequel Pro. I provided with some assignments to solve the problems about the product (i.e. the number of those who joined the app last week, the number of images uploaded by the new users), and solved them all together in the next session.

Sequel Pro offers simple UI and easy installation

1W: the basic SQL syntax I (select, from, where, limit, order by, and, or, not)

2W: the basic SQL syntax II (join, group by, having, count, sum)

3W: the understanding of the database schema

  • I explained where the data sets are located, especially the ones which the Operations team might be interested in.

4W: the understanding of the statistics with SQL

  • I shared the raw queries to yield weekly reports so that they could practice interpreting what they mean.

5W: proving hypotheses

  • I asked for some hypotheses the participants would like to prove with the data from the database. One was curious about the economic impact of being featured on our main page. It turned out to be more than 17 times of revenue in comparison with control groups.

2) Get the most out of teamwork with Apache Zeppelin

Nope not this one
Yup this one with two E’s

We enjoyed the interactive data analytics with Apache Zeppelin. Here are why I chose it over anything else:

  1. DRY with queries

Some of the main indexes (i.e. newly-joined social action, revenue, etc) are processed under the same query but only with a different date. I eliminated the redundant work by placing the variables on the top so that the teammates could easily adjust them.

The convenient markdown on Zeppelin

2. Collecting a great variety of perspectives on the same dataset

Saving queries means saving the perspectives on the data. When we encounter similar problems consecutively, the previous solutions are robust references for later on.

For example, we needed a key index to evaluate our project on launching the ‘cart’ feature. I went through some of the queries used to evaluate the ‘market’ feature so far, and I could come up with ‘the number of users who bothered to buy several contents within a minute (due to the absence of cart)’ by compounding the indexes there.

3. Visualization

No code to be written for simple charts

The reason why I chose Zeppelin over any other products with a notebook (i.e. Jupyter Notebook) was the visualization. A simple chart offers you an upfront sense of the data flow.

The queries saved on the notebook show the team how one tried to solve or finally solved the problem. Sharing them on a notebook will boost the whole team to learn from each other.

3) Last but not least: Applying the project on a production level

  1. Restore a replica database with a snapshot

The biggest challenge was to figure out how to let non-developers access the database without losing any data or affecting the on-going service. I considered offering the read-only accounts in the first place. However, it cannot prevent the complicated queries from hindering the database ending up with getting the service slower. So, I simply restored a database from a snapshot only for data analytics.

Since it may hold some sensitive data, I restricted the access by registering the security group with the IP address of the office. Also, I set the configuration to go through authentication by default.

Zeppelin accesses only the replica not affecting the main database

2. Automate restoring a database with a snapshot

At first, I automated two tasks with AWS Lambda; setting up a replica database with the snapshot and starting an EC2 instance with Zeppelin.

Lambda with botocore restores a replica DB with a snapshot and start an EC2 for Zeppelin

3. Improving the cost-efficiency by running them only in the working hours

Running another server stack for data analytics adds up some cost. To reduce it, I scheduled starting the servers every morning and shutting them down in the evening with Cloudwatch and Lambda. This way we could run them only with 25% of the cost of 24/7 server hosting.

Every morning Lambda Sets up the Data Analytics Stack

When the domain changes due to running a new instance every morning, a Telegram bot sends the new public domain.

Telegram bot shares a new public domain

Lastly, when the working hours are over, the Lambda deletes the database and shuts down the instance.

Do not stay late, or you will lose them

Expectations Are Never Too High

  1. Quick notice on abnormality

Though the data in a snapshot is not in realtime, it is never worse than staying completely out of touch with the database. Now our team has a new crew to detect something different except for developers.

2. Sharing the process of solving a problem

We are running several products selling digital assets under the same mission for creators. The team on another product would likely bump into similar problems repeatedly. Collecting the methodology to solve the problem will be a precious heritage to look up.

3. Data-driven Decision Making

Data is a great first step when making a decision. Sometimes our instinct and experience may be a breakthrough but we all know they can be misleading. Data can show us why we should do it, how we can evaluate our achievement, and where we currently are. It is a fair standard regardless of the hierarchy or professional position.

One of the UI developers who took the SQL seminar collected data and processed it to bring a good reason why we should encourage users to report any inappropriate content on our app. I wish I could note such a clear argument more frequently on my team.

--

--