Email to Bassel telling him about my internship and thesis during July 2019


The internship is really enriching and enjoyable, for the following reasons:

  • Novel: I'm learning new technologies and touching new stuff every day. So far I've learned: how to build a deep learning model with Keras, how to build a deep learning pipeline (feeding the data into the model), a little bit about Bayesian inference, multi-armed bandits, Thompson sampling, databases (relational algebra/relational design theory/SQL), and currently starting on Scala/functional programming/Apache Spark (distributed computing). This brings me to my next point, which is...
  • Self-paced: Being the only data scientist on the team, I'm not "blocked" by other people. How fast I can make progress depends solely on myself---The faster I learn, the faster I go: I learn things quite quickly, so I have been pushing myself incredibly hard to learn as much and as quickly as I can.
  • Challenging: The projects are interesting and challenging. More on this later..
  • Open-ended: I have enormous latitude to tackle the problems the way I see fit.
  • Holistic: I have to plan the "grand strategy" of how everything should fit together (more on this later), which i find very rewarding.
  • No bullshit work: I haven't been given any administrative or accounting or "get me coffee" tasks.
  • Very short commute: It takes me about 10 minutes to get to the office from my bed.

My job role is "data scientist", but in reality I do a bit of software engineering as well. First, a little bit about the company. It's a small B2B startup (<20 people) in the field of auto insurance. They sell white-label apps to insurers. The app basically track their customers' driving patterns and can identify low- and high- risk drivers --- they take 1/s GPS lat/long coordinates from every trip. This is obviously very attractive to insurers, as you know, helping to bridge the gap of asymmetric information. Their key moat is this thing called the "Driver Profiler", which is a very complicated rule-based system that converts the raw GPS coordinates into a "driver safety score". Basically how it's done is as follows: from the raw GPS data you can get velocity, acceleration, turning, braking, etc. Also you can do a database lookup of the coordinates to find the speed limit of the road the car is on, to check if the driver is speeding. Mix this all up and apply some "magic" and you get a driver safety score.

There are three main projects that I'm working on at the moment:

  • SMS intervention "pipeloop" to triple active users of a client company's app
  • Deep learning driver profiler
  • Big data analysis with 5 million trips (6 billion (!) lat/long points)

SMS intervention pipeloop One of the company's clients is Thailand's 2nd largest motor insurer. They've got about 4000 people who bought a policy which gives them 15% cashback if they use the app to log their driving. The problem is that people who buy the policy are not installing the app, and even after they install the app they don't really use it to record their driving. See below graph --- out of 2005 customers, only 922 installed the app.

The insurer's CEO gave my CEO an ultimatum: greatly increase the number of active drivers on the app by 31st September 2019, or have our contract with them terminated. This would be a significant loss to our startup.

So I identified the problems and wrote an intervention plan. I also came up with some quantitative metrics on which to judge the success of intervention.

My CEO Richard suggested that we send SMSes to encourage people to install the app/record trips actively. Immediately my mind jumped to behavioural econs. There was a hypothesis I wanted to test: that SMS messages with a loss-aversive framing would be more effective than SMS messages with a positive or neutral framing. Here's how I went about it: I randomly sent people one of three different SMSes: neutral, positive, or negative. At the end of the message there's a personalised link: "Click here to install:<SMS_ID>". Clicking that link brings you to my server, which logs the SMS ID and redirects you to the App Store.

We know that the treatments are Bernoulli with parameter p. Then in N repeated trials the treatments must follow a binomial distribution with B(N,p).

Suppose we conduct N trials and observe α successes and β failures. Then the beta distribution Beta(α+1,β+1) gives us the range of “reasonable estimates” of the binomial distribution parameter p.

This is where the multi-armed bandit problem I was asking you about came into play. In the traditional algorithms to solve MABs, you pull one lever at a time, check its efficacy, and then update your knowledge about the distributions of the arms. Here you can't really do that because time is short---there are 2000 SMSes to send out! That's why you have to pull N arms at a time.

Here's a simplified picture of the architecture. Basically every day I check the server for clicks, then update my database (the blue cylinder labelled "PostgreSQL"). I do some data analysis on the relative performance of each SMS, then feed that into my Thompson sampling algorithm (the slot machine labelled "MAB"), and finally send the SMSes out to the next batch of customers. I call it a "pipeloop" because it loops onto itself --- the results of the previous day's SMSes are used to generate today's SMSes.

SMK SMS intervention architecture.png

Anyway, I just finished sending the last batch of SMSes on Friday. Just for fun, here are a subset of the results. In the following graphs I've plotted beta distributions of each SMS variant. I start with an uninformative (uniform) prior.


After sending the first batch of 300 SMSes (100/100/100), the beta distributions update. Red is neutral, green is positive, blue is negative.

image.png image.png

They keep updating as more SMSes are sent. The Thompson sampling algorithm makes sure we send more of the SMS we think are optimal, essentially maximising our response rate. You can see in the following picture that the red distribution has a very tight distribution -- that's because we sent a lot of them, so our uncertainty wrt its actual probability


Results: sadly, I did not get the results I wanted. There seems to be no appreciable difference between the neutral (red) and negative (blue) SMSes, although both did better than the positive (green) SMSes. My hypothesis is that the green SMS sounded a bit too spammy/marketing, and people are used to ignoring those.

This is not the end of the world, as I have/am going to send way more SMSes, not just this round. For instance reminding "lapsed" users to start recording trips regularly again, or just kicking these never-installers again with more strongly-worded messages.

What I learned: I learned a lot from this project. First of all, I had to learn about databases and SQL to figure out how to create a database, identify the customers to send the SMS to, and how to record customers' clicks into the database. Second, it was a real high-level exercise to plan out the architecture, how everything would fit together, and then getting my hands dirty building all the systems. Lastly, I had to learn about probability and statistics and bandits and algorithms for bandits, most of which went over my head, but I find the idea of Bayesian inference super fascinating.

Value to company: if the intervention succeeds, then I will have directly caused the company to keep the patronage of a large client of theirs. (We charge the insurer a flat fee per active user per month, so even if they cancel our contract at the end of September, we still get some money from users who have been persuaded to install the app and use it. 25 users have installed the app because of the SMSes, so that's about 100 pounds in revenue already (assuming they stay active for the 2 months).

Deep learning driver profiler I am trying to use a neural network to approximate the score given by the Driver Profiler. Currently, the device sends over raw GPS data to our servers at the end of a trip, which crunches the numbers, does a lot of database lookups, and returns a trip score to the device. Because of network latency / DB lookups, this process takes about 4-7 seconds per trip, which

Using a neural network has two key advantages over the status quo. First, if we don't need to do expensive database lookups, we can do much faster inference. But the actual advantage is on-device inference: we could possibly deploy the deep learning model on to the phone, which can generate a trip score on its own! Then it could send the trip data to the server at a quiet time.

In fact, if the deep learning model is fast enough, we could even do real-time, on-device inference: at every time step (or maybe every 10 seconds), calculate a rolling trip score. This would give drivers real-time feedback on their driving, and would be a huge step from the current model

My current progress: The training and test loss is still quite high, although I've significantly outperformed a naive baseline. I need to get more data, I think. Right now, I'm training on 50,000 trips -- but we have 5 million trips available. Iteration is a bit slow, because it takes about 12 hours to train a model. So there is 12 hours of latency between my having an idea, and seeing how effective it is. This significantly bogs me down.

What I learned: I already knew the theory of deep learning, but there was still a lot to learn about wrangling the data into the proper format and building something called a Keras Sequence to feed the data into the deep learning model.

Value to company: Servers are expensive to run, and decreasing computation/offloading computation onto the device would save the company (a lot) of money. It'd also give them some sexy marketing material ("We use deep learning AI to score trips!").

Big Data analysis on 6 billion data points This is the most nebulous one. The idea is that we have 5 million trips recorded: if each trip is an average of 20 minutes long (1200 seconds), then we have 6 billion data points. Doing data analysis on this on a single machine is impossible because 5 million trips doesn't even fit on disk of a single machine, much less its memory.

But the insights are much richer. With 5 million trips we can do things like calculate traffic speeds/traffic flow on every road at every time of day. We can see which roads are most accident-prone. To be honest, Richard hasn't really been clear about what sort of Big Data queries he wants to run, but whatever it may be, I need to build a distributed infrastructure that enables the queries.

Literally just started this third big project yesterday, so not much to say here, but I'm learning Scala and Apache Spark as fast as I can, as I only have five weeks left in this internship.

What I'm learning: functional programming, distributed computing, Scala, Apache Spark, ... I'm sure I will have to learn more stuff as I go.

Value to company: enable hitherto-impossible data queries ... new insights ... I'm sure Richard has plans for this.


First, the good news: Andy helped me get in touch with Jonathan Rodden and his co-authors, and we're exploring the possibility of collaboration. I'm very glad this is happening! The problem is that they're very slow with their email replies, so I'm kind of blocked waiting for them.

I'm currently working on two possible extensions to Rodden and co's measure of political dislocation. PD a voter-level metric, measured as (number of Democrats in k-nearest neighbours of voter X) - (number of Democrats in electoral district of voter X). I identified that this PD metric is sensitive to both geography and voter homophily, and have written two (very short! < 100 words) Google Docs that explain the problem (here) and (here).

I think I have a solution for the first problem (geography). And I know this sounds very hubristic, but I think the voter homophily one is a fatal flaw for their political dislocation measure. You know we were working hard on how voter homophily affects seat share for like a month back in Trinity? It seems like we are circling back to it again! Basically, I think whether or not PD is high depends on the district size, the number of clusters, and the degree of homophily. And this means that the PD metric would give false positives if we don't adjust for all these things. Worse still, I could construct several toy examples where a non-gerrymandered plan actually had a higher PD score than a deliberately gerrymandered district plan.

The intuition is as follows:

Consider a "donut-shaped", perfectly homophilous distribution, with 1000 blues in the middle and 1000 reds surrounding. Let's coin a term "gerrymanderability" which means given a person who wants to gerrymander and draws the districts "optimally", how many districts could he win?

It's all about being able to draw districts that win by a hair. If districts are small compared to cluster size, then gerrymanderers will have their hands tied.

How district magnitude affects gerrymanderability The size of the district affects gerrymanderability. Imagine if district size was 10. Then gerrymanderability would be low. This is because lots of inefficient districts would be drawn inside the middle of the donut (all blues), and at the outer rim (all red). Only at the places where red meets blue would gerrymandering be possible.

But if the size of a district increases, then gerrymanderability starts to increase. You can start to draw districts that include reds: think a very long, thin strip that emnates from the center and picks up the reds.

And in the degenerate case where there is only one national district, then obviously gerrymanderability would be 0 as well. So I think there is an inverted-U-shaped relationship between magnitude and gerrymanderability.

How nature of homophily (number/location of clusters) affects gerrymanderability Again a similar concept here. With the same value of homophily, one can have one big central cluster, or multiple smaller clusters. Given the same district magnitude, smaller clusters means more gerrymanderability for the reason outlined above.

In conclusion, I think there is some sort of super complicated interaction term between district magnitude, type of clustering, and homophily, which fundamentally fucks the PD metric (I think). I want to bring this up and I really pray to God the smart postdocs and Rodden can find a solution or show how it's irrelevant. And hopefully they will involve me in it...

(The following is somewhat tangential)

You know we were talking about how to measure homophily --- I asked this question on CrossValidated and got an interesting answer. One commenter suggested the coefficient of overlapping, or OVL --- basically the degree of overlap between two distributions. With perfect homophily (Ds and Rs never mix), OVL would be 0, and with perfect homogeneity, OVL would be 1.

How I guess this would work in practice: given the observed geographical distribution of Democrats and Republicans, I'd use nonparametric estimation to generate the distribution of Ds and Rs. (These would be two-dimensional distributions). Then I'd superpose one distribution over another and integrate to obtain OVL.

Rodden promised to reply 8 days ago, so I'm keen to hear his thoughts.

Hello all: I care deeply about all of this and have thought a lot about it. And I’ve been in the courtroom a good deal, so I have thoughts on that part as well. I am having a week from hell, and am about to leave for vacation, but will send some thoughts as soon as I can. But in the short term, I am enjoying reading your thoughts! Thanks for all of this.