Improving Diner Listings Using Machine Learning

Published in

Dineout Tech

8 min readJun 3, 2022

I landed my first job as a product analyst at Dineout in September 2021. Since then, I have had an opportunity to work on diverse projects, from descriptive analytics, be it analysing user journey funnels, to machine learning. Today, I take this opportunity to walk the readers through the first machine learning project that I undertook here, which most probably is also the first full-fledged ML project to be implemented in Dineout.

We’ll use the following roadmap to go through the project.

Problem statement

Making the app more customer-centric, by personalising the listings for the diners based on their past data.

The first project that came my way was regarding the personalisation of the app. We have a lot of data with us, and now we want to be able to leverage that data in order to offer a better experience to the diners. A personalised restaurant listing will nudge them to explore more restaurants, which means that there is a higher chance that the monetary contribution of the diner towards the GMV (Gross Merchandise Value) increases.

We are in unsupervised learning territory, using clustering algorithms to achieve our goal of bucketing similar restaurants and creating user personas.

Variable selection

Now that we have a problem statement in place, it is time to identify the data points (or variables) that would provide us with enough information about the restaurants and diners, so that we can bucket them as per their characteristics.

After carefully analysing the data, we decided to include the following variables in our modelling process.

Restaurant-level Data

Cuisines: Cuisines were of course the first variable to look at. The number of cuisines in our system was around 107, but we reduced the number of cuisines by mapping various cuisines to one parent cuisine. For example, Japanese cuisine, Korean cuisine and Chinese cuisine all came under Asian cuisine. This activity reduced the number of cuisines to 70. Based on descriptive analysis, only the top 20 cuisines were considered, and all the remaining cuisines were categorised as “Others”.
Establishment type: After cuisine, a very important variable is the establishment type. This variable tells us what kind of restaurant it is. These can be bars, QSR, casual dining or fine dining, etc. Just like cuisines, the top occurring tags were taken, and the rest were termed as others.
Localities: Then we selected localities in which the restaurants are located. Diners generally tend to have affinities towards certain localities, ones which may be close to their house, for instance. In this case, too, the top occurring localities were taken, and the rest were termed as “Others”.
Cost for two (CFT): The next variable considered was the cost for two, which tells us what would the average spend at a restaurant be if two people were to dine there. It proves as a good proxy for the spending appetite of the diners. The bill amount is dependent on the number of diners and occasion and may be misleading at times.

The main reason for taking only the top occurring tags, cuisines, and localities was to keep the number of dimensions in check. If all of them had been considered, then the number of dimensions after one hot encoding would have exploded.

Diner level Data

After considering the restaurant level variables, we turned to the diner level data and variables to be incorporated into the model.

Booking and Transactions: We had the past data of the diner's bookings and transactions, which gave us information about which restaurants they prefer and their spending habits.
RDP views: Another very important source of information is the RDP (restaurant detail page) views. There are a lot of diners that come on the app, just look at the restaurants, and leave. They do not make any transactions or bookings. But in this case, the restaurants that they viewed can give us a little info about what kind of restaurants they might like. It is not as strong an indicator as a transaction or a booking, but it can give us a fair idea.
Membership status: Then another variable taken into account was their membership status, whether they were DO pay members or not. In general, we found that DO pay members (the ones who had purchased DineOut passport memberships), exhibited a different kind of behaviour as compared to the non-members, which isn’t a head-scratcher. So whether a diner is a member or not, can be a valuable point.
Time preference: One derived variable we used for the diners was which time of the week they preferred to go out, meaning whether the diner went out more often during the weekdays or weekends.
Number of Occurrences: This tells us the number of times the diner has made bookings, transactions, and the number of RDP views a given diner has performed. So this variable will basically tell us how often the diner engages with the app.

Approach

The clustering problem was broken down into two parts, restaurant clustering and diner clustering.

So the basic thought process was as follows: cluster the restaurants of a city together, so that restaurants with similar characteristics fall in the same buckets. Then cluster the diners based on their affinities towards these restaurant clusters.

One important point here we decided was to perform the clustering separately for all the cities. The main reason for this was that the preferences of diners in different cities are different. For example, the diners in cities like Chennai were more into South Indian cuisine, whereas if we move upwards towards the northern side, other cuisines like Fast Food, North Indian and Mughlai became more and more prominent. Hence, we decided that it would be more sensible to perform this activity city-wise.

Methodology

So, if I were to put the process down in simple words step wise, it would be :

Restaurant clustering: Using the restaurant level variables, i.e., the cuisines, the establishment type, the CFT, and others, the restaurants were clustered together using the K-means algorithm (In order to improve the silhouette score, the metric used to gauge how good the clustering is, PCA and certain feature engineering steps were performed). So let’s say there are 5 different restaurant clusters in Delhi. The 1st cluster mainly contains restaurants that are Bars and Fine Dining, which have an average CFT of 2300 (well, these are fancy restaurants!). The 2nd cluster contains restaurants that are QSR and fast-food restaurants, which have an average CFT of 800, and so on. All the restaurants fall in one of these clusters based on their characteristics.
Diner clustering: Now that we have restaurant clusters, we use them as an input for the diner clustering as a proxy for their past bookings, transactions and RDP views, along with the membership column, the number of occurrences, and the time preference. For example, let’s say a diner has made just 2 transactions on the app, both being in bars. And I happen to know that all bars lie in restaurant cluster 5. So for this diner, a very high weightage will be given to restaurant cluster 5, and the other restaurant clusters will get 0 weightage( the cell value of the rows are normalised, so they add up to 1 ). Now if this diner would have had transactions or bookings in restaurants of other clusters too, then the weightage would have been assigned proportionately.

Using these parameters, the K-means algorithm would help us cluster diners with similar tastes together, giving us the affinities of each diner cluster towards certain restaurant clusters.

Let’s take a sample chart below to understand this better.

In the above picture, the row index shows the diner clusters and the column index shows restaurant clusters, the ones we got in step 1. I can look at this diagram and say that for the diners in diner cluster 1, the diners have a very high affinity towards restaurants in cluster number 5 (you can see the value as 0.8967). The affinity towards the rest of the clusters is not very high. Now Label 5 restaurants happen to be bars, so I know that the diners in this cluster happen to prefer bars, and we would suggest bars to these diners.

In this way, we are able to create user personas for the diners, and we can get to understand the diners better.

Based on your activity carousel

Using the restaurant and diner clusters that we got above, we introduced a new carousel on the homepage, the “Based on your activity” carousel in January 2022.

If there is a diner whose data we already have in our database, the diner is clustered in one of the buckets. Based on the affinity that the diner has towards the restaurant clusters, the restaurants are recommended to them in the carousel.

In order to monitor the performance of the carousel, the main metric that we are looking at is the CTR (or the click-through rate). The CTR is basically given by (clicks/impressions) x 100. The carousel has been performing well on this front. In the months of Feb and March, this was the top-performing carousel, with the highest CTR, above the “Restaurants near you” carousel by 1.46 percentage points!

Another metric that we’re looking at is the transaction funnel, starting right from the carousel interaction to the payment. The catch here is that the user might not make the payment in the very same session as their banner interaction. So we consider 3-day and 7-day conversion funnels. If there is a diner that interacted with the carousel and looked at a certain restaurant, and they transact in that restaurant in the next 3 days or 7 days, they are considered as a converted diner. In this metric too, the banner has been performing increasingly well, with both the 3-day and 7-day conversions up by around 2 percentage points.

Currently, we have implemented the carousel in Delhi, Bangalore, Mumbai, and Hyderabad. We are planning to roll it out for the other cities as well once we have good results for these, and it is sufficiently enhanced.

Next Steps

This is a decent performing carousel, but there is still a lot more we can do in order to improve its performance. We are planning to add other data points such as favourites, search bar usage and rating data to enhance the model, and I will keep you posted with the progress that we make. I hope you enjoyed the read.