Sql puzzle: Returning Customers

It's the end of the month, and your store manager comes to you with a request. The store ran a special promotion in the third week of September with the idea of luring back existing customers who hadn't shopped in a while. He has all the sales data for the month, but he can't tell which customers had slipped and only came back during the promotion period. He asks for your help.

"We ran the promotion from Sept 22-28. I'd like to know - which customers came in sometime during the first two weeks of the month, didn't come in at all during the third week, but returned during our promo period."

Let's throw together some data and see how to solve this sort of problem. We'll use the month of Sept 2013, say that the promo period was the 22-28th. The "pause" is the 15th-21st; an existing customer is one who came in BEFORE the 15th, but NOT during the "pause".

CREATE TABLE `visits` (
  `customer_id` int(11) NOT NULL,
  `visit_date` date NOT NULL,
  PRIMARY KEY (`id`)
) ;

INSERT INTO `visits` (`id`, `customer_id`, `visit_date`) VALUES
(1, 1, '2013-09-03'),
(2, 1, '2013-09-11'),
(3, 1, '2013-09-25'),
(4, 2, '2013-09-10'),
(5, 2, '2013-09-18'),
(6, 2, '2013-09-25'),
(7, 3, '2013-09-26');

Our first inclination to solve this is probably to try to come up with some sort of WHERE clause that looks for a customer_id where the visit_date is BETWEEN Sept 22 and 28 AND NOT BETWEEN Sept 15 AND 21 AND is less than Sept 15. Well, obviously that can't work for a single record, so we want to somehow GROUP BY customer_id and then apply the filters. But how do we apply 3 different filters to the same group of records simultaneously?

HAVING might jump to mind, but HAVING only works on the grouped result rows, not within the group. The rule of thumb for WHERE vs. HAVING is:

- first WHERE filters out rows
- then GROUP BY aggregates them
- then HAVING filters the aggregated rows

So we need a different solution. Three filters means three queries, roughly speaking. UNIONS? Get the previous cutomers, UNION to the last week return customers, subtract out any who came in during the pause...No, not going to work out.

Let's back up a moment. We want our customer to be in two different time periods... two different result sets...When we have two different sets of data, how do we find out what the matching data is? With a JOIN, of course! This is just like finding the customer_id that exists in two different tables.

Now we're making progress! Let's look at a simple query for the return week and the pre-pause times:

SELECT DISTINCT v.customer_id 
FROM `visits` v
WHERE v.`visit_date` BETWEEN '2013-09-22' AND '2013-09-28';

SELECT DISTINCT v.customer_id 
FROM `visits` v
WHERE v.`visit_date` < '2013-09-15';

Okay, so we could wrap those both in table aliases and join them, but we are still left with how to ignore the results that also came in during the pause. So we need something like this pseudo code:


SELECT customer_id
	(SELECT DISTINCT v.customer_id 
	FROM `visits` v,
	WHERE v.`visit_date` BETWEEN '2013-09-22' AND '2013-09-28') as returnWeek 
JOIN (SELECT DISTINCT v.customer_id 
	FROM `visits` v
	WHERE v.`visit_date` < '2013-09-15') as previousCustomer ON (previousCustomer.customer_id = returnWeek.customer_id)
WHERE customer_id NOT IN 
	(SELECT v.customer_id 
	FROM `visits` v,
	WHERE v.`visit_date` BETWEEN '2013-09-15' AND '2013-09-22')	

I suppose that would probably work, but I don't really see it scaling well (and I'm personally not a big fan of NOT IN when it can be helped). However, we are on the right track. If we could just get rid of that NOT IN clause...

What if we think of our second query as a table of ALL rows before the return week (ie, before Sept 22) and then filtered that to get our subset that excludes the pause? This would mean that inside this "table", a customer could have many visits, but the latest one must be BEFORE Sept 15. Something like:

SELECT `customer_id` 
FROM `visits` 
WHERE `visit_date` < '2013-09-22' 
GROUP BY `customer_id`
HAVING MAX(`visit_date`) < DATE_ADD('2013-09-22', INTERVAL -7 DAY);

We've done it! This will now exclude all the "pause week" customers, so we can just join this to our first query to see who is included in both:

SELECT DISTINCT v.customer_id 
FROM `visits` v
JOIN (	SELECT `customer_id` 
		FROM `visits` 
		WHERE `visit_date` < '2013-09-22' 
		GROUP BY `customer_id`
		HAVING MAX(`visit_date`) < DATE_ADD('2013-09-22',INTERVAL -7 DAY)) as PreviousData ON (PreviousData.customer_id = v.customer_id)
WHERE v.`visit_date` BETWEEN '2013-09-22' AND '2013-09-28';

There you have it! Took some thinking, but we got there.

Final Words: While the solution was fun to track down, in the end it may not work for us. Our tables may be too big, or our servers too small, or other issues may crop up. The point of doing exercises like this is not to simply memorize a bunch of solutions to apply in certain cases, but to teach ourselves how to think about how the data is stored and formed and queried, so when we come across something complex we know what to try.

Hope that helped!

Jeff Madsen is a web applications programmer living in Japan. 

Follow him on Twitter if that's your thing,
or drop him a line about an interesting project. 

Contact me