RecLab Tutorial

Introduction

Welcome to the RecLab tutorial. This tutorial is designed to lead you through writing your first simple recommender in the RecLab environment. RecLab's goal is to empower you to write recommenders for e-commerce systems that take a variety of shopping contexts into account. It also enables researchers, in cooperation with online retailers and personalization service providers like RichRelevance to build their algorithms with real data, put them in front of real users in real shopping environments, and study their real reactions. Historically, most efforts to do this have tripped over issues ranging from failure of research recommenders to meet production SLAs to concerns over data privacy. RecLab addresses these issue by saying, "if you can't bring the data to the code, bring the code to the data."

We will not go through anywhere near all the contexts RecLab offers in this tutorial. Nor will be build a world-class recommender. But by building a simple recommender, we will illustrate the key points of interaction with the RecLab system and give you a foundation to build upon in your pursuit of more advanced models and algorithms.

As, you work through the tutorial, you will learn a few basics: how the map reduce computing paradigm works; how Maven works; how to unit test with JUnit; and how to build a simple item-to-item collaborative filter recommender. If you are already familiar with one or more of these topics, you may breeze through certain sections. However, we encourage you to at least skim each section of the tutorial as you go through it, and to actually compile and run the sample code. Expect to spend an hour or two going through this tutorial in detail. Should you wish to download a complete copy of the tutorial's source code, you can do so with Subversion (see also the Version Control with Subversion book). Once you have Subversion installed, simply type

# svn co http://code.richrelevance.com/svn/reclab/reclab-tutorial/trunk reclab-tutorial

at the command line to check it out. Note that in this and all shell examples, the leading # represents the command-line prompt. Don't type it.

If you prefer, you can use the same URL with your IDE or a GUI tool like RapidSVN, TortoiseSVN, or Versions. If you want to look at the code in a browser, you can do so in raw form here or in a somewhat prettier form here.

The only prerequisites for this tutorial are Java 1.6 JDK, a working installation of Maven version 2.X (we used 2.2.0 to test) and an editor or IDE for creating and editing java code (we use Eclipe). All of the binary code you need for the RecLab system will be pulled down from the RecLab binary repository or other public repositories as needed by Maven. If you want to check out the source code as described above, you will also need Subversion or one of the clients mentioned.

The specific recommender we will build in this example will use a source product S to recommend target products T based on how frequently they were purchased in the same session as S. This is a recommender that we might place on S's product page, along with the message, "Shoppers who bought S also bought."

Before we dive into actually building our first recommender, we will briefly introduce the basic ideas behind how RecLab works and then show you how to set up a basic project that uses it. Once we have done that, we can dive into the fun stuff and actually write the code for the recommender. Then we can test it out.

Ideally, this whole tutorial should take about an hour to work through. Once you have done so, you will be well on your way to building and running your own models.

How RecLab Works

Any RecLab recommender runs in two phases. At model build time, we analyze data on past shopping behavior and build one or more models. A model is simply a mapping from an ID-such as a product ID, a shopper ID, or a category ID, to an ordered set of IDs, each with an associated score.

In our example, the keys for the mapping will be the IDs of source products S while the values will be ordered lists of target product IDs T, each scored by the conditional probability of purchasing T given that S was also purchased in the same session.

Once the model is built, we implement a simple runtime recommender that responds to requests for recommendations by looking into one or more models and then optionally doing a small amount of computation before returning final results.

In our example, all we will do at run time is look up the source product S in the model and return the target products T to our caller.

In order to test and debug a model, RecLab provides a convenient test harness that reads annotated clickstream logs and hands them to a map reduce job. The resulting model is written to a file that can then be used with a runtime test harness to evaluate the quality of the resulting model.

In addition to test harnesses for building and testing models on sample data, there is also a production harness that builds models with very large scale data from real online retailers in a distributed map reduce environment. The model is then run in production and takes live requests from real shoppers. This means that research models built to the RecLab API can undergo the ultimate user testing with real users on real shopping sites in their real homes and offices.

Introducing MapReduce

RecLab model building is done in a MapReduce environment. If you are familiar with the map reduce concept from using tools like Hadoop, you will feel right at home writing code to run at model build time. If you are not, don't worry, we are about to introduce the basic concepts you need to know in order to proceed.

MapReduce, as its name implies, consists of two distinct phases, mapping and reduction. There is technically a third phase in between the two, called collection, which we will discuss later, but it is of less importance in the RecLab world.

First, let's consider mapping. In RecLab, mapping is handled by classes that implement the RlMapper interface.

The basic idea behind mapping is that we start with a set of (k, v) pairs where k is a key and v is a value. We examine these pairs one by one in an arbitrary order. In some cases, we might have multiple threads doing the examination independently-with the only caveat being that each pair is examined once. Each time we examine a key value pair, we can decide to write out one or more key value pairs (k', v'). k' need not be the same type as k, and v' need not be the same type as v. The same key k can appear in the input many times, along with different values v1, v2, v3, or even with the same value multiple times. Similarly, we can write outputs using the same key k' many times if we want to.

So what good is this whole abstract concept of mapping? Let's look at a concrete example that frequently occurs in RecLab. The input keys are timestamps indicating when shoppers did particular things, and the values are descriptions of what they did and the context in which they did them. For example, one entry in the input might indicate that at 11:43am on the 23rd of August (the key) a shopper with whose ID is 1234987 and whose session ID is ABC555 made a purchase which included three of item 5998 at $1.79 each and one of item 4392 at $59.99 (all of that is rolled into the value).

So what could a mapper do with this? One possibility is to strip out just the data that will be essential to our model. For example, all we might really want to know is which products were bought together. Thus our mapper would look at the key and value described above and output the pairs (5998, 4392) and (4392, 5998), indicating that when the shopper bought 5998 they also bought 4392, and vice versa. For reasons that will become apparent shortly, we will also output (5998, 5998) and (4392, 4392). If we apply this same mapper to all the purchases that occurred in a given month, we would generate a full mapping of what shoppers purchased together over the course of the month. There could be a lot of duplicates, since some other shopper might have also bought 5998 and 4392 together.

That's all well and good, but now that we have mapped this information, what do we do with it? The answer? We reduce it. In RecLab, reduction is handled by classes that implement the RlReducer interface.

Reduction is the process of taking all the values v1', v2', v3' ... which a mapper associated with a given key k' and reducing them down to one or more pairs (k'', v''). The mapper might have written the pair (k', v1') long before it ever wrote (k', v2'). It could have literally written millions of other pairs in between. It also could have written (k', v1') from one thread on one machine and (k', v2') from another thread on a completely different machine in a distributed environment. However, this actually doesn't matter. Before the reducer is called, the map reduce system will make sure that all of the pairs with the same key k' have been collected together. Just like a mapper, a reducer can write as many (k'', v'') pairs as it wants, and can reuse the same k' as often as it wants to.

In the RecLab example from above, now that we have a map from product id to product id for all pairs of products purchased together, we can use a reducer to generate a mapping from each product ID that was ever purchased to all of the other products that were purchased alongside it. So, if the example purchase above occurred at the beginning of the month, and another shopper appeared at the end of the month and bought product 4392, then our reducer would be asked to do a reduction for key 4392 and the values {4392, 5998, 4392}. The first two come from the pairs (4392, 4392) and (4392, 5998) generated by the mapper from the first purchase. The third comes from the pair (4392, 5998) generated by the second purchase. If there were no other purchases of item 4392, then these would be the only three values we would be asked to reduce.

When the reducer is called for key 4392 it is given an Iterable over the three values {4392, 5998, 4392}. Now we want the reducer to output each of the unique values and the fraction of the purchases that include it. In our example, this means we would output the key 4392 and a value that maps IDs to ratios as (4392 -> 1.0; 5998 -> 0.5). As we will see when we write the code for this reducer, we can compute these ratios simply by counting the number of times each value occurs and then dividing that by the number of times the key occurs as a value. For 4392 that is 2/2 = 1.0. For 5998, that is 1/2 = 0.5.

We are getting close, but we aren't quite all the way to a model of co-purchases yet. As we will see below, we still have a little housekeeping to do in order to build the actual model but we are very close. We'll come back to writing the actual code the implements the mapper and reducer we just described, but first we have to pause to quickly set up our development environment.

Setting up the project

Before we can write any actual code, we need to set up a Maven project. We will assume you already have Maven installed on your system.

Creating the Project Directory Structure

The first thing we need to do is set up a standard Maven project directory. On Linux, OS X, or Windows running cygwin, we do the following

# cd ~
# mkdir reclab-tutorial
# cd reclab-tutorial
# mkdir -p src/main/java
# mkdir -p src/test/java

The first directory you created was for the project itself. The two subdirectories are where your source code will live. src/main/java is for the source code you will be working on. src/main/java is for unit tests. Please use exactly these directories, as they are Maven standards. If you use different directories you have to jump through a few more hoops to tell Maven where to find your code.

Creating a Maven pom.xml File for a RecLab Project

Once these directories are in place, open up your favorite editor and create a file called pom.xml in the reclab-tutorial directory you just created. pom.xml is a Maven file that describes the project you are building.

Copy and paste the following into your pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project 
  xmlns="http://maven.apache.org/POM/4.0.0" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

  <modelVersion>4.0.0</modelVersion>

  <!-- 1. This section tells Maven a bit about our project           -->
  
  <groupId>reclab</groupId>
  <artifactId>reclab-tutorial</artifactId>
  <name>RecLab Tutorial</name>
  <version>0.1</version>
  <packaging>jar</packaging>
  <description>
        A sample project for code from the RecLab tutorial.
  </description>

  <!-- 2. This section tells Maven where to look for projects we     -->
  <!--    need in order to build this one.                           -->
  
  <repositories>
    <repository>
      <id>rr-code</id>
      <name>RichRelevance Code Repository</name>
      <url>http://code.richrelevance.com/maven2</url>
    </repository>
  </repositories>
  
  <!-- 3. This section tells Maven what other projects we depend on. -->
  <!--    Maven will download these for us and cache them locally.   -->
  
  <dependencies>
        <dependency>
          <groupId>reclab</groupId>
          <artifactId>reclab-core</artifactId>
          <version>0.5.1</version>
        </dependency>
        <dependency>
          <groupId>junit</groupId>
          <artifactId>junit</artifactId>
          <version>4.8.1</version>
          <scope>test</scope>
        </dependency>
  </dependencies>

  <!-- 4. This is a relatively verbose way of telling Maven to use   -->
  <!--    Java 1.6                                                   -->
  
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>2.3.2</version>
        <configuration>
          <source>1.6</source>
          <target>1.6</target>
        </configuration>
      </plugin>
    </plugins>
  </build>

</project>

If you don't have a lot of experience with Maven, don't worry. The pom.xml we just created essentially does four things, as the comments indicate.

  1. It tells Maven a little bit about our project.
  2. It tells Maven where to find some of the projects we depend on. Specifically, it tells it where the repository that contains RecLab code lives, so that it can fetch it before trying to build our project.
  3. It tells Maven that we depend on some other projects. In particular, we need reclab-core, which contains various interfaces and classes we will use, and when we want to compile and run unit tests we need junit.
  4. It tells Maven what Java version to use.

That's it for your initial pom.xml. To see if you got it right, go back to your shell and from the reclab-tutorial directory, type

# mvn clean

All this does is tells Maven to parse the pom.xml, download whatever it needs, and then clean out the target directory. You should see some download messages go by, especially if you have not run Maven before, but there should be no error messages and just before Maven finishes it should say something like

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 12 seconds
[INFO] Finished at: Wed Nov 03 21:48:28 PDT 2010
[INFO] Final Memory: 95M/123M
[INFO] ------------------------------------------------------------------------

Coding the Mappers and Reducers

Now that we have our project all set up, we can start writing some actual code. Using your favorite editor or IDE, create a new Java package org.reclab.tutorial.cppurchase in the src/main/java directory you created. Now create a class called PurchaseByUserMapper. Make it implement the RlMapper interface. It should looks something like this:

package org.reclab.tutorial.cppurchase;

import org.reclab.core.context.RecContext;
import org.reclab.core.context.RlValue.RlDate;
import org.reclab.core.context.RlValue.RlInteger;
import org.reclab.core.mapreduce.RlMapper;
import org.reclab.core.mapreduce.Writer;

public class PurchaseByUserMapper 
implements RlMapper<RlDate, RecContext, RlString, RlInteger> {

    @Override
    public void map(RlDate key, RecContext value, 
            Writer<RlString, RlInteger> writer) throws Exception {
        // TODO - this method needs to be written.
    }
}

Before we try to write the body of the map method, let's take a look at the parameters to the class and the map function. RlMapper is parameterized by four classes, in input key type KEYIN, an input value type VALIN, an output key type KEYOUT, and an output value type VALOUT. In our PurchaseByUserMapper the input has a key type of RlDate, which represents a date, and a value type of RecContext which is a java bean containing all kinds of information about the context where an event occured. The output has a key type of RlString, which we will use for user IDs, and a value type of RlInteger which we will use for product IDs.

The various Rl* classes we are using are RecLabs-safe versions of common data types like dates, integers, strings, and doubles. They are designed to be easy to use in a wide variety of map reduce environments across both small single machine instances and large distributed clusters like Hadoop. They are all part of the RlValue interface.

Now comes the fun part, actually writing the body of our map method. Following the approach we outlined in the section introducing MapReduce. For each purchase event we see we'll want to write a mapping for each pair of items purchased. The code to do so is as follows:

package org.reclab.tutorial.cppurchase;

import org.reclab.core.context.CartContentsContext;
import org.reclab.core.context.EventType;
import org.reclab.core.context.LineItem;
import org.reclab.core.context.RecContext;
import org.reclab.core.context.UserContext;
import org.reclab.core.context.RlValue.RlDate;
import org.reclab.core.context.RlValue.RlInteger;
import org.reclab.core.context.RlValue.RlSet;
import org.reclab.core.context.RlValue.RlString;
import org.reclab.core.mapreduce.AbstractDataCacheClient;
import org.reclab.core.mapreduce.RlMapper;
import org.reclab.core.mapreduce.Writer;

/**
 * A mapper that maps from raw purchase events
 * to (userId, productId) pairs for every purchase
 * of a product made by a user. All non-purchase
 * event types are ignored.
 */
public class PurchaseByUserMapper 
extends AbstractDataCacheClient 
implements RlMapper<RlDate, RecContext, RlString, RlInteger> {

    @Override
    public void map(RlDate key, RecContext value, 
            Writer<RlString, RlInteger> writer) throws Exception {

        // We only care about purchase events. Ignore all
        // others.
        if (value.getEventType().equals(EventType.PURCHASE)) {

            // If there is no user id, then we are out
            // of luck. This should not normally be the case. 
            UserContext userContext = value.getUserContext();
            
            if (userContext == null) {
                return;
            }
            
            RlString userId = new RlString(userContext.getUserId());
            
            // Get the contents of the cart.
            CartContentsContext cartContentsContext = value.getCartContentsContext();
            
            // Build a set of unique items in the cart. If a product id
            // appears in more than one line item, this effectively
            // removes duplicates.
            RlSet<RlInteger> productIds = new RlSet<RlInteger>(cartContentsContext.size());
            
            for (LineItem lineItem : cartContentsContext.getLineItems()) {
                productIds.add(new RlInteger(lineItem.getProductId()));
            }
            
            // Now loop over the set and write each product the user bought.
            for (RlInteger productId : productIds) {
                writer.write(userId, productId);
            }                    
        }
    }
}

With this, we have essentially translated the approach we outlined in English into code that produces an output containing one (userId, productId) pair for each product each customer bought. We don't care about the quantity or price, just that they bought it.

At this point you may want to dig a bit more into the guts of the purchase context to see what else is there. Feel free to look at the JavaDoc for CurrentPurchaseContext to see more of what is available.

Next up we will work on the reducer that turns the set of products purchased by a shopper into a collection of (sourceProductId, targetProductId) pairs for all the pairs of products a shopper purchased.

You might wonder why we didn't just do this in the mapper we just wrote. After all, we explicitly constructed the set of product IDs containing exactly the products the shopper bought. The reason we can't do this is because the shopper might have made several purchases over time. Suppose the shopper bought A and B in one purchase on Tuesday and then bought C and D on in another purchase on Friday. PurchaseByUserMapper.map() would be invoked once for each of these purchases. But it could be done first for one and then the other, in either order. Or it could be done in entirely different threads concurrently. If we built the co-purchase pairs in the mapper, we would construct pairs (A, B) and (B, A) from the first purchase, then (C, D) and (D, C) from the second purchase. In doing so we would entirely miss cross-purchase pairs like (A, C) and (D, B). That's where reduction comes in, and more importantly, where the hidden grouping that occurs between mapping and reduction helps us out. We'll see this in action in our reducer code.

To create the reducer, create a new class AllPairsForKeyReducer in org.reclab.tutorial.cppurchase as follows:

public final class AllPairsForKeyReducer 
extends AbstractDataCacheClient 
implements RlReducer<RlString, RlInteger, RlInteger, RlInteger> {

    @Override
    public void reduce(RlString key, Iterable<RlInteger> values,
            Writer<RlInteger, RlInteger> writer) throws Exception {
        // TODO - this method needs to be written.        
    }
    
    @Override
    public void close(Writer<RlInteger, RlInteger> writer) throws Exception {
        // NOP        
    }    
}

Notice that unlike an RlMapper which deals with one (k, v) pair at a time, RlReducer takes a key k and a collection of values v1, v2 .... They are presented in the form of a Java Iterable over the input value class. The output is run through a Writer, just like it was for an RlMapper.

Now fill in the class as follows:

/**
 * A reducer class that writes out all unique pairs of values
 * associated with a given string key.
 */
public final class AllPairsForKeyReducer 
extends AbstractDataCacheClient 
implements RlReducer<RlString, RlInteger, RlInteger, RlInteger> {

    @Override
    public void reduce(RlString key, Iterable<RlInteger> values,
            Writer<RlInteger, RlInteger> writer) throws Exception {
     
        Set<RlInteger> uniques = new HashSet<RlInteger>();
        
        for (RlInteger value : values) {
            uniques.add(value);
        }
        
        for (RlInteger source : uniques) {
            for (RlInteger target : uniques) {
                writer.write(source, target);
            }
        }
    }

    /* (non-Javadoc)
     * @see org.reclab.core.mapreduce.RlReducer#close(org.reclab.core.mapreduce.Writer)
     */
    @Override
    public void close(Writer<RlInteger, RlInteger> writer) throws Exception {
        // NOP        
    }
}

Notice what the reduce method does. First it uses a set to collapse out any duplicate values. Then, it uses a nested pair of for loops on the elements of the set to write out every unique pair.

The other thing to notice is that this class doesn't explicitly know about purchases or shoppers. It only knows about keys and values. Thus it can be reused for a lot of different other ID spaces besides just shoppers and products.

Finally, notice the close method. This is called when all reduction is complete, in case the reducer has any final cleanup or output generation it has to do. Our class does not have any of this.

Once we have all pairs each shopper bought, we need to process them across all shoppers to find out how often any given target ID occurred along with any given source ID in the pairs. This turns out to be another map reduce job. The mapping, however, is trivial; we don't have to change anything. This is done with a special mapper called an identity mapper that simply writes every input it sees to the output without changing or computing anything. The second reduction does have some work to do, the details of which we'll see as we build it.

To create the second reducer, create a new class CoOccurrenceReducer in org.reclab.tutorial.cppurchase as follows:

package org.reclab.tutorial.cppurchase;

import org.reclab.core.context.RlValue.RlInteger;
import org.reclab.core.context.RlValue.RlSet;
import org.reclab.core.mapreduce.RlReducer;
import org.reclab.core.mapreduce.Writer;

public class CoOccurrenceReducer 
implements RlReducer<RlInteger, RlInteger, RlInteger, RlSet<RlInteger>>{

    @Override
    public void reduce(RlInteger key, Iterable<RlInteger> values,
            Writer<RlInteger, RlSet<RlInteger>> writer) throws Exception {
        // TODO - this method needs to be written.       
    }
}

In CoOccurrenceReducer, the input keys and values are RlIntegers. They are the source and target product IDs that our mapper wrote. The output key type is an RlInteger as well, and again represents a product ID. But the output value type is a map from integers to doubles. We will use this to represent the fraction of the time each target product was purchased along with the source product. The value should be one when the target is the same as the source, and less than one otherwise.

package org.reclab.tutorial.cppurchase;

import java.util.HashMap;
import java.util.Map;

import org.reclab.core.context.RlValue.RlDouble;
import org.reclab.core.context.RlValue.RlInteger;
import org.reclab.core.context.RlValue.RlMap;
import org.reclab.core.mapreduce.RlReducer;
import org.reclab.core.mapreduce.Writer;

/**
 * A reducer that determines the rate of co-occurrences of arbitrary 
 * integer IDs.
 */
public class CoOccurrenceReducer 
extends AbstractDataCacheClient
implements RlReducer<RlInteger, RlInteger, RlInteger, RlMap<RlInteger, RlDouble>>{

    @Override
    public void reduce(RlInteger key, Iterable<RlInteger> values,
            Writer<RlInteger, RlMap<RlInteger, RlDouble>> writer) throws Exception {

        // Count up how many times each target occurred.
        
        Map<RlInteger, RlInteger> targetCounts = new HashMap<RlInteger, RlInteger>();  
        
        for (RlInteger target : values) {
            if (targetCounts.containsKey(target)) {
                targetCounts.get(target).increment();
            } else {
                targetCounts.put(target, new RlInteger(1));
            }
        }
        
        // The total number of times the source occurred 
        // is the number of times it mapped to itself.
        
        double totalSourceEvents = targetCounts.get(key).intValue();
        
        // Build up our final result by dividing each count by
        // the total number of source events.

        RlMap<RlInteger, RlDouble> targetFractions = new RlMap<RlInteger, RlDouble>();  

        for (RlInteger target : targetCounts.keySet()) {
            targetFractions.put(target, new RlDouble(targetCounts.get(target).intValue() / totalSourceEvents));
        }
        
        // And write it out.
        writer.write(key, targetFractions);
    }
    
    @Override
    public void close(Writer<RlInteger, RlMap<RlInteger, RlDouble>> writer)
            throws Exception {
        // NOP
    }    
}

First, we iterate through the targets, many of which probably occurred more than once due to purchases by different shoppers. As we iterate, we add up the total number of times each occurs as a target. Then, we simply have to iterate through these counts and compute the ratio of each count to the number of times the source occurred as a target, which should be exactly equal to the number of people who purchased the source.

Notice that in the code we didn't mention product IDs anywhere, and we didn't name this class CoPurchaseReducer or anything like that. The reason is that this reducer is actually far more generic than that. If we generate pairs of IDs representing anything that are related by behavior-whether the IDs are products, categories, or shoppers and whether the behaviors are purchases views, add-to-carts, or anything else-their co-occurrences can be computed by this reducer. We will use this reducer, or very similar ones, quite often for building simple conditional probability models.

Before we move on, make sure your code compiles without error in Maven. At the command line, type

# mvn clean compile

from the reclab-tutorial directory.

This tells Maven to clean up anything left behind from last time it was invoked and compile your code. You should see some messages about what Maven is up to and then a success message like

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 4 seconds
[INFO] Finished at: Thu Nov 04 13:05:21 PDT 2010
[INFO] Final Memory: 17M/81M
[INFO] ------------------------------------------------------------------------

Testing the Mapper and Reducers

Before we actually try to use our mapper and reducers to do anything useful, we need to test them. You may be tempted to jump ahead and wire things together so you can build a real model, but we encourage you not to skip this section. It is much easier to test a single mapper or reducer class in isolation on a small carefully controlled data set than it is to debug it in the wild on a large data set. RecLab provides some utility classes to make this as easy as possible.

To test our code, we will write test cases in the src/java/test directory we created when we were first setting up our project. Maven knows to look there for JUnit tests and run them when it is asked to, but not package them up as part of any binary distribution it creates.

Within src/test/java, create a java package org.reclab.tutorial.cppurchase to mirror the one we created in src/test/java. It is generally good practice to keep test cases in the same package as the code it is testing. Our first test case will be for our mapper. Create the test case as a class called PurchaseByUserMapperTest as follows:

package org.reclab.tutorial.cppurchase;

import static org.junit.Assert.assertEquals;

import org.junit.Test;

/**
 * Test case for {@link PurchaseByUserMapper}.
 */
public class PurchaseByUserMapperTest {

}

In order to test our mapper, we will use a very simple custom Writer that simply dumps everything it is asked to write into a list. We can then call our mapper with various arguments and see what gets written. Here is what the test looks like when complete. Here is the code we add to PurchaseByUserMapperTest for this writer and the test that uses it.

    /**
     * A simple little helper class that represents
     * a key/value pair.
     */
    private static class KeyValue {
        String key;
        int value;

        public KeyValue(String key, int value) {
            this.key = key;
            this.value = value;
        }
    }

    /**
     * Test our ability to map purchase events.
     * @throws Exception
     */
    @Test
    public void testMapPurchase() throws Exception {
        
        PurchaseByUserMapper mapper = new PurchaseByUserMapper();

        // A list of everything the mapper wrote.
        
        final List<KeyValue> writtenOutput = new ArrayList<KeyValue>();
        
        // Construct a writer that just appends what it writes to
        // the output.
        
        Writer<RlString, RlInteger> writer = new Writer<RlString, RlInteger>() {

            @Override
            public void write(RlString key, RlInteger value) throws Exception {
                KeyValue keyValue = new KeyValue(key.toString(), value.intValue());
                writtenOutput.add(keyValue);
            }
            
            @Override
            public void close() throws Exception {
            }
        };
        
        // Now pass it a purchase event.
        
        final String shopperId = "S123";
        final int productId1 = 10001;
        final int productId2 = 10002;
        
        List<LineItem> lineItems1 = new ArrayList<LineItem>();
        
        lineItems1.add(new LineItem(productId1, 1, 7995));
        lineItems1.add(new LineItem(productId2, 3, 9995));
        
        RecContext purchaseContext1 = new RecContext(new Date(), EventType.PURCHASE);
        
        purchaseContext1.setUserContext(new UserContext(shopperId));
        purchaseContext1.setSessionContext(new SessionContext("ABC"));
        purchaseContext1.setCartContentsContext(new CartContentsContext(lineItems1));
        
        mapper.map(new RlDate(purchaseContext1.getDate()), purchaseContext1, writer);
        
        // Now the output should contain two elements, one for each 
        // product the shopper purchased.
        
        assertEquals(2, writtenOutput.size());
        
        assertEquals(shopperId,  writtenOutput.get(0).key);
        assertEquals(productId2, writtenOutput.get(0).value);
        
        assertEquals(shopperId,  writtenOutput.get(1).key);
        assertEquals(productId1, writtenOutput.get(1).value);
    }

Essentially, we just force a call to map and make sure that the products we passed in end up getting passed through to our writer.

To run the test, we invoke Maven with

# mvn clean test

from the reclab-tutorial directory.

Just as every time we invoked maven before, we should end up with a success message.

So we have tested how the mapper responds to purchases. For completeness, we should also test how it responds to other kinds of events. In general, our map reduce jobs will start out with streams of all kinds of events in RecContext form. Here is a the test method we can add to PurchaseByUserMapperTest for this. It uses a Writer that explodes and causes the test to fail if it is ever called.

    /**
     * Test our ability to ignore events that are not purchases.
     */
    @Test
    public void testMapNonPurchase() throws Exception {

        PurchaseByUserMapper mapper = new PurchaseByUserMapper();

        // Construct a writer that should never be called.
        
        Writer<RlString, RlInteger> writer = new Writer<RlString, RlInteger>() {

            @Override
            public void write(RlString key, RlInteger value) throws Exception {
                assertFalse("We should never be called.", true);
            }
            
            @Override
            public void close() throws Exception {
                assertFalse("We should never be called.", true);
            }
        };
        
        // Test that the mapper ignores an event that is not a purchase.
        
        RecContext recContext = new RecContext(new Date(), EventType.HOME_PAGE_VIEW);
        
        mapper.map(new RlDate(recContext.getDate()), recContext, writer);        
    }

Add this test, run mvn clean test again, and it should succeed again.

Next up, we want to test our first reducer in a similar way. As you probably already guessed, we will create a test class AllPairsForKeyReducerTest in the org.reclab.tutorial.cppurchase package in src/test/java. This test class will verify that our reducer behaves as we expect it to. The code for the test is as follows:

/**
 * Test case for {@link AllPairsForKeyReducer}.
 */
public class AllPairsForKeyReducerTest {

    /**
     * A simple little helper class that represents
     * a key/value pair.
     */
    private static class KeyValue {
        RlInteger key;
        RlInteger value;

        public KeyValue(RlInteger key, RlInteger value) {
            this.key = key;
            this.value = value;
        }

        @Override
        public boolean equals(Object obj) {
            if (this == obj) {
                return true;
            }
            if (obj == null) {
                return false;
            }
            if (getClass() != obj.getClass()) {
                return false;
            }
            KeyValue other = (KeyValue) obj;
            if (key == null) {
                if (other.key != null) {
                    return false;
                }
            } else if (!key.equals(other.key)) {
                return false;
            }
            if (value == null) {
                if (other.value != null) {
                    return false;
                }
            } else if (!value.equals(other.value)) {
                return false;
            }
            return true;
        }
    }

    @Test
    public void testReduce() throws Exception {
        
        // Construct the reducer.
        AllPairsForKeyReducer reducer = new AllPairsForKeyReducer();
    
        // Construct an input key and a set of input values.
        
        final RlString inputKey = new RlString("U17");
    
        final RlInteger value1 = new RlInteger(10001);
        final RlInteger value2 = new RlInteger(10002);
        final RlInteger value3 = new RlInteger(10003);
    
        // Hold onto the input values in a set so we can
        // use them later.
        
        Set<RlInteger> allValues = new HashSet<RlInteger>();
        
        allValues.add(value1);
        allValues.add(value2);
        allValues.add(value3);
        
        // Put duplicates in the input.
        
        final RlArray<RlInteger> inputValues = new RlArray<RlInteger>();
        
        inputValues.add(value1);
        inputValues.add(value2);
        inputValues.add(value3);
        inputValues.add(value1);
        inputValues.add(value1);
        inputValues.add(value3);
        inputValues.add(value3);
        inputValues.add(value3);

        // Now reduce them and we should get each
        // possible pair of values exactly once.

        final List<KeyValue> written = new ArrayList<KeyValue>();
        
        // When reducing, we use a Writer that just puts
        // each key value pair into our array of written
        // pairs so we can verify later that we got what
        // we expected.
        
        reducer.reduce(inputKey, inputValues, new Writer<RlInteger, RlInteger>() {

            @Override
            public void close() throws Exception {
            }

            @Override
            public void write(RlInteger key, RlInteger value) throws Exception {
                written.add(new KeyValue(key, value));
            }
        });
        
        // There are three unique values, so there are nine pairs.
        
        assertEquals(9, written.size());
        
        // Make sure we have every possible pair.
        
        for (RlInteger source : allValues) {
            for (RlInteger target : allValues) {
                assertTrue(written.contains(new KeyValue(source, target)));
            }
        }
    }
}

The test emulates multiple orders from the same shopper with some individual items (values) purchased more than once. There are only three unique values, so even though there are quite a few more item purchases, the reducer will get down to just those three, then produce all nine possible pairs of them. After running the reducer, we simply have to verify that all nine pairs are present by reproducing them using a set in which we stored the input values.

To run the new reducer test, once again run

# mvn clean test

from the reclab-tutorial directory. This will run both of the tests classes in your project.

Finally, we need a unit test for our second reducer. It looks like the following:

package org.reclab.tutorial.cppurchase;

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertTrue;

import org.junit.Test;
import org.reclab.core.context.RlValue.RlArray;
import org.reclab.core.context.RlValue.RlDouble;
import org.reclab.core.context.RlValue.RlInteger;
import org.reclab.core.context.RlValue.RlMap;
import org.reclab.core.mapreduce.Writer;

/**
 * Test case for {@link CoOccurrenceReducer}.
 */
public class CoOccurrenceReducerTest {

   @Test
   public void testReduce() throws Exception {

       // Construct the reducer.
       
       CoOccurrenceReducer reducer = new CoOccurrenceReducer();
       
       // Construct an input key and a list of input values.
       final RlInteger inputKey = new RlInteger(17);
       
       final RlArray<RlInteger> inputValues = new RlArray<RlInteger>();

       // Make it look more or less like four purchases.
       
       inputValues.add(inputKey);
       inputValues.add(new RlInteger(1001));
       inputValues.add(new RlInteger(1002));
       inputValues.add(new RlInteger(1003));
       
       inputValues.add(inputKey);
       inputValues.add(new RlInteger(1003));
       inputValues.add(new RlInteger(1004));
       
       inputValues.add(inputKey);
       
       inputValues.add(inputKey);
       inputValues.add(new RlInteger(1001));
       inputValues.add(new RlInteger(1003));
       inputValues.add(new RlInteger(1004));
       inputValues.add(new RlInteger(1005));
       
       // Our writer is where we validate that the input
       // key and values are passed through as they should
       // be.
       Writer<RlInteger, RlMap<RlInteger, RlDouble>> writer = 
           new Writer<RlInteger, RlMap<RlInteger, RlDouble>>() {

               @Override
               public void write(RlInteger key, RlMap<RlInteger, RlDouble> value)
               throws Exception {
                   // We should write a key that is equal to the
                   // input key, with values for each of the 
                   // input values.
                   
                   assertEquals(inputKey, key);
                   assertEquals(6, value.size());
                   
                   for(RlInteger inputValue : inputValues) {
                       assertTrue(value.containsKey(inputValue));
                   }

                   // The count should be 1.0 for the source, and 
                   // fractional based on the number occurrences for
                   // the other keys.
                   
                   assertEquals(1.0, value.get(inputKey).doubleValue(), 1e-10);
                   
                   assertEquals(0.5,  value.get(new RlInteger(1001)).doubleValue(), 1e-10);
                   assertEquals(0.25, value.get(new RlInteger(1002)).doubleValue(), 1e-10);
                   assertEquals(0.75, value.get(new RlInteger(1003)).doubleValue(), 1e-10);
                   assertEquals(0.5,  value.get(new RlInteger(1004)).doubleValue(), 1e-10);
                   assertEquals(0.25, value.get(new RlInteger(1005)).doubleValue(), 1e-10);
               }

               @Override
               public void close() throws Exception {
               }
           };
    
       // Run the reduction, which will write to the writer,
       // which will verify the reducer did its job.
           
       reducer.reduce(inputKey, inputValues, writer);
   }
}

This test exercises the CoOccurenceReducer class by simulating how we would reduce all of the values for a given key, and verifying that we correctly compute the fraction of the time each value is associated with the key. For the key itself, this should be one, and for all other cases it is less than one based on the test data we put into the test.

Once again,

# mvn clean test

from the reclab-tutorial director will run all three of the tests in your project. By all means copy, study, run, and extend the test to your heart's content.

Building a Model

We have now constructed a mapper and two reducers that together can compute the probability that someone who bought a source product S also bought a target product T. Now we have to put them together and arrange for the final results to be put into a model we can use at run time.

Building a model requires three things.

  1. We have to have a source of input data to use in building the model.
  2. We have to do one or more map reduce operations to transform the data from its raw form into a map from IDs (e.g. the product IDs in this tutorial's running example) to IDs that point to scores e.g. the output of the example's reducer class.
  3. We need a model builder that we can pass the output of our map reduce operations to in order for it to build our model.

The exact manner in which these things occur depends tremendously on the nature of the file system in which our data is stored. RecLab supports everything from the tiny memory-resident NanoFileSystem, which is really only good for very small scale unit testing to the local-disk-resident YamlFileSystem that is great for debugging medium-scale data because all the files it reads and writes are human-readable, to large distributed file systems like HDFS.

RecLab was designed to let you run the same code on any of the file systems that it supports, all of which extend the abstract FileSystem class.

In order to coordinate the process of building a model, we need a class that derives from MapReduceJob. A MapReduceJob has a FileSystem and a ModelBuilder to actually build the model.

We will demonstrate model building using the YamlFileSystem, a local file system that writes data in a relatively human-readable YAML format. This is generally a very good choice when you first write a model builder, as it is easy to construct test input files and debug intermediate files that your mappers and reducers produce. We will also use the YamlModelBuildEnvironment, which writes models out to YAML files in the local file system.

We will do all of this in a static main() method in a class called CpPurchaseDemo. The method looks like

    public static void main(String[] args) {
        
        try {
            
            // Parsing args and setting us up takes most
            // of the space in this method.
            
            if(args.length != 2 && args.length != 3) {
                System.err.println("Usage: CpPurchaseDemo sourceFile modelDir [workingDir]");
                System.exit(-1);
            }
        
            // The source file in the local file system.
            String sourceFilePath = args[0];
            
            // The location in the local file system where
            // we want to write our model.
            String modelDirPath  = args[1];
            
            // The path to the directory to use for intermediate files.
            // If it is not specified, we'll use a temporary directory.
            String rootDirectoryPath;
            
            if(args.length == 3) {
                rootDirectoryPath = args[2];
            } else {
                // Hack to get a temp directory.
                File tempFile = File.createTempFile("RecLab.PurchaseCpDemo", "dir");
                
                tempFile.delete();
                tempFile.mkdir();
                
                rootDirectoryPath = tempFile.getAbsolutePath();
            }
        
            System.out.println("ROOT : " + rootDirectoryPath);
            System.out.println("MODEL: " + modelDirPath);
            
            // This is where the real work begins.
            
            // First, we need a file system.
            
            FileSystem fileSystem = new YamlFileSystem(rootDirectoryPath, sourceFilePath, false);
        
            // Next we need a model build environment that writes 
            // to a directory in the local file system.
            
            ModelBuildEnvironment modelBuildEnvironment = new YamlModelBuildEnvironment(modelDirPath);
            
            // Now we construct a job and run it.
            
            CpPurchaseDemo demo = new CpPurchaseDemo(fileSystem, modelBuildEnvironment);
        
            demo.run();
            
        } catch (Exception e) {
            System.err.println("CpPurchaseDemo failed. Details:");
            e.printStackTrace(System.err);
            System.exit(1);
        }
    }

As the comments indicate, most of the method is just parsing command line arguments and setting up the file system we will use as well as the input and output files. Once we have done this, we construct a CpPurchaseDemo and call its run() method. CpPurchaseDemo extends the class MapReduceJob, which is an abstract class that holds onto a file system and a model builder on behalf of its subclasses and declares an abstract run() method that does all the real work. Here is what our version of this looks like:

/**
 * A job that implements a conditional-probability of
 * co-purchase model.
 */
public class CpPurchaseDemo extends MapReduceJob {
    
    public static class TypedIdentityMapper extends IdentityRlMapper<RlInteger, RlInteger> {
    }

    /**
     * Construct the demo given a file system and a model builder.
     * @param fileSystem the file system
     * @param modelBuildEnvironment the model build environment.
     */
    protected CpPurchaseDemo(FileSystem fileSystem, ModelBuildEnvironment modelBuildEnvironment) {
        super(fileSystem, modelBuildEnvironment);
    }

    @Override
    public void run() throws Exception {

        // Get the source event file.
        
        TypedValueFile<RlDate, RecContext> eventStream = getFileSystem().getSourceEventFile();

        // First map reduce, to go from raw context events
        // to a mapping from product ids to the product ids 
        // purchased by the same shopper.

        // Construct the first map reduce factory.
        
        MapReducerFactory<RlDate, RecContext, RlInteger, RlInteger, RlInteger, RlInteger> mapReducerFactory1 = 
            getFileSystem().new 
                MapReducerFactory<RlDate, RecContext, RlInteger, RlInteger, RlInteger, RlInteger>();
        
        // Run the first map reduce.
    
        TypedValueFile<RlInteger, RlInteger> allPairsFile = 
            mapReducerFactory1.getMapReducer().mapReduce(
                eventStream,
                PurchaseByUserMapper.class,
                AllPairsForKeyReducer.class);

        // Construct the second map reduce factory.
        
        MapReducerFactory<RlInteger, RlInteger, RlInteger, RlInteger, RlInteger, RlMap<RlInteger, RlDouble>> mapReducerFactory2 = 
            getFileSystem().new 
                MapReducerFactory<RlInteger, RlInteger, RlInteger, RlInteger, RlInteger, RlMap<RlInteger, RlDouble>>();

        // Run the second map reduce.
        
        TypedValueFile<RlInteger, RlMap<RlInteger, RlDouble>> scoreFile = 
            mapReducerFactory2.getMapReducer().mapReduce(
                allPairsFile,
                TypedIdentityMapper.class,
                CoOccurrenceReducer.class);
        
        // Open a model builder and build the model.
        
        new ModelBuilderUtils<RlInteger, RlInteger>().buildModel(scoreFile, 
                getModelBuildEnvironment().new ModelBuilderFactory<RlInteger, RlInteger>().open("CpPurchaseModel"));
    }

        
    /**
     * Parse the command-line and run the job.
     * @param args command line arguments
     */
    public static void main(String[] args) {
        
        // See the listing above...
        
    }
}

Several important things happen here. First, we define an identity mapper. This mapper goes in between our two reducers. We always map and reduce in pairs (along with the implicit grouping of the output of the map by key before the reducer is invoked). We'll pair the identity mapper with our second reducer. Next, we define the constructor for CpPurchaseDemo which just delegates to its superclass.

Finally, we have the run() method, where it all happens. The run() method invokes two mapings and two reductions. Each of these begins by asking the underlying file system for a FileSystem.MapReducerFactory. This step is present primarily for type-safety. It provides a mechanism for getting a MapReducer that has the right six type parameters, two for the input key and value type, two for the intermediate key and value type that the mapper produces and the reducer consumes, and two for the final output. We then do the mapping and reduction with a mapper and reducer that have matching types.

The output of each map reduce step is a file that lives in the file system. In our case, this is a local file. After you run the program, you can look in the directory it used (which main() conveniently prints) and see the intermediate files.

Finally, we take the last file and use the helper class ModelBuilderUtils. to build a model, which goes back into the local file system at the location specified on the command line.

Running the Model Build

Running from an IDE

If you are building your code in an IDE such as eclipse, then you can probably run the main() method directly from a menu. Try this, and you should see

Usage: CpPurchaseDemo sourceFile modelFile [workingDir]

on the console output. This is good news. It means main() ran properly and did what is it supposed to when it isn't given any command line args.

Running From the Command Line

You can skip this section if you are using an IDE.

If you aren't using and IDE and are just running Maven directly on the command line, you will need to add a little bit of configuration to your pom.xml file to enable Maven to package up your code and all of its dependencies in runnable form. Add the following to your pom.xml inside the <plugins>...</plugins> block just below the existing <plugin>...</plugin> block where we specified version 1.6 of the compiler.

      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>2.2</version>
        <configuration>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
          <archive>
            <manifest>
              <mainClass>org.reclab.tutorial.cppurchase.CpPurchaseDemo</mainClass>
            </manifest>
          </archive>
        </configuration>
        <executions>
          <execution>
            <id>make-assembly</id>
            <phase>package</phase>
            <goals>
              <goal>single</goal>
            </goals>
          </execution>
        </executions>
      </plugin>

It's fairly verbose, but essentially this block tells Maven that we want to assemble a jar containing our code and all the code it depends on and that we want CpPurchaseDemo to be the main class that gets run when we invoke the jar from the command line.

To have Maven build the complete jar, simply invoke it with

# mvn package

now if you look in the target directory, you should see a jar file with a name like reclab-tutorial-0.1-jar-with-dependencies.jar. To run it, on the command line simply type

# java -jar target/reclab-tutorial-0.1-jar-with-dependencies.jar

you should see the same console message

Usage: CpPurchaseDemo sourceFile modelDir [workingDir]

mentioned above. Now we just have to add some command line arguments to give the code some actual data to work with.

Running with Input and Output Files

Before we can run CpPurchaseDemo.main() to proper completion, we need an input file for it to read. Create a file called smallPurchaseLog.yaml and copy the following YAML representations of RecContexts into it.

# User 123 in session ABC buys 10001 and 20001
--- {!!timestamp '2010-02-01T19:00:56.777Z': !RlContext {cartContentsContext: !Cart {lineItems: [{productId: 10001, quantity: 1,
        unitPrice: 7995}, {productId: 20001, quantity: 3, unitPrice: 9995}]}, date: !!timestamp '2010-02-01T19:00:56.777Z',
  eventType: PURCHASE, sessionContext: !Session {sessionId: ABC}, userContext: !User {
    userId: 123}}}
# User 456 in session DEF buys 10001 and 20002    
--- {!!timestamp '2010-02-01T19:01:56.777Z': !RlContext {cartContentsContext: !Cart {lineItems: [{productId: 10001, quantity: 1,
        unitPrice: 7995}, {productId: 20002, quantity: 10, unitPrice: 995}]}, date: !!timestamp '2010-02-01T19:01:56.777Z',
  eventType: PURCHASE, sessionContext: !Session {sessionId: DEF}, userContext: !User {
    userId: 456}}}
# User 123 in session XYZ buys 10001 alone
--- {!!timestamp '2010-03-01T19:00:56.777Z': !RlContext {cartContentsContext: !Cart {lineItems: [{productId: 10001, quantity: 1,
        unitPrice: 7995}]}, date: !!timestamp '2010-03-01T19:00:56.777Z',
  eventType: PURCHASE, sessionContext: !Session {sessionId: XYZ}, userContext: !User {
    userId: 123}}}
# User 789 in session XYZ buys 20001 and 20002
--- {!!timestamp '2010-04-01T19:00:56.777Z': !RlContext {cartContentsContext: !Cart {lineItems: [{productId: 20001, quantity: 5,
        unitPrice: 7995}, {productId: 20002, quantity: 1, unitPrice: 895}]}, date: !!timestamp '2010-04-01T19:00:56.777Z',
  eventType: PURCHASE, sessionContext: !Session {sessionId: XYZ}, userContext: !User {
    userId: 789}}}
# User 777 in session XYZ buys 10001 and 20001
--- {!!timestamp '2010-04-02T19:00:56.777Z': !RlContext {cartContentsContext: !Cart {lineItems: [{productId: 10001, quantity: 1,
        unitPrice: 7995}, {productId: 20001, quantity: 1, unitPrice: 7995}]}, date: !!timestamp '2010-04-02T19:00:56.777Z',
  eventType: PURCHASE, sessionContext: !Session {sessionId: XYZ}, userContext: !User {
    userId: 777}}}
# User 456 in session QQQ buys 50001 and 50002    
--- {!!timestamp '2010-02-05T19:01:56.777Z': !RlContext {cartContentsContext: !Cart {lineItems: [{productId: 50001, quantity: 1,
        unitPrice: 995}, {productId: 50002, quantity: 1, unitPrice: 3995}]}, date: !!timestamp '2010-02-05T19:01:56.777Z',
  eventType: PURCHASE, sessionContext: !Session {sessionId: QQQ}, userContext: !User {
    userId: 456}}}

Notice that although this is not quite plain English, it is much more readable than a binary format and quite a bit less baroque than XML. Save this file somewhere on your local machine, say ~/tmp/smallPurchaseLog.yaml for the moment.

Now run the program with two command line arguments, one to specify the input file and the other to specify where to put the output model. From the command line with the jar Maven built, this will look like

# mkdir ~/tmp/models
# java -jar target/reclab-tutorial-0.1-jar-with-dependencies.jar ~/tmp/smallPurchaseLog.yaml ~/tmp/models

If you are working in an IDE, run the CpPurchaseDemo with the command line arguments

~/tmp/smallPurchaseLog.yaml ~/tmp/models

Either way you run it, the console output should be something like

ROOT : /var/folders/D6/D6o-jCPuEhyMhwnhLtjjZU+++TI/-Tmp-/RecLab.PurchaseCpDemo6551711462415584948dir
MODEL: /Users/username/tmp/models

This tells us the temporary directory where intermediate files were stored during the map reduce operation and the location of the final model that was built. If you look in the temporary directory, you will see several files.

# ls -l /var/folders/D6/D6o-jCPuEhyMhwnhLtjjZU+++TI/-Tmp-/RecLab.PurchaseCpDemo6551711462415584948dir
total 32
-rw-r--r--  1 vengroff  staff  125 Nov  6 14:18 intermediate.0
-rw-r--r--  1 vengroff  staff  532 Nov  6 14:18 intermediate.1
-rw-r--r--  1 vengroff  staff  266 Nov  6 14:18 intermediate.2
-rw-r--r--  1 vengroff  staff  416 Nov  6 14:18 intermediate.3

These are the intermediate files produced by the two map reduce operations we did: intermediate.0 is the output of the first mapper as grouped for input to the first reducer; intermediate.1 is the output of the first reducer; intermediate.0 is the output of the second mapper as grouped for input to the second reducer; and intermediate.1 is the output of the second reducer. Take a look inside the first intermediate file, and you will see

# cat /var/folders/D6/D6o-jCPuEhyMhwnhLtjjZU+++TI/-Tmp-/RecLab.PurchaseCpDemo6551711462415584948dir/intermediate.0
--- {'123': [20001, 10001, 10001]}
--- {'456': [50002, 50001, 20002, 10001]}
--- {'777': [20001, 10001]}
--- {'789': [20002, 20001]}

Each line of this file represents one unique key written by our PurchaseByUserMapper class. It is not exactly the output of PurchaseByUserMapper for our input; that would have a (userId, productId) pair on each line. But if we took that and grouped all the product IDs for a given user ID together into a sequence, it would look like the contents of the intermediate file. This is exactly what our map reducer did. It ran the mapper, then grouped its output by key so that the reducer can iterate over the set of values.

If we look at the next intermediate file, we can see that it is the output of the first reducer.

# cat /var/folders/D6/D6o-jCPuEhyMhwnhLtjjZU+++TI/-Tmp-/RecLab.PurchaseCpDemo6551711462415584948dir/intermediate.1
--- {20001: 20001}
--- {20001: 10001}
--- {10001: 20001}
--- {10001: 10001}
--- {50002: 50002}
--- {50002: 50001}
--- {50002: 20002}
--- {50002: 10001}
--- {50001: 50002}
--- {50001: 50001}
--- {50001: 20002}
--- {50001: 10001}
--- {20002: 50002}
--- {20002: 50001}
--- {20002: 20002}
--- {20002: 10001}
--- {10001: 50002}
--- {10001: 50001}
--- {10001: 20002}
--- {10001: 10001}
--- {20001: 20001}
--- {20001: 10001}
--- {10001: 20001}
--- {10001: 10001}
--- {20001: 20001}
--- {20001: 20002}
--- {20002: 20001}
--- {20002: 20002}

The first line of intermediate.0 was reduced to the first four lines of intermediate.1 each of which represents one of the unique ordered pairs of co-purchased product IDs by user 123. Next come the pairs from the second line of intermediate.0, which represent the unique pairs of product user 456 purchased. 456 produced four unique products, so there are sixteen pairs, which occupy the next sixteen lines of the file. Users 777 and 789 each bought two unique products, and therefore added four lines each to the file, for a total of twenty-eight lines. Notice that the lines for each user are all there, but not in any particular order within the group of lines generated for the user. You may wish to go back and look at the code we wrote for AllPairsForKeyReducer to make sure you see exactly where these values came from.

Next, we look at the output of the second mapper. Recall that this was an identity mapper, but the file we are looking at will have been grouped by key in preparation for the second mapper. So it looks like

# cat /var/folders/D6/D6o-jCPuEhyMhwnhLtjjZU+++TI/-Tmp-/RecLab.PurchaseCpDemo6551711462415584948dir/intermediate.2
--- {10001: [20001, 10001, 50002, 50001, 20002, 10001, 20001, 10001]}
--- {20001: [20001, 10001, 20001, 10001, 20001, 20002]}
--- {20002: [50002, 50001, 20002, 10001, 20001, 20002]}
--- {50001: [50002, 50001, 20002, 10001]}
--- {50002: [50002, 50001, 20002, 10001]}

Each product ID is associated with each instance of a product ID having been bought along with it. The total number of values is twenty-eight, one for each of the values in intermediate.1, but they have been grouped by their keys.

Finally, we have the output of the final reducer. It looks like

# cat /var/folders/D6/D6o-jCPuEhyMhwnhLtjjZU+++TI/-Tmp-/RecLab.PurchaseCpDemo6551711462415584948dir/intermediate.3
--- {10001: {20002: 0.3333333333333333, 20001: 0.6666666666666666, 10001: 1.0, 50002: 0.3333333333333333,
    50001: 0.3333333333333333}}
--- {20001: {20002: 0.3333333333333333, 20001: 1.0, 10001: 0.6666666666666666}}
--- {20002: {20002: 1.0, 20001: 0.5, 10001: 0.5, 50002: 0.5, 50001: 0.5}}
--- {50001: {20002: 1.0, 10001: 1.0, 50002: 1.0, 50001: 1.0}}
--- {50002: {20002: 1.0, 10001: 1.0, 50002: 1.0, 50001: 1.0}}

Here, we have reduced the contents of intermediate.2 so to produce a mapping from each source product to a value that is itself a mapping from target product ID to the fraction of the time the target was co-purchased with the source. Again, you may wish to go back and look at the code we wrote for CoOccurrenceReducer to make sure you see exactly where these values came from.

All that is left to look at is the final model. It is put in the model directory we specified on the command line. The model file is very much like the final intermediate file except that it is ordered by descending score and is a bit more strongly typed. It is also truncated by default to no more than twenty values per source ID. But since our input data was very small, this did not affect us.

# cat ~/tmp/models/CpPurchaseModel.yaml 
--- {10001: [{id: 10001, score: 1.0}, {id: 20001, score: 0.6666666666666666}, {id: 50002,
      score: 0.3333333333333333}, {id: 50001, score: 0.3333333333333333}, {id: 20002,
      score: 0.3333333333333333}]}
--- {20001: [{id: 20001, score: 1.0}, {id: 10001, score: 0.6666666666666666}, {id: 20002,
      score: 0.3333333333333333}]}
--- {20002: [{id: 20002, score: 1.0}, {id: 50002, score: 0.5}, {id: 20001, score: 0.5},
    {id: 50001, score: 0.5}, {id: 10001, score: 0.5}]}
--- {50001: [{id: 50002, score: 1.0}, {id: 20002, score: 1.0}, {id: 50001, score: 1.0},
    {id: 10001, score: 1.0}]}
--- {50002: [{id: 50002, score: 1.0}, {id: 20002, score: 1.0}, {id: 50001, score: 1.0},
    {id: 10001, score: 1.0}]}

If you want to try your code out with a bigger data set, consult the documentation on RecLab data sets and find a data set to test with.

Loading and Using the Model

Having now built a model, we we most likely want to read it into some runtime code and work with in to make real recommendations. This is done using RecLab's runtime support classes. Specifically, we will write a class that implements the ProductRecommender interface. This interface specifies some simple methods for getting at models and evaluating run-time context to return product recommendations.

The Recommender

In order to do its work, our ProductRecommender will rely on a model it accesses through a RecommenderRuntime that represents the runtime environment in which it operates. A RecommenderRuntime abstracts away the details of the runtime environment from the ProductRecommender in much the same way that the details of the underlying file system were abstracted away from our model build job. In this way, we can write a ProductRecommender once, but have it run in a variety of settings, ranging from a simple local command-line environment to a full-blown large-scale cloud environment. Each environment will provide our product recommender with a different recommender runtime.

Our ProductRecommender looks as follows:

public final class CpPurchaseRecommender implements ProductRecommender {

    /**
     * The model reader we use to read our model.
     */
    private ModelReader modelReader;

    @Override
    public void initialize(final RecommenderRuntime recommenderRuntime) throws Exception {
        
        this.modelReader = recommenderRuntime.getModelReader("CpPurchaseModel");
    }

    @Override
    public List<ScoredId> getRecs(final RecContext recContext) {
        
        // If we have a product context, dig through and
        // find the item, then look it up in the model.
        
        ProductContext productContext = recContext.getProductContext();
            
        if (productContext != null) {
            int productId = productContext.getProductId();
            
            List<ScoredId> scoredIds = modelReader.getScores(productId);
            
            if (scoredIds != null) {
                return scoredIds;
            }
        }

        // If we have a cart context, key off the first item
        // in the cart. We can fall through to this in a lot
        // of different cases where we have something in the
        // cart, but we don't have a single product context
        // the way we do on a item page.
        
        CartContentsContext cartContentsContext = recContext.getCartContentsContext();
        
        if (cartContentsContext != null) {
            List<LineItem> lineItems = cartContentsContext.getLineItems();
            
            int productId = lineItems.get(0).getProductId();
            
            return modelReader.getScores(productId);
        }
        
        return null;
    }
}

The first public method, initialize(RecommenderRuntime) is how we are given a RecommenderRuntime to work with. The only thing our recommender needs from the runtime is a ModelReader capable of reading the model we built. We get this model reader by calling the getModelReader(String) method on the runtime. We pass the method the same model name we used when we built our model. The reason we have to pass a name is that not all model jobs build just a single model. A model job can build several models, each with a different name, and then the runtime can load them seperately, again by name, and use them together to make recommendations.

The next method, getRecs(RecContext) is the method that actually generates recommendations. For this model, we first check to see if we have a product context. If we do, then we use the product ID in the product context as a key into our model and return the resulting products. If we don't have a product context, we check for a cart context. If we have one, then we pull an item out of the cart and use that as our key into the model. The method returns null if the recommender has nothing to recommend in the given context.

Most runtime recommenders are quite simple like this. Our goal is to be able to quickly deliver recommendations, so we really just want to do one or a very small number of model lookups and as little actual computation as possible at run time. Most of the heavy lifting should all have been done at model build time.

The full source code for this recommender is available in the reclab-tutorial project.

Running the Recommender

Now that we have a recommender, it would be nice to be able to run it and see what kind of results it produces. Luckily, RecLab Core includes a tool for doing just that in the form of a class called RunRecommender.

We'll move through this a little more quickly that we went through some of our earlier classes, so if you haven't done so already, now is probably a good time to check out the full source code of the reclab-tutorial project with

# svn co http://code.richrelevance.com/svn/reclab/reclab-tutorial/trunk reclab-tutorial

Returning to running the recommender, the

# mvn package

command that we used earlier to build a jar with all our dependencies actually builds RunRecommender into the jar as well. If you just typed in the code above, you will need to run mvn package again to build a new jar. You can then run

# java -cp target/reclab-tutorial-0.1-jar-with-dependencies.jar org.reclab.core.recommender.local.RunRecommender

You should get a message telling you that you don't have the right number of command line arguments.

Usage: RunRecommender recommenderClassName modelDir dataCacheDir [sourceFile]

RunRecommender takes three required command-line arguments. The first is the fully qualified name of the recommender class we want to test, and the second is the name of the directory where our models can be found. The third argument is a directory that contains additional reference information relevant to the recommender represented as a DataCache instance. Currently this contains information about the product catalog of the merchant. See the javadoc for additional details. There is a fourth optional argument that specified the source of events to make recommendations against. If this is not given we read the events from standard input.

Assuming the model we built earlier is still in the ~/tmp/models directory, we can run RunRecommender as follows

# java -cp target/reclab-tutorial-0.1-jar-with-dependencies.jar \
  org.reclab.core.recommender.local.RunRecommender \
  org.reclab.tutorial.cppurchase.CpPurchaseRecommender \
  ~/tmp/models src/test/resources/LocalDataCache src/test/resources/smallViewLog.yaml

The src/test/resources/LocalDataCache contains sample catalog data in yaml format, but our simple recommender does not use this data, so you can ignore this for now.

The fourth argument is the name of an input file. This input file contains contextual events that we want to make recommendations against. The file we have chosen is smallViewLog.yaml which can be found in the src/test/resources/ directory of the reclab-tutorial project. This input file contains a small number of item page views.

RunRecommender works as follows. First, it instantiates a recommender of the class specified on the command line. In our case this is the CpPurchaseRecommender class we just wrote. Next, it opens up the model directory it was given (~/tmp/model) in our case, builds a RecommenderRuntime around it, and passes that to our recommender via the initialize(RuntimeRecommender) method we wrote. Finally, it parses each of the events in smallViewLog.yaml into a RecContext and passes it to the getRecs(RecContext) method of our recommender, then prints the results to standard output.

Before we look at the output, let's look at the input file smallViewLog.yaml. It looks like this:

# User 7701 in session AAAAA views 10001
--- {!!timestamp '2010-03-01T19:00:56.777Z': !RlContext {date: !!timestamp '2010-03-01T19:00:56.777Z',
  eventType: ITEM_PAGE, sessionContext: !Session {sessionId: AAAAA}, userContext: !User {
    userId: 7701}, productContext: !Product {productId: 10001}}}
# User 7702 in session BBBBB views 10010
--- {!!timestamp '2010-03-01T19:01:56.777Z': !RlContext {date: !!timestamp '2010-03-01T19:01:56.777Z',
  eventType: ITEM_PAGE, sessionContext: !Session {sessionId: BBBBB}, userContext: !User {
    userId: 7702}, productContext: !Product {productId: 10010}}}
# User 7701 in session AAAAA views 10020
--- {!!timestamp '2010-03-01T19:10:56.777Z': !RlContext {date: !!timestamp '2010-03-01T19:10:56.777Z',
  eventType: ITEM_PAGE, sessionContext: !Session {sessionId: AAAAA}, userContext: !User {
    userId: 7701}, productContext: !Product {productId: 10020}}}
# User 7703 in session CCCCC views 20001
--- {!!timestamp '2010-03-01T19:11:56.777Z': !RlContext {date: !!timestamp '2010-03-01T19:11:56.777Z',
  eventType: ITEM_PAGE, sessionContext: !Session {sessionId: CCCCC}, userContext: !User {
    userId: 7703}, productContext: !Product {productId: 20001}}}

Notice that the users and sessions are completely different than the ones in smallPurchaseLog.yaml that we used to build the model. But some of the products these new users are looking at are the same ones we have seen before, which ended up in our model. The output of the run looks like this

ITEM_PAGE,7701,AAAAA,10001 -> [[10001, 1.0], [20001, 0.6666666666666666], [50002, 0.3333333333333333], [50001, 0.3333333333333333], [20002, 0.3333333333333333]]
ITEM_PAGE,7702,BBBBB,10010
ITEM_PAGE,7701,AAAAA,10020
ITEM_PAGE,7703,CCCCC,20001 -> [[20001, 1.0], [10001, 0.6666666666666666], [20002, 0.3333333333333333]]

There is one line for each input event, with a short representation of some of the key attributes of the event. Two of them also have recommendations. The recommendations appear after the ->. The two with recommendations are the ones that passed some of the various context tests we wrote into CpPurchaseRecommender. In particular, they have product contexts for products in our model, and thus we were able to look them up and get recommendations. The other two events also had product contexts, but they were for products we had never seen before, so we had no idea what to recommend. A full-fledged recommender, as opposed to the kind of tutorial one we have built here, might have other fallbacks for things to recommend, possibly from a different model it wrote along with the single one we wrote at model build time.

Finally, for completeness, we can look at how the recommender reacts to the other kind of events it is designed to react to, namely cart contexts. We have another input file src/test/resources/smallCarts.yaml that contains the following:

# User 7702 in session BBBBB views 10010, and already has 50001 in their cart
--- {!!timestamp '2010-03-01T19:01:56.777Z': !RlContext {cartContentsContext: !Cart {lineItems: [{productId: 50001, quantity: 1, 
        unitPrice: 1999}]}, date: !!timestamp '2010-03-01T19:01:56.777Z',
  eventType: ITEM_PAGE, sessionContext: !Session {sessionId: BBBBB}, userContext: !User {
    userId: 7702}, productContext: !Product {productId: 10010}}}
# User 7703 in session DDDDD views 10010, and already has 50099 in their cart
--- {!!timestamp '2010-03-01T19:01:56.777Z': !RlContext {cartContentsContext: !Cart {lineItems: [{productId: 50099, quantity: 1, 
        unitPrice: 999}]}, date: !!timestamp '2010-03-01T19:01:56.777Z',
  eventType: ITEM_PAGE, sessionContext: !Session {sessionId: DDDDD}, userContext: !User {
    userId: 7703}, productContext: !Product {productId: 10010}}}
# User 7704 in session QQQQQ views 10001, and already has 50001 in their cart
--- {!!timestamp '2010-03-01T19:02:56.777Z': !RlContext {cartContentsContext: !Cart {lineItems: [{productId: 50001, quantity: 1, 
        unitPrice: 1999}]}, date: !!timestamp '2010-03-01T19:02:56.777Z',
  eventType: ITEM_PAGE, sessionContext: !Session {sessionId: QQQQQ}, userContext: !User {
    userId: 7704}, productContext: !Product {productId: 10001}}}

There are three item page events, similar to the ones in smallViewLog.yaml, but in each case the shopper already has an item in their cart, as indicated by the cartContentsContext field. In the first two entries, the shopper is looking at item 10010, which is not in our model. In the first case, they have 50001 in the cart, which is in our model, and in the second they have 50099, which is not. In the third entry, the shopper is viewing 10001 but has 50001 in their cart. Both are in our model, but the code is written so that 10001 is chosen.

The output when we run

# java -cp target/reclab-tutorial-0.1-jar-with-dependencies.jar \
  org.reclab.core.recommender.local.RunRecommender \
  org.reclab.tutorial.cppurchase.CpPurchaseRecommender \
  ~/tmp/model src/test/resources/smallCartLog.yaml

is

ITEM_PAGE,7702,BBBBB,10010 -> [[50002, 1.0], [20002, 1.0], [50001, 1.0], [10001, 1.0]]
ITEM_PAGE,7703,DDDDD,10010
ITEM_PAGE,7704,QQQQQ,10001 -> [[10001, 1.0], [20001, 0.6666666666666666], [50002, 0.3333333333333333], [50001, 0.3333333333333333], [20002, 0.3333333333333333]]

as we would expect. In the first case we keyed off the cart item, in the second we had no match, and in the third the item being viewed trumped the item in the cart.

You may wish to experiment with the CpPurchaseRecommender code and change how it decides where and how to obtain its seed product. You can then run on the provided files or experiment by changing the around as well.

The model builder and recommender we have now written are all that is needed to run the exact same algorithm in the cloud using real data and live traffic. The model build environment and run-time environment will be completely different cloud-based implementations, but as long as your code uses the RecLab Core APIs as we have been doing, it is ready to be run in the cloud. See RichRelevance and RecLab for more details on this program.

Exercises for the Reader

Conditional Probability of Browse

The model we just built was a, "people who bought also bought" model. In order to test your knowledge of how the model build process works, try writing a variant that does, "people who viewed also viewed."

In writing this new model builder, you probably want to group views together by session ID, not by user ID, especially for cases where the data you are working with spans a long interval of time and a shopper may have had several distinct sessions.

In order to get session IDs, you will want to use the RecContext.getSessionContext() method instead of the RecContext.getUserContext() we used when we wanted the user context. You will also have to make a few changes because session IDs are strings where user IDs were integers.

To test your new model builder, you will want to use a larger and more varied data set than the simple example above. At a minimum, you will want to try out your code with an input that actually includes item page views. Here is a sample to get you started.

--- {!!timestamp '2010-12-04T16:00:36.583Z': !RlContext {date: !!timestamp '2010-12-04T16:00:36.583Z',
    eventType: ITEM_PAGE, productContext: !Product {productId: 193175680}, sessionContext: !Session {
      sessionId: S3}, userContext: !User {userId: 57003}}}
--- {!!timestamp '2010-12-04T16:00:50.590Z': !RlContext {date: !!timestamp '2010-12-04T16:00:50.590Z',
    eventType: ITEM_PAGE, productContext: !Product {productId: 122401422}, sessionContext: !Session {
      sessionId: S3}, userContext: !User {userId: 57003}}}
--- {!!timestamp '2010-12-04T16:00:15.697Z': !RlContext {date: !!timestamp '2010-12-04T16:00:15.697Z',
    eventType: ITEM_PAGE, productContext: !Product {productId: 117661791}, sessionContext: !Session {
      sessionId: S4}, userContext: !User {userId: 57004}}}
--- {!!timestamp '2010-12-04T16:00:29.573Z': !RlContext {date: !!timestamp '2010-12-04T16:00:29.573Z',
    eventType: ITEM_PAGE, productContext: !Product {productId: 145460275}, sessionContext: !Session {
      sessionId: S4}, userContext: !User {userId: 57004}}}

Better yet, consult the documentation on RecLab data sets and find a data set to test with.

If you get stuck, you can always check out the source code for the the reclab-tutorial with

# svn co http://code.richrelevance.com/svn/reclab/reclab-tutorial/trunk reclab-tutorial

and then look at the classes ViewBySessionMapper, AllPairsforStringKeyReducer and CpBrowseDemo. These are the classes that implement the conditional probability of viewing by session model. You will find that very little is different between these classes and the ones we used for building the purchase model.

Finally, you can write a runtime analogous to CpPurchaseRecommender>> but for browsers, and pull various bits of context out to drive recommendations. You can then test with <<<RunRecommender as we did before.

More Examples

Additional example code can be found in the RecLab Examples project.