Discussion of analytics

Data Based Lead Prioritization: Using Data, Statistics, and Causal Reasoning

(This is the third part of our three-part series about Data Based Lead Prioritization.)


There’s a lot to be gained by systematically using your data to prioritize your leads on a rolling basis. Let’s say you have 20 thousand prospects in your database and, after handling in-bound leads, your Sales Department makes outbound calls to 400 database leads a month. By deploying a statistical model to prioritize your leads, you are ensuring that your sales people are focusing on the top 2% available at any point in time. Better focus means less time wasted on low-probability prospects, no second guessing the right prospects to go after, reduced cognitive biases, and no “paralysis by analysis.” And as the available data grows, the prioritization improves.

A mathematically derived Lead Prioritization engine will enable your salespeople to:

  1. Focus their efforts on the best leads by sorting their leads in order from most likely to ‘succeed’ to least likely to ‘succeed.’
  2. The same algorithm that determines the priority can also highlight why a particular prospect is more likely to succeed than others. Knowing ‘why’ will help give your sales staff ideas on how to start the call as well as provides relevant insights.  
  3. And, in certain circumstances, you can estimate both the value of the lead and the   probability that it ‘succeeds’. Multiply these two numbers, and you get the expected value of calling the lead. Knowing this information can help motivate sales reps, create a business case for hiring sales reps, or provide the impetus to create new tactics.

You should know, however, that there isn’t an out-of-the box, configure-it-yourself, software-as-a-service Data Based Lead Prioritization tool. The data available to Sales and Marketing departments is always evolving. The sales process and marketing tools are different from one organization to the next. As a result, only a tailored solution will work.

There are five broad steps to building a lead prioritization engine: 1) Causal Modeling, 2) Data Collection, 3) Feature Extraction, 4) Statistical Modeling and 5) Systems Integration. In the next few sections we touch on each as we explain how to build a lead prioritization engine for your organization.


Causal Modeling

The first step is to create a Causal Model. You want to set a meeting between Sales and/or Marketing managers with the technical team (a data scientist, statistician, economist or mathematician) to discuss how sales usually unfold. If you have a dedicated trainer that handles CRM training, they are also helpful.  

The purpose of the meeting is to provide the technical team with the understanding necessary to build a causal model of the sales process. A causal model describes the sales process and relates that description to the data you have in the CRM. Here are some of the questions you may hear:

  • “How does a lead become disqualified?”
  • “What does this status mean?
  • “Does everybody try to sell everything?”
  • “Are all products available to all customers?”
  • “How do you handle multiple contacts at the same organization”
  • “When do you inform prospects of pricing?”

The causal model is used to determine the structure of the statistical model.  There’s a great book by Judea Pearl, The Book of Why, that describes more about Causal Modeling.


Data Collection

To build a Statistical Model, you need data. The data comes in two varieties: data that is unique to your company and data about your prospects.  

  1. You need data about your successes and failures – 500 of each is a good start. If you have less than 500 data points, building a statistical model is probably overkill.  
    1. ‘Success’ could be a ‘closed won’ sale, or setting an appointment, or converting the lead to an opportunity.
    2. ‘Failure’ could be a ‘closed lost’ opportunity, a failure to set an appointment, or disqualifying a lead. Looking only at your ‘sales won’, for instance, creates a survival bias that is hard to overcome with data.
    3. Other private contextual information about the success or failure – what marketing campaign led to the prospect, what rep handled it, what product or services were being sold, etc.
  2. With regards to prospects, you need a large group.  In this context, ‘large’ means that the salespeople would only be able to reach a small fraction of prospects in a month. For each prospect, you will need contact information and whatever contextual data you have acquired about them.

If you don’t have much information on prospects, the technical team can supplement your data with online information. For instance, you may find that you only have geo coordinates on a third of your prospects’ business, or you do not know the industry classifier for some of your prospects.  Maybe you have their websites, but you have not grabbed their keywords or crawled their content for clues about who they are.

‘Getting the data’ could mean connecting to an API and downloading data, purchasing data from a third party vendor, asking salespeople to categorize competing products or services, or collecting data off of public websites. In this context, you want to stick to data that can be obtained on-demand. In our experience, asking your sales staff to do more data collection on an ongoing basis does not work unless they already need the information to do their job.   


Feature Extraction

Not all the data you have is in the form it needs it to be in. Feature extraction is about taking data you have and converting it into a form best for modeling. For instance, 

  • corporate titles may need to be passed through a language processor in order to extract key words or groups of words in a title.  
  • If you decide to crawl the websites of your target prospects and capture the text regarding their products and services they offer, you will want to pass that information through a text mining algorithm, in order to reduce that information to a handful of salient characteristics.  
  • you may decide to take a list of industries (SIC or NAICS) and group them, or
  • take geo-coordinates and cluster them for modeling purposes.

All of these activities fall under the category of feature extraction.  


Statistical Modeling & Testing

Once the technical team understands the sales process, has collected the data, and extracted key features, they are ready to build a statistical model that will predict which prospects are most likely to succeed. Here are some things to keep in mind.

There’s no universally ‘best’ statistical model. There are a handful of very good statistical models that can be used to prioritize leads using data: logistic classifier, random forest classifier, neural net classifier or gradient boosted classifier are common examples. Which type will work best depends on the problem at hand, whether and how it needs to be interpreted, how much data you have, the expertise of the people producing the model and how you expect to deploy the results.  

It may take a few tries before the right statistical model is developed. After a model looks like it is performing well and difficult to improve upon, the technical staff will derive sample results (both in-sample and out-of-sample), some basic model documentation, and statistics about the model. The Sales and Marketing staff should consult the documentation, evaluate the results, ask questions and, if necessary, suggest changes to the model. Sometimes the question being answered is different from the question you need answered.  For instance, the model may answer the question “What is the likelihood that this prospect will become an opportunity?” And the question you need answered is “What is the likelihood that this prospect will become an opportunity within 2 weeks of a call?” A change like this has important implications for the model.

Expect to see back-testing results. After a final model is given a thumbs up, you should know how well the model would have worked had it been implemented in the past. A back-test compares model predictions with actual outcomes using past observations. Generally, you want your back-tests to be out-of-sample, meaning the model is built using most of the data and tested on the rest of the data. The out-of-sample test results should be similar to real world model results.

Ask for the confusion matrix. At the core of all these lead prioritization models is a classification model that classifies a lead as either “likely to succeed” or “not likely to succeed”.  The confusion matrix shows how often the classification model has been wrong and looks something like this:


Predict: SuccessPredict: Failure
Actual: Success1,000700
Actual: Failure80014,000

In the testing data, there were a total of 14,800 failures and 1,700 successes. This is a success rate of  ~ 1 in 10. The classifier is not perfect, however, it is still quite useful. If failure is predicted, 19 out of 20 times, failure follows. If success is predicted, 55% of the time success follows.

Even fantastic models are not perfect. The above model, with a 55% ‘success’ rate could be a fantastic improvement over the status quo. In this hypothetical, the success rate improvement over the random success rate is a healthy 5x (from 11% to 55%). In general, both false positives and false negatives are costly, and you can tune the model somewhat if you prefer one type of error over the other. 

Usually, this is enough information to make the decision to implement the model.  However, occasionally it is useful to do a beta-roll out to a subset of salespeople in order to fine-tune usability. This beta-roll out, in our opinion, should not be used to evaluate whether the model ‘works’, but instead, tease out the best way to use the model. Once you know that it would have worked in the past and it is usable, you’re ready to implement.

Pro-Tip: Don’t bother with a live test to evaluate if it works now. First, these tests almost always validate the original model’s results. Second, they are time consuming – you have to wait while the previous leads depart from the funnel before you get a clear signal on the impact of prioritization. And third, (if you separate salespeople into two groups to do the test), live testing can either create tension between the groups or the groups will share information, contaminating the results.

Instead, we suggest you integrate the model into the workflow and evaluate progress monthly or quarterly.


Integrating the Model Predictions into the Workflow

Once you have a tested and trusted statistical model, it’s time to put the predictions to use! Sometimes, if it is simple, the model can be directly coded into a CRM. Usually, however, the numbers are crunched outside of the CRM, and the results are put back into the CRM. This can be accomplished either with a scheduled importer or by creating RESTful API that the CRM can talk to. Regardless of the method, it usually makes sense to roll out the model in a two staged process.

Stage One – Manual Updates

For the first dozen updates (or first few months), it is helpful to have the technical staff directly engaged in creating the updates. Through trial and error, they may find that they need to make their system more flexible than they previously imagined. Or, after a few updates, the salespeople may have usability suggestions. Over time, however, these changes become far less common. At which point, having the technical staff manually run updates creates more cost and risk (they could be late) than reward.

Stage Two – Automatic Updates and/or creating an API

Once the updating process is mundane and predictable, the technical staff should automate model updates. Since they are no longer directly updating the system, they will need to create reports that ensure the program is continuing to work properly. These reports should:

  1. Check that new data is of comparable quality to previous data.
  2. Report on model health:
    1. Does the model continue to fit the data well?
    2. Do variables deemed to be statistically irrelevant continue to be irrelevant?
    3. Are recent model successful prediction rates in-line with past model results?
  3. Report on API usage statistics and uptime
  4. Compare before and after sales results.

With this in place, your sales team will always be pursuing the best possible leads! That’s it for our Data Based Lead Prioritization series. Happy Hunting!