Predicting Lending Rates: An Intro to AWS Machine Learning
“Machine learning for X” is a major trend in the startup space1. CEOs of the largest tech companies have cited machine learning as a strategic component of their strength and future growth2. Jeff Bezos, while highlighting machine learning applications in Amazon’s consumer facing products, notes the “less visible” but pervasive impact on internal business processes:
At Amazon, we’ve been engaged in the practical application of machine learning for many years now. Some of this work is highly visible: our autonomous Prime Air delivery drones; the Amazon Go convenience store that uses machine vision to eliminate checkout lines; and Alexa, our cloud-based AI assistant. … Much of what we do with machine learning happens beneath the surface. Machine learning drives our algorithms for demand forecasting, product search ranking, product and deals recommendations, merchandising placements, fraud detection, translations, and much more. Though less visible, much of the impact of machine learning will be of this type – quietly but meaningfully improving core operations.
Jeff Bezos - Letter to Shareholders - April 12, 2017
Let’s imagine you’re in the consumer credit business looking to grow accounts with a particular sector of the market. Savvy consumers have a myriad of choices and their key metric for discriminating among their options will be a personalized rate. The quicker you can provide that rate the better chance you have of converting potential applicants to customers instead of seeing them abandon you for a competitor. Perhaps your loan application process has an element of human review of applicant info and credit data before a rate can be offered to a potential customer. The delay introduced by a human review may be the difference in whether your applicants will stick around or abandon you for another alternative. A machine learning solution that predicts the appropriate rate based on data from all past applications would offer your applicants a more responsive experience and could help your conversions.
Understanding the ins and outs of selecting, applying and tuning machine learning algorithms for your business can be a daunting task. However, the AWS Machine Learning offering may be a suitable entry point for helping your business get or retain an edge. This service packages up some basic machine learning functionality in a scalable, robust solution that integrates with your existing AWS infrastructure. Let’s take a high level look at how you might use the service to build a machine learning model to predict a personalized rate for your customers in near real-time.
We’re going to use some publicly available loan applicant data from Lending Club. The data set I’ve chosen contains over 420,000 loans with data such as applicant annual income, loan term, state of residence, credit score range and a number of other credit data, etc. The dataset also contains a letter grade classification of the loan in the range A - G. We’re going to use AWS machine learning to build two models, one that will predict an appropriate interest rate and a second that will infer the letter grade for the loan application. The general flow and some highlights will be presented in this article but refer to the accompanying repo on GitHub for the raw data and some additional details.
AWS machine learning data sources are comma-separated (.csv) data files stored in an S3 bucket. Some integration with RedShift and RDS sources is also available but ultimately those sources will be exporting .csv files to S3. The AWS console Datasource wizard will guide you through selecting your .csv location in S3, correcting the schema it infers for you and identifying which column from your data is the “target” value that your machine learning model is going to try to predict.
After the datasource is created you’ll have access to a few descriptive statistics such as distributions of all the variables in your dataset; here’s a look at the distribution for the interest rate variable we’re trying to predict where the different loan amounts are shown on the X axis and the frequency of loans offered at each rate is on the Y axis.
Also provided are the correlation of each variable in your dataset to your target variable and warning about missing values which can negatively impact the effectiveness of your predictive model.
Before we actually create our machine learning model, we can’t avoid the most time consuming and least enjoyable part of data science: cleaning our data3. In our loan data set, several variables have missing values or are not formatted so that we can use them properly. For example, our variable for “number of personal finance inquiries” (inq_fi) and “months since borrower’s last delinquency” (mths_since_last_delinq) are missing values in several rows. Replacing missing values with the mean of existing values or zero are common practices but these decisions should really be driven by knowledge of the data and business domain. For this open dataset I’ve made some judgement calls about when to replace with zero, the mean or the max value; for “number of personal finance inquiries” I’m replacing missing values with zero and for “months since borrower’s last delinquency” I’m using the max value from the dataset on the assumption that a missing value indicates there has been no delinquency and this is a continuous/numeric variable rather than categorical. We’ll take a look at how to transform certain numeric variables to categorical in the next section.
A close look at the data can also give us some ideas on additional features that might be useful in our predictive model4 a.k.a “feature engineering”. For example, we have an earliest_cr_line variable for the month the borrower’s earliest reported credit line was opened. However, this data is in the format MMM-YYYY e.g. “APR-2011” and is treated as a categorical variable by default. It might be more useful if we presented it as a numeric “days since first credit line” variable by simply converting the month and year to a number of days prior to today. Another ratio that’s not present in the data is the proposed loan amount relative to the borrower’s income even though we have both the loan amount and the borrow income separately in the data.
This process can be iterative as we experiment with possible data transformations. The eventual outcome another data file… this time one that should be better suited for building our predictive model of loan interest rates. The cleaning work done in this step for our demo took the form of a Python script that uses Pandas to read the raw CSV file, scrub the data and save the output; it’s available in the repository. At this point we’ll upload our enhanced data file to S3 and again create a Datasource using the AWS console.
With our datasource built on our scrubbed data file ready to go we’re ready to finally create a machine learning model. The AWS console walks you through the steps but we’ve got a few decisions to make beyond just selecting our datasource. AWS refers to the mix of variables, from those available in the datasource, that will be used to train the model as a “recipe” and offers a default recipe.
However, our loan data set, drawn from existing loans, includes some data we don’t want to incorporate into the model such as the next loan payment, whether the borrower has been late, how the principal was financed by the loan issuer, etc. We only want to use data that will be available when predicting interest rates so only applicant and credit report data is valid for use. We’re going to need to create a custom “recipe” (see custom-model-recipe.json). Check out the AWS docs for details on the recipe format; it’s not the most straightforward and the editor in the AWS console is a little raw from a user experience standpoint but essentially the recipe consists of groups of variables you want the model to consider. There is a validation step to help identify recipe errors as well.
A custom recipe also allows you to use data transformations that are built into the AWS machine learning service. Here are a couple that we used in our demo:
normalize() - Our loan data contains some variables measured in dollars and others that are counts of open credit lines. However, comparing two applications, we know intuitively that a $10 change in a credit card balance isn’t as significant as 10 additional derogatory remarks on a credit report. The value range for these two variables are drastically different and normalize() helps the model avoid giving undue weight to variables which simply have a larger range of values.
quartile_bin() - We also know that a $10 variation in the loan amount is probably not extremely influential in two otherwise equal loan applications. It may be more useful to group some of our numeric data into “bins” e.g. loans between $0 and $10K, $10K and $20K, etc and then just consider which group the application falls into. The quartile_bin() transformation creates these groups on the fly which are then treated as categorical variables rather than numeric.
Next up the console will prompt for some additional training parameters such as how to divide up your data into a training set on which the model will learn to make predictions and a test set on which it will evaluate how accurate the predictions are, regularization, etc. Refer to the AWS training parameter docs for details as discussion of these parameters is outside the scope of our demo.
Having stepped through the AWS machine learning model wizard, our newly created model will now be in a pending state for a period of time while AWS trains our model and evaluates it using the portion of our data designated as the “test set”. For the loan data set this typically took around 30 minutes.
After the training and evaluation the AWS console shows us how the interest rate predictions from our model compare with the actual interest rates from our test set. In this case, the model is “pretty good”; it predicts an interest rate that is not off by much for most of the data. It does appear to be skewed high e.g. the prediction is higher than the actual interest rate more often than not.
The second model we trained for the demo attempts to predict the loan grade rather than the interest rate. In this case our model’s performance is presented as a “confusion matrix” which identifies where our model predicted the correct loan grade and where it went awry. Here the blue diagonal line is signal of the model predicting correctly most of the time. The dark orange square at the intersection of G on the left axis and F on the top axis indicates that our model incorrectly predicted a large number of G grade loans as being F. This would prompt some follow-up analysis of the data and/or discussions with domain experts who know what distinguishes between an F and a G grade. Perhaps we need additional data captured in the model or can calculate some additional features that will improve the performance.
The performance data offered here helps us determine if our model is suitable for our business needs. Are these predictions good enough for some kind of contingent loan approval that might keep customers from looking to the next option in their search results? The answer is of course a business decision but if they are not, we may need to iterate again through some analysis of the data and work to improve the quality of the model.
If we are at an acceptable level of predictive performance, the next step is integrating the model with our business applications and systems. There are two approaches for using our trained machine learning model.
- Batch Predictions - We can generate predictions on a bulk set of data at once and save the results back out to S3. This option has the benefit of saving costs since we are using our AWS resources efficiently.
- Real-Time Predictions - In this approach our trained machine learning model is kept available for applications to invoke and make real-time predictions. For example, as a credit-seeking loan applicant perhaps I can get a real-time contingent approval based on the trained prediction of the machine learning model. Details on making real time predictions including sample request/response payloads can be found at Creating a Real-time Endpoint in the AWS documentation.
The AWS machine learning service does offer a fairly easy entry point into harness the power of machine learning for your business processes. While you won’t be spared the work of capturing and probably some amount of scrubbing your data, you do get to leverage a scalable and affordable machine learning regression pipeline and enable API endpoints that can be easily integrated with your business systems to improve their intelligence and ultimately provide better value for internal or external customers.
2 See references to machine learning in the most recent shareholder letters from Microsoft and Alphabet
3 Per Forbes in “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says”