Tackling the Human Cost of R&D Tax Credit Qualification with AI-Powered Models

Co-authored by Edward Kung.
The Gusto R&D Tax Credits product helps clients identify and maximize the R&D tax credits that they qualify for. One component of this process is calculating the percentage of their general ledger (GL) accounts that would qualify for tax credits according to IRS rules, as well as categorizing them in a defined set of categories.
Currently, these percentages are calculated by a team of quantitative tax associates (quants) under the guidance of a CPA.
A sample GL looks like this:
+---------------------------------+---------------------------------------------------+-----------+
| Account Name | Expense Description | Amount($) |
+---------------------------------+---------------------------------------------------+-----------+
| Loan to Shareholder | move s/h dist to s/h loans no dist w/negative r/e | 77472.55 |
| Brand Program | move 1/3 salaries to brand program asset | 70158 |
| PAYROLL EXPENSES | Payroll | 20078.04 |
| Closed- First Republic Checking | System-recorded deposit for QuickBooks Payments | 13300 |
| Wages-Employees | To correct coding of September monthly payroll | 13012.64 |
| Undeposited Funds | Paid via QuickBooks Payments | 10300 |
| Wages-Employees | To correct coding of June monthly payroll | 9593.34 |
+---------------------------------+---------------------------------------------------+-----------+
For tax year 2022, the quant team reviewed general ledger accounts totaling multiple hundred million dollars and calculated ~10% of it as qualifying for R&D tax credit.
Problem: The Human Cost
A GL account on an average contains 2,840 expenses. Only a small subset of those expenses qualifies for R&D. This means, a majority of the time spent is looking at expenses that don’t qualify at all. In many cases, it is infeasible to manually review all of them.
To set a baseline view of how time consuming this problem can be:
(From Gusto’s subset of R&D Tax Credits customer base spanning 2 consecutive years)
Average number of GL accounts: 25
Average number of expenses: 2,840
Average number of qualifying expenses: 39
Non-qualifying expenses percent: 98.6%(We ran an internal study to measure time spent manually reviewing expenses of a GL)
Time to process 100 expenses: 15 seconds
Time spent on average expense size account: (2840/100)*15 = 426 seconds
Time spent on non-qualifying expenses: = (0.986*426)*25 = 10500 seconds ~= 3 hours
Clearly, a lot of time as well as most of it is spent on expenses that aren’t even qualified for R&D. This can be significantly solved by a system that is able to highlight expenses that are more likely to qualify than others.
Approach
We decided to use a Machine Learning (ML) model with two goals:
- To accelerate the GL review process by automatically surfacing or highlighting the accounts and expenses that are most likely to qualify for R&D tax credits
- To increase accuracy of the GL review process by automatically surfacing and highlighting the accounts and expenses that are most likely to qualify for R&D tax credits
It is important to note the non-goals of this initial model:
- Assign a qualification percentage to expense expenses
- Determine the expense category
- Serve as a final judgment for whether or not an expense qualifies for R&D tax credits, without human review
Model Development Data & Preparation
Target Variable
There are two models, one at the account level and one at the expense level. At the expense level, the target variable is whether or not the expense qualifies for R&D tax credits. At the account level, the target variable is whether or not the account contains a qualified expense.
The choice to train two models was made to reflect the quants’ workflow. When estimating credits for a client, the quant will first see a list of GL accounts. Then, if the quant decides that an account is likely to contain qualified expenses, the quant can enter a detailed view of the account and qualify individual expense line items. Thus, the account-level model assists quants by surfacing to the top the accounts that are most likely to contain qualified expenses, then when the quant enters the detailed view, the expense model surfaces to the top the line items most likely to qualify.
Input Variables
Expense Model: account_name (string), expense_description (string), account_type (string), user_provided_industry (string)
Account Model: account_name (string), account_type (string), user_provided_industry (string), avg_score (float)
avg_score: The average score of the expenses in the account from the expense model. Therefore, the accounts model has a dependency on the expense model
Model Data Window
We utilized data from customers we serviced during the tax years 2021 through 2023.
The training data for the accounts model included all customer accounts with a positive balance.
The training data for the expense model included only expenses with > 0-dollar amount, and only expenses from accounts that had at least one qualified expense. This sample selection criteria were made to save on training time, because the number of non-qualified expenses from non-qualifying accounts is orders of magnitude larger.
Methodology
Key Model Evaluation Metrics
Both the account and the expense models are binary classifiers. They can therefore be evaluated according to the standard suite of metrics for binary classifiers: accuracy, precision, recall, AUC, etc.
As a reminder, the precision of a binary classifier measures how accurate the model’s positive predictions are. How many of the model’s positive predictions are actually qualified? Recall, on the other hand, measures how many of the truly qualified items the model identifies as positive. Generally, there is a tradeoff between precision and recall. If the model’s threshold for marking an item as qualified is high, then the model will have higher precision but lower recall, and if the model’s threshold is low then the model will have lower precision but higher recall.
The tradeoff between precision and recall can be plotted in a curve, which shows the precision and recall at various choices of the model threshold. During model development, the primary means of model selection was the plotting of the precision-recall curve and selecting the model with the highest precision at all recall levels. (Generally, we have found that step improvements to the model result in higher precision at every level of recall. In some cases of hyperparameter tuning, we found that some hyperparameter choices resulted in better precision at certain recall levels and worse precision at others. In these cases, we chose the model with better precision at high levels of recall, around 80% recall).
Expense (LineItem) Model
Account Model
Validation Holdout
20% of the data was used as a validation holdout sample. The data was split by company so that the same company cannot show up in both the training and the validation samples.
Model method(s)
We used a XGBoost Binary Classifier for both models. We chose this model because in previous testing with the wages qualification model, we found that XGB performed the best in comparison to other classifiers.
Variable Transformation
In the accounts model, account_name is passed through a pre-trained language model to obtain embeddings that represent the semantic meaning of the account name. The specific model chosen was: multi-qa-MiniLM-L6-cos-v1, a BERT based model for semantic search.
In the expense model, account_name and expense_description are concatenated and passed through the BERT model.
It should be noted that in neither case do we re-train the BERT model weights. We take the weights as given. Only the parameters of the XGBoost model are being trained.
user_provided_industry is transformed into a one-hot encoded categorical variable.
avg_score, which is only used in the accounts model, does not come from the raw data. It is calculated by passing each non-negative expense in the account into the expense model first, then calculating the average score of the expenses.
Model Results
Account Model
On test data, the account model achieved an accuracy of 95.6% with a precision score of 67.1%, a recall score of 45.9%. The ROC AUC score was 96.2%. It should be noted that the training and testing samples were heavily unbalanced, with 94% of accounts being negative (not containing qualified expenses) and 6% being positive (containing qualified expenses).
Examples of high scoring account names include:
- Research & Development
- Software & Web Hosting Expenses
- Contractors — R&D
- R&D Consultants
- R&D — Testing Material
Examples of low scoring account names include:
- Payroll Tax Expenses
- Advertising & Marketing
- Travel Expenses
- Facilities & Office Expense
- Benefits: Employee Medical
Expense Model
The expense model achieved an accuracy of 74.4%, with a precision score of 79.4%, a recall score of 38.2%, and a ROC AUC score of 83.8%.
Examples of high scoring expense descriptions include:
- Amazon Web Services
- Heroku
- GitHub
- Med-Logics Inc.
- Orange Coast Pneumatics
Examples of low scoring expense descriptions include:
- Slack
- Intuit QuickBooks
- PayPal
- 1Password
- Adobe
Manual Review
One of the data science challenges we encountered while building this model is that the underlying training data may not be 100% reliable. The data is generated during R&D credit calculation, which involves a Quantitative Tax Associate determining whether or not an expense qualifies for R&D tax credit based on the company’s data. Because the Quant is working under a time constraint and may have to review a large number of expenses, it is possible that they miss one or more qualified expenses due to lack of time. This is like a type 2 error (false negative) in the training data because an expense that should have been qualified is not, and it will lead to a type 1 error (false positive) in the model because the model will think it should qualify while the data says it is not.
Another data quality issue occurs when some accounts or expenses normally would not qualify, but they are marked as qualified because of additional information the Quant receives from the customer. This is information not available to the model at inference time and will lead to type 2 errors in the model (false negatives). However, it is not clear that we want the model to score these expenses highly, since their qualification relied on idiosyncratic information from the customer.
To get a sense of the data quality, we conducted a manual review exercise using an earlier version of the model. We used the model to generate scores for all accounts and expenses in both the training and validation datasets. We then flagged accounts and expenses with a high log-loss; that is, we flagged the items with a large disagreement between what the model thinks and what the data says.
We then passed these high log-loss examples to a Quant Team for manual review. If upon review the Quant determines that he or she agrees with the model more than the data, we asked them to relabel the data. Altogether, 85% of the accounts that the model thought should have qualified but did not were relabeled, and 28% of the expenses that the model thought should qualify but did not were relabeled. This suggests that the underlying data may contain a significant amount of missed qualified expenses, which hopefully the model will help find and capture more easily.
The relabeled data was used in training the final version of the model.
Model Application in Product
Model System Process Flow
The following system process flow diagram displays how the data is flowing from end-user to our application and between the ML model server before being surfaced to the internal ops team.
User Interaction with the model
The model is used to sort and color-code accounts and expenses according to their model scores. Accounts and expenses are colored green if their model score exceeds a threshold. If the score does not exceed the threshold, its color is the regular gray color. The goal of the model is therefore not to definitively say whether an account or expense is qualified, but instead to surface to the quants’ attention the items that are most likely to qualify.
The threshold was chosen in order to balance the number of items that would be colored green (and thus be likely to be reviewed by a quant), and the accuracy of those items actually being qualified. The idea was to strike a balance between the number of items the quants would be directed to review and the likelihood that such review would yield qualifying items. The exact value of the threshold was chosen by discussing these tradeoffs with key stakeholders including the quant and data science teams.
Conclusion
There has been much hype about generative AI, but the expense prioritization model is an example of how large language models can also be used in a non-generative way, as an upstream process for feature engineering, within an otherwise standard machine learning problem (e.g. binary classification). We believe that using semantic embeddings as features for supervised learning is a promising approach that can find many more applications within Gusto products. As the size of our datasets grow and as language models develop, we also believe that the accuracy of these models will get better over time.
However, an ongoing challenge is ensuring the quality of training data. This can be done, for example, by encouraging or educating clients to adhere to naming conventions and standards in their GL accounts, or by encouraging the use of software that adheres to those standards. Another challenge is preventing the machine from feeding on itself. As the model becomes an integrated part of the quants’ workflow, future training data will be influenced by model output. Thus, it may be prudent to require a certain percentage of GL accounts to be reviewed without model assistance, in order to preserve some data that is not influenced by model outputs. Although we are not currently requiring such a holdout set, this is something to carefully consider in the future.
Contributors
Shoutout to all the contributors to this work, namely, Shannon Yates, Christopher Jo, David Lam, Jim Lee