Electronic statistics textbook banner

Fraud Detection


Fraud detection is a topic applicable to many industries including banking and financial sectors, insurance, government agencies and law enforcement, and more.  Fraud attempts have seen a drastic increase in recent years, making fraud detection more important than ever. Despite efforts on the part of the affected institutions, hundreds of millions of dollars are lost to fraud every year.  Since relatively few cases show fraud in a large population, finding these can be tricky. 

In banking, fraud can involve using stolen credit cards, forging checks, misleading accounting practices, etc.  In insurance, 25% of claims contain some form of fraud, resulting in approximately 10% of insurance payout dollars.  Fraud can range from exaggerated losses to deliberately causing an accident for the payout.  With all the different methods of fraud, finding it becomes harder still.  

Data mining and statistics help to anticipate and quickly detect fraud and take immediate action to minimize costs. Through the use of sophisticated data mining tools, millions of transactions can be searched to spot patterns and detect fraudulent transactions. 

An important early step in fraud detection is to identify factors that can lead to fraud. What specific phenomena typically occur before, during, or after a fraudulent incident?  What other characteristics are generally seen with fraud?  When these phenomena and characteristics are pinpointed, predicting and detecting fraud becomes a much more manageable task. 

Association Rules graph for Fraud DetectionUsing sophisticated data mining tools such as decision trees (Boosting trees, Classification trees, CHAID and Random Forests),  machine learning,  association rules, cluster analysis and neural networks , predictive models can be generated to estimate things such as probability of fraudulent behavior or the dollar amount of fraud.  These predictive models help to focus resources in the most efficient manner to prevent or recuperate fraud losses.


"Fraud” vs. “Erroneous” Claims, Information, etc.

The notion of “fraud” implies an intention on the part of some party or individual presumably planning to commit fraud. From the perspective of the target of that attempt, it is usually less important whether or not intentional fraud has occurred, or some erroneous information was introduced into the credit system or process evaluating insurance claims etc. So from the perspective of the credit, retail, insurance, or similar business the issue is rather whether or not a transaction that will be associated with loss has occurred or is about to occur, if a claim can be subrogated, rejected, or funds recovered somehow, etc.

While the techniques briefly outlined here are often discussed under the topic of “fraud detection”, other terms are also frequently used to describe this class of data mining (or predictive modeling; see below) application, as “opportunities for recovery”, “anomaly detection”, or using similar terminology.

From the (predictive) modeling or data mining perspective, the distinction between “intentional fraud” vs. “opportunities for recovery” or “reducing loss” is also mostly irrelevant, other than that the specific perspective of how losses occur may guide the search for relevant predictors (and databases where to find relevant information). For example, intentional fraud may be associated with unusually “normal” data patterns as intentional fraud usually aims to stay undetected – and thus hide as an average/common transaction; other opportunities for recovery of loss (other than intentional fraud), however, may simply involve the detection of duplicate claims or transactions, the identification of typical opportunities for subrogation of insurance claims, correctly predicting when consumers are accumulating too much debt, and so on.

In the following paragraphs, the “fraud” term will be used as a short hand to reference the types of issues briefly outlined above. 

Fraud Detection as a Predictive Modeling Problem

 One way to approach the issue of fraud detection is to consider it a predictive modeling problem, of correctly anticipating a (hopefully) rare event. If historical data are available where fraud or opportunities for preventing loss have been identified and verified, then the typical useful predictive modeling workflow can be directed at increasing the chances to capture those opportunities.

In practice, for example, many insurance companies support investigative units, to evaluate opportunities for saving money on claims that were submitted. The goal is to identify a screening mechanism so that the expensive detailed investigation into claims (requiring highly experienced personnel) is selectively applied to claims where the overall probability for recovery (detecting fraud, opportunities to save money, etc.; see the introductory paragraphs) is generally high. Thus, with an accurate predictive model for detecting likely fraud, subsequent "manual" resources required to investigate a claim in detail are generally more likely to reduce loss.

Predicting Rare Events

The approach to predicting the likelihood of fraud as described above essentially comes down to a standard predictive modeling problem. The goal is to identify the best predictors and a validated model providing the greatest Lift to maximize the likelihood that the observations predicted to be fraudulent will indeed be associated with fraud (loss). That knowledge can then be used to reject applications for credit, or to initiate a more detailed investigation into an insurance claim, credit application, purchase via credit card, etc.

As most types of fraud are sporadic events (less than 30% of cases are fraud), the stratified sampling technique can be used to oversample from the fraudulent group.  This technique aids in model building.  With more cases from the group of interest, data mining models are better able to find the patterns and relationships to detect fraud. 

Depending on the base rate of fraudulent events in the training data it may be necessary to apply appropriate stratified sampling strategies to create a good data set for model building, i.e., a data file where fraudulent vs. non-fraudulent observations are represented with approximately equal probability (as described in stratified random sampling, model building is usually easiest and most successful when the data presented to the learning algorithms include exemplars of all relevant classes with about equal proportions;).


Fraud Detection as Anomaly Detection, Intrusion Detection

Another use case and problem definition of "fraud detection" presents itself rather as an "intrusion" or anomaly detection problem. Such cases arise when there is no good training (historical) data set that can be unambiguously assembled where known fraudulent and non-fraudulent observations are clearly identified.

For example, consider again the simple case of an insurance use case. A claim is filed against a policy, which given existing procedures (and rule engines, see below) triggered a further investigation that resulted in some recovery for the insurance company in a small proportion of cases. If one were to assemble a training dataset of all claims, some of which were further investigated and some recovery occurred or perhaps fraud was uncovered, then any modeling of such a dataset would likely capture to a large extent the rules and procedures that led to the investigation in the first place. (However, perhaps a more useful training dataset could be constructed only from those data referred to the investigative unit for further evaluation.). In other common cases, there is no "investigative unit" in the first place, and the data available for analysis do not contain a useful indicator of fraud or potentially recoverable loss, or potential savings.  

In such cases, the available information simply consist of a large and often complex data set of claims, applications, purchases, etc. with now clear outcome "indicator variable" that would be useful for predictive modeling (and supervised learning). In those cases, another approach is to effectively perform unsupervised learning to identify in the data set (or data stream) "unusual observations" that are likely associated with fraud, unusual conditions, etc.

For example, consider the typical health insurance case. A large number of very (in fact extremely) diverse claims are filed, usually encoded via a complex and rich coding scheme to capture various health issues and common and "approved" or "accepted" therapies. Also, with each claim there can be the expectation of obvious subsequent claims (e.g., a hip replacement requires subsequent rehabilitation), and so on.

Anomaly Detection

The field of anomaly detection has many applications in industrial process monitoring, to identify "outliers" in multivariate space that may indicate a process problem. A good example of such an application for monitoring multivariate batch processes is discussed in the chapter on Multivariate Process Monitoring for batch processes, using Partial Least Squares methods. The same logic and approach can fundamentally be applied for fraud detection in other (non-industrial-process) data streams.

To return to the example of a health care, assume that a large number of claims are filed and entered into a database every day. The goal is to identify all claims where reduced payments (less than the claim) are due, including outright fraudulent claims. How can that be achieved? 

A-priori rules.

First, obviously there are a set of complex rules that should be applied to identify inappropriately filed claims, duplicate claims and so on. Typically, complex rules engines are in place that will filter all claims to verify that they are formally correct, i.e., consistent with the applicable policies and contracts. Duplicate claims will also have to be checked.

What remains are formally legitimate claims which nonetheless could (and probably do) contain fraudulent claims. To find those it is necessary to identify any configurations of data fields associated with the claims that would allow us to separate the legitimate claims from those that are not. Of course, if no such patterns exist in the data, then nothing can be done; however, if such patterns do exist then the task becomes to find those "unusual" claims.

The usual and unusual.

There are many ways to define what might constitute an "unusual" claim. But basically there are two ways to look at this problem: Either by identifying outliers in the multivariate space, i.e., unusual combinations of data fields that are unlike typical claims, or by identifying "in-liers", that is, claims that are "too typical", and hence suspect of having been "made up".  

How to detect the usual and unusual.

This task is one of unsupervised learning. The basic data analysis (data mining) approach is to use some form(s) of clustering methods (e.g., k-means clustering, and then use those clusters to score (assign) new claims: If a new claim cannot be assigned with high confidence to a particular cluster of points in the multivariate space made up of numerous parameters (information available with each claim) then the new claim is "unusual" and an outlier of sorts, and should be considered for further evaluation; if a new claim can be assigned to a particular cluster with very-high confidence, and perhaps, if a large number of claims from a particular source all share that characteristic (i.e., are "in-liers"), then again these claims might warrant further evaluation since they are uncharacteristically "normal".

Anomaly detection, intrusion detection.

It should be noted that similar techniques are useful in all applications where the task is to identify atypical patterns in data, or patterns that are suspiciously too typical. Such use cases exist in the area of intrusion (to networks) detection, as well as many industrial multivariate process monitoring applications where complex manufacturing processes involving a large number of critical parameters must be monitored continuously to ensure overall quality and system health.


Rule Engines and Predictive Modeling

The previous paragraphs briefly mentioned rule engines as one component in fraud detection systems. In fact, they typically are the first and most critical component: Usually, the expertise and experience of domain experts can be translated into formal rules (that can be implemented in an automated scoring system) for pre-screening data for fraud or the possibility of reduced loss. Thus, in practice, the fraud detection analyses and systems based on data mining and predictive modeling techniques serve as the method for further improving the fraud detection system in place, and their effectiveness will be judged against the default rules created by experts. This also means that the final deployment method of the fraud detection system, e.g., in an automated scoring solution, needs to accommodate both sophisticated rules and possibly complex data mining models.

Text Mining and Fraud Detection

In recent years, text mining methods are increasingly used in conjunction with all available numeric data to improve fraud detection systems (e.g., predictive models). The motivation simply is to align all information that can be associated with a record of interest (insurance claim, purchase, credit application), and to use that information to improve the predictive accuracy of the fraud detection system. Basically, the approaches described here are applicable in the same way when used in conjunction with text mining methods, except that the respective unstructured text sources would first have to be pre-processed and "numericized" so that they can be included in the data analysis (predictive modeling) activities.