Fraud detection using ML and PySpark framework

Introduction

The image on the right was obtained from here.

There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.

A synthetic dataset were generated using the simulator called PaySim. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.

Data cleaning and understanding

The data is big and we did a comprehensive data exploration on the data to find the pattern existed in the data. For example, to investigate which types of transactions are more likely to the fraud.

In addition, the data is always noisy inherently and how to identify them is also crucial for building a generalized model.

In this project, the feature enginering have been used for outlier detection.

Hyperparameter tunning in PySpark

The hyperparamters are critical to build a robust and high performing models. Generally, this step is one of trickiest part of ML pipeline. The goal of hyperparameter tuning is to optimize the model and achieve a better model performance. The most common approaches including grid search and random search to do hyperparameter tuning in PySpark is the same as we did in a typical-classical machine learning pipeline (scikit learn).

To avoid the overfitting issue, different data split strategies and cross validation were also used. The users can chose which way they prefer to conduct a similar study or explore both and to see which way can give a better solutions.

References

PySpark ML Tuning: model selection and hyperparameter tuning.

Project link: https://github.com/tankwin08/PySpark_Fraud_detection_ML

Nifty tech tag lists fromĀ Wouter Beeftink