Tuesday 2 June 2009
Fighting Corruption in Brazil Using Data Mining and Bayesian Programming
The Brazilian public sector has been searching for new approaches to combat corruption and bolster good governance on. Good transparency strategies and good governance are the keys to address corruption in any country. One of the promising tools that is emerging in Brazil is the crossing of databases to identify problems and irregularities and to monitor the quality of public expenditure —we term this ‘data mining’.
The Brazilian Court of Audit (Tribunal de Contas da União - TCU ) audits the accounts of administrators and other persons responsible for federal public funds, assets, and other valuables, as well as the accounts of any person who may cause loss, misapplication, or other irregularities that may cause losses to the public treasury. The TCU is a front line federal fiduciary agency with an oversight mandate giving it access to the major government databases.
A R&D collaboration between the TCU, INRIA , CNRS and ProBayes is underway concerning the implementation of an innovative anti-corruption approach based on data mining and Bayesian programming. This approach will help the TCU to drive its auditing efforts towards the most sensible and risky subjects, taking into account its own prior knowledge combined to the large amount of data available inside multiple databases routinely registering the federal spending.
Unlikely most KDD initiatives, this project must deal with a very specific context, defined by the following premises:
Available data has uncertain quality;
The public administration is very complex, ruled by a complex legal system, most rules having a significant number of exceptions;
There are no “learning” databases where previously detected corruption cases were registered;
Auditors have an extensive but unstructured knowledge about corruption schemes, business rules and government functioning;
Many factors can determine the importance of an auditing effort: risk, materiality , social impact, political relevance, etc.
Useful findings must be traceable, meaning that any answer given by the analysis must be tracked back to the original data where the concerned facts can be closely audited.
Most data mining methods are not suitable to this context. Bayesian programming, instead, seems to address the major issues involved. Probability theory, considered as an alternative to logic to model rational reasoning, seems to be the perfect mathematical framework to face this difficult challenge. Databases are used in a first step to transform incompleteness into uncertainty, inference is then used to reason and take decisions based on the probability distributions constructed based on the data. Prior knowledge can be embedded inside the models by choosing appropriate variables and by defining conditional probabilities for specific cases where data is not available.
The ProBT engine is being used to implement the probabilistic models. Models and their findings are being validated by the TCU which is sending its auditors to the government units pointed by the analysis as being potential sources of irregularities. This ongoing validation represents a valuable feedback to the probabilistic approach.
People involved on the project:
Remis Balaniuk , PhD: Senior Expert at the TCU, Brazil.
Emmanuel Mazer , PhD: CNRS Research Scientist, PDG of Probayes, France.
Pierre Bessière , PhD: CNRS Research Scientist, Research Director at the e-Motion Project , INRIA – Rhône-Alpes, France.
|