Xem mẫu

Chapter 12 Fraud Can the fabrication of research results be prevented? Can peer review be aug-mented with automated checking? These questions become more important with the increase in automated submission of data to portals. The poten-tial usefulness of niche modeling to detecting at least some forms of either intentional or unintentional ‘result management’ is examined here. Benford’s Law is a postulated distributional relationship on the frequency of digits [Ben38]. It states that the distribution of the combination of digits in a set of random data drawn from a set of random distributions follows the log relationship for each of the digits, as shown in the Figure 12.1. Benford’s Law, actually more of a conjecture, suggests the probability of occurrence of a sequence of digits d is given by the equation: Prob(d) = log10(1+ 1) For example, the probability of the sequence of digits 1,2,3 is given by log10(1 + 1 ). The frequency of digits can deviate from the law for a range of reasons, mostly to do with constraints on possible values. Deviations due to human fabrication or alteration of data have been shown to be useful for detecting fraud in financial data [Nig00]. Although Benford’s law holds on the first digit of some scientific datasets, particularly those cov-ering large orders of magnitude, it is clearly not valid for data such as simple time series where the variance is small relative to the mean. As a simple ex-ample, the data with a mean of 5 and standard deviation of 1 would tend to have leading digits around 4 or 5, rather than one. Despite this, it is possible that subsequent digits may conform better. A recent experimental study suggested the second digit was a much more reliable indicator of fabricated experimental data [Die04]. Such a relationship would be very useful on time series data as generated by geophysical data. This paper reports some tests of the second digit frequency as a practical methodology for detecting ’result management’ in geophysical data. It also illustrates a useful generalization of niche modeling as a form of prediction based on models of statistical distribution. 179 © 2007 by Taylor and Francis Group, LLC 180 Niche Modeling X2 X3 X4 X5 FIGURE 12.1: Expected frequency of digits 1 to 4 predicted by Benford’s Law. © 2007 by Taylor and Francis Group, LLC Fraud 181 12.1 Methods Here we looked at histograms and statistics, Chi square tests of deviation from Benford’s Law and uniform distribution, and normed distance of digit frequency for the first and second digit. We also generate a plot of the chi square values of digit frequencies in a moving window over a time series plot. This is useful for diagnosing what parts of a series deviate from the expected digit distribution. Four datasets are tested. These are: • simulated dataset composed of random numbers and fabricated data, • Climate Research Unit - CRU - composed of global average monthly temperatures from meteorological stations from 1856 to present, • tree ring widths drawn from the WDCP paleoclimatology portal, and • tidal height dataset, collected both by hand recording and instrumental reading. 12.1.1 Random numbers A set of numbers with an IID distribution were generated. I then fabricated data to resemble the random numbers. Below are the result for the first and second digit distribution with fits on a log-log plot and residuals. The plots show that while the first digit deviates significantly, the second does not (Figure 12.2). A number of statistics from digit frequency were calculated. The first two are from Nigrini [Nig00]. These include: df indicating management of results up or down as might occur with rounding of results financial records up or down, z score, chi-square value, and distance of distribution from expected. On examining the probability of conformance of digit frequency with Ben-ford’s Law (P) on the random data, the first digit appears mildly deviant, while the second digit is not (Figure 12.3). The significant deviation of the second digit is correctly identified on the fabricated data as well (Table 12.1). © 2007 by Taylor and Francis Group, LLC 182 Niche Modeling 0 2 4 6 8 0 2 4 6 8 digit digit 2 4 6 8 2 4 6 8 digit digit FIGURE 12.2: Digit frequency of random data. Another way of quantifying deviation is to sum the norm of the difference between the expected and observed frequencies for each digit (D). The value for D on the second digit of the fabricated data is much higher than the value for random data. The difference between random and fabricated data in first digit is less clear. These results give one confidence that statistical tests of the second digit frequency can detect fabrication in datasets. The second digit method appears more useful on these types of geophysical data than Nigrini’s method (df, z), developed primarily for detecting results management in financial data. Figure 12.4 is the result on time series data. The solid line is the significance of the second digit where the p is calculated on the moving window of size 50. The dashed line is a benchmark level of probability below which indicates deviation from Benford’s Law distribution. Figure 12.5 shows the differenced series. The line dipping below the dashed line shows differenced fabricated data is detected by the test of the distribution of the second digit against Benford’s Law, although the results are less clear. The simulated data consist of a random sequence on both sides, and fabri-cated data in the center. In the figure above, the fabricated region is clearly © 2007 by Taylor and Francis Group, LLC Fraud 183 0 2 4 6 8 0 2 4 6 8 digit digit 2 4 6 8 2 4 6 8 digit digit FIGURE 12.3: Digit frequency of fabricated data. 0 100 200 300 400 500 Index FIGURE 12.4: Random data with section of fabricated data inserted in the middle. © 2007 by Taylor and Francis Group, LLC ... - tailieumienphi.vn
nguon tai.lieu . vn