CS-151 Labs > Lab 2. Object-Oriented Programming
Part 3. Benford Analysis
For this part of the lab, you will investigate the distribution of digits in a set of files.
Benford’s Law, also known as the “first-digit law,” is an observation about the distribution of digits in numbers in data sources. In brief, in many natural datasets the first digit will be a ‘1’ about 30% of the time with each higher digit occurring less frequently than the previous, ending with a ‘9’ occurring less than 5% of the time.
There is a nice graph of this distribution on the Wikipedia page.
The input to this program will be very similar to some of the ones you have already worked on. You will give a list of file names on the command line and then your program will work through all of them (if you need a refresher on how to process command line arguments, consult Lab 0 Part 2). If a file cannot be opened, just print a message to the user and continue to the next file.
We collected a number of pieces of sample data for you to try out your Benford
analysis tool on here: inputs.zip. This data comes from the US Census
Bureau, CIA World Factbook, and a variety of other sources around the web.
Because these are large files, we don’t want you to include the subfolder called inputs
with your submission.
To be able to use these files from the command line in your Eclipse project, you should unzip them and place the entire inputs
folder in the main directory of the lab2
project, where your src
and bin
folders are.
Create a Java class named Benford
in lab2
package.
For this program, you are only concerned about words that start with a
numerical digit, and then you need to get the value of that digit. You will
use an int array of size 10 to maintain a count of the number of words you
encounter that start with the digits 0 through 9. You should use the digit as
the index into the array (i.e., count[0]
gives you the number of words that
start with 0).
There are a number of ways you could get at the first character, determine if it is a digit, and get the value. One way is to:
- Get a word from your input (probably using
Scanner
andhasNext()
/next()
. - Use the word’s
charAt()
method to get the first characterc
and pass it intoCharacter.isDigit(c)
to determine if it is a digit. - If it is, then you can use the
String.valueOf(c)
method to convert characterc
into its String representation. - And that String can already be passed into
Integer.parseInt()
to determine its integer value. - Note that we must first convert the Character to Strng,
because
Integer.parseInt()
takes aString
and not aCharacter
.
Once you have gone through all the input files and collected counts of all the leading digits, you should print out a histogram of the various frequencies that you encountered. You should print out the following information for each row,
- the leading digit 0–9;
- the total count of words beginning with that digit;
- the relative frequency of that first digit (i.e., the count of all words beginning with that digit divided by the sum of counts for all digits);
- a bar of
*
characters representing the frequency graphically, scaled by the maximum frequency.
For example when run with the input file inputs/census-pop-est.txt
, this will print:
Welcome to the Benford analysis program
0 1116 0.9% : **
1 35294 29.8% : **************************************************
2 20825 17.6% : ******************************
3 14070 11.9% : ********************
4 11405 9.6% : ****************
5 9210 7.8% : *************
6 8019 6.8% : ***********
7 6809 5.7% : **********
8 6472 5.5% : *********
9 5259 4.4% : *******
The most frequently occurring digit should have a bar of length MAX_WIDTH
where MAX_WIDTH
is defined by a constant in your class.
public static final int MAX_WIDTH = 50;
All of the other bars’ lengths should be scaled proportionally to this bar as follows: Take
the digit’s count, divide by the max count, and multiply by MAX_WIDTH
. You
should round this number using Math.round()
to get it to the nearest integer.
If all the frequencies were about even, you’d expect all the bars to be about
50 units wide. If the distribution is not even, then the longest will be 50 and the
others will be of various shorter widths. Note that if the frequency is 0, the
bar should have 0 *
characters.
To print out the table in a formatted fashion, you might want to use
System.out.printf()
. You can pass it a format string that uses %d
for integer (decimal) numbers and %f
for floating point numbers. You can also specify a
width for each field by including a number between the %
and the letter. Here
is what we used for the example shown above.
System.out.printf("%d %8d %4.1f%% : ", digit, count[digit], freq );
%d
- is just a placeholder for a decimal number%8d
- is a decimal number where up to 8 leading spaces will be used%4.1f
- is a float with a width of 4 and 1 digit after the point%%
- because%
is used to signify a variable placeholder, you need to put 2 of them in a row to actually get a percentage%
in your output
After this format string, you list the values you want to print in the same order.
Recall that unlike in Python, we don’t have a different operator for real vs
integer division: instead, dividing one integer by another will always return
an integer. In order to get a float
, you will need to cast one of them as a
float
, like: count/(float)totalCount
.
Once you are done coding, try
running your tool on the various files. Note in your readme file
which of the input files seem to follow Benford’s law and which ones don’t. Feel free to
speculate why the ones that don’t are that way.
When run on the set of inputs inputs/census-pop-est.txt kittens
inputs/primes-10000.txt
, this will look like:
Welcome to the Benford analysis program
Could not open kittens (No such file or directory)
0 1116 0.9% : **
1 36898 28.7% : **************************************************
2 21954 17.1% : ******************************
3 15167 11.8% : *********************
4 12474 9.7% : *****************
5 10265 8.0% : **************
6 9032 7.0% : ************
7 7836 6.1% : ***********
8 7475 5.8% : **********
9 6265 4.9% : ********