Wednesday, May 21, 2014

Problem #1

Problem: Given a file name and a number, n, print out the n most frequnetly occurring words in the file. The algorithm should parse through the file and print the given number of words followed by the number of times the word occurs in the file.

Example: Here is an example of the output I got when I ran my program on the entire text of Jane Eyre (Charlotte Bronte) and an n of 10.


the : 7868
i : 6766
and : 6229
to : 5077
a : 4322
of : 4315
in : 2658
was : 2402
you : 2249
my : 2139





Exciting right? Though this does seem reasonable...I mean who uses the words "endeavoring" and "chiding" 8000 times each in a book? I have to admit that would certainly be interesting a read. 



Algorithm: In order to solve this problem I used a Priority Queue with my own comparable node class in order to draw out the word with the highest frequency.

The class looks like this:

private static class Node implements Comparable{
  String key;
  int freq;
  
  // used by the priority queue
  public int compareTo(Object other){
  Node n1 = (Node) this;
   Node n2 = (Node) other;
  if (n1.freq == n2.freq){
    return 0;
    }
  if (n1.freq > n2.freq){
    return -1;
   }
   return 1;
  }
  }

Note how the compareTo function's output seems a bit off, this is because you want to queue to return the largest frequency. The queue is implemented to pull the smallest item, the change reverses the order of the queue.

Now to find the words and to keep track of the frequencies when parsing through the file, I used a HashMap. The key of the map is the word and the value is the node. Each time you hit a word, you check if it is in the map already, if so you just increment the frequency of the node. If the word is not the map then make a new node and initialize frequency to 1. After parsing the whole file, I iterate through the keys in the map and add each node associated to the key to a priority queue. To finish off simply print n number of keys by pulling the top node from the queue. Happy coding!

Full Code:

import java.util.*;
import java.io.*;

public class PriorityParse {
 /* This class takes in a file name and the number of most freq occuring words
 the program should print. It parses through the file and then prints the given
 number of words followed by the number of times the word occurs in all the
 files. Only works with one file */

 // helper class node for the parsing
private static class Node implements Comparable{
  String key;
  int freq;
  
  // used by the priority queue
  public int compareTo(Object other){
  Node n1 = (Node) this;
   Node n2 = (Node) other;
  if (n1.freq == n2.freq){
    return 0;
    }
  if (n1.freq > n2.freq){
    return -1;
   }
   return 1;
  }
  }


  public static String removePuntc(String s){
  // takes in a word and removes the ending punction
  while(s.endsWith(".") || s.endsWith("!") || s.endsWith("?") || s.endsWith(",")){
   s = s.substring(0, s.length() - 2);
  }
  return s;
  }


  public static HashMap<String, Node> parseFile(Scanner s){
  // function parses a file and adds each word to a hashmap
  // the word is the key and the node with the freq is the value
  HashMap<String, Node> map = new HashMap<String, Node>();
  while (s.hasNextLine()){
   // disregard case when looking at a word
   String[] text = s.nextLine().toLowerCase().split("\\s+");
   // loop through the array and update the hashmap
for (int i = 0; i < text.length; i++){
String word = text[i];
word = removePuntc(word);
if (!word.equals("")){
// check if the word is in the hash map
if (map.containsKey(word)){
// if it is in the map update the freq of the node
Node temp = map.get(word);
temp.freq++;
}
else{
// if the word is not in the map add it
Node n = new Node();
n.key = word;
n.freq = 1;
map.put(word, n);
}
}
}
 
  return map;
  }

  // main takes in file names as arguments
public static void main(String[] args) {
  // print an error if not enough arguments are given
  if (args.length != 2){
   System.err.println("Invalid Arguments");
   return;
  }
  Scanner scan;
  int n;
  try{
    // try to open the file
   String filename = args[0];
   scan = new Scanner(new File(filename));
   // try to get the number
   n = Integer.parseInt(args[1]);
  }
  catch (FileNotFoundException e) {
   // if the file could not be opened print an error and quit
  System.err.println("Could not open file.");
  return;
  }
  catch (NumberFormatException ex){
   System.err.println("Argument 2 must be an integer.");
   return;
  }
  // parse the file using the helper function
  HashMap<String, Node> map = parseFile(scan);
// load the node in the map into a priority queue
PriorityQueue<Node> queue = new PriorityQueue<Node>();
Set<String> set = map.keySet(); 
for (String s: set){
queue.add(map.get(s));
}
//print the n most freq occuring words
for (int j = 0; j < n; j++){
Node word = queue.poll();
System.out.println(word.key + " : " + word.freq);
}
}

}




No comments:

Post a Comment