Thursday, April 16, 2009

Reading a directory of files into a database

When trying to test software nothing beats real data. Well, except beer. But anyway, large, real, and freely available datasets are unfortunately hard to come by. Thank god for jskill. Because of him (and nerds around the world for getting it together) us other nerds can play with a 500 million plus dataset of Enron emails.

My use for the data was to test an LSI system, but the sky's really the limit when it comes to what you could use it for. Its tar'd for 367mb and unzipped its 2.54gb (on disk).

The code itself isn't too much. I started with the fat kid approach and just barreled through the files. I pretty much yoinked this guy for the file traversal. The two things the code pads are single quotes and backslashes.

I ran it and it took about 7.5 hours to grind on through. I showed it to The Funk and he almost piddled with glee and said "Commons IO". Maaan I'm going to name my firstborn CommonsIO. I threw in prepared statements and we're in business.

To get setup with the database here is a creation script. My column names (and how I pull out the first line) are kind of unique for the software I'm using, but the idea is there.
DROP TABLE IF EXISTS `enron`.`tbl_enron`;
CREATE TABLE  `enron`.`tbl_enron` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `itemId` varchar(256) NOT NULL,
  `data` longtext NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
So here is an example email from the dataset (the attachments aren't provided).
Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast

 
And then finally the baby maker..
/**
 * Loader.java
 *
 * Created on Apr 19, 2009
 *
 * @author Kevin
 */

package com.noviidesign.enron;

import java.io.File;
import java.io.IOException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.Iterator;
import org.apache.commons.io.FileUtils;


/**
 * Loader.java does the following thangs...
 */
public class Loader {

    public static void main(String[] args) {

        System.out.println("connecting to db");

        Connection connection = null;
        try {
            connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/enron", "root", "password");
        } catch (SQLException ex) {}

        PreparedStatement preparedStatement = null;
        try {
            preparedStatement = connection.prepareStatement("INSERT INTO tbl_enron (itemId, data) VALUES (?, ?);");
        } catch (SQLException ex) {}

        System.out.println("connected.  now procressing");

        Iterator iterator = FileUtils.iterateFiles(new File("C:\\Documents and Settings\\Kevin\\Desktop\\enron_mail_030204\\maildir"), null, true);
        while (iterator.hasNext()) {

            String file = null;
            try {
                file = FileUtils.readFileToString((File) iterator.next());
            } catch (IOException ex) {
                System.err.println("Bullocks! " + ex);
            }

            String itemId = null;
            if (file.contains("Date"))
                itemId = file.substring(0, file.indexOf("Date") - 2);

            //find \ and pad \\ for insertion
            file = file.replaceAll("\\\\", "\\\\");
            //find ' and pad \' for insertion
            file = file.replaceAll("\'", "\\\'");

            try {
                preparedStatement.setString(1, itemId);
                preparedStatement.setString(2, file);
                preparedStatement.executeUpdate();
            } catch (SQLException ex) {
                System.out.println("SHIT! " + itemId + " " + ex.getErrorCode());
                System.err.println(ex);
            } catch (Exception ex) {
                System.out.println("SHIT! " + itemId);
                System.err.println(ex);
            }
        }

        try {
            preparedStatement.close();
        } catch (SQLException ex) {}

        System.out.println("WOOT DONE PROCESSING!  closing connection");

        if (connection != null) {
            try {
                connection.close();
            } catch (Exception e) { }
        }

        System.out.println("le finished");
    }
}

No comments:

Post a Comment