CMSC 828o Programming Assignments and Project Options Assignment 1 (week 5): Install Java 2 SE, Sun Java Mail, GNU Java Mail, Java Activation Framework, Lucene and the email indexing and search demo. Index your own email archives (anything in unix mbox format should work). Turn in a screen shot of the interface with an email resulting from running the program on any collection of email that you have available. Here is where you can get the parts you need: Java 2 SE SDK: http://java.sun.com/j2se/1.4.2/index.jsp Lucene: http://jakarta.apache.org/lucene/docs/index.html JAF: http://java.sun.com/products/javabeans/glasgow/jaf.html Java Mail: http://java.sun.com/products/javamail/ GNU Mail: http://savannah.gnu.org/cgi-bin/viewcvs/classpathx/mail/ Everything else:http://www.glue.umd.edu/~oard/email/ A simple approach that should get this working: Install Java 2 SE, then install Lucene and run the demo that comes with it. Use that to get a feel for how Lucene works. The trickiest part here is getting your classpath right so that it will recognize Lucene. On a PC running Windows XP, the classpath is set using Control Panel->System Properties->Advanced->Environment Variables. Grab everything from the "Everythign else" directory and read the three README files. These capture the discussion that ensued when I tried to get this to work for the first time. Install JAF, Java Mail, GNU Mail, and my demo application (not the original one that Anton sent, which did not use Lucene) and then run the demo application on a unix mbox file (for example, a file of saved email created using Pine). The trickiest part here is getting the right mailcap file in the right place. Where the right place seems to depend on how you order things in your classpath. My classpath is: .;.;C:\PROGRA~1\JMF21~1.1\lib\sound.jar;C:\PROGRA~1\JMF21~1.1\lib\jmf.jar;C:\WINDOWS\java\classes;c:\temp\lucene\lucene\lucene-1.3-rc1.jar;c:\temp\lucene\lucene\lucene-demos-1.3-rc1.jar;c:\email\src;c:\j2sdk1.4.2_02\jaf-1.0.2;c:\j2sdk1.4.2_02\javamail-1.3.1;. and it started working after I put the mailcap file in C:\j2sdk1.4.2_02\javamail-1.3.1\META-INF If you put javamail before JAF in your path, you would probably need to put the mailcap file in the META-INF there instead. The mailcap file that I am using is available along with a tarball for the version of GNU mail that I am using at the "Everything else" URL - you'll need to change a few paths in that and perhaps in the demo source, but everything else should be pretty painless. If it is not, let me know - I have a working reference implementation, so it should be easy to troubleshoot. The demo has a known inability to handle certain cases that violate RFC requirements for email or MIME, but these result in graceful rejection of the email being processed. The demo does, however, chokes if the file (or files) it is trying to index gets really big, apparently because of a memory leak. Assignment 2 (week 6): Lucene's query language and API do not appear to provide the ability to rank the results of a Boolean query for keywords by similarity to a given set of free text terms or by date. Add these capabilities to Lucene in a manner that is consistent with the existing API. Demonstrate this capability by extending the search demo so that it has only a single search button that will rank emails from a sender (if specified) or from anyone (if no sender is specified) by their similarity to the entered body text. Send me a jar file with your new classes. Include a README that I can use to easily get your new demo running on my machine and a screen shot of your new demo in action. Assignment 3 (week 8): Build a simple thread reconstructor based on chronological order and subject lines conventions and modify the search demo to support thread-based searching as an option. Both attached and included text should be handled in some reasonable way. Project Option 1 (week 14): Extend the thread reconstructor to exploit included text and to break threads at apparent topic shifts and to recognize and separately handle signature blocks. Project Option 2 (week 14): Extend name searching to include named entity references within the subject and body, and extend name searching to accommodate efficient searches based on substring matching or similar pronunciation. Project Option 3 (week 14): Extend attachment handling to index most common forms of encoded text (PDF, postscript, Word, Powerpoint, ...) and to resolve references to URL's using the Internet Archive. Project Option 4 (week 14): Implement active learning for text classification. Use it to support redaction and for spam identification. Project Option 5 (week 14): You can design your own project if you like, but please discuss it with me first.