The N Commandments for using the Internet Archive

The N Commandments for using the Internet Archive

Our group has researcher access to the Internet Archive, which permits us to work on the Archive's cluster.

Over the past several years, we have evolved some rules of good citizenship that help to avoid problematic situations, e.g. accidentally taking over a large chunk of the internal bandwidth on their network. This Web page attempts to institutionalize some of that knowledge. It's titled "The N Commandments" because, like the Archive itself, the list is likely to need updating.

(List initially contributed by Michael Subotin.)

  1. Be sure not to run a big job before you've learned how to monitor and kill it
  2. Avoid running processes on homeserver (an automatic check in the scripts might be useful)
  3. Avoid jobs that have multiple hosts read or write large amounts of data to /home (use /tmp/your_username directories instead)
  4. Avoid heavy I/O activity on /home in general
  5. Be careful not to run one p2 job inside another (an automatic check in the scripts might be useful)
  6. When you use ctrl-C to kill a p2 job, it leaves pipe files in the /tmp (or /tmp/your_username?) directory (extension .p2tmp), and if you kill a sort process, it may leave behind a temporary file there. Please check to clean these up once in a while.
  7. Be careful about exceeding disk quota on any of the disks at run time
  8. Be sure to nice your processes
  9. Be sure not to leave any processes running invadvertently (check the hosts you've been working with before logging off)
  10. Be sure to clean up the files you've placed in /tmp/your_username directories on all hosts (particularly large files)
  11. For the sake of your own sanity, if a job seems to be running forever, check to see if an I/O or ssh breakdown on some host isn't holding it back

Questions? Contact Philip Resnik at lastname _AT_