Button Button Button Button

A little over a year ago I discovered that Berkeley makes available on the Internet (at http://directory.berkeley.edu) contact information for all people associated with Berkeley which includes email addresses. This information is available to anyone on the Internet, not just to those inside the UC system or accessing it from outside with via a proxy. The email addresses are shown as text rather than graphics and, as I discovered, the system isn’t smart enough to monitor traffic and shut off computers that attempt to collect too many addresses. (“Too many” as in “all.”) So, if you want a list of 23,864 email addresses to spam – help yourself. Yes, they let you remove your address, but something tells me that the majority of the people who have their addresses on the site don’t even know that it exists.

I tried raising those concerns with the administration, but they didn’t seem to see a problem. They pointed out that Berkeley provides students with a notice about the directory on page 21 of the printed course schedule. Raise your hand if you've ever seen a printed course schedule at Berkeley.

I then thought that maybe I would get administration’s attention if I actually successfully retrieved the addresses. After writing a python script and letting it run for a while, I came into possession of 23,864 Berkeley email addresses. Something like the following but without the hash marks:

AAB##, ANN# ######@socrates.berkeley.edu
AAH###, ERI#### #######@berkeley.edu
AAK##, DAV##### #####@haas.berkeley.edu
AAK##, JOH############# ######@uclink.berkeley.edu
AAL###, JES############### ####@uclink.berkeley.edu
AAL##, ROL# not available
AAR#, JOH### not available
AAR##, HOL##### ######@socrates.berkeley.edu
AAR##, MAR####### not available
AAR#####, SCO####### ########@cs.berkeley.edu
AAR##, ASH### not available
ABA#, CHR######## #####@berkeley.edu
ABA#, JOS##### not available
ABA#, RON############ #########@berkeley.edu
ABA#, STE######### ######@uclink.berkeley.edu
ABA#, MEB######## not available
ABA###, IMA## #######@library.berkeley.edu
ABA###, PAT######### ################@#######.com
ABA####, REA############### ########@uclink.berkeley.edu
ABA####, TAW######### ########@berkeley.edu
ABA#####, KAT######### #######@berkeley.edu
ABA####, JOY######### not available
ABA####, VIR##### ################@###.com

It continues like this for another 39,479 lines.

I am not going to include the code here out of concern for those 23,868 inboxes, but the basic idea is simple. You call the following URL for each two-letter combination, substituting each two letter combination for ‘%s’:

https://directory.berkeley.edu/cgi-bin/search.cgi?
display_type=textonly&search-type=lastfirst&search-base=all
&search-term=%s&search.x=14&search.y=12&search=Search

Every time you get a list of names with links to more information. Some queries (e.g., “ad”) give you an error: “exceeded the maximum number of results,” but it’s pretty trivial to overcome this by doing queries for certain three letter combinations. (You need to search only for those triples “xyz” where both “xy” and “yz” resulted in the error. Those are actually very few.) Thus, it takes a total of about 800 queries to retrieve the full list of people associated with UC Berkeley (about 40,000 people) and then one query per person to get all of their details.

Having collected the addresses I sent an email to University Registrar Ms. Castillo-Robson who assured me that there really is nothing to worry about. A year later, the directory still functions just like it did last year.

On a ligher side, now that I've got 23,864 addresses, I thought I might as well get some statistics on it. First, here are the most popular domain names for the email addresses:

berkeley.edu 8600
uclink.berkeley.edu 7348
uclink4.berkeley.edu 1806
socrates.berkeley.edu 920
haas.berkeley.edu 586
yahoo.com 367
hotmail.com 342
nature.berkeley.edu 295
eecs.berkeley.edu 272
library.berkeley.edu 238
boalthall.berkeley.edu 187
lbl.gov 130
law.berkeley.edu 129
math.berkeley.edu 125
cs.berkeley.edu 121
aol.com 114
me.berkeley.edu 103
econ.berkeley.edu 94
ssl.berkeley.edu 91
ce.berkeley.edu 80
dev.urel.berkeley.edu 75
cchem.berkeley.edu 73
mba.berkeley.edu 65
unx.berkeley.edu 59
cp.berkeley.edu 58
cal.berkeley.edu 57
uhs.berkeley.edu 56
stat.berkeley.edu 56
calmail.berkeley.edu 51
newton.berkeley.edu 46
sims.berkeley.edu 44

Or, put differently:

berkeley.edu          8600
uclink.berkeley.edu   7361
{etc}.berkeley.edu    2130
uclink4.berkeley.edu  1823
haas.berkeley.edu      587
yahoo.com              367
hotmail.com            342
nature.berkeley.edu    295
eecs.berkeley.edu      273
library.berkeley.edu   238
{etc}.com              232
boalthall.berkeley.edu 187
lbl.gov                130
aol.com                114
{etc}.net              112
{etc}.edu               88
{etc}.org               31
{something}.{etc}       16
{etc}.gov               11
--------------------------
total                23864

Now the methods of choosing the user names. The table shows for each domain pattern what percentage of user names fit into specific patterns. (The patterns are illustrated by the hypothetical user names for “john marvin doe”: “doe”, “jdoe”, etc.) If the user name matched only the beginning of the string (e.g. “jlong” instead of “jlonglastname”), I counted it as a match.

jdoe doe doej johnd john john doe jmdoe jmd john .doe nums _ misc
all [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]
berkeley.edu [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]
uclink.berkeley.edu [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]
{etc}.berkeley.edu [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]
uclink4.berkeley.edu [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]
socrates.berkeley.edu [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed]
haas.berkeley.edu [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed]         [HTML tag <font> removed: not allowed]
yahoo.com [HTML tag <font> removed: not allowed]     [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]     [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]
hotmail.com [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]     [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]
nature.berkeley.edu [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]     [HTML tag <font> removed: not allowed]
eecs.berkeley.edu [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]       [HTML tag <font> removed: not allowed]
library.berkeley.edu [HTML tag <font> removed: not allowed]           [HTML tag <font> removed: not allowed]         [HTML tag <font> removed: not allowed]
{etc}.com [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]
boalthall.berkeley.edu [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]
lbl.gov [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]         [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]     [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]
aol.com [HTML tag <font> removed: not allowed]         [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]     [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed]
{etc}.net [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]     [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]
{etc}.edu [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]
{etc}.org [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]
{something}.{etc} [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]   [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]       [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]
{etc}.gov   [HTML tag <font> removed: not allowed]         [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed] [HTML tag <font> removed: not allowed]     [HTML tag <font> removed: not allowed]

A random observations:

  • “john@*.berkeley.edu” is more common than “john@berkeley” – that’s obvious. However, also “doe@*.berkeley.edu” is much more common than “doe@berkeley” – is Berkeley large enough that last names collisions are common?
  • MBAs and .gov people really like their last names. Lawyers, on the other hand, are like the rest of us.
  • library.berkeley.edu and lbl.gov must have an explicit policy. What’s interesting is that lbl.gov must make exceptions for people without initials and for people without first names. :)