A lot of people don’t like to put their e-mail address directly on a web-page. You may have seen e-mail addresses that look like “jondoeATgmailDOTcom”, or “j o h n do e @ gma il . c o m!”. Strangeness such as spelling out words like “AT”, odd spaces, or non e-mail address characters such as ! are all tricks employed to avoid having your e-mail address scopped up by someone looking to spam you.
The way in which I approach the problem is by having javascript write my e-mail address for me. I get both a well formed and clickable e-mail link, as well as avoid getting my e-mail address scooped up by a web crawler. Now, I know that the javascript document.write() function is shunned in most circles, but below is some example code.
[cc lang="javascript"]
// File is contactInfo.js
document.write(”David Kennedy:
Web http://davidwkennedy.com|
E-Mail dave@orderinchaos.org |
Phone: 435.770.6865 “);
[/cc]
Which is then called in the HTML file as seen below
[cc lang="html"]
[/cc]
I’m sure that it’s probably quite possible to write a web crawler that will be smart enough to pick it up, but I think most web crawlers won’t.
For example, when I use the web crawler below on my personal site davidwkennedy.com you can see that the results do not contain my e-mail address!
Below is a web crawler, written in Python & borrowed from IBM’s site on web spiders.
[cc lang="python"]
#!/usr/local/bin/python
import httplib
import sys
import re
from HTMLParser import HTMLParser
class miniHTMLParser( HTMLParser ):
viewedQueue = []
instQueue = []
def get_next_link( self ):
if self.instQueue == []:
return ”
else:
return self.instQueue.pop(0)
def gethtmlfile( self, site, page ):
try:
httpconn = httplib.HTTPConnection(site)
httpconn.request(”GET”, page)
resp = httpconn.getresponse()
resppage = resp.read()
except:
resppage = “”
return resppage
def handle_starttag( self, tag, attrs ):
if tag == ‘a’:
newstr = str(attrs[0][1])
if re.search(’http’, newstr) == None:
if re.search(’mailto’, newstr) == None:
if re.search(’htm’, newstr) != None:
if (newstr in self.viewedQueue) == False:
print ” adding”, newstr
self.instQueue.append( newstr )
self.viewedQueue.append( newstr )
else:
print ” ignoring”, newstr
else:
print ” ignoring”, newstr
else:
print ” ignoring”, newstr
def main():
if sys.argv[1] == ”:
print “usage is ./minispider.py site link”
sys.exit(2)
mySpider = miniHTMLParser()
link = sys.argv[2]
while link != ”:
print “\nChecking link “, link
retfile = mySpider.gethtmlfile( sys.argv[1], link )
mySpider.feed(retfile)
link = mySpider.get_next_link()
mySpider.close()
print “\ndone\n”
if __name__ == “__main__”:
main()
[/cc]
Below are the results of the above crawler on davidwkennedy.com
[cc lang="bash"]
dave@dave-sparta:~$ python miniCrawler.py davidwkennedy.com /
Checking link /
adding index.htm
adding photos.htm
adding videos.htm
adding projects.htm
ignoring http://orderinchaos.davidwkennedy.com
adding funstuff.htm
adding about.htm
ignoring http://twitter.com/davidwkennedy
ignoring statcounter
Checking link index.htm
Checking link photos.htm
Checking link videos.htm
Checking link projects.htm
Checking link funstuff.htm
Checking link about.htm
done
dave@dave-sparta:~$
[/cc]

