• 02 Oct 2009 /  Code, Technology

    A lot of people don’t like to put their e-mail address directly on a web-page. You may have seen e-mail addresses that look like “jondoeATgmailDOTcom”, or “j o h n do e @ gma il . c o m!”. Strangeness such as spelling out words like “AT”, odd spaces, or non e-mail address characters such as ! are all tricks employed to avoid having your e-mail address scopped up by someone looking to spam you.

    The way in which I approach the problem is by having javascript write my e-mail address for me. I get both a well formed and clickable e-mail link, as well as avoid getting my e-mail address scooped up by a web crawler. Now, I know that the javascript document.write() function is shunned in most circles, but below is some example code.
    [cc lang="javascript"]
    // File is contactInfo.js
    document.write(”David Kennedy:
    Web
    http://davidwkennedy.com|
    E-Mail dave@orderinchaos.org |
    Phone: 435.770.6865 “);
    [/cc]
    Which is then called in the HTML file as seen below
    [cc lang="html"]

    [/cc]
    I’m sure that it’s probably quite possible to write a web crawler that will be smart enough to pick it up, but I think most web crawlers won’t.

    For example, when I use the web crawler below on my personal site davidwkennedy.com you can see that the results do not contain my e-mail address!

    Below is a web crawler, written in Python & borrowed from IBM’s site on web spiders.
    [cc lang="python"]
    #!/usr/local/bin/python

    import httplib
    import sys
    import re
    from HTMLParser import HTMLParser

    class miniHTMLParser( HTMLParser ):

    viewedQueue = []
    instQueue = []

    def get_next_link( self ):
    if self.instQueue == []:
    return ”
    else:
    return self.instQueue.pop(0)

    def gethtmlfile( self, site, page ):
    try:
    httpconn = httplib.HTTPConnection(site)
    httpconn.request(”GET”, page)
    resp = httpconn.getresponse()
    resppage = resp.read()
    except:
    resppage = “”

    return resppage

    def handle_starttag( self, tag, attrs ):
    if tag == ‘a’:
    newstr = str(attrs[0][1])
    if re.search(’http’, newstr) == None:
    if re.search(’mailto’, newstr) == None:
    if re.search(’htm’, newstr) != None:
    if (newstr in self.viewedQueue) == False:
    print ” adding”, newstr
    self.instQueue.append( newstr )
    self.viewedQueue.append( newstr )
    else:
    print ” ignoring”, newstr
    else:
    print ” ignoring”, newstr
    else:
    print ” ignoring”, newstr

    def main():

    if sys.argv[1] == ”:
    print “usage is ./minispider.py site link”
    sys.exit(2)

    mySpider = miniHTMLParser()

    link = sys.argv[2]

    while link != ”:

    print “\nChecking link “, link

    retfile = mySpider.gethtmlfile( sys.argv[1], link )
    mySpider.feed(retfile)
    link = mySpider.get_next_link()

    mySpider.close()

    print “\ndone\n”

    if __name__ == “__main__”:
    main()
    [/cc]
    Below are the results of the above crawler on davidwkennedy.com
    [cc lang="bash"]
    dave@dave-sparta:~$ python miniCrawler.py davidwkennedy.com /

    Checking link /
    adding index.htm
    adding photos.htm
    adding videos.htm
    adding projects.htm
    ignoring http://orderinchaos.davidwkennedy.com
    adding funstuff.htm
    adding about.htm
    ignoring http://twitter.com/davidwkennedy
    ignoring statcounter

    Checking link index.htm
    Checking link photos.htm
    Checking link videos.htm
    Checking link projects.htm
    Checking link funstuff.htm
    Checking link about.htm
    done

    dave@dave-sparta:~$

    [/cc]

    Tags: , , , ,