Announcement

Collapse
No announcement yet.

Script to automate building an adblocking hosts file

Collapse
This topic is closed.
X
This is a sticky topic.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Feathers McGraw
    replied
    Hehe

    Leave a comment:


  • SteveRiley
    replied
    Kum-by-yah, Jerr-ee, kum-by-yah...

    LOLOLOLOLOL

    Leave a comment:


  • Feathers McGraw
    replied
    I don't want to take over Steve's thread, so I created a new one here for the router project, so I don't clutter this one with stuff not specific to Kubuntu.

    Leave a comment:


  • Feathers McGraw
    replied
    Google Analytics for Wordpress works by inserting the analytics code into the header of each page. This is the code:

    Code:
                <script type="text/javascript">//<![CDATA[
                // Google Analytics for WordPress by Yoast v4.3.3 | http://yoast.com/wordpress/google-analytics/
                var _gaq = _gaq || [];
                _gaq.push(['_setAccount', 'XXXXXXXXX']);
    				            _gaq.push(['_setCustomVar',2,'post_type','page',3],['_setCustomVar',4,'year','2013',3],['_trackPageview']);
                (function () {
                    var ga = document.createElement('script');
                    ga.type = 'text/javascript';
                    ga.async = true;
                    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    
                    var s = document.getElementsByTagName('script')[0];
                    s.parentNode.insertBefore(ga, s);
                })();
                //]]></script>
    So I'm going to try putting "google-analytics" in that whitelist file and see what happens. Thought I might need a few more phrases!

    Feathers
    Last edited by Feathers McGraw; Oct 28, 2013, 12:11 PM.

    Leave a comment:


  • Feathers McGraw
    replied
    Plus, if either of us have done something stupid (even if it does work), people can suggest an alternative, and we all learn. Now all I need to do is work out which Google things to enable, which may turn out to be the most difficult bit!

    Leave a comment:


  • GreyGeek
    replied
    Wow! This is a PRIME example of how Open Source is supposed to work. Steve creates a very neat Bash script, his first, to create a specialized /etc/hosts file and folks jump in and add mods for their special purposes. Everyone benefits! Now, suppose he had created a binary to sell as shareware? Only he could have made changes, depriving himself and others of improvements, changes, bug fixes, etc..., that other more experienced Bash script writers could have contributed. Everyone benefits from Steve, Feathers and the other contributers.
    Last edited by GreyGeek; Oct 28, 2013, 05:22 AM.

    Leave a comment:


  • Feathers McGraw
    replied
    Didn't fancy trawling through 33,000 lines for certain things so I thought I'd automate it. Was a good learning experience.

    Code:
    #!/bin/bash
    
    #Before calling this script, create a whitelist file containing phrases to allow, one phrase per line
    
    if [ $# -ne 1 ]; then
    echo "Usage: $0 whitelist_file_location"
    exit
    fi
    
    INPUT_FILE=~/hosts-block
    OUTPUT_FILE=~/hosts-block-less-whitelist
    
    #first, remove empty lines from whitelist_file (or next step will throw an error)
    sed '/^$/d' $1 > tt
    mv tt $1
    echo 'Removed empty lines from whitelist_file'
    
    cp $INPUT_FILE $OUTPUT_FILE
    
    #now, read lines from whitelist file and remove entries with matching content from OUTPUT_FILE
    cat $1 | while read line; do
            sed -e '/'$line'/d' $OUTPUT_FILE > tt
            mv tt $OUTPUT_FILE
            echo 'Removed any lines containing' $line
    done
    If I've done anything embarrassingly inefficient, let me know lol.

    Leave a comment:


  • SteveRiley
    replied
    You can edit the output of my script and remove any references to Google Analytics before you copy the file to your router.

    Leave a comment:


  • Feathers McGraw
    replied
    Thanks, that's really interesting!

    Originally posted by SteveRiley View Post
    You will see that it contains a number of links to sites my script does block (google-analytics.com, quantcast.com)
    Ahh, then I'm in a bind. I use Google Analytics on the site because it gives such an insight into which bits people are finding interesting/useful etc.

    Blocking google-analytics at the router would break the connection from the Pi to the Google server. Finding a local equivalent would be ideal, I've tried a couple of wordpress plugins but unfortunately the counts were pretty wild.

    Feathers

    Leave a comment:


  • SteveRiley
    replied
    Originally posted by Feathers McGraw View Post
    Tested using the site below, and some adverts still showed, but they're not real adverts, so I'm not sure what that means! Haven't had any real ones get through. Browsing seems snappier with AdBlock turned off.
    Here's the HTML from the portion of the page that delivers the images:
    Code:
    <img src="http://img236.echo.cx/img236/5108/adbannersportedtop9tr.gif" alt="Ad banner should be blocked" title=" Ad banner should be blocked"> 
    <h3>[^ You should NOT be seeing this image above Ad banner was here ^]</h3>
    <br>
    <img src="http://img145.echo.cx/img145/3690/atribalfushionsported3ti.gif" alt="Ad should be blocked" title="Ad should be blocked"> 
    <h3>[^ You should NOT be seeing this image above Ad image was here ^]</h3>
    <br>
    <img src="http://img207.echo.cx/img207/1241/realmedia6iw.gif" alt="Ad should be blocked" title="Ad should be blocked"> 
    <h3>[^ You should NOT be seeing this image above Ad image was here ^]</h3>
    <br>
    <img src="http://img61.echo.cx/img61/2681/adtrackingpromo1gl.gif" alt=" Ad should be blocked" title="Ad should be blocked"> 
    <h3>[^ You should NOT be seeing this image above Ad image was here ^]</h3>
    <br>
    <img src="http://img104.echo.cx/img104/9528/friendsaffiliatessported8zx.gif" alt="Ad should be blocked" title="This is NOT an AD"> 
    <h3>[^ You SHOULD be seeing this image above ^]</h3>
    <br>
    <img src="http://img64.echo.cx/img64/6751/doubleclickaffsportedbottom6cf.gif" alt="Ad should be blocked" title="Ad should be blocked"> 
    <h3>[^ You should NOT be seeing this image above Affilates was here ^]</h3>
    You'll note that they're served from various hosts in the echo.cx domain. Browser-based ad blockers maintain lists of ad sites and also URL matching strings for common ad image names, and would thus block all those images based on the file names. My script can only block known ad hosts because it's DNS based.

    Upon first glance, then, my script shouldn't block any of the images, because it has no entries for hosts in the echo.cx domain. However, the snippet of HTML above deserves a bit more investigation. The first, fourth, fifth, and sixth image links point to true images. But the second and third do not: instead, they point to HTML files! Let's download the second:
    Code:
    steve@t520:~/junk$ [B]wget -S http://img145.echo.cx/img145/3690/atribalfushionsported3ti.gif[/B]
    --2013-10-27 14:32:06--  http://img145.echo.cx/img145/3690/atribalfushionsported3ti.gif
    Resolving img145.echo.cx (img145.echo.cx)... 208.94.1.239
    Connecting to img145.echo.cx (img145.echo.cx)|208.94.1.239|:80... connected.
    HTTP request sent, awaiting response... 
      HTTP/1.1 200 OK
      Server: nginx/1.0.4
      Date: Sun, 27 Oct 2013 21:32:07 GMT
      Content-Type: text/html
      Transfer-Encoding: chunked
      Connection: close
      X-Powered-By: PHP/5.2.9
      X-Server-Name-And-Port: _:14000
      Expires: Sun, 27 Oct 2013 21:32:06 GMT
      Cache-Control: no-cache
      X-Server-Name-And-Port: _:14000
    Length: unspecified [text/html]
    Saving to: ‘atribalfushionsported3ti.gif’
    
        [ <=>                                                                               ] 19,145      --.-K/s   in 0.09s   
    
    2013-10-27 14:32:07 (207 KB/s) - ‘atribalfushionsported3ti.gif’ saved [19145]
    Now, let's take a look at what this supposed "image" really is: http://paste.ubuntu.com/6314839/

    You will see that it contains a number of links to sites my script does block (google-analytics.com, quantcast.com) and also tries to open a popup. My script plus your browser's pop-up blocker prevent the second "image" from loading. (The third "image" grabs exactly the same HTML as the second.)

    Browser-based ad blockers will likely catch more ads, but they are slower and they work only in browsers. DNS blocking catches fewer ads but is faster and will work for every application that makes an Internet connection, including email clients, RSS readers, and more. It's up to each individual to determine which set of tradeoffs matter most.

    Originally posted by Feathers McGraw View Post
    Is there any reason why it would be a bad idea to do this on a router? Would it filter ads for every device connected?
    Absolutely you can place it on your router, and the outcome will be exactly as you expect -- so long as each node on your network is using your router as the DNS server.

    Leave a comment:


  • Feathers McGraw
    replied
    Trying this now, so far so good!

    Tested using the site below, and some adverts still showed, but they're not real adverts, so I'm not sure what that means! Haven't had any real ones get through. Browsing seems snappier with AdBlock turned off.

    http://www.angelfire.com/alt2/entert...lock_test.html

    Is there any reason why it would be a bad idea to do this on a router? Would it filter ads for every device connected?

    Feathers

    Leave a comment:


  • kubicle
    replied
    Originally posted by jlittle View Post
    Kubicle, we've gone OT
    IIRC, this isn't the first time ...and likely not the last, I've got a tendency to do that.
    Originally posted by jlittle View Post
    I'll start a new thread.
    Sounds like a plan, meet you on the other side.

    Leave a comment:


  • jlittle
    replied
    Kubicle, we've gone OT, I'll start a new thread.

    Leave a comment:


  • kubicle
    replied
    Originally posted by SteveRiley View Post
    "Regular" Linux always consults /etc/hosts before DNS every time an application performs a host name lookup.
    This actually depends on configuration, although checking local host files before dns is the default on most distributions.

    Config in /etc/nsswitch.conf (and older /etc/host.conf), man pages will give details.

    Leave a comment:


  • kubicle
    replied
    Originally posted by jlittle View Post
    If you have a blank entry in your $PATH, including starting or ending with the separator colon, that means the cwd. I've done that for three decades. It's only an issue if the cwd is writable by people (or bots or software) you don't trust; we don't do that in a typical linux install.
    I'd still prefer a literal "." for clarity, a blank entry is easier to miss.

    I'm not terribly fond of relative path elements, especially if those are before the absolute path elements...as one might run something malicious by accident (like a browser plugin that places a modified sudo executable in your $HOME).

    It is usually much more convenient to place your executables in $PATH, but I do understand the reason why someone might wish to add cwd as a fallback.
    Last edited by kubicle; Oct 27, 2013, 12:29 AM.

    Leave a comment:

Working...
X