Announcement

Collapse
No announcement yet.

[SOLVED] On the trail of the resolv.conf killer -- a murder mystery

Collapse
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    [SOLVED] On the trail of the resolv.conf killer -- a murder mystery

    I'd be browsing dumb, fat and happy when all of a sudden the next URL I clicked on failed to display. "Server not found", or some such message.

    I opened a Konsole and pinged Google.com. Nothing.

    I did "cat /etc/resolv.conf" and noticed that my ISP DNS server information had been replaced with "192.168.1.1", as if it was being reconfigured as a static IP. My wicd icon in the system tray still showed a vertical green bar, indicating my connection was good. I opened up the wicd dialog and disconnected and reconnected to my wireless router. I got another green bar and fired up my browser. It still wouldn't connect. I checked resolv.conf and noticed that redoing my connection did NOT reload resolv.conf.

    I disconnected my wireless connection and went to my cable modem/wireless power strip and recycled the power to both of them. When my wireless was back up I used wicd to reconnect. I could browse again. I didn't want to be doing that every time my connection got dropped.

    A similar bug, reported here didn't offer helpful information. I do not want to disable the automatic assignments to /etc/resolv.conf because I may be wanting to access another AP.

    Some bug reports say that installing resolvconf prevents the resolv.conf file from being overwritten, but I didn't want to install it for reason expressed below.

    This was the beginning of a month of involuntary wireless disconnects that occurred with increasing frequency. When a drop occurred just minutes after turning on my equipment last week, and about once and hour there after, I was beginning to suspect my Linksys WRT54GL router. I purchased a new TP-Link TL-W1043ND wireless (thanks for the tip, SithLord48!), expecting the problem to go away. Then, last Friday, while composing a response to a problem on this forum, I happened to be looking at the cable modem while deep in thought when the lights on the front of it all turned off, then started blinking just as if it had been power cycled! "Ah ha!, I got you!", I thought. I called my ISP tech support staff and their logs verified several incidents of involuntary power cycling on their side. They told me to bring in my modem and get it exchanged for a new one. I did.

    Fully expecting the issue to be resolved I began surfing. Nothing happened that night, but on Saturday, the next day, I got another disconnect. Resolv.conf contained only 192.168.1.1. I had created a copy of that file containing my DNS values and copied it on top of resolv.conf. I pinged Google.com and got a response!. Browsing resumed, but during the day Friday I got several more disconnects, and on a couple occasions, even though wicd's icon had a green bar, and I restored resolv.conf, pinging google.com gave no response.

    New wireless router. New modem. What could it be? FireFox? This morning I began surfing with Konqueror. Five minutes after I began surfing I got a disconnect. I immediately opened a console and restored resolv.conf from my backup. Surfing resumed normally.

    It's not FireFox and, it's not Konqueror. It's not my wireless router. It's not my modem.

    So, I will focus on either the wireless hardware in my laptop (Link 1500) or the wicd or dhcp application.

    If it's wicd an update will fix it sooner or later.

    I wondered if the default lease time had somehow been reconfigured in a recent update. I checked my wireless router but found the lease time is still 1440 minutes (24 hours). While I had the admin section of my wireless router opened I add my ISP DNS numbers and domain to the DHCP settings of the wireless and rebooted it, then continued to research the problem.

    dhclient.conf has no active settings for the lease time. That which it does have is commented out.

    Manually setting the DNS, search and domain in resolv.conf doesn't work, even if you mark the file read-only, because /sbin/dhclient-script changes the owner and chmod's it, then overwrites it with "$new_resolv_conf" which, I believe, is the source of the problem:
    Code:
        chown --reference=/etc/resolv.conf $new_resolv_conf
        chmod --reference=/etc/resolv.conf $new_resolv_conf
        mv -f $new_resolv_conf /etc/resolv.conf
    That's why people who try to fix this problem report that their "fixed" changes to resolv.conf do not stick. I could modify that code with:
    Code:
     
    if ! $(egrep -q "192.168.1.1" $new_resolv_conf); then  
        chown --reference=/etc/resolv.conf $new_resolv_conf
        chmod --reference=/etc/resolv.conf $new_resolv_conf
        mv -f $new_resolv_conf /etc/resolv.conf 
    fi
    but I'd have no guarantee that my changes wouldn't be overwritten with a future update.

    Putting my ISP DNS and domain name in my TP-Link wireless router DHCP settings appears to be working. IF I use my laptop to connect to public wireless or at Starbucks or a friends house I haven't changed dhcp3 so it should get me connected, and to stay connected I may have to make a copy of my resolv.conf as resolv.conf_whereever and use the cp command to cp it to resolv.conf if resolv.conf gets reset to the gateway.

    Someone suggested adding
    Code:
    make_resolv_conf() {
        echo "doing nothing to resolv.conf"
    }
    to /etc/dhclient-enter-hooks but that would require you to manually edit resolv.conf and insert the DNS IP and domain and search names, IF you have that information. If you are at Starbucks and you don't know their DNS you'll have to use OpenDNS or Google's DNS, IF you've kept those settings handy in a file somewhere.

    Meanwhile, I am going to keep an eye on this. IF putting the DNS and domain name in the router configuratin works I'll just leave it this way for now.
    "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
    – John F. Kennedy, February 26, 1962.

    #2
    Re: On the trail of the resolv.conf killer -- a murder mystery

    i am not sure if this helps or muddies things, but don't wicd and networkmanager bypass or use a their own resolv.conf, and other files? , if that makes any sense...

    Comment


      #3
      Re: On the trail of the resolv.conf killer -- a murder mystery

      Originally posted by claydoh
      i am not sure if this helps or muddies things, but don't wicd and networkmanager bypass or use a their own resolv.conf, and other files? , if that makes any sense...
      Don't know about wicd but networkmanager (at least KNetworkManager) uses resolv.conf.

      GreyGeek, I am not sure if it helps, but I had some problems with resolv.conf recently when I figured out that whenever KNetworkManager connects or disconnect any network, it updates the resolv.conf, which was resetting my manually entered DNS info.

      Comment


        #4
        Re: On the trail of the resolv.conf killer -- a murder mystery

        Don't think so.

        Wicd has: /var/lib/wicd/resolv.conf.orig
        but it contains nothing and has a timestamp of the day I installed wicd.

        So, putting my ISP's DNS and domain in the dhcp configuration section of my TP-Link wireless router appears to be working.

        I have yet to determine what is recycling the dhcp daemon and why, when that happens, resolv.conf is populated with my gateway address.
        "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
        – John F. Kennedy, February 26, 1962.

        Comment


          #5
          Re: On the trail of the resolv.conf killer -- a murder mystery

          Originally posted by aqeeliz
          ......
          Don't know about wicd but networkmanager (at least KNetworkManager) uses resolv.conf.

          GreyGeek, I am not sure if it helps, but I had some problems with resolv.conf recently when I figured out that whenever KNetworkManager connects or disconnect any network, it updates the resolv.conf, which was resetting my manually entered DNS info.
          I believe that both knm and wicd use /etc/resolv.conf or, more specifically, they both use the dhcp3 daemon, which updates resolv.conf every time it is called, as illustrated by the code snippet I posted from /sbin/dhclient-script. When fired from a cold boot dhcp3 takes the DNS and domain name from the ISP and populates resolv.conf correctly. When "something" triggers dhcp3 while the wireless socket (wlan0 in my case), it seems that dhcp3 often does not load the memory variable "$new_resolv_conf" correctly, and that is what gets copied to resolv.conf.

          But, so far, if the ISP DNS and domain name are put into the dhcp configuration section of the wireless router, resolv.conf is populated correctly.
          "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
          – John F. Kennedy, February 26, 1962.

          Comment


            #6
            Re: On the trail of the resolv.conf killer -- a murder mystery

            This issue effects using custom name servers with dialup or in my case mobile-broadband. resolve.conf is always populated with the isp name servers. I can echo in my own but its really insane to have to do that every time the connections interrupted. If I wanted to use opendns or google for example.

            Comment


              #7
              Re: On the trail of the resolv.conf killer -- a murder mystery

              The mystery has been solved!

              I fired up this laptop and twenty minutes afterward, while I was browsing a website, I clicked on a URL and a new tab opened up ... and there it sat. I had been hit again...

              I opened a console and "cat /etc/resolv.conf". All was well.
              I tried to ping google.com. Nothing. >
              I checked the wicd icon in the system tray. It gave me a green bar and the dialog said I was still connected.

              I was getting ready to run ifconfig and "route -n" when, after a couple minutes, I could ping google.com again and the the URL displayed.

              But, this occurrence was different, or at least I noticed something about it that I didn't see in the other occurrences: "Microcode SW error detected." I had something specific to google about. 8)

              I found that this bug has existed in various distros and version for over two years and I posted to an existing bug report:

              Comment 118 for bug 200509
              Code:
              I been treated to a frequent but erratic occurrence of this bug:
              ************************************************************
              [b]iwlagn 0000:05:00.0: Microcode SW error detected. Restarting 0x2000000[/b]
              Registered led device: iwl-phy0::radio
              Registered led device: iwl-phy0::assoc
              Registered led device: iwl-phy0::RX
              Registered led device: iwl-phy0::TX
              iwlagn 0000:05:00.0: Stopping AGG while state not ON or starting
              iwlagn 0000:05:00.0: queue number out of range: 0, must be 10 to 19
              WARNING: at /build/buildd/linux-2.6.32/net/mac80211/agg-tx.c:150 ___ieee80211_stop_tx_ba_session+0x82/0x90 [mac80211]()
              Hardware name: VGN-FW140E
              Modules linked in: cryptd aes_x86_64 aes_generic ppdev vboxnetadp vboxnetflt vboxdrv 
              snd_hda_codec_intelhdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep 
              fbcon snd_pcm_oss snd_mixer_oss tileblit snd_pcm font snd_seq_dummy snd_seq_oss 
              bitblit snd_seq_midi softcursor snd_rawmidi vga16fb snd_seq_midi_event vgastate arc4 
              snd_seq snd_timer uvcvideo videodev iwlagn snd_seq_device v4l1_compat i915 iwlcore 
              btusb usbhid drm_kms_helper snd v4l2_compat_ioctl32 bluetooth joydev sdhci_pci hid 
              mac80211 soundcore sdhci drm psmouse cfg80211 snd_page_alloc serio_raw led_class 
              i2c_algo_bit sony_laptop intel_agp wacom lp parport coretemp video output usb_storage 
              ohci1394 ahci sky2 ieee1394 [last unloaded: kvm]
              Pid: 9, comm: events/0 Not tainted 2.6.32-24-generic #42-Ubuntu
              Call Trace:
              [<ffffffff81066dbb>] warn_slowpath_common+0x7b/0xc0
              [<ffffffff81066e14>] warn_slowpath_null+0x14/0x20
              [<ffffffffa015b922>] ___ieee80211_stop_tx_ba_session+0x82/0x90 [mac80211]
              [<ffffffffa015babf>] __ieee80211_stop_tx_ba_session+0x7f/0x90 [mac80211]
              [<ffffffffa015b267>] ieee80211_sta_tear_down_BA_sessions+0x27/0x50 [mac80211]
              [<ffffffffa01701a8>] ieee80211_reconfig+0x378/0x460 [mac80211]
              [<ffffffffa01564c0>] ? ieee80211_restart_work+0x0/0x30 [mac80211]
              [<ffffffffa01564e2>] ieee80211_restart_work+0x22/0x30 [mac80211]
              [<ffffffff81080867>] run_workqueue+0xc7/0x1a0
              [<ffffffff810809e3>] worker_thread+0xa3/0x110
              [<ffffffff81085430>] ? autoremove_wake_function+0x0/0x40
              [<ffffffff81080940>] ? worker_thread+0x0/0x110
              [<ffffffff810850b6>] kthread+0x96/0xa0
              [<ffffffff810141ea>] child_rip+0xa/0x20
              [<ffffffff81085020>] ? kthread+0x0/0xa0
              [<ffffffff810141e0>] ? child_rip+0x0/0x20
               ---[ end trace a69e8e33c007fa40 ]---
              
              [b]then the wireless reconnects on its own ...[/b]
              
              wlan0: deauthenticated from d8:5d:4c:b9:f4:ba (Reason: 2)
              wlan0: direct probe to AP d8:5d:4c:b9:f4:ba (try 1)
              wlan0: direct probe responded
              wlan0: authenticate with AP d8:5d:4c:b9:f4:ba (try 1)
              wlan0: authenticated
              wlan0: associate with AP d8:5d:4c:b9:f4:ba (try 1)
              wlan0: RX AssocResp from d8:5d:4c:b9:f4:ba (capab=0x431 status=0 aid=1)
              [b]wlan0: associated[/b]
              CE: hpet increasing min_delta_ns to 15000 nsec
              ******************************************************
              
              I am running the 64bit Kubuntu 10.4 with KDE 4.5 and with this kernel:
              2.6.32-24-generic #42-Ubuntu SMP Fri Aug 20 14:21:58 UTC 2010 x86_64 GNU/Linux
              on my Sony VAIO VGN-FW140E, which has the 
              "Intel Corporation Wireless WiFi Link 5100" wireless controller.
              
              It characterizes its appearance by 
              1) blinking my cable modem lights as if it was going through a power cycle, and 
              2) overwriting my ISP's DNS and domain name in resolv.conf with my wireless router gateway address: 192.168.1.1. 
              
              Pinging google.com returns nothing.
              
              I used to power cycle my wireless and modem but I found that IF I manually 
              restore /etc/resolv.conf from my backup copy the wireless connection will 
              automatically resume and after a few seconds I can get a response from 
              pinging google.com, and I can then continue browsing. That led me to 
              adding my IPS DNS and domain name in the DHCP configuration section 
              of my wireless router. 
              Now, when my connection drops and automatically restores, my wireless 
              router restores /etc/resolv.conf and after a few seconds I can continue browsing.
              I read there that some folks restored their system by doing:
              sudo rmmod iwlagn && modprobe iwlagn
              Reading the bug reports I also found something interesting ...
              a way to read the status of my wireless hardware:

              jerry@sonyvgnfw140e:~/$ vdir /sys/bus/pci/drivers/iwlagn/0000:05:00.0/
              total 0
              -rw-r--r-- 1 root root 4096 2010-08-31 15:51 broken_parity_status
              -r--r--r-- 1 root root 4096 2010-08-31 13:25 class
              -rw-r--r-- 1 root root 4096 2010-08-31 13:25 config
              -r--r--r-- 1 root root 4096 2010-08-31 13:25 device
              lrwxrwxrwx 1 root root 0 2010-08-31 13:25 driver -> ../../../../bus/pci/drivers/iwlagn
              -rw------- 1 root root 4096 2010-08-31 15:51 enable
              -rw-r--r-- 1 root root 4096 2010-08-31 15:51 filter_flags
              lrwxrwxrwx 1 root root 0 2010-08-31 15:51 firmware_node -> ../../../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:1e/device:1f
              -rw-r--r-- 1 root root 4096 2010-08-31 15:51 flags
              drwxr-xr-x 3 root root 0 2010-08-31 13:25 ieee80211
              -r--r--r-- 1 root root 4096 2010-08-31 13:25 irq
              drwxr-xr-x 6 root root 0 2010-08-31 13:47 leds
              -r--r--r-- 1 root root 4096 2010-08-31 15:51 local_cpulist
              -r--r--r-- 1 root root 4096 2010-08-31 13:25 local_cpus
              -r--r--r-- 1 root root 4096 2010-08-31 15:51 modalias
              -rw-r--r-- 1 root root 4096 2010-08-31 15:51 msi_bus
              drwxr-xr-x 3 root root 0 2010-08-31 13:25 net
              -r--r--r-- 1 root root 4096 2010-08-31 15:51 numa_node
              drwxr-xr-x 2 root root 0 2010-08-31 15:51 power
              --w--w---- 1 root root 4096 2010-08-31 15:51 remove
              --w--w---- 1 root root 4096 2010-08-31 15:51 rescan
              --w------- 1 root root 4096 2010-08-31 15:51 reset
              -r--r--r-- 1 root root 4096 2010-08-31 15:35 resource
              -rw------- 1 root root 8192 2010-08-31 15:51 resource0
              -r--r--r-- 1 root root 4096 2010-08-31 15:51 statistics
              lrwxrwxrwx 1 root root 0 2010-08-31 13:25 subsystem -> ../../../../bus/pci
              -r--r--r-- 1 root root 4096 2010-08-31 13:25 subsystem_device
              -r--r--r-- 1 root root 4096 2010-08-31 13:25 subsystem_vendor
              -r--r--r-- 1 root root 4096 2010-08-31 15:51 temperature
              -rw-r--r-- 1 root root 4096 2010-08-31 15:51 tx_power
              -rw-r--r-- 1 root root 4096 2010-08-31 13:24 uevent
              -r--r--r-- 1 root root 4096 2010-08-31 13:25 vendor
              Here is the temperature of my wireless router:
              cat /sys/bus/pci/drivers/iwlagn/0000:05:00.0/temperature
              67
              I suspect that 67 is the Celsius temperature, which is 152 F. IF it is, that's HOT!

              I've been exploring the other parameters in that sub-directory.
              It's giving me an idea for a wireless monitor application which would give a graphical presentation of the wireless modem and display the results of those parameters live.

              EDIT:
              I forgot to add that some people report that
              echo options iwlagn swcrypto=1 >> /etc/modprobe.d/options
              solves their disconnect problem. I may try that.
              "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
              – John F. Kennedy, February 26, 1962.

              Comment


                #8
                Re: On the trail of the resolv.conf killer -- a murder mystery

                Originally posted by claydoh
                i am not sure if this helps or muddies things, but don't wicd and networkmanager bypass or use a their own resolv.conf, and other files?
                After a quick browse through the wicd code (just a glance, so I may have missed something), it looks like when wicd starts, it backs up (moves) the original resolv.conf to /var/lib/wicd/resolv.conf.orig, and takes over handling of /etc/resolv.conf through it's own settings.

                You can configure it to use static nameservers or acquire DNS through dhcp with your chosen dhcp client (dhclient, dhcpcd...).

                And when wicd exits, it'll restore the original resolv.conf back.

                So it does use /etc/resolv.conf (the file), but still uses it's own settings in modifying the file (and doesn't use the original settings in the file).

                Comment


                  #9
                  Re: [SOLVED] On the trail of the resolv.conf killer -- a murder mystery

                  This has certainly been an interesting thread. I had nothing to contribute but I learned a lot about how resolv.conf is handled by the network managers. Great research guys.

                  Comment


                    #10
                    Re: On the trail of the resolv.conf killer -- a murder mystery

                    Originally posted by kubicle
                    Originally posted by claydoh
                    i am not sure if this helps or muddies things, but don't wicd and networkmanager bypass or use a their own resolv.conf, and other files?
                    After a quick browse through the wicd code (just a glance, so I may have missed something), it looks like when wicd starts, it backs up (moves) the original resolv.conf to /var/lib/wicd/resolv.conf.orig, and takes over handling of /etc/resolv.conf through it's own settings.
                    ....
                    And when wicd exits, it'll restore the original resolv.conf back.

                    So it does use /etc/resolv.conf (the file), but still uses it's own settings in modifying the file (and doesn't use the original settings in the file).
                    I saw that too, but when I checked out the resolv.conf.orig file it contained only "# Generated by NetworkManager". Nothing else. And it has a time stamp of 2010-5-14. I have been running wicd for months, probably since around May 14th.

                    I also read somewhere that if the time stamp on /etc/resolv.conf doesn't match that on resolv.conf.orig then wicd (nor networkmanager) will touch /etc/resolv.conf. My memory is fuzzy on the details.

                    There are bug reports on this "Microcode SW error detected" going back more than two years and on a wide variety of wireless modules, not just mine. While the mystery is solved for me, apparently it is not solved for the developers. All they have managed to do is find workarounds, like what I am doing, or reloading the driver module, or the swcrypto settings. Some think it is in specific kernels because they can downgrade to the previous kernel and the problem stops. Others report the problem with the downgraded kernel. No reports in the bugzillas (Ubuntu, RedHat, etc...) have definitively pointed to a piece of code and claimed it to be the problem.

                    But, so far, the sycrypto thing appears to be working for me, although it doesn't work for others. Weird bug.
                    "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
                    – John F. Kennedy, February 26, 1962.

                    Comment


                      #11
                      Re: On the trail of the resolv.conf killer -- a murder mystery

                      Originally posted by GreyGeek
                      Originally posted by kubicle
                      Originally posted by claydoh
                      i am not sure if this helps or muddies things, but don't wicd and networkmanager bypass or use a their own resolv.conf, and other files?
                      After a quick browse through the wicd code (just a glance, so I may have missed something), it looks like when wicd starts, it backs up (moves) the original resolv.conf to /var/lib/wicd/resolv.conf.orig, and takes over handling of /etc/resolv.conf through it's own settings.
                      ....
                      And when wicd exits, it'll restore the original resolv.conf back.

                      So it does use /etc/resolv.conf (the file), but still uses it's own settings in modifying the file (and doesn't use the original settings in the file).
                      I saw that too, but when I checked out the resolv.conf.orig file it contained only "# Generated by NetworkManager". Nothing else. And it has a time stamp of 2010-5-14. I have been running wicd for months, probably since around May 14th.
                      Timestamp doesn't say much as long as the file is only moved (from one place to another), the timestamp remains the same.

                      Wicd can move the file back-and-forth numerous times with the timestamp staying the same as the last edit (in your case, networkmanager was probably the last program to edit the file...judging by the contents...and wicd has just tossed that original config back-and-forth in a series of back-ups and restores ever since)

                      There is an easy way to test that, stop the wicd daemon (sudo service wicd stop), and see what your /etc/resolv.conf has is it the backed up file or something else?

                      Comment


                        #12
                        Re: [SOLVED] On the trail of the resolv.conf killer -- a murder mystery

                        Just did that.

                        Neither file changed their contents. The resolv.conf.orig file has only that "networkmanager" comment and /etc/resolv.conf has my ISP's DNS and domain name in it.
                        "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
                        – John F. Kennedy, February 26, 1962.

                        Comment


                          #13
                          Re: [SOLVED] On the trail of the resolv.conf killer -- a murder mystery

                          Originally posted by GreyGeek
                          Neither file changed their contents. The resolv.conf.orig file has only that "networkmanager" comment and /etc/resolv.conf has my ISP's DNS and domain name in it.
                          Hmm...I guess I'll have to go back looking at the code, then...to see what's missing from the equation :P.

                          Anyway, this is purely academical, doesn't have an effect on the issue you were having (and tracked down)

                          Comment


                            #14
                            Re: [SOLVED] On the trail of the resolv.conf killer -- a murder mystery

                            Originally posted by kubicle
                            ....
                            Anyway, this is purely academical, doesn't have an effect on the issue you were having (and tracked down)
                            True, but reading anything you write I find to be very informative and educational.
                            "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
                            – John F. Kennedy, February 26, 1962.

                            Comment


                              #15
                              Re: [SOLVED] On the trail of the resolv.conf killer -- a murder mystery

                              Hmm...It seems to go like this:

                              wicd will restore the backup if you stop the daemon with "sudo wicd --kill". Then it'll move the resolv.conf.orig file as /etc/resolv.conf.

                              And when wicd starts again, it'll back up /etc/resolv.conf as /var/lib/wicd/resolv.conf.orig again, but only if the back-up doesn't exist. (if wicd hasn't restored the back-up when exiting, the file still exists).

                              And "sudo service wicd stop" doesn't stop wicd with "sudo wicd --kill" (maybe it should though?) /etc/init.d/wicd uses the generic "start-stop-daemon" command...and wicd doesn't restore resolv.conf.

                              Comment

                              Working...
                              X