006 5 NetMon Design

While NetReg manages what your network "should" look like, NetMon was designed to give more insight into how things actually look. We like to think every machine will be configured with DHCP so things will just "work". NetMon can tell you what machines aren't, and more. Using NetMon data we have begun identifying machines that are using a different MAC address than registered, as well as finding unused network registrations. The former allows us to contact users and get the registrations fixed, so that when we sweep for the latter, we catch true dead registrations and reclaim IP space.

Much of this assumes you have a number of static (aka "fixed") addresses and don't simply run lots of dynamic pools. However, NetMon has also been extended to keep track of DHCP lease information, and provide tools for searching and graphing this data. In short, NetMon is sort of an always-changing target of features, but the basic premise is to enable you to have better control of your network.

A work of caution, though, before proceeding: we are a fairly Cisco-centric campus. Some of the SNMP MIBs used may not be compatible with other agents. We have, however, had fairly good success collecting from the Lucent/ORiNOCO AP-1000 and AP-500's. Long term we'd like to make the vendor-specific bits minimal and extracted for easy extensibility. One of many goals, though.

5.1 Master Capturing Agent

One of the central NetMon components is the master capture process. Invoked from cron (typically), the master capture process:

Identifies work that needs to be done (from device_timings)
Determines how many agents are collecting
Starts additional agents as necessary

If there is no more work to be done, the process dies. It could be easily designed to stay around and thus save the expense of re-invoking from cron, but we like having definite cycles. This also gives you a chance to easily update NetMon code. In the lib/CMU/NetMon directory is a netmon.sh. file. This is what we run from cron; it invokes the capture agent and, once done, invokes the processing agent. You can start a new capture agent with the command line: perl -I$NRHOME -MCMU::NetMon::Capture -e 'run_capture()' (substituting your NRHOME appropriately..)

Most configuration of the capturing agent is done in the Config.pm file. (See the Configuration section above for more details.)

If /etc/nonetmon exists, no NetMon capturing or processing will take place (unless you are running an agent on a remote machine, for example, and that machine does not contain the nonetmon file.)

5.2 Master Processing Agent

The master processing agent is the hub for launching processing routines as needed. Unlike the capture tasks, in which the master agent starts many subagents for parallel collection, the processing routines are fairly serial in nature - generally they are consuming CPU time exchanging data with the SQL server as well as doing a fair bit of processing work themselves. Thus, the process agent reviews the work to be done, starts the tasks, and waits for completion before starting the next.

Most configuration of the processing agent is done in the Config.pm file. (See the Configuration section above for more details.)

5.3 Device, Device Timings

A large part of NetMon is the collection of data from network devices -- routers and switches on your network. Any device we collect from must be registered in the device table. The table also contains additional fields related to the device:

read_comm: All collection is done via SNMP, and we need to know the SNMP "read" string used on your device. The default for most devices is 'public', but you may/may want to restrict access to your devices by setting a non-public read string.
slow: The slow value can be 'Yes' or 'No', depending upon whether NetMon has had recent trouble contacting this device. If slow, all collection will be done by separate collection agents. This will allow our 'fast' agents to collect from responsive devices (and thus, collect from them faster.)
netreg_id: NetMon attempts to identify the NetReg machine record matching this device. This is so that we can track changes in the hostname, and in general establish a more solid link to NetReg.
alias_dev: If this device is just an alias for another device (ie, multiple interfaces on the same router), then NetMon will set the alias_dev field to the "real" device name.
function: The function field can be 'Gateway' or 'Non-Gateway'. We use this designation in the topology generation (see below).
system_id: This is a text field that will be updated to contain information about the device, as reported by itself through SNMP. This could be used, for example, to select different collection routines depending upon the device type, version, etc.

As alluded to above, NetMon can dig through NetReg looking for new devices that you want to collect. The parameters for this searching are in the NetMon configuration (see above).

5.4 CAM Tables

The CAM tables of network switches contain information about which port a given MAC address is connected to. Fundamentally a CAM table is what makes a switch more than a hub: stateful information about where traffic should be forwarded. By gathering the information about where MAC addresses appear in CAM tables around the network, we have very useful information for tracking down machines.

The fundamental data elements we deal with in capturing this information is a 4-element group:

The network device we captured this MAC Address from
The MAC Address
The port that the MAC Address appeared on
The VLAN that this MAC Address appeared on

Our capture of VLAN information is actually a bit more specific: we identify all the VLANs on a given switch, and then walk the CAM table for each separately. Thus we can elegantly deal with multi-access and/or trunked VLANs. This is extremely useful in generating the topology information, as VLANs are handled very nicely, being kept separate even if multiple VLANs use the same network devices.

The capture agents just suck all of the CAM table information from the device and put it into the cam_capture table. It is then the job of the CAM processine engine to reduce the size of the captured data and store it in the cam_tracking table.

The CAM processing engine makes the assumption that if a host appeared on a given switch/port/vlan an hour ago, and is still there, it makes no sense to add another line to the cam_tracking table. Instead, all we do is update a last_updated field to indicate that we saw the host again, and move on (removing the entry from cam_capture). We also tie into NetReg to identify if a host is unregistered at the time of processing. Thus, if a host was unregistered but then registered in NetReg, a new cam_tracking entry will be created.

Each cam_tracking entry has a start and end time. Any "active" entries are defined to have the end value set to '0'. NetMon is designed for maximum parallelism: we have made no assumptions about the serialability of certain tasks. Thus, if a processing engine is processing records from a given device at the same time as a capturing agent is adding records, the behavior will be consistent and correct.

As entries disappear from CAM tables, the end fields are updated, and the entry is no longer considered active. If the device appears again on the same port/vlan/device, we'll add a new line.

You can then query NetMon for the known information about a given MAC address and receive a list of everywhere that NetMon saw the host.

An additional feature that NetMon does with cam_tracking entries is the so-called "assumed location" identification. In a typical network, if two hosts communicate with one another, the CAM tables of all switches between them will be updated to reflect this. The MAC Addresses of the hosts will appear on the uplink ports of non-common network infrastructure. Thus, NetMon will "see" a host on the uplink port on a number of other network switches (other than those in the direct path to the host from your root).

The "assumed location" feature takes a look at all the currently active locations that we see a current host, and makes its best guess about the actual physical port the device is connected to. It does this by counting the number of CAM table entries on all the ports that we see the host on. The port with the fewest number of CAM table entries is then the "assumed location". This, in practice, is very accurate.

5.5 ARP Tables

We also collect the ARP tables from all of our network routers. The ARP tables provide a layer-3 (IP) to layer-2 (MAC Address) translation. Thus, the ARP tables give us some idea what IP address various MAC Addresses are using. This is interesting data, because we can then compare the MAC Address registered for a given IP address with the actual MAC Address being used. Where different, we can contact the user and request they update the network registration. Additionally, while we are not currently doing this, we could compare the IP information provided by DHCP with the actual IP/MAC Address in use.

All of this helps us provide good service for our users -- if we decide to renumber an IP subnet, for example, we can identify the users that won't automatically work because of not using DHCP and contact them prior to the move.

The ARP table capturing agent simply grabs the tables from all of our routers and stores it in the arp_capture table. Periodically the processing agent runs, eliminating duplicate IP and MAC Address mappings (which should not happen unless you capture from more than 1 layer-3 device on the same physical network). It updates the arp_tracking and arp_map tables with the mapping information. To reduce the number of entries in arp_tracking, we record the actual location of mappings being seen in arp_map.

As with CAM information, the ARP information can be searched in NetMon and used to validate the location of hosts.

5.6 Ping/DevMAC

These two collection types are done mainly for housekeeping purposes, but they do play important roles for the topology generation. The ping "capture" agent does not actually capture any data, but simply sends out pings to all of your devices. The goal here is to make sure that the management interface of the network devices is completely known by all upstream devices (and by "known" we mean "in the CAM or ARP table of..").

The DevMAC collector solves an issue we discovered during testing. Namely, most network devices have several MAC addresses that technically belong to the device. For example, a 24 port switch might have one MAC address for each port, as well as an address for the management interface. During normal network activities, though, the exact MAC address used to answer ARPs and otherwise act as a network citizen is sometimes different from device to device. Thus, the DevMAC collector simply grabs ALL MAC addresses that belong to the device. Then in generating the Topology information, we understand that if we see any of the MAC addresses, we still know what the actual device is.

Due to some relatively short CAM/ARP table timeouts, the Ping agent is generally run very frequently (on the order of every 5 minutes). However, most devices do not dynamically change their MAC addresses (at least, none that we have found), so the DevMAC collector typically runs once a day.

5.7 DHCP Lease Information

While tracking down abuse issues and other network problems, we frequently needed to identify the user of a particular IP address at a given time. Rather than spending the rest of our lives grepping DHCP logs (and thus, maintaining large unweildy logfiles), we decided to write a quick NetMon agent that gathered DHCP lease information and loaded it into our NetMon database. Then we can quickly identify the user of a given IP and easily generate some utilization graphs. Much more could be done with the gathered lease data.

5.7.1 Capture Agent

To load your DHCP lease information into NetMon, you'll need to install some capturing agents on your DHCP servers. In practice we put a skeleton NetReg/NetMon install on our DHCP servers (though it includes the entire lib tree.) We then have a script that launches the agents. The script is designed to be put into cron and have it attempt to re-launch every hour or so. Upon startup it tries to determine if any other agents are running and, if not, continue running. This is a basic failure protection; it's not elegant, but we like our DHCP lease information.

The agent reads /etc/dhcpd.leases for dynamic lease assignment information. However, fixed-addresses are not recorded in this file. Thus, we need to get information about static leases. We could use syslog information, but we'd definitely encounter duplicates and it's a fairly annoying format. Instead we've written a simple patch to the DHCP server to keep a separate logfile of static lease assignments. This solution is not particularly elegant (hrm, there is an elegance trend here), but works nicely. The patch is located in doc/dhcpd.patch. You should modify it to write your logfile where you like (we use /var/log/dhcpd.static). If someone feels like making the agent work using syslog, that would be useful for many, we're sure.

The script for starting the collection agent is in support/bin/start-netmon-dhcp-agent.sh. As noted, we run this from cron every hour or so.

5.7.2 Graph Generation

Every hour (or whatever interval you specify in Config.pm, the DHCP lease processing engine fires off and updates your RRD graphs of DHCP utilization. It will add a data point every DHCPGraphInterval seconds (where DHCPGraphInterval is also a value from Config.pm.) These graphs will then be available from the standard NetMon web interface (in the "DHCP" section.)

The graph generation uses your DHCP configuration (also configured in Config.pm) to determine the subnets with active dynamic pools. If your configuration is not generated from NetReg, your results with the graphs may vary. The parser is fairly robust, but you definitely want to add one comment you probably don't otherwise have -- namely, a description of the subnet (ie subnet name) prior to every subnet declaration.

So the DHCP configuration files that NetReg generates looks like:

Wireless Network

subnet 128.2.64.0 netmask 255.255.240.0 { ... pool ... }

The utilization graphs are created and keyed on the base address of the subnet (the first IP address in the previous section). Thus, if you change the base address of a subnet and want to keep the utilization graphs going (with historical data), then you need to perform some non-trivial conversion. We have not done this, but the basic outline is:

Lock the NetMon DHCP Processing from happening (set the last run time of the RUN_DHCP key in _sys_info to now()), and lock the DHCP configuration (or copy the config without the base address change to somewhere safe)
As atomically as possible, kill your DHCP servers, force the NetMon collection agent to load all the lease information (you can just start a new one), kill all the NetMon collection agents, and restart your DHCP server (with the new base address configured). Do NOT restart the collection agents yet. It's okay, as they will lose no data as long as they aren't dead for longer than your shortest DHCP lease time.
Run the NetMon DHCP processing agent (set the last run time of the RUN_DHCP key to at least an hour in the past). Make sure the DHCP config it has does NOT see the new base address.
Move the RRD database file to the new base address (the filenames in your RRD directory are named after the base address of the subnet, so move it to the new base address).
Restart the NetMon agents on your DHCP servers.
At some point the NetMon DHCP Processor will update all the graphs, etc.

Note, though, that if your base address moves and the old leases are no longer on the same network (ie if you are moving the entire network), your graphs will not include the old leases in the graphs. Your best bet in this case would be to shorten the lease time considerably to reduce the overlap as clients move from one network to the other.

You should never change the DHCP_GRAPH_START variable in _sys_info unless you really know what you are doing -- this can dramatically affect your RRD databases.

5.7.3 Web Access

The DHCP lease search page and DHCP graphs are available under the "DHCP" section on the administrative access section of NetMon.

5.8 Topology

One of our later goals of NetMon was to be able to construct a dynamic graph of the network topology with as few assumptions about the design as possible. Using the ARP and CAM collected information, and an interesting topology algorithm, we have had some success in generating the topology. While are not actively generating the topology currently (there have been some major changes in ARP and CAM collection since the initial design, requiring it to be fixed), the hope is to have this working in the near future.

The basic premise of the topology generation is that if the path to network device C is via A and B, then "A" will know about the MAC address of "B" and "C" on the appropriate downlink port, and "B" will know of "C" on a downlink port. The algorithm repeatedly iterates over a list of devices. For each device we keep track of an active list of devices known by another device. Through each iteration, devices known only in a one location are removed as we are able to place them in the network graph. Following this, that device is removed from every other device's list, thus making more devices "known" in one location.

If you are interested in the topology discovery algorithm and system, please let us know. There are several interesting issues related to the topology discovery that we can discuss.

5.9 Web Access

5.10 Network Debugger

The network debugger was designed to give end-users (and our front-line support staff) a quick way to diagnose common network problems. It provides a very limited interface to the NetMon data. .

Given an IP address, it queries the ARP information to see how many unique mappings we identified (which can quickly identify if a machine is misconfigured and stealing an IP). It also queries the DHCP lease information to see how many leases were given (0 leases means no use of DHCP/no network connectivity..)

Given a MAC address, it queries the ARP information, as above. It also queries the CAM tables to see how many switches knew of the host (0 switches knowing about the host implies no network connectivity at very low levels).

Both queries also check NetReg for basic registration. We do not allow un-registered attainment of DHCP leases (except in a very few, protected locations.)

In all cases, we only report anything seen in the last 24 hours. Thus, historical access to the NetMon data is not permitted. We also do not show any location-specific information (ie, this machine was seen on switch X). The goal is to protect privacy but eliminate some load on the networking staff to answer simple questions.

When the authenticated user has no access to NetMon, or no authentication was supplied (ie, Apache didn't request any authentication), netmon.pl will default to providing users the Network Debugger form. You should edit the text in CMU/NetMon/Web.pm (the netdebug_info function) to suit your individual support system.

5.11 Subnet Purge

As users come and go, the machine registrations are not always kept up-to-date. When we ran a completely flat (ie non-subnetted) network, it was easy to just start assigning addresses from another pool to additional machines. However, as subnetting has taken hold and additional routers added around the network, when subnets fill up we have problems to deal with. In some cases, doubling the subnet size is reasonable and done. In many cases, though, we started to wonder how many of the registrations were actually in use.

The results were better (or worse) than we expected: some of our subnets had as much as 30% wasted to inactive registrations. By removing these outdated registrations, we regained valuable IP space and did not need to waste additional space by doubling already large subnets. We have currently run the complete purge operation on a few subnets, and are planning on expanding this to encompass most subnets in the near future.

Purging is done on a per-subnet basis. The purge parameters are fields in the NetReg subnet table. There are four parameters (described in the NetReg design section). They are: purge interval, purge days, grace time for NetReg updates, and grace time for network usage.

The purge is actually run as a NetMon component, but long term we will run the agent on NetReg out of scheduled.pl, and it will remotely access the NetMon database (via the NetMon_Rem_RW_DB variable).

We expect to run the purge no more than once per day. It will identify any subnets that have not been purged in purge_interval days (we keep the purge_lastdone variable to remember the last purge day). On each of these subnets, we get all the registered MAC Address from NetReg (on that subnet) that have not been updated in purge_notupd days. We then query NetMon, and build a list of all MAC Addresses that have not been seen on the network (via the CAM tracking table) in at least purge_notseen days. All of these machine registrations are then set to expire in purge_explen days.

For expiration purposes, the expires field on the machine registration is set to a non-zero date. Users are then mailed about all registrations they own that have just been set to expire (each user should receive no more than one piece of mail per day). On the appointed expiration day, unless the user has "Retain"ed the registration, the delete-expired.pl script deletes the registration.

We may also include a scan of the DHCP lease information in the determination of unused registrations. We specifically excluded matching IP addresses against the ARP tracking table, because we would like users to keep registrations up to date. To this end, we created the "Misregistered Devices" report (see above.)

5.12 Users

NetMon supports a very basic user authorization system. It incorporates the concept of "levels" like NetReg, but has only a single per-user level. Thus, a user can be set as having full read, read-write, or administrative access. Administrative access allows the user to control other UserIDs and levels, while read-write allows the user to change the capture settings of a device. As with NetReg, these permissions are enforced by the web interface; thus, having access to the database password (via login access/permissions to the machine, for example) grants the user read-write access to the NetMon database.

[ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

006

5 NetMon Design