Tracking Last Visitors
|
URI: |
http://herbert.gandraxa.com/herbert/tlv.asp |
|
Link template: |
<a href="http://herbert.gandraxa.com/herbert/tlv.asp">Tracking Last Visitors</a> |
|
Link symbols: |
|
Tracking Last Visitors
|
URI: |
http://herbert.gandraxa.com/herbert/tlv.asp |
|
Link template: |
<a href="http://herbert.gandraxa.com/herbert/tlv.asp">Tracking Last Visitors</a> |
|
Link symbols: |
|
Table of Contents
Home »
Articles
» Tracking Last Visitors
This article describes, how one can write a serverside script to track a website's recent visitors. It features the theoretical background as well as some actual challenges in implementing the tool. An implementation is displayed on the left below the navigation section.
2007-Feb-04; 2007-Jun-21 (integrated a screenshot)

Fig. 1: Sample of displaying last visitors
Scanning through my web statistics it became apparent, that people of a wide variety of nations visited these pages (predominantly the
Length of Day article).
I was not really interested in an absolute ranking of nations, though, as in "which nation's inhabitants visit my pages how often", but certainly it was very interesting to see people and organizations around the world calling my pages at widely varying hours of day. So I gave it a try and decided to develop a tool to display the recent visitors chronologically, in real-time.
This article describes some considerations and techniques during development of the tool location of web visitors. You can see the statistics on the left-hand side, below the menu entries in the navigation section.
Although fully functional, at this point of time I do not want to publish the source-code, simply because it was a quickshot and by my standards an ugly piece of code [1], not fit for publication yet (and maybe never).
And although the tool is designed to run on a multitude of websites and keep track of their visitors individually (which it in fact does, for example on the pages of my wife), my small web-server would not be able to handle high-volume traffic. For this reason, it will also not be available as a service for everybody.
[1] It is the second version now, running more stable than before. It is not that ugly any longer.
There are a multitude of ways to track a visitor's location, but fortunately none of them are so accurate to track you down to your very spot at your desk (or for that matter to that spot at the beach where you are sitting with your notebook right now).
However, such accurate position information never was my intention: what I am interested in is, to which organization the visitor belongs. For example are about a quarter of this site's visitors members of educational institutions, I'd like to know to which institute the visitor belongs, not where the individual currently is.
To achieve this it is most suitable to find the owner of the IP address with which the visitor currently logs in. Consequently, the tool may displace your location by thousands of miles, but since it can be assumed that an abroad student usually logs into the Internet using one of his his University's IP, I eventually will get the information I am interested in.
So, how do I get your IP? All right, let's have a look at the data which is available to every webmaster simply by you looking at one of the pages on the Internet.
It is a massive amount of information which is available down to the level of your browser you are using, heck, it most cases it even is possible to determine your actual screen's resolution. Most servers log each and every piece of information which is submitted by a request's HTTP headers (which often can amount to really impressive data piles, so don't expect them to store the raw data for years). The main reason to do so is due to those contemporaries who attempt to break into every network they know of: after all, it is handy to have a culprit after some damage has been done. And from that data it is, that webmasters publish their statistics as compressed information blocks.
One of the informations available to a webmaster is the visitor's IP address. There are two IP addresses, actually: one is the IP address you are using within your local network (open the still useful DOS command prompt if you are using Windows, type in ipconfig and there it is. It also shows you the IP of the router with which you connect to the Internet). This is your internal IP address. It is of no importance for our purpose. The second is the IP address which your Internet Services Provider (ISP) assigns to you when you log in (you can see that IP in the tool's section labeled "Your ISP"). This is your external IP address, and it is the one we want.
Because the quantity of IP addresses is somewhat limited (256^4 under IPv4, amounting to more than 4 billion, which sounds impressive but isn't really), most ISPs assign their users a dynamic IP address, meaning that your IP address with which you connect to the Internet changes every time you log into the net. Even if you happen to use broadband via cable or DSL the address will change regularly (at what intervals depends on actual usage, but a change will occur most likely at least once a day). However, it is possible to "buy" an IP address from your ISP, meaning that your ISP grants you the usage of a fixed IP address which won't change (this is most useful if you operate a web server, for example). Note, however, that such a fixed IP still "belongs" to your ISP; more specifically, to one of possibly several IP Ranges assigned to that particular ISP. This is important, as will be seen below when we talk about the Internet Registries.
So the fixed IP addresses are no problem: they don't change. But even the dynamic IP addresses are not really that much of a problem: unless you use a dial-up modem to connect to the internet individually for each page visit of a site (in which case you most likely are assigned a different IP each time you do so), the visitor's IP address can be assumed to remain the same. A certain category of ISPs assigning dynamic IP addresses are problematic, though, because those ISPs make your IP change with each and every page you look at. Whilst this can be considered to be efficient resource usage (after all, no ISP has really huge amounts of IP addresses), it on the other hand disables the ability to tie a certain IP to a specific user: after all, it could very well be that two users of certain IP ranges visit a site simultaneously: it is not trivial to associate two different IPs to one and the same individual. (Well there are means, via Cookies for example, or via options offered by several script languages on the web server, e.g. Sessions for ASP etc., but they also all have their drawbacks: for instance can a user disable cookies, etc.)
The important thing of this all is: once that a visitor visits a site's page, his IP is likely to not change within some time, and even if he visits multiple pages on that site, he still can be identified pretty accurately just by his IP. This is increasingly the case as broadband usage increases.
For our purpose, the IP address of a visitor does identify that visitor.
Now that we have an IP address, what are we going to do with it? After all, we want to know the location of the IP "owner". This is, where the Internet Registries come into play.
The first information source is the Internet Assigned Numbers Authority (IANA). It kind of operates as a "central registry" roughly hinting which Regional Internet Registry is responsible for a certain IP address. This information is maintained in a very short
list.
The list has a few shortcomings: firstly, it does not tell you the ISP yet, it just tells you who might have the desired information. The whole IPv4 range of more than 4 billion IP addresses is segmented into 256 ranges (0.x.x.x to 255.x.x.x) with 16.7 million IP addresses each, so it helps to narrow down the real source of informations about an ISP, but the remaining number still is pretty high. Secondly, there are entries reading "Various Registries", which does not really help in narrowing down the search. And thirdly, a given entry is not necessarily accurate, i.e. it simply may point to a wrong information source (this is due to the fact that an IP range originally was assigned to a regional registry, but later on could have been reassigned to another).
Nevertheless we use this list as a basis to do a first lookup to the mentioned Regional Internet Registry (if there is any). Should the information be satisfactory, then we had luck and the job is done. Should it not be satisfying, we have no choice but to look up all RIRs.
There is a number of RIRs. The "Regional" might imply that they operate on smaller areas, which is not at all the case: they actually operate on areas encomprising whole continents. This is well reflected in their names, e.g. ARIN (American Registry for Internet Numbers), RIPE (Réseaux IP Européens), LACNIC (Latin American and Caribbean Internet Addresses Registry) etc.
Now these RIRs subdivide such a 16.7 million IP numbers range in smaller IP Ranges (not individual IP numbers) and allocate those more or less large blocks to ISPs. Your IP is part of such an IP range. To get a feeling on how such information looks like, try to query all of the following RIRs with your current IP number (which is displayed on the left-hand side in the tool). Here are the links to the RIRs' web interface:
AfriNIC,
APNIC,
ARIN,
LACNIC, and
RIPE.
You might have noticed, that the indicated IP Ranges vary very much. It is even possible that the whole IPv4 space is returned (Range 0.0.0.0 - 255.255.255.255). The goal is to find the range which is the narrowest: the more initial numbers are identical in the "From" and "To" portions of the range, the better. You might want to compare also with the IANA list mentioned above, just to see to what RIR it hints: oftentimes the hint is not really helpful.
Also, often you will encounter a RIR stating that an IP Range belongs to the country "EU" (Europe). Notice, that in this case this is not the final answer: there almost certainly exists a RIR with more detailed information (specifying a "real" country), which does not need to be in Europe at all, as "EU" in fact really translates to "somewhere in the world".
Quite a bit of programming logic is invested to find that "best RIR" (if the RIR hinted by IANA did not provide a satisfying information). We can't just rely on "the narrowest IP Range", because it is very well possible, that several RIRs have the same narrowest IP Range, but only one of them will provide useful information.
For the further text let's assume we have found that best RIR. What now?
Check the best RIR for your individual case. You will find an entry "Country" and next to it a
two-letters country code, which mostly follows the
ISO Standard (but not always).
This is the only easy part to extract, as it is always present for every (best) record. For further handling in our tool it suffices to create a database table decoding these two-letter-codes into the real name.
I did this by hand, using a "common" country name when there are multiple choices about name usage. As common I usually considered the particular English Wikipedia entry (but made exceptions when I found a name to not be politically correct or potentially offensive to some).
And as a byproduct, it was verified that all countries had an article in the English Wikipedia, which was exploited to link the output information (the country's flag actually) to the according article (try it).
If you happen to have a US or Canadian ISP, then you will notice two more entries labeled "City" and "StateProv". Unsurprisingly, they contain the ISP's location, and "StateProv" refers to the US state or the Canadian province, resp.
For most other countries, these fields are not used. Instead, the town hides somewhere in a multitude of "address" lines. These addresses follow the (local) convention of addressing postal letters, and that means, that the format varies greatly. An Indian address looks very different from a German address, which has nothing in common with a UK address. Furthermore, the address really is free-style: some ISPs simply use one single line to give all information, whereas others might even use two lines per "address" entry. The quintessence of this is: it is almost impossible to tell, in which line the town information hides (and if at all, because there are ISPs which do not provide that information even if they should).
Does this mean, that we can't go any further for those countries? Well: obviously not, as you see in my tool. But the procedure is somewhat cumbersome.
What I did first is to collect town lists freely available on the Internet. Unfortunately, there seems not to be a single good source in the public domain, but there is a vast number of smaller list, often down to individual regions within a country. I collected as many as was appropriate in my eyes (several million entries), and imported them into a database table. This is a process I still work on as I come along more lists within the public domain.
Oftentimes, there are many spellings to consider: for example is it possible in the German language to render the German Umlaut "ü" as "ue", but since English names are often preferred over German names in the RIRs' "address" information, it may also be rendered simply as "u" (e.g. "Zürich" may appear as "Zuerich" or "Zurich"; "Nürnberg" may appear as either of "Nuernberg", "Nurnberg", "Nuremberg"; etc.). Similar issues arise with the usage of apostrophes, hyphens and more.
This made it necessary to have a "search entry" in our table, i.e. the town's name in one of it's possible spellings, and an "official entry" which then translates the used spelling into the properly spelled location in order to have a somewhat appealing output.
The process of examining the "address" fields is somewhat tedious and requires a great deal of programming logic once again: what I basically did is to search for the address lines of the appropriate record (beginning with the last line, as chances are that the town is located closer to the bottom than to the start), starting within each line with so many words as that particular line contains, thus composing a sequence of words which might be a town's name. This entry is searched for in the table within the given country, and if there is a match, it is considered to be the best match so far. When there is no match, the word number is reduced and of all possible sequences with that many words the table is searched again, until we searched for a single word. When all lines are done, we use the sequence consisting of the most words. Such, we will prioritize "New York" over "York". It is not fail-safe, but it seems to work better than in 99 out of 100 cases.
For convenience, the found town (and if US or Canada, also the State or Province, resp.) is eventually linked to the English Wikipedia again. If you are like me and encounter a place "I never heard of" then you appreciate it, otherwise it at least is not too much of a bandwidth consumption to provide that useless link.
Once the cumbersome process of finding the best RIR and extracting the town for an ISP was done, it would be a tremendous waste to go through that process again each time a visitor from the same ISP returns (or the same visitor simply requests another page): a waste of CPU usage as well as own bandwidth consumption and generating unnecessary traffic to the RIRs.
Therefore, the once gathered ISP information is stored in an ISP table, with the IP Range's start and end addresses: this enables us to quickly do a lookup for a particular IP, and all the information can be re-used.
However, since IP Ranges can be re-allocated, the information is kept on the server only for 30 days. A visitor belonging to such an ISP will then next time cause a new lookup.
Now all what's left is to keep track of the visitor's IP. This is simply done by storing the IP - ISP relation in a table. When a visitor requests the first page in his visit, and then each time he requests another page, the time is updated. This means, that every IP has just one record in that table, namely with the time of his last visit.
Accessing this table in descending time (newest first) enables us to output the 10 most recent visits (or any number we choose to display, actually). Well, that's all what there is about. The outcome can be seen on these pages. If you got inspired to roll your own on your server, let me know :)