On robots, URL design, and bad optimization

<< 2008-06-25 15:14 >>

Helsinki cathedral entrance

Over the last few weeks my photo collection application has been struggling seriously with its performance. The Tomcat server would sometimes crash, which isn't so serious, as my monitoring script would restart it at most 30 minutes later. What's worse is that often it would get stuck and also make Apache freeze, and this would kill the entire site (including this blog), and the monitor script doesn't detect that. Or, load on the server would soar into the double digits, and just stay there, basically making the server unusable until I did a manual restart.

The robots

So, of course I started digging into the problem to see what might be causing this. And pretty quickly I found that Yahoo's Slurp robot was hammering tmphoto with more requests than it was able to handle. At the same time msnbot and Googlebot were also pretty active, though not as bad as Slurp.

My stats made it abundantly clear that most of the traffic in the application was from robots. I checked the logs quickly now, and I find that out of 2.3 million requests, 93.8% were from robots. Of the total, 23% were from Slurp. However, Slurp tends to bunch its requests together, so at times of peak Slurp traffic a much higher proportion of the traffic would be Slurp.

I figured the easiest way to solve this would be to make the robots go a little easier on the site, and added "crawl-delay" statements for Slurp and msnbot to my robots.txt file, telling the robots to wait 45 seconds between requests. I could see both robots picking up the new robots.txt, and while they might have slowed down a bit over the days that followed, they didn't really slow down much.

So I wrote to the Yahoo team complaining about their robot, and asking them to do something. Contrary to expectation, they wrote back the same day, asking me to set the "crawl-delay" to slow down the robot. Wonderful. So no help there. From what I read on the web other people are finding much the same.

A closer look at my access log also revealed gems like this one: - - [22/Jun/2008:04:41:38 -0400] "GET /tmphoto/photo.jsp?id=t61298 HTTP/1.0" 200 4314 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp;" - - [22/Jun/2008:04:42:34 -0400] "GET /tmphoto/photo.jsp?id=t88323 HTTP/1.0" 200 4324 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0;" - - [22/Jun/2008:04:41:40 -0400] "GET /tmphoto/photo.jsp?id=t61298 HTTP/1.0" 200 4314 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp;" - - [22/Jun/2008:04:41:36 -0400] "GET /tmphoto/photo.jsp?id=t61298 HTTP/1.0" 200 4314 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp;"

What you're seeing here is four consecutive lines lifted straight out of the log. Ignore the second line, as that's not what we're interested in. The other three show Apache receiving three requests for the same page from the same IP within a span of four seconds. I think what's happening is that when Slurp doesn't get a response quickly enough, it tries again.

This is not exactly sympathetic behaviour. Imagine, for a moment, what might make it take a while to get a response from a server. Well, it might be load on the server, right? So what happens when you ask again? Well, you increase the load further, don't you? So basically, when my server first started struggling, Slurp would hammer it even more, thus making it even more overloaded. Reasonable behaviour would be to make a note that this URL is not responding quickly right now, and put it back in the queue to be tried again later, rather than trying again right away. This is much more efficient for the robot (it's likely to be slow on the next try as well, so the robot is likely to get through more pages if it changes to a different URL), and much friendlier to other people's web sites.

URL design

Passing boat, Løvøysund, Norway

There was one thing about this application that had been worrying me for a long time, which was the way I'd implemented filtering and context-tracking. Initially, people had found it confusing that if, for example, you went to the list of beer photos and clicked on a photo, the next/previous buttons on the photo page would not necessarily take you to a beer photo. Instead, they would just go to the next/previous photo chronologically, regardless of whether that was also a beer photo.

This was because I'd done a naive implementation where every page was completely stateless, and so the next/previous buttons would not know where you came from, and would always behave the same. To get around this, I made some pages pass in an extra request parameter saying where you'd come from, if you came from a person, a place, or a category. This way the previous/next buttons would move within the list you'd come from.

Similarly, in lists like the list of beer photos you can use the filters on the right-hand side to select only beer photos from a specific place, event, or with a specific person in them, etc. The ID of the filtering topic is also passed as a request parameter back to the same page, telling it which topic to filter by.

This all worked fine, but unfortunately it increased the number of URLs in my web application quite a lot. Without this feature there is basically one URL per parameterless page (such as the start page, lists of places/events/etc) and one URL per topic (that is, one per event, person, etc). With 82 categories, 157 events, 324 places, 195 persons, and 8521 photos this comes to 9279 URLs plus the static pages. Not so bad.

But the list stepping parameters and the filtering parameters changes this, because now there is one URL per combination. Photos can be filtered by categories, events, places, and persons, so we get 8521 * (82 + 157 + 324 + 195 + 1) = 6,467,439 URLs there. Lists of photos by categories, events, places, and persons can be filtered by the same topics, so we get roughly (82 + 157 + 324 + 195)2 = 574,564 URLs there. That gives us a rough estimate of 7 million URLs. And while normal people will never try out more than a tiny fraction of those combinations there's nothing to tell the robots not to do it, and so of course they'll just keep crawling and crawling the site endlessly.

So I sat down to do a quick rewrite and within an hour or two I had changed the code to store the filters in the user's server-side session and not pass them as request parameters. This also had the nice side-effect of simplifying the code quite a lot, as I no longer had to remember to keep propagating these request parameters in every link. (There were actually a few bugs here, which are now gone.) The downside is that most likely people will find the filters on in places where they don't expect it (because of caching, or because they don't pass through a page which removes the filters from their session), but as few people use this anyway, and it's just one click to turn them off, that shouldn't be much of an issue.

Of course, I also had to add code to handle the old URLs, which I did by adding Java code at the very top of the pages to check for filtering request parameters. If any are found, the code updates the session, then sends a permanent forward to the URL without filtering. Thus, old URLs will continue to work, and the search engines are told not to keep these URLs in their indexes, but instead to replace them with the filterless URL.

The redirect code is the first that executes on the page, so for these requests the server hardly does any work, and barely produces any output at all, thus dramatically reducing the server load from these requests.

The Four Courts, Dublin, Ireland

So, I deployed, and waited to see the results. After some hours it became clear that while the site would no longer go into a death-spiral immediately after being started, it was still struggling, and would still wind up barely crawling along once every two hours or so, and then never come back to health without a restart. This was disappointing, to put it mildly.

It was at this point I started doing what I should have done immediately: do some more precise analysis to see what the impact of the filtering URLs actually was. Some quick log analysis revealed that, actually, pages with filtering URLs was only about 55% of the traffic. So I'd reduced the load by about half, but no more, and apparently this was not enough.

From this I learned two things:

  1. Consider the design of your URLs with search engines in mind, and avoid designs that cause an explosion of possible web-visible URLs.
  2. Don't make optimizations without estimating the effect of your optimization first! You often find that the optimization has much less effect than you thought, making it a waste of effort (and, often, a needless source of bugs).

In this case, the optimization had other benefits, so I won't be going back to the old design, but my site was still near-unusable. So then what? Well, this blog entry is now more than long enough as it is, and the solution I eventually found takes us into very different territory, so I'll save that story for the next entry.

Update: next entry now published.

Similar posts

What's up?

While RSS and Atom are a great way to stay up to date on what is published around the web, I think the feed-centric approach taken by most feed readers is suboptimal

Read | 2011-02-03 19:50

Finally solving the performance problem

I wrote about the performance problems the tmphoto application had suffered from, and my failed attempts to fix them

Read | 2008-07-06 14:49

The get-illustration web service

I'm working on a site that lists the various Topic Maps-related software that's out there, in an effort to make all the tools that have been released more visible

Read | 2008-10-28 15:20


Paul - 2008-06-25 10:03:20

Optimization! Screen resolution! Have you tried to read your conternt at 800 or 1024?

Regards, Paul

joe - 2008-06-25 11:26:48

Bad advice, HTTP _is_ stateless.

What happens when someone bookmarks your page, and doesn't have javascript?

When filtering a page based on query arguments which can lead to huge ranges of urls, the best approach is just to hide those links from robots, not use a session based approach.

One way you can stop a crawler from following links without javascript is to use a form.

Each filter link is a form button that GETs the current page, but addes it's own query string to it.

wd - 2008-06-25 11:39:46

What sized screen was that layout designed for? It still didn't fit inside Safari after I maximised it at 1680!

Eric - 2008-06-25 11:42:14

Liquid design. Heard of it?

Lars Marius - 2008-06-25 11:46:12

I'm no web designer, and I guess this shows why. :-) It turns out that while Firefox and Opera treat a "pre" element with no set width and style="overflow: scroll" sensibly (meaning, use the container width), Safari does not, and instead uses some seemingly randomly chosen width for the "pre".

Solved it now by adding a "max-width: 600". Not as good for people with better browsers, but keeps the content readable in Safari.

FoRo - 2008-06-25 13:07:09

As already noted, the layout of this site is hosed. I think you've got a little mote in your eye there...

russ - 2008-06-25 13:51:39

You might also consider using nofollow on your forward and back links. Make sure there's a way for the robots to get to your pages, but lessen the number of links they ( should* ) follow between the photos. This'll also have the effect of "concentrating" your search engine results on specific locations, like a category page or a latest page.

I'd also include a "link rel=self" permanent link (like blogs use) with the filtering urls, exactly because of joe's advice above; someone might bookmark a page, but because the filters are stored in session variables, their filtering is lost.

* yeah, nofollow's not perfect. Some evidence suggests some search engines ignore it :(

Jonas - 2008-06-26 03:42:27

With this design you break other behaviour. People will get completely different links on a page when they return from a saved session or bookmark. In the choice between two bad things, I would say stick to web principles and in this case statelessness. If the robots (and/or spammers) give you a problem that is what you should solve, starting with link attributes and ending with ip filters.

Lars Marius - 2008-06-26 04:57:36

Joe: Someone who bookmarks the page with filters set will most likely come back without them set. This is a weakness, but it's not clear that it's important in this particular case.

Joe: JavaScript is only needed for the dynamic HTML used to open the filter boxes on the list pages, nothing else. If you don't have JavaScript you can still use the site fine.

russ: I tried nofollow months ago, and it doesn't seem to have any effect at all. Others have suggested using JavaScript for the links, and that would probably be better.

Jonas: The bookmarking issue I replied to above. As for blocking robots: that would drop me from search engines completely. Not much fun.

Paul Houle - 2008-06-26 08:24:07

Or... You could just write your app in PHP and never worry about kicking Tomcat again.

Markus Ueberall - 2008-07-01 09:26:53

As long as the bots identify themselves properly (cf. user agent info), isn't it possible to send them modified/static/compressed pages w/o images (just the metadata), but with increased validility period where applicable?

Lars Marius - 2008-07-01 11:02:21

Markus: the HTML pages are already without images. They only contain the <img> referring to the images themselves, and the images are served from a different server. So it's not clear what I could do to simplify the view given to robots.

trond - 2008-07-01 13:19:15

I would generally *not* recommend doing so, but did you consider creating/outputting the links with javascript? That way, the tmphoto-links would be available to js-enabled browsers/users, while remaining "invisible" to bots (Note: I *assume* the bots don't make use of a js interpreter.)

I almost hate myself for suggesting such a thing, but I would actually prefer that over sessions. After all, you're not in a position where you have to make sure the site is WAI compliant and/or renders gracefully without js support ;)

And, out of curiosity: what are the server specs? Perhaps upgrading the hardware could help?

Lars Marius - 2008-07-01 13:26:31

Trond: Using JavaScript would be a possible way to hide the extra URL parameters from robots. In some ways it might be better than sessions, actually.

I don't know what the server specs are, but I have actually solved the problem. The next instalment will tell you how.

Robert Cerny - 2008-07-04 03:18:12

I think a server-side HTTP cache might have helped you a great deal, until you decided to make responses depending on some context which is unknown to the client and the caches. Since the context is unknown to the cache, it would have returned non-sense. I consider the breaking of the HTTP caching one of the great drawbacks of introducing a session, to be more exact a context only known to the server. There is nothing wrong with an empty session :) Besides that it gets more difficult to test: you can't just fire a URL and expect a certain response, but you rather have to create a certain context by a sequence of requests, which is simply more cumbersome. There is many more drawbacks.

When i thought about your application, i had some difficulties in finding an alternate (more RESTful) design, mostly your filtering issue provided obsticles. It all got clearer, when i freed myself from the thought that you are simply providing photos, as the name of the jsp suggests. You are not :) By introducing the navigational features, you are providing a gallery which a user can walk through. By introducing filters you are providing a great number of galleries.

sitara - 2008-09-04 20:47:40

i hate robots don't even talk about it ooooooooooooooooo:( :0

Add a comment

Name required
Email optional, not published
URL optional, published
Spam don't check this if you want to be posted
Not spam do check this if you want to be posted