[IPOL discuss] Time out

Nicolas Limare nicolas.limare at cmla.ens-cachan.fr
Mon Feb 7 11:31:02 CET 2011


> We frequently habe "time out" on our demos. I suppose that many will
> be fixed with the more powerful server, but however, I think that
> users are not properly informed.

No, the new server will only improve the performance of parallel
algorithms. There will be more CPUs, but each CPU won't be faster
than a CPU on the current server, so a program using only 1 CPU won't
have better performances on the new server.

And the situation won't improve. The processors on the next
generations of machines won't go faster, there will only be more
processors. This has been true for a few years now, and will probably
last for a while.

For the record, only ASIFT really does real (OpenMP) parallel
processing for the moment. Some demos use "quick and dirty" parallel
processing by splitting the image, but I'd like to see these hacks
gradually disappear.

> Once a demo is launched, in addition to the small spinning top, there
> should de a message such as: "In case this execution takes more than
> 60 seconds, it will be interrupted"

I just added this sentence.

> And the window that appears in case of time out with "Error 504: Gateway
> Timeout. This part of IPOL Is Temporary unavailable ... " is
> inaccurate since this is generally interrupted by time out and not a
> server error. 

This page should never appear. The shortest answer is: «demos should
properly use the timeout system». The detailed answer starts with a
little explanation about the demo system network and timeout. See the
end of this message for an answer in 3 sentences.

There are 2 layers between the network users and the demo:

users --- www.ipol.im --- demo.ipol.im --- demo 
          proxy           proxy

These proxy are some transparent service redirections:
* The proxy on www.ipol.im redirects all the requests to
  http://www.ipol.im/pub/demo to the demo.ipol.im server. With this
  redirection, the demos and other pages seem to be on the same
  machine (ie on http://www.ipol.im/) even if they are hosted on
  different servers, in different places: one in a datacenter in
  Paris, one in the ENS server room
* The proxy on demo.ipol.im connects the demo program to the external
  network environment. It handles all the network load (and possible
  abuses) while the demo programs run quietly.

Error 504 is a standard status code[1], sent by a proxy then it did
not receive an answer from the backend server. In our case, the first
proxy on www.ipol.im sends this page when it received no answer from
demo.ipol.im. It is supposed to only happen when there is something
gone wrong on the demo server (server off, rebooting, network
disturbances, ...), and that's why the message says "temporarily
unavailable".

[1]http://tools.ietf.org/html/rfc2616#section-10.5.5

Of course, we don't want to see this message when the demo are running
but slow. Easy! The "Error 504" appears when the demo server didn't
answer within 90 seconds (the proxy timeout), so we must ensure the
demos reply before, and this is possible with the current code.

Most of the processing time comes from running the algorithm programs
as subprocesses, so we control the execution time with the timeout
option parameter of the wait_proc() function. If the process takes
more than the timeout delay, it will be interrupted and a TimeoutError
exception is raised:

    timeout = 60
    p = self.run_proc([foo, bar])
    try:
        self.wait_proc(p, timeout)
    except TimeoutError:
        return self.error(errcode='timeout',
                          errmsg="The program took too much time.")

A better way to do it is (see app/my_affine_sift/app.py):
* use the self.timeout attribute defined for each app class
* catch the timeout error from the self.run() function
* pass the timeout parameter to the self.run_algo() function


    def run(self):
        try:
            self.run_algo(timeout=self.timeout)
        except TimeoutError:
            return self.error(errcode='timeout',
                              errmsg="The program took too much time.")
        except RuntimeError:
            return self.error(errcode='runtime',
			      errmsg="The program ended in an execution error")     
        return self.tmpl_out("run.html")

    def run_algo(self, timeout=False)
        p = self.run_proc([foo, bar])
        self.wait_proc(p, timeout)
        return

So, to avoid the infamous "Error 504" page, demos should always
specify a non-null timeout when waiting for a process. And be careful
if you run 3 processes sequentially, you should divide the timeout to
ensure a global timeout, and the demo editor expertise is required to
know which process is expected to need more time.

        p1 = self.run_proc([foo, bar])
        self.wait_proc(p1, timeout/2)
        p2 = self.run_proc([foo, bar])
        self.wait_proc(p2, timeout/4)
        p3 = self.run_proc([foo, bar])
        self.wait_proc(p3, timeout/4)

Finally, to avoid this process timeout for most of the images, you
should ensure the processed data size is within a reasonable limit,
either via the input restrictions (max_size attributes) or via a
custom control if you manipulate the input (crop, ...)

Another timeout is configurable at the Cherrypy level, but I thought
it would be overkill.

And as usual, this is just the way I spontaneously implemented it; it
can be modified, your contributions are welcome.

                                -o-0-o-

In brief:
* there is a hard external timeout of 90 seconds on all the demo pages,
  always active : Error 504 
* we avoid the hard timeout with a soft internal timeout on the
  process executions; activated on demand in wait_proc(), the
  suggested timeout configuration is 60 seconds
* we avoid the soft timeout by controling and reducing the input size

Todolist:
* notify by email the demo authors of any timeout triggered
* gather statistics about these timeouts with other operation
  infos[2]

[2]http://tools.ipol.im/munin/ipol/fuchsia.html#Ipol

-- 
Nicolas LIMARE
http://nicolas.limare.net/                         pgp:0xFA423F4F
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://tools.ipol.im/mailman/archive/discuss/attachments/20110207/e0222b31/attachment-0001.pgp>


More information about the discuss mailing list