Details
-
Type: Bug
-
Status: Resolved
-
Priority: Minor
-
Resolution: Obsolete
-
Affects Version/s: Current Version
-
Fix Version/s: None
-
Component/s: Evaluation Engine / Workers
-
Labels:None
Description
The worker cannot cancel a running task, because Python/Twisted threads suck. This itself is not a big problem, but this limitation is basically ignored by sioworkersd, which can result in various strange errors.
What exactly happens:
* sioworkersd receives a task and schedules it on spr2g1
* (sioworkersd) 16:41:51+0200 [-] Running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 on spr2g1
* worker receives task and starts executing
* (worker) 16:41:51+0200 [-] running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036, exclusive: True
* TASK_TIMEOUT (15) minutes pass, the remote call to run in sioworkersd/manager.py:118 times out on rpc layer.
note: this could as well be a network failure, the outcome would be pretty much the same.
* (sioworkersd) 16:56:51+0200 [-] Task urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 finished.
* At this point sioworkersd considers the task failed, but finished. The task finished, so the worker must not be executing it anymore, so the scheduler will schedule more tasks to it.
This is of course not true in reality and worker is still executing the first task - all subsequent tasks sent to it will try to execute concurrently with it. If we are not using oitimetool, they will fail exclusivity checks - this worker becomes basically useless.
* (worker) 17:09:10+0200 [-] urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 done.
* Worker finally finishes executing the first task. This causes an exception on sioworkers side, because ofSIO-1682, which actually improves the situation - sioworkersd will drop the worker and close connection.
* (worker) 17:09:13+0200 [-] Starting factory <sio.protocol.worker.WorkerFactory instance at 0xb75dd34c>
* Worker reconnects, everything is back to normal.
What exactly happens:
* sioworkersd receives a task and schedules it on spr2g1
* (sioworkersd) 16:41:51+0200 [-] Running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 on spr2g1
* worker receives task and starts executing
* (worker) 16:41:51+0200 [-] running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036, exclusive: True
* TASK_TIMEOUT (15) minutes pass, the remote call to run in sioworkersd/manager.py:118 times out on rpc layer.
note: this could as well be a network failure, the outcome would be pretty much the same.
* (sioworkersd) 16:56:51+0200 [-] Task urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 finished.
* At this point sioworkersd considers the task failed, but finished. The task finished, so the worker must not be executing it anymore, so the scheduler will schedule more tasks to it.
This is of course not true in reality and worker is still executing the first task - all subsequent tasks sent to it will try to execute concurrently with it. If we are not using oitimetool, they will fail exclusivity checks - this worker becomes basically useless.
* (worker) 17:09:10+0200 [-] urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 done.
* Worker finally finishes executing the first task. This causes an exception on sioworkers side, because of
* (worker) 17:09:13+0200 [-] Starting factory <sio.protocol.worker.WorkerFactory instance at 0xb75dd34c>
* Worker reconnects, everything is back to normal.
Issue Links
- blocks
-
SIO-1682 Worker RPC implementation doesn't remove call from pending after timeout
https://gerrit.sio2project.mimuw.edu.pl/2466
SIO-1681(partial) sioworkersd doesn't handle timeouts or network errors properlywork in progress - not tested
This is not a complete fix - while it should stop all exclusivity errors
from appearing, a runaway (executing very long or just frozen somehow)
task will still block its worker until it finishes.
Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843