Details
-
Type: Bug
-
Status: Resolved
-
Priority: Minor
-
Resolution: Obsolete
-
Affects Version/s: Current Version
-
Fix Version/s: None
-
Component/s: Evaluation Engine / Workers
-
Labels:None
Description
The worker cannot cancel a running task, because Python/Twisted threads suck. This itself is not a big problem, but this limitation is basically ignored by sioworkersd, which can result in various strange errors.
What exactly happens:
* sioworkersd receives a task and schedules it on spr2g1
* (sioworkersd) 16:41:51+0200 [-] Running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 on spr2g1
* worker receives task and starts executing
* (worker) 16:41:51+0200 [-] running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036, exclusive: True
* TASK_TIMEOUT (15) minutes pass, the remote call to run in sioworkersd/manager.py:118 times out on rpc layer.
note: this could as well be a network failure, the outcome would be pretty much the same.
* (sioworkersd) 16:56:51+0200 [-] Task urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 finished.
* At this point sioworkersd considers the task failed, but finished. The task finished, so the worker must not be executing it anymore, so the scheduler will schedule more tasks to it.
This is of course not true in reality and worker is still executing the first task - all subsequent tasks sent to it will try to execute concurrently with it. If we are not using oitimetool, they will fail exclusivity checks - this worker becomes basically useless.
* (worker) 17:09:10+0200 [-] urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 done.
* Worker finally finishes executing the first task. This causes an exception on sioworkers side, because ofSIO-1682, which actually improves the situation - sioworkersd will drop the worker and close connection.
* (worker) 17:09:13+0200 [-] Starting factory <sio.protocol.worker.WorkerFactory instance at 0xb75dd34c>
* Worker reconnects, everything is back to normal.
What exactly happens:
* sioworkersd receives a task and schedules it on spr2g1
* (sioworkersd) 16:41:51+0200 [-] Running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 on spr2g1
* worker receives task and starts executing
* (worker) 16:41:51+0200 [-] running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036, exclusive: True
* TASK_TIMEOUT (15) minutes pass, the remote call to run in sioworkersd/manager.py:118 times out on rpc layer.
note: this could as well be a network failure, the outcome would be pretty much the same.
* (sioworkersd) 16:56:51+0200 [-] Task urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 finished.
* At this point sioworkersd considers the task failed, but finished. The task finished, so the worker must not be executing it anymore, so the scheduler will schedule more tasks to it.
This is of course not true in reality and worker is still executing the first task - all subsequent tasks sent to it will try to execute concurrently with it. If we are not using oitimetool, they will fail exclusivity checks - this worker becomes basically useless.
* (worker) 17:09:10+0200 [-] urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 done.
* Worker finally finishes executing the first task. This causes an exception on sioworkers side, because of
* (worker) 17:09:13+0200 [-] Starting factory <sio.protocol.worker.WorkerFactory instance at 0xb75dd34c>
* Worker reconnects, everything is back to normal.
Issue Links
- blocks
-
SIO-1682 Worker RPC implementation doesn't remove call from pending after timeout
Activity
- All
- Comments
- History
- Activity
- Transitions
- Commits
Connect your code to JIRA
Link every code change to JIRA just by adding an issue keys in commit messages. Bridge the gap back to your source and know which changes fixed which JIRA issues.Git & Mercurial in the cloud
Collaborate across unlimited private code repositories.
Git behind the firewall
Manage and collaborate on Git repositories behind a firewall.
Browse and search code
Browse, search, and track source code repositories.