[SIO-1681] sioworkersd doesn't handle timeouts or network errors properly - SIO2

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Obsolete
Affects Version/s: Current Version
Fix Version/s: None
Component/s: Evaluation Engine / Workers
Labels:
None

Description

The worker cannot cancel a running task, because Python/Twisted threads suck. This itself is not a big problem, but this limitation is basically ignored by sioworkersd, which can result in various strange errors.

What exactly happens:
* sioworkersd receives a task and schedules it on spr2g1
* (sioworkersd) 16:41:51+0200 [-] Running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 on spr2g1
* worker receives task and starts executing
* (worker) 16:41:51+0200 [-] running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036, exclusive: True
* TASK_TIMEOUT (15) minutes pass, the remote call to run in sioworkersd/manager.py:118 times out on rpc layer.
note: this could as well be a network failure, the outcome would be pretty much the same.
* (sioworkersd) 16:56:51+0200 [-] Task urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 finished.
* At this point sioworkersd considers the task failed, but finished. The task finished, so the worker must not be executing it anymore, so the scheduler will schedule more tasks to it.
This is of course not true in reality and worker is still executing the first task - all subsequent tasks sent to it will try to execute concurrently with it. If we are not using oitimetool, they will fail exclusivity checks - this worker becomes basically useless.
* (worker) 17:09:10+0200 [-] urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 done.
* Worker finally finishes executing the first task. This causes an exception on sioworkers side, because of ~~SIO-1682~~, which actually improves the situation - sioworkersd will drop the worker and close connection.
* (worker) 17:09:13+0200 [-] Starting factory <sio.protocol.worker.WorkerFactory instance at 0xb75dd34c>
* Worker reconnects, everything is back to normal.

Issue Links

blocks

SIO-1682 Worker RPC implementation doesn't remove call from pending after timeout

Activity

Transition

Time In Source Status

Execution Times

Last Executer

Last Execution Date

New

Resolved

1663d 15h 47m

Szymon Acedański

2020-04-27 16:27

People

Assignee:

Szymon Acedański

Reporter:

Bartosz Stebel

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

2015-10-8 0:39

Updated:

2020-04-27 16:27

Resolved:

2020-04-27 16:27

Agile

View on Board