[SIO-1681] sioworkersd doesn't handle timeouts or network errors properly - SIO2

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Obsolete
Affects Version/s: Current Version
Fix Version/s: None
Component/s: Evaluation Engine / Workers
Labels:
None

Description

The worker cannot cancel a running task, because Python/Twisted threads suck. This itself is not a big problem, but this limitation is basically ignored by sioworkersd, which can result in various strange errors.

What exactly happens:
* sioworkersd receives a task and schedules it on spr2g1
* (sioworkersd) 16:41:51+0200 [-] Running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 on spr2g1
* worker receives task and starts executing
* (worker) 16:41:51+0200 [-] running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036, exclusive: True
* TASK_TIMEOUT (15) minutes pass, the remote call to run in sioworkersd/manager.py:118 times out on rpc layer.
note: this could as well be a network failure, the outcome would be pretty much the same.
* (sioworkersd) 16:56:51+0200 [-] Task urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 finished.
* At this point sioworkersd considers the task failed, but finished. The task finished, so the worker must not be executing it anymore, so the scheduler will schedule more tasks to it.
This is of course not true in reality and worker is still executing the first task - all subsequent tasks sent to it will try to execute concurrently with it. If we are not using oitimetool, they will fail exclusivity checks - this worker becomes basically useless.
* (worker) 17:09:10+0200 [-] urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 done.
* Worker finally finishes executing the first task. This causes an exception on sioworkers side, because of ~~SIO-1682~~, which actually improves the situation - sioworkersd will drop the worker and close connection.
* (worker) 17:09:13+0200 [-] Starting factory <sio.protocol.worker.WorkerFactory instance at 0xb75dd34c>
* Worker reconnects, everything is back to normal.

Issue Links

blocks

SIO-1682 Worker RPC implementation doesn't remove call from pending after timeout

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Gerrit Gerrit added a comment - 2015-11-11 1:48

Change Ifce26ab32dc4a06ed348885015a466f8ca5c0843, patchset 1
https://gerrit.sio2project.mimuw.edu.pl/2466

~~SIO-1681~~ (partial) sioworkersd doesn't handle timeouts or network errors properly

work in progress - not tested
This is not a complete fix - while it should stop all exclusivity errors
from appearing, a runaway (executing very long or just frozen somehow)
task will still block its worker until it finishes.

Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843

Show

Gerrit Gerrit added a comment - 2015-11-11 1:48 Change Ifce26ab32dc4a06ed348885015a466f8ca5c0843, patchset 1 https://gerrit.sio2project.mimuw.edu.pl/2466 SIO-1681 (partial) sioworkersd doesn't handle timeouts or network errors properly work in progress - not tested This is not a complete fix - while it should stop all exclusivity errors from appearing, a runaway (executing very long or just frozen somehow) task will still block its worker until it finishes. Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843

Hide

Permalink

Gerrit Gerrit added a comment - 2015-11-11 21:12

Change Ifce26ab32dc4a06ed348885015a466f8ca5c0843, patchset 2
https://gerrit.sio2project.mimuw.edu.pl/2466

~~SIO-1681~~ (partial) sioworkersd doesn't handle timeouts or network errors properly

This is not a complete fix - while it should stop all exclusivity errors
from appearing, a runaway (executing very long or just frozen somehow)
task will still block its worker until it finishes.

Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843

Show

Gerrit Gerrit added a comment - 2015-11-11 21:12 Change Ifce26ab32dc4a06ed348885015a466f8ca5c0843, patchset 2 https://gerrit.sio2project.mimuw.edu.pl/2466 SIO-1681 (partial) sioworkersd doesn't handle timeouts or network errors properly This is not a complete fix - while it should stop all exclusivity errors from appearing, a runaway (executing very long or just frozen somehow) task will still block its worker until it finishes. Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843

Hide

Permalink

Gerrit Gerrit added a comment - 2015-11-11 21:20

Change Ifce26ab32dc4a06ed348885015a466f8ca5c0843, patchset 3
https://gerrit.sio2project.mimuw.edu.pl/2466

~~SIO-1681~~ (partial) sioworkersd doesn't handle timeouts or network errors properly

This is not a complete fix - while it should stop all exclusivity errors
from appearing, a runaway (executing very long or just frozen somehow)
task will still block its worker until it finishes.

Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843

Show

Gerrit Gerrit added a comment - 2015-11-11 21:20 Change Ifce26ab32dc4a06ed348885015a466f8ca5c0843, patchset 3 https://gerrit.sio2project.mimuw.edu.pl/2466 SIO-1681 (partial) sioworkersd doesn't handle timeouts or network errors properly This is not a complete fix - while it should stop all exclusivity errors from appearing, a runaway (executing very long or just frozen somehow) task will still block its worker until it finishes. Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843

Hide

Permalink

Gerrit Gerrit added a comment - 2015-11-12 0:08

Change Ifce26ab32dc4a06ed348885015a466f8ca5c0843, patchset 4
https://gerrit.sio2project.mimuw.edu.pl/2466

~~SIO-1681~~ (partial) sioworkersd doesn't handle timeouts or network errors properly

This is not a complete fix - while it should stop all exclusivity errors
from appearing, a runaway (executing very long or just frozen somehow)
task will still block its worker until it finishes.

Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843

Show

Gerrit Gerrit added a comment - 2015-11-12 0:08 Change Ifce26ab32dc4a06ed348885015a466f8ca5c0843, patchset 4 https://gerrit.sio2project.mimuw.edu.pl/2466 SIO-1681 (partial) sioworkersd doesn't handle timeouts or network errors properly This is not a complete fix - while it should stop all exclusivity errors from appearing, a runaway (executing very long or just frozen somehow) task will still block its worker until it finishes. Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843

Hide

Permalink

Szymon Acedański added a comment - 2016-11-9 17:27

Should this be closed?

Show

Szymon Acedański added a comment - 2016-11-9 17:27 Should this be closed?

Hide

Permalink

Bartosz Stebel added a comment - 2016-12-8 21:52

Well, not really - back when I did this I planned to finish this properly...
The problem is as mentioned in the last commit message, example: if a worker receives a job that will take five hours to execute (because somebody didn't set limits or something), that worker is effectively unavailable for those five hours, unless somebody kills it manually. This does not impact accuracy in any way (it was fixed in that previous commit), just can cause a slowdown/deadlock.

Show

Bartosz Stebel added a comment - 2016-12-8 21:52 Well, not really - back when I did this I planned to finish this properly... The problem is as mentioned in the last commit message, example: if a worker receives a job that will take five hours to execute (because somebody didn't set limits or something), that worker is effectively unavailable for those five hours, unless somebody kills it manually. This does not impact accuracy in any way (it was fixed in that previous commit), just can cause a slowdown/deadlock.

Hide

Permalink

Szymon Acedański added a comment - 2020-04-27 16:27

This issue has been automatically closed as Obsolete due to no activity for 365 days.

Feel free to reopen it or create a new one if it's still relevant.

Show

Szymon Acedański added a comment - 2020-04-27 16:27 This issue has been automatically closed as Obsolete due to no activity for 365 days. Feel free to reopen it or create a new one if it's still relevant.

People

Assignee:

Szymon Acedański

Reporter:

Bartosz Stebel

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

2015-10-8 0:39

Updated:

2020-04-27 16:27

Resolved:

2020-04-27 16:27

Agile

View on Board