The SIO2 project
  1. The SIO2 project
  2. SIO-1681

sioworkersd doesn't handle timeouts or network errors properly

    Details

    • Type: Bug Bug
    • Status: Resolved Resolved
    • Priority: Minor Minor
    • Resolution: Obsolete
    • Affects Version/s: Current Version
    • Fix Version/s: None
    • Labels:
      None

      Description

      The worker cannot cancel a running task, because Python/Twisted threads suck. This itself is not a big problem, but this limitation is basically ignored by sioworkersd, which can result in various strange errors.

      What exactly happens:
      * sioworkersd receives a task and schedules it on spr2g1
      * (sioworkersd) 16:41:51+0200 [-] Running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 on spr2g1
      * worker receives task and starts executing
      * (worker) 16:41:51+0200 [-] running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036, exclusive: True
      * TASK_TIMEOUT (15) minutes pass, the remote call to run in sioworkersd/manager.py:118 times out on rpc layer.
          note: this could as well be a network failure, the outcome would be pretty much the same.
      * (sioworkersd) 16:56:51+0200 [-] Task urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 finished.
      * At this point sioworkersd considers the task failed, but finished. The task finished, so the worker must not be executing it anymore, so the scheduler will schedule more tasks to it.
      This is of course not true in reality and worker is still executing the first task - all subsequent tasks sent to it will try to execute concurrently with it. If we are not using oitimetool, they will fail exclusivity checks - this worker becomes basically useless.
      * (worker) 17:09:10+0200 [-] urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 done.
      * Worker finally finishes executing the first task. This causes an exception on sioworkers side, because of SIO-1682, which actually improves the situation - sioworkersd will drop the worker and close connection.
      * (worker) 17:09:13+0200 [-] Starting factory <sio.protocol.worker.WorkerFactory instance at 0xb75dd34c>
      * Worker reconnects, everything is back to normal.

        Issue Links

          Activity

          Hide
          Gerrit Gerrit added a comment -
          Change Ifce26ab32dc4a06ed348885015a466f8ca5c0843, patchset 1
          https://gerrit.sio2project.mimuw.edu.pl/2466

          SIO-1681 (partial) sioworkersd doesn't handle timeouts or network errors properly

          work in progress - not tested
          This is not a complete fix - while it should stop all exclusivity errors
          from appearing, a runaway (executing very long or just frozen somehow)
          task will still block its worker until it finishes.

          Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843
          Show
          Gerrit Gerrit added a comment - Change Ifce26ab32dc4a06ed348885015a466f8ca5c0843, patchset 1 https://gerrit.sio2project.mimuw.edu.pl/2466 SIO-1681 (partial) sioworkersd doesn't handle timeouts or network errors properly work in progress - not tested This is not a complete fix - while it should stop all exclusivity errors from appearing, a runaway (executing very long or just frozen somehow) task will still block its worker until it finishes. Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843
          Hide
          Gerrit Gerrit added a comment -
          Change Ifce26ab32dc4a06ed348885015a466f8ca5c0843, patchset 2
          https://gerrit.sio2project.mimuw.edu.pl/2466

          SIO-1681 (partial) sioworkersd doesn't handle timeouts or network errors properly

          This is not a complete fix - while it should stop all exclusivity errors
          from appearing, a runaway (executing very long or just frozen somehow)
          task will still block its worker until it finishes.

          Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843
          Show
          Gerrit Gerrit added a comment - Change Ifce26ab32dc4a06ed348885015a466f8ca5c0843, patchset 2 https://gerrit.sio2project.mimuw.edu.pl/2466 SIO-1681 (partial) sioworkersd doesn't handle timeouts or network errors properly This is not a complete fix - while it should stop all exclusivity errors from appearing, a runaway (executing very long or just frozen somehow) task will still block its worker until it finishes. Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843
          Hide
          Gerrit Gerrit added a comment -
          Change Ifce26ab32dc4a06ed348885015a466f8ca5c0843, patchset 3
          https://gerrit.sio2project.mimuw.edu.pl/2466

          SIO-1681 (partial) sioworkersd doesn't handle timeouts or network errors properly

          This is not a complete fix - while it should stop all exclusivity errors
          from appearing, a runaway (executing very long or just frozen somehow)
          task will still block its worker until it finishes.

          Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843
          Show
          Gerrit Gerrit added a comment - Change Ifce26ab32dc4a06ed348885015a466f8ca5c0843, patchset 3 https://gerrit.sio2project.mimuw.edu.pl/2466 SIO-1681 (partial) sioworkersd doesn't handle timeouts or network errors properly This is not a complete fix - while it should stop all exclusivity errors from appearing, a runaway (executing very long or just frozen somehow) task will still block its worker until it finishes. Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843
          Hide
          Gerrit Gerrit added a comment -
          Change Ifce26ab32dc4a06ed348885015a466f8ca5c0843, patchset 4
          https://gerrit.sio2project.mimuw.edu.pl/2466

          SIO-1681 (partial) sioworkersd doesn't handle timeouts or network errors properly

          This is not a complete fix - while it should stop all exclusivity errors
          from appearing, a runaway (executing very long or just frozen somehow)
          task will still block its worker until it finishes.

          Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843
          Show
          Gerrit Gerrit added a comment - Change Ifce26ab32dc4a06ed348885015a466f8ca5c0843, patchset 4 https://gerrit.sio2project.mimuw.edu.pl/2466 SIO-1681 (partial) sioworkersd doesn't handle timeouts or network errors properly This is not a complete fix - while it should stop all exclusivity errors from appearing, a runaway (executing very long or just frozen somehow) task will still block its worker until it finishes. Change-Id: Ifce26ab32dc4a06ed348885015a466f8ca5c0843
          Hide
          Szymon Acedański added a comment -
          Should this be closed?
          Show
          Szymon Acedański added a comment - Should this be closed?
          Hide
          Bartosz Stebel added a comment -
          Well, not really - back when I did this I planned to finish this properly...
          The problem is as mentioned in the last commit message, example: if a worker receives a job that will take five hours to execute (because somebody didn't set limits or something), that worker is effectively unavailable for those five hours, unless somebody kills it manually. This does not impact accuracy in any way (it was fixed in that previous commit), just can cause a slowdown/deadlock.
          Show
          Bartosz Stebel added a comment - Well, not really - back when I did this I planned to finish this properly... The problem is as mentioned in the last commit message, example: if a worker receives a job that will take five hours to execute (because somebody didn't set limits or something), that worker is effectively unavailable for those five hours, unless somebody kills it manually. This does not impact accuracy in any way (it was fixed in that previous commit), just can cause a slowdown/deadlock.
          Hide
          Szymon Acedański added a comment -
          This issue has been automatically closed as Obsolete due to no activity for 365 days.

          Feel free to reopen it or create a new one if it's still relevant.
          Show
          Szymon Acedański added a comment - This issue has been automatically closed as Obsolete due to no activity for 365 days. Feel free to reopen it or create a new one if it's still relevant.

            People

            • Assignee:
              Szymon Acedański
              Reporter:
              Bartosz Stebel
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: