The SIO2 project
  1. The SIO2 project
  2. SIO-1681

sioworkersd doesn't handle timeouts or network errors properly

    Details

    • Type: Bug Bug
    • Status: Resolved Resolved
    • Priority: Minor Minor
    • Resolution: Obsolete
    • Affects Version/s: Current Version
    • Fix Version/s: None
    • Labels:
      None

      Description

      The worker cannot cancel a running task, because Python/Twisted threads suck. This itself is not a big problem, but this limitation is basically ignored by sioworkersd, which can result in various strange errors.

      What exactly happens:
      * sioworkersd receives a task and schedules it on spr2g1
      * (sioworkersd) 16:41:51+0200 [-] Running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 on spr2g1
      * worker receives task and starts executing
      * (worker) 16:41:51+0200 [-] running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036, exclusive: True
      * TASK_TIMEOUT (15) minutes pass, the remote call to run in sioworkersd/manager.py:118 times out on rpc layer.
          note: this could as well be a network failure, the outcome would be pretty much the same.
      * (sioworkersd) 16:56:51+0200 [-] Task urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 finished.
      * At this point sioworkersd considers the task failed, but finished. The task finished, so the worker must not be executing it anymore, so the scheduler will schedule more tasks to it.
      This is of course not true in reality and worker is still executing the first task - all subsequent tasks sent to it will try to execute concurrently with it. If we are not using oitimetool, they will fail exclusivity checks - this worker becomes basically useless.
      * (worker) 17:09:10+0200 [-] urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 done.
      * Worker finally finishes executing the first task. This causes an exception on sioworkers side, because of SIO-1682, which actually improves the situation - sioworkersd will drop the worker and close connection.
      * (worker) 17:09:13+0200 [-] Starting factory <sio.protocol.worker.WorkerFactory instance at 0xb75dd34c>
      * Worker reconnects, everything is back to normal.

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          New New Resolved Resolved
          1663d 15h 47m 1 Szymon Acedański 2020-04-27 16:27

            People

            • Assignee:
              Szymon Acedański
              Reporter:
              Bartosz Stebel
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: