The SIO2 project
  1. The SIO2 project
  2. SIO-1681

sioworkersd doesn't handle timeouts or network errors properly

    Details

    • Type: Bug Bug
    • Status: Resolved Resolved
    • Priority: Minor Minor
    • Resolution: Obsolete
    • Affects Version/s: Current Version
    • Fix Version/s: None
    • Labels:
      None

      Description

      The worker cannot cancel a running task, because Python/Twisted threads suck. This itself is not a big problem, but this limitation is basically ignored by sioworkersd, which can result in various strange errors.

      What exactly happens:
      * sioworkersd receives a task and schedules it on spr2g1
      * (sioworkersd) 16:41:51+0200 [-] Running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 on spr2g1
      * worker receives task and starts executing
      * (worker) 16:41:51+0200 [-] running urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036, exclusive: True
      * TASK_TIMEOUT (15) minutes pass, the remote call to run in sioworkersd/manager.py:118 times out on rpc layer.
          note: this could as well be a network failure, the outcome would be pretty much the same.
      * (sioworkersd) 16:56:51+0200 [-] Task urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 finished.
      * At this point sioworkersd considers the task failed, but finished. The task finished, so the worker must not be executing it anymore, so the scheduler will schedule more tasks to it.
      This is of course not true in reality and worker is still executing the first task - all subsequent tasks sent to it will try to execute concurrently with it. If we are not using oitimetool, they will fail exclusivity checks - this worker becomes basically useless.
      * (worker) 17:09:10+0200 [-] urn:uuid:36f2612b-1334-4b7a-92c1-f69da96a3036 done.
      * Worker finally finishes executing the first task. This causes an exception on sioworkers side, because of SIO-1682, which actually improves the situation - sioworkersd will drop the worker and close connection.
      * (worker) 17:09:13+0200 [-] Starting factory <sio.protocol.worker.WorkerFactory instance at 0xb75dd34c>
      * Worker reconnects, everything is back to normal.

        Issue Links

          Activity

          Connect your code to JIRA

          Link every code change to JIRA just by adding an issue keys in commit messages. Bridge the gap back to your source and know which changes fixed which JIRA issues.

          Git & Mercurial in the cloud

          Collaborate across unlimited private code repositories.

          Git behind the firewall

          Manage and collaborate on Git repositories behind a firewall.

          Browse and search code

          Browse, search, and track source code repositories.

            People

            • Assignee:
              Szymon AcedaƄski
              Reporter:
              Bartosz Stebel
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: