Theoretical Framework

Some History…

The development of a framework for CCDS process scheduling is rooted in the job scheduling problems that emerged with industrialization and assembly-line work. It is not a coincidence that many of the classical scheduling problems have names like Job Shop Problem or Flow Shop Problem. Back then and now the targets are the same: execute complex jobs that might need treatment on different machines in a defined sequential orders with certain degrees of efficiency. Efficiency in this context can be defined in a number of ways. It might be the total time needed to execute a set of jobs, the compliance of given deadlines, the minimization of overall costs or a combination of all the above.

There are two ways to view the the Cognitive Continuum Digital Shadow¹:

¹ CCDS

Facility- or Machine-centered view.
Job-centered view.

The first is best modelled by a supply chain model. The second is principally a scheduling approach.

Figure 1: Two different viewponts of the CCDS: a supply chain model (left) and a scheduling model (right). DC = data center, CC = compute center, J = job.

We propose a combined model for the CCDS, composed of a Supply Chain Network Design² coupled with a Resource Constrained Process Scheduling³. In such a hierarchical system, the supply chain model allocates jobs to the best facilities, and the scheduling model then (optimally) sequences jobs among facilities and within a given facility. Supply chain optimization emphasizes “where” (facility choice, data transfer mode). Scheduling emphasizes “when” and “in what order.”

² SCND

³ RCPSP

Recall the basic problem: given a directed graph of $N$ data storage, transfer and processing jobs $\mathscr{J},$ find an optimal allocation $A$ and a feasible schedule $S.$ Further details below.

$\mathscr{J}=\{ J_1, J_2, \ldots J_N\},$

Supply Chains with Scheduling

Let us begin by defining supply chains and scheduling in the context of the CCDS.

Definition 1 (Supply Chain) A supply chain is a network (directed graph) of jobs, facilities, data storage, networks, processing and end-product delivery to users. The design poroblem is to optimally allocate an ordered list of data storage and data processing jobs to an optimal choice of storage and processing centres.

Definition 2 (Scheduling) Scheduling is a decision-making process that deals with the allocation of (limited) resources to tasks over given time periods. Its goal is to optimize one or more objectives. The resulting schedule is a job sequence determined for every machine (facility) of the processing system.

These two models are complementary:

the SCND can be used for global, over a fixed period, modelling—for example over an entire project duration, or annualized;
the RCPSP can be used for time-dependent models, especially when uncertainty is introduced, as in a two-stage, or multi-stage, model—see below Section Stochastic, multistage programming.

In Figure 2 and Figure 3 below, we show the two alternative viewpoints of a simple 4-job, 4-machine scheduling problem. The global mathematical formulation will effectively include and combine these two viewpoints, as depicted in Figure 4 and formulated below.

graph LR
    A[0] -->|"$$c_{0,1}$$"| B("$$J_1$$")
    B -->|"$$c_{1,2}$$"| C("$$J_2$$")
    C -->|"$$c_{2,3}$$"| D("$$J_3$$")
    C -->|"$$c_{2,4}$$"| E("$$J_4$$")
    D -->|"$$c_{3,5}$$"| F[5]
    E -->|"$$c_{4,5}$$"| F

Figure 2: Job-centered view: A simple directed graph (DAG) for a genomics workflow with 4 processing jobs $J_1,\cdots, J_4,$ and data transfers $c_{i,j}$ between them, including initial upload and final download. The tasks 0 and 5 are dummy tasks, representing the start and the end of the workflow.

flowchart LR
    A[(DC1)]:::otherclass --> C{{HPC1}}:::thirdclass
    B[(DC2)]:::otherclass 
    A --> D{{HPC2}}:::thirdclass
    %%A[(data1)]:::someclass --> E{{HPC3}}:::thirdclass
    C --> F[(DC1)]:::otherclass
    C --> G[(DC2)]:::otherclass
    F --> H{{HPC1}}:::thirdclass
    F --> I{{HPC2}}:::thirdclass  
    classDef someclass stroke:#f00
    classDef otherclass stroke:#0f0
    classDef thirdclass stroke:#00f

Figure 3: Machine-centered view: Supply chain model for the CCDS. For a given job, the dataset (stored in one of two Datacentres, $\mathrm{DC}_1$ or $\mathrm{DC}_2$) can be processed on one of two HPC centres ($\mathrm{HPC}_1$ or $\mathrm{HPC}_2$), and ouput data is sent to (and stored on) one of the two Datacentres, ready as input for the next job. This is repeated for each job in the ordered job list (DAG).

Figure 4: The CCDS model is based on RCPSP and SCND.

Supply Chain Network Design with Resource Constrained Scheduling

Problem Definition

The resource-constrained project scheduling problem is a classical, well-known problem in operations research, and started with the CPM⁴ that was developed in the 1950’s. A number of activities are to be scheduled. Each activity has a duration and cannot be interrupted. There are a set of precedence relations between pairs of activities which state that the second activity must start after the first has finished.

⁴ Critical Path Method

The set of precedence relations are usually given as a directed acyclic graph (DAG), where the edge $(i,j)$ represents a precedence relation where job $i$ must finish before job $j$ begins. The DAG contains two additional dummy activities with duration 0, the source and sink, where the source is the first activity and the sink is the last activity.

There are also sets of renewable resources and non-renewable resources. Each resource has a maximum capacity and at any given time slot no more than this amount can be in use. Each activity has a demand (possibly zero) on each resource. The dummy source and sink activities have zero demand on all resources.

Multi-Mode Variant The multi-mode resource-constrained project scheduling problem⁵ is an extension of the resource-constrained project scheduling problem where each task can be executed in one of a number of alternative modes. The problem aims to select a single task mode from a set of available modes in order to construct a precedence- and resource-feasible project schedule with a minimal makespan. This is similar to a supply chain.

⁵ MRCPSP

Non-Renewable Resources Another extension concerns non-renewable resources. Each non-renewable resource has a capacity for the entire schedule. An example would be a financial, or human resources budget that applies to the entire project. Modes of activities must be chosen to avoid exceeding the total capacity of each of the non-renewable resources.

Time Windows Instead of optimizing over the maximum value of the project timespan, we can use time windows. The benefit of time windows is twofold: first, they can be used in the mathematical programming formulation to reduce the number of variables substantially, i.e. they provide a tighter formulation. Second, they can be utilized in several enumeration procedures to speed up the convergence of the underlying optimization algorithms.

State Variables

The state of the system at a given time $t,$ is represented by a set of binary-valued state variables that designate the facilities that are used, at each stage, by each job to be performed in the workflow.

Objective Function

The overall objective is to combine location decisions—which centres to use—with allocation decisions—how to distribute the workloads and jobs among the chosen centres (data and compute). Various project performance metrics can be optimized, including task-based, resource-based, financial-based and user-based metrics. One can minimize makespan, tardiness, resource consumption, or maximize total NPV with respect to sustainability, for example.

There can be a single objective such as minimizing (any function of) the total of fixed and variable costs, or minimizing the completion time (makespan). Multiple objective optimization⁶ seeks a tradeoff between minimum cost and maximum sustainability (minimum environmental impact). Finally, stochastic optimization takes into account the uncertainties of resource availabilities and delays, maintenance and failures, resource allocations, variable energy costs, project costs (HR, budget).

⁶ MOO

⁷ Some terms can be ignored by setting the coefficients to zero, depending on the context.

The overall, deterministic cost function can be expressed either as a bottleneck objective, where we seek to minimize the longest or most expensive job, or as a weighted sum over all jobs of makespan, cost and any functions of these.⁷

For example, suppose we have $m$ machines $\{M_j\}_{j=1}^{m}$ to process $n$ jobs $\{J_i\}_{i=1}^{n}.$ A job $J_i$ can eventually be broken down into $n_i$ operations $O_{i1},\ldots,O_{in_i},$ each with its own processing requirement $p_{ij}.$

Objective Functions Denote the finishing time of job $J_i$ by $C_i$ and the associated cost $f_i(C_i).$ Then there are two types of total cost functions, the bottleneck objective \[ f_{\mathrm{max}}(C) \doteq \max \{f_i(C_i) \mid i=1,\ldots,n\} \] and the sum objective \[ \sum f_i(C) \doteq \sum_{i=1}^{n} f_i(C_i). \] The sum can also be weighted, if required.

Mathematical Formulation - Deterministic Model

Resource-Constrained Project Scheduling Problem (RCPSP)

The Resource-Constrained Project Scheduling Problem (RCPSP) is an NP-hard combinatorial optimization problem that consists of finding a feasible scheduling for a set of $n$ jobs subject to resource and precedence constraints. Each job has a processing time, a set of successor jobs and a required amount of different resources. Resources may be scarce but are renewable at each time period. Precedence constraints between jobs mean that no jobs may start before all its predecessors are completed. The jobs must be scheduled non-preemptively, i.e., once started, their processing cannot be interrupted.

The RCPSP has the following input data:

$\mathcal{J} = \{J_1, J_2, \ldots, J_n \}$ set of jobs.
$\mathcal{R}$ set of renewable resources.
$\mathcal{S}$ set of precedences⁸ between jobs $(i,j)\in\mathcal{J}\times\mathcal{J}.$
$\mathcal{T}$ planning horizon: set of possible processing times for jobs.
$p_{j}$ processing time of job $j.$
$u_{jr}$ amount of (renewable) resource $r$ required for processing job $j.$
$c_{r}$ capacity of renewable resource $r.$

⁸ These can be rigorously defined using an order relation, $i \prec j$.

⁹ An alternative formulation, based on disjunctions will be presenetd in the Examples—see Disjunctions.

There are many different Mixed Integer Linear Programming (MILP) formulations for the RCPSP. We choose the most suitable, discrete-time (DT) formulation⁹. This is a binary formulation where we have a binary variable $x_{i,t}$ for each activity $i$ and starting time/date $t,$
\[ x_{i,t} = \begin{cases} 1, \quad \text{if activity $i$ starts on day/time $t,$} \\ 0, \quad \text{otherwise.} \end{cases} \]

The binary programming formulation, proposed by Pritsker et al. in 1986 can be written as follows.

\[ \begin{align} \text{Minimize} \quad & \sum_{t\in \mathcal{T}} t\cdot x_{n+1,t} & (1)\\ \text{Subject to:} \quad & \sum_{t\in \mathcal{T}} x_{j,t} = 1 \,\,\, \forall j\in J & (2)\\ & \sum_{j\in J} \sum_{t_2=t-p_{j}+1}^{t} u_{jr}x_{j,t_2} \leq c_{r} \,\,\, \forall t\in \mathcal{T}, r \in R & (3) \\ & \sum_{t\in \mathcal{T}} t\cdot x_{s,t} - \sum_{t \in \mathcal{T}} t\cdot x_{j,t} \geq p_{j} \,\,\, \forall (j,s) \in S & (4)\\ & x_{j,t} \in \{0,1\} \,\,\, \forall j\in J, t \in \mathcal{T} & (5) \end{align}\]

The objective function (1) represents the sum of all possible start dates for the final sink job (dummy variable) $x_{n+1,t}.$ We know that only one of these variables will equal 1 for a specific $t.$ Therefore, by minimizing the sum of the product $t \cdot x_{n+1,t},$ we are effectively minimizing the total project duration. Constraint (2) ensures that each activity $i$ has exactly one start date, i.e. a single execution. Constraint (3) guarantees that for any time period $t,$ the schedule does not exceed the capacity $c_r$ for any renewable resource¹⁰ $r.$ Finally, constraint (4) ensures that if an activity $j$ follows another activity $i,$ then activity $j$ must start after the finish time of activity $i,$ which is equal to the start time of activity $i$ plus its duration $p_i.$ This is the precedence constraint.

¹⁰ Non-renewable resources will be introduced below in the multi-mode formulation.

Time-indexed vs. Disjunctive approaches

Most common models for the RCPSP rely on time discretization where the scheduling horizon is decomposed into discrete time intervals of unit length. The jobs get allocated to feasible starting times that respect the resource and precedence restrictions. Since the model size expands with increasing scheduling horizon, time-discrete models can become intractable for large time horizons. In this case, we can resort to disjunctive formulations, based on the big-M approach. See examples in the CCDS Implementation section.

Mutli-Project Multi-Mode RCPSP

Note that the above formulation is restricted to temporal scheduling on a single machine, and for a single project—a collection of jobs. This formulation can be generalized to deal with multiple machines and multiple projects. In this case, the formulation becomes very similar to that of a supply chain. It is referred to in the literature as the multi-mode resource-constrained multi-project scheduling problem, or MRCMPSP.

Multi-Mode The single-mode RCPSP assumed that each activity has only one way to be executed, whereas the multi-mode RCPSP considers multiple ways to execute an activity, which often have tradeoffs in duration, cost, or resource requirement.

Let the decision variable $x_{jm,t} \in \{0,1 \}$ denote the execution of job $j$ in mode $m$ completed in period $t.$ Then the MRCPSP with tight, time-indexed formulation¹¹ can be written as:

¹¹ A disjunctive formulation, based on XOR relations, is also possible—see Disjunctions

\[ \begin{align} \min \quad & \sum_{t=\mathrm{EF}_J}^{\mathrm{LF}_J} t\cdot x_{J1,t} & (1)\\ \text{s.t.} \quad & \sum_{m=1}^{M_j} \sum_{t=\mathrm{EF}_J}^{\mathrm{LF}_J} x_{jm,t} = 1, \,\,\, j=1,\ldots, J ,& (2)\\ & \sum_{m=1}^{M_j} \sum_{t=\mathrm{EF}_h}^{\mathrm{LF}_h} t \cdot x_{hm,t} \le \sum_{m=1}^{M_j}\sum_{t=\mathrm{EF}_j}^{\mathrm{LF}_j} (t - p_{jm}) \cdot x_{jm,t}, \,\,\, j=2,\ldots, J, \, h \in \mathcal{P}_j , & (3)\\ & \sum_{j=2}^{J-1} \sum_{m=1}^{M_j} k_{jmr}^{\rho} \sum_{q=\max\{t,\mathrm{EF}_j\} }^{\min \{ t+p_{jm}-1,\mathrm{LF}_j \}} x_{jm,q} \le K_r^{\rho}, \,\,\, r\in R^{\rho}, \, t=1,\ldots, \bar{T}, & (4) \\ & \sum_{j=2}^{J-1} \sum_{m=1}^{M_j} k_{jmr}^{\nu} \sum_{t=\mathrm{EF}_J}^{\mathrm{LF}_J} x_{jm,t} \le K_r^{\nu}, \,\,\, r\in R^{\nu}, & (5)\\ & x_{jm,t} \in \{0,1\}, \,\,\, j=1,\ldots,J, \, m=1,\ldots ,M_j, \, t =0,\ldots, \bar{T}, & (6) \end{align}\] where $\mathcal{P}_j$ is the set of predecessors of job $j,$ the earliest finish and latest finish times of job $j$ are denoted $\mathrm{EF}_j,$ $\mathrm{LF}_j,$ an upper bound on the project’s makespan is given by $\bar{T}$, and we have denoted renewable resources by the index $\rho$ and non-renewables by $\nu.$

Note that objective (1) is the minimization of the makespan¹², the constraints (2) indicate that each activity is assigned exactly one mode and exactly one finish time, (3) ensures that no activity is started until all its predecessors are finished, (4) ensures that the per-period levels of the renewable resources are met, and consumption of the nonrenewable resources is limited to their availabilities by (5). Finally, restricting the summations¹³ over time to the intervals $[\mathrm{EF}_j, \mathrm{LF}_j]$ reduces drastically the number of variables and gives better convergence. These time windows can be computed by simple forward and backward recursion loops.

¹² To which we can add any cost function of the duration.

¹³ This produces what is known as a tighter formulation.

¹⁴ Or terms, if other costs are to be taken into account, e.g. related to sustainability

Multi-Objective To include infrastructure costs in the objective function, we just add a term¹⁴ \[\begin{align} \min \quad & \sum_{t=\mathrm{EF}_J}^{\mathrm{LF}_J} t\cdot x_{J1,t} + \sum_{j,m} c_{jm} x_{jm} & (1')\\ \end{align}\] where $c_{jm}$ is the cost associated with the use of facility $m$ for task $j.$

Multi-Project We can also perform simultaneous scheduling of a set of multiple projects taking into account the availability of local and global resources under different time and resource constraints. This has practical importance, at national and European levels, when cross-facility implies exploitation of cyberinfrastructure resources across different countries, for example, as would be the case for EuroHPC¹⁵, at the European level, or GENCI¹⁶ for France. Imagine being able to plan and pilot multiple exascale projects, on multiple sites, over multiple countries…¹⁷

¹⁵ https://www.eurohpc-ju.europa.eu/

¹⁶ https://www.genci.fr//

¹⁷ These would be multiple states, in the USA context.

¹⁸ SCCP

Conclusion This general formulation of the RCPSP is then mathematically equivalent to the supply chain configuration problem¹⁸ with the addition of resource constraints. The loop is closed.

Dynamic Networks

We can generalize the supply chain by considering a multi-stage network with an added time dimension. This enables flow and carrying of data and jobs across time periods, in addition to flow among facilities. For example, if certain facilites become unavailable, or less available, as the project evolves over time. This is a kind of recourse, which will be considered below in Section Stochastic, multistage programming.

flowchart LR
    subgraph t1["t = 1"]
    direction LR
    a1((" "))-->a2((" "))
    a1-->a3((" "))
    a2-->a4((" "))
    a2-.->a5((" "))
    end
    subgraph t2["t = 2"]
    direction LR
    b1((" "))-->b2((" "))
    b1-->b3((" "))
    b2-.->b4((" "))
    b2-->b5((" "))
    end
    t1 --> t2
    style t1 fill: white
    style t2 fill: white

Figure 5: Multi-stage, dynamic network that evolves over time due to changing availability constraints.