## Abstract

This paper presents an implementation of a homotopy path tracking algorithm for polynomial numerical continuation on a graphical processing unit (GPU). The goal of this algorithm is to track homotopy curves from known roots to the unknown roots of a target polynomial system. The path tracker solves a set of ordinary differential equations to predict the next step and uses a Newton root finder to correct the prediction so the path stays on the homotopy solution curves. In order to benefit from the computational performance of a GPU, we organize the procedure so it is executed as a single instruction set, which means the path tracker has a fixed step size and the corrector has a fixed number iterations. This trade-off between accuracy and GPU computation speed is useful in numerical kinematic synthesis where a large number of solutions must be generated to find a few effective designs. In this paper, we show that our implementation of GPU-based numerical continuation yields 85 effective designs in 63 s, while an existing numerical continuation algorithm yields 455 effective designs in 2 h running on eight threads of a workstation.

## 1 Introduction

Freudenstein and Roth [1] described what they called a parameter-perturbation procedure to compute the solution of a set of non-linear equations starting from a known solution of another similar set of equations. They used this technique to solve the synthesis equations for a four-bar linkage that guides a coupler point through nine specified task positions [2]. Wampler et al. [3] returned to this nine point synthesis problem with what they called numerical polynomial continuation and identified Roth and Freudenstein’s approach as a type of numerical continuation.

Zangwill and Garcia [4] described a wide range of problems that can be solved by tracking the paths of a homotopy, such as nonlinear programming, economic equilibria, and game theory. Morgan [5] applied this approach to finding all of the solutions of a system of polynomial equations. Tsai and Morgan [6] used numerical polynomial continuation to solve for the inverse kinematics of a general 6R robot, which was an important outstanding problem at the time.

Since then numerical continuation has improved in capabilities, see Ref. [7]. Now a variety of software packages for numerical continuation are available, such as Bertini [8], PHCpack [9], HOM4PS [10], and POLSYS_GLP [11].

Homotopy solution of a system of polynomials follows the known roots of a starting polynomial system as its coefficients are smoothly changed into the coefficients of the target polynomial system [7]. The process of following the transformation of these known roots into the roots of the target system is known as path tracking and can be parallelized for distributed computation [11].

Graphics processing units (GPUs) were developed to accelerate rendering calculations in computer graphics [12]. These devices execute the same instruction set for each pixel in a display at very high speed. This capability has been deployed in other applications where identical sets of instructions are executed for a large number of cases. Examples are computational fluid mechanics [13], robot motion planning [14,15], and deformable body modeling for computer graphics [16].

In contrast to parallelization on a cluster of central processing units (CPUs), parallelization on a GPU requires the execution of identical instruction sets on a collection of threads known as a warp to ensure maximum performance. Verschelde and Yoffe [17] introduced the use of a GPU for polynomial homotopy, using it to evaluate the polynomials and their derivatives with extended precision mathematics. The result was speeds of almost 20 times the speed of computation on a single CPU.

In order to increase the performance for path tracking on a GPU in a numerical polynomial continuation solver, a strategy to manage changes in step size is needed, because the execution of a conditional statement in one thread can pause the computation in other threads of a warp until the conditional is completed. Verschelde and Yu [18,19] manage this by providing three levels of adaptive step sizes for tracking the tens of thousands solution paths in GPU-based polynomial homotopy solver as well as using higher precision to more accurately track paths.

This paper presents a different strategy for path tracking using a GPU in a polynomial continuation solver, which avoids changes in step size. Paths for which the desired accuracy has not yet been achieved are grouped and computed with a smaller step size. This approach is well-adapted to the need for high-speed path tracking of many thousands of paths needed to solve the design equations for the kinematic synthesis of mechanisms, see for example Ref. [20].

## 2 Path Tracking

*n*polynomials, $P(x)=(p1,p2,\u2026,pn)$ in

*n*unknowns $x\u2208Cnx1=(x1,x2,\u2026,xn)$ by starting with a system $S(x)$ that has the same total degree,

*M*. The start system $S(x)$ is constructed so that these roots have known values

**y**

_{k},

*k*= 1, …,

*M*, that is

*M*roots $x\xafk$ of $P(x)$ are obtained by smoothly transforming the roots of $S$ into those of $P$ and tracking the paths

**v**

_{k}(

*t*) = (

*v*

_{1k},

*v*

_{2k}, …,

*v*

_{nk}) as

*t*varies from 1 to 0; the initial positions of the roots are

**y**

_{k}=

**v**

_{k}(1) and their final positions are $x\xafk=vk(0)$. Note that while $S(x)$ starts with

*M*roots, the target system $P(x)$ may have fewer than

*M*roots. The number of roots is $P(x)$ is upper bounded by

*M*.

**v**

_{k}(

*t*) exists for each root of the start system and is a solution of the polynomial system $H(x,t)$, that is

**x**and

*t*, given by

**J**

_{x}is an

*n*×

*n*matrix and

**H**

_{t}an

*n*× 1 vector with elements

*J*

_{ij}and

*H*

_{i}, respectively, given by

**x**,

*t*) that is sufficiently close to a path

**v**(

*t*) of the homotopy, such that $H(x,t)\u22480$. Then, we can predict a new point

**p**=

**x**+ Δ

**x**at

*t*+ Δ

*t*by setting $H(x+\Delta x,t+\Delta t)=0$ and solving for the first-order terms of Eq. (4) to obtain

**c**=

**x**+ Δ

**x**for Δ

*t*= 0 such that $H(x+\Delta x,t)\u22480$. From Eq. (4), we obtain

## 3 Numerical Path Tracking

**f**(

**x**,

*t*) for the vector function in Eq. (7), so we have,

*t*, the next prediction point

**p**of a path can be calculated using the Runge–Kutta fourth-order formulas,

**p**involves four evaluations of

**f**for different arguments, each of which requires finding the inverse of the

*n*×

*n*Jacobian matrix

**J**

_{x}.

### 3.1 LU Decomposition.

**J**

_{x}is known as Lower-Upper (LU) decomposition [21]. This is achieved by permuting

**J**

_{x}, so that it can be factored into the product of a lower triangular matrix

**L**and an upper triangular matrix

**U**, that is

**P**is an

*n*×

*n*matrix that permutes the rows of

**J**. Write Eq. (7) in the form,

_{x}**z**=

**U**Δ

**x**and use sequential elimination by rows to solve

**z**. Then, back-substitution is used to solve

**x**.

This solution Δ**x** is used to calculate each of the four terms **K**_{i}, *i* = 1, 2, 3, 4 in the Runge–Kutta calculation for **p**.

### 3.2 Newton’s Correction.

**p**(

*t*+ Δ

*t*) along a path

**v**(

*t*) takes the point away from the homotopy hypersurface $H(x,t)=0$, so we use Newton’s method to find the nearby root

**c**(

*t*+ Δ

*t*). Write Eq. (8) in the form,

**x**, which yields the correction,

### 3.3 Graphical Processing Units Implementation of Path Tracking.

A GPU consists of an array of streaming multiprocessors (SMs), each of which is analogous to the core of a CPU. However, unlike a CPU that can execute different instruction sets concurrently, the SMs of a GPU execute the same instruction set concurrently. An important consequence of this aspect of SM computation is that conditional instructions degrade the performance on a GPU. This can occur wherever there is an IF statement or a WHILE statement, because when the instructions of some of the threads in a warp are different from the rest, the SM must execute the two different instruction sets one at a time, which results in stalled threads and under-utilization of the GPU. The amount of under-utilization depends on the size of the instruction sets to be used. IF statements are quite useful so they should not be avoided all together, rather the amount of divergence should be considered when implementing code on a GPU, particularly if maximum performance is desired.

In polynomial numerical continuation, branch divergence can occur when the path tracking step size Δ*t* is changed, and when the number of correction steps is changed to achieve convergence. The path tracking algorithms of numerical polynomial continuation solvers implement adaptive step size and convergence checking to ensure the accuracy of each path. Unfortunately, these features can have a large performance hit when implemented on a GPU, because when path tracking step is changed in one thread all the other threads in the warp are paused until the computation is completed. This happens when the number of correction steps is changed as well.

Our approach is to introduce a new path tracking algorithm that is a better match to the SIMT requirements of a GPU, Algorithm 1. Our algorithm is has a fixed number *N* of equal steps Δ*t* along the path, as well as a fixed number of iterations *C* of the Newton correction.

#### New path tracking algorithm with fixed step size Δ t.

It is possible to implement a path tracker with an adaptive step size on a GPU; however, this causes the all the threads of the GPU to run at the speed of the most difficult path. Rather than parallelize the path tracking, it is also useful to use the GPU to speed up the computation of the elements of both the polynomial system and its Jacobian matrix, see Refs. [18,19]. Our goal is to use the GPU to minimize computation time, so we fix the step size to minimize conditional branching.

Tracking paths with a fixed step size can mean that we cannot avoid the zones around areas singularities which would cause the Jacobian matrix, **J _{x}**, to become ill-conditioned. In these zones, the path will almost certainly not result in a solution to the target system of polynomials. Similarly, if the number of iterations of Newton’s correction is not enough to assure convergence, the path will not result in a solution.

Both of these issues can be mitigated by an evaluation kernel call that checks all of the roots computed for $P(x)$ and identifying those paths that do not yield accurate roots. Those that fail can be recalculated using a smaller step size and larger values for *N* and *C*. This strategy sacrifices individual path tracking accuracy for increased GPU computational performance associated with a single instruction set.

## 4 Four-Bar Linkage Synthesis

In this section, we formulate the synthesis equations for a four-bar linkage that we solve using our GPU implementation of numerical polynomial continuation. The goal is to compute the dimensions of a four-bar linkage that guides its coupler through five task positions. We formulate the design problem following Glabe and McCarthy [22] using the loop equations of the linkage. This is different from the usual approach known as Burmester theory that uses the constraint equations of a crank [23]. We use this approach because it can be generalized to design more complex linkage systems [20].

All code was implemented in the Compute Unified Device Architecture (cuda) language and executed on an Nvidia Quadro M2000 GPU. The sections are broken up into a series of individual functions known as kernel calls. A kernel call is a CUDA term for a function that is called from the CPU, but executed on the GPU.

### 4.1 Loop Equations.

The synthesis equations that we will solve are obtained from the loop equations of the four-bar linkage. Let the coordinates of the fixed pivots of the linkage be denoted *O* = *O*_{x} + *iO*_{y} and *C* = *C*_{x} + *iC*_{y} and let the moving pivot coordinates be *A* = *A*_{x} + *iA*_{y} and *B* = *B*_{x} + *iB*_{y}.

*j*= 1, …, 5, where

*P*

_{j}is the position of the origin and $\theta j$ is the orientation of a desired end effector pose with respect to the

*x*-axis. Then, evaluate the loop equations and their conjugates for each of these task positions. The result is

*Q*

_{j}and

*T*

_{j},

*O*,

*A*,

*B*, and

*C*and their conjugates that define the coordinates of the pivots of the linkage in the reference position and the relative angles $\varphi j$ and $\psi j$,

*j*= 1, …, 5, that define the movement of the linkage through the five task positions.

*Q*

_{j}, $Q\xafj$,

*S*

_{j}, and $S\xafj$ by solving Eq. (21) for

*Q*

_{j}and

*S*

_{j}and Eq. (22) for $Q\xafj$ and $S\xafj$ and then substitute the results into Eq. (23). This yields

*P*

_{1}= (0, 0) and $\theta 1=0$ and measuring the remaining four task positions relative to this frame. To do this, compute the five homogeneous transformation matrices

*H*

_{j}associated with the given task positions, $\Gamma j=(\theta j,Pj)$,

*j*= 1, …, 5,

_{1}

*P*

_{1},

^{8}= 256. We used the numerical continuation solver Bertini [8] to compute a start system for this polynomial system and found that it had a multi-homogeneous degree of 25.

Finding the solutions to these equations can be divided into three separate kernel calls: Path Tracker, Task Generator, and Solution Filter.

### 4.2 Path Tracker.

We use the numerical continuation software Bertini to solve start system $S$ that can be used for parameter continuation to solve a polynomial system $P(p,x)$ for different values of the parameters **p**. Bertini chooses a generic set of parameters **q** so that the start system is $S(q,x)$ has known roots **y**_{k}.

**q**computed by Bertini to construct the parameter homotopy,

**J**and the 8 × 1 vector

_{x}**H**

_{t}; see Ref. [5]. These symbolic equations are copied into the CUDA kernel call, which is labeled Path Tracker.

Path Tracker includes the algorithm for LU decomposition, Runge–Kutta prediction, and Newton correction. Organizing the calculations in this way uses the advantages of the GPU for rapid computation; however, it means that some of the paths may not converge to roots of our target system. Our formulation of the linkage synthesis problem reduces the importance of finding any particular root. This is discussed in Sec. 4.3.

### 4.3 Task Generator and Solution Filter.

It is the nature of this linkage design problem that for a given task $T:\Gamma j=(\theta j,Pj)$, *j* = 1, …, 5, the resulting four-bar linkage may have one or more of a set of various defects [25,26]. To address this Plecnik and McCarthy [27] introduced tolerance zones around the specified task positions and randomly selected small variations within these zones. The result is a successful set of linkages that reach task positions close to the originally specified positions, see also Ref. [28].

*j*= 1, …, 5, and a set of tolerance zones $(\Delta \theta ,\Delta x,\Delta y)i$,

*i*= 1, …, 5, and writes

*L*new tasks,

*L*, the number of task iterations, to be 200. Our Path Tracker computes up to 25 roots for each of these 200 paths for 5000 possible linkage designs.

We use a third kernel call Solution Filter to evaluate each of the 5000 roots to determine if (*O*, *A*, *B*, *C*) and $(O\xaf,A\xaf,B\xaf,C\xaf)$ are complex conjugates. This ensures that the root yields a physical linkage. It is known that this synthesis problem can have at most six solutions, which means of the 25 roots that exist for each task at most six define physical linkages [23]. It is important to mention that while Solution Filter does contain IF statements, the branch divergence caused by these statements is quite small and thus has minimal impact on performance.

### 4.4 Outline of Execution.

A block diagram of our GPU-based numerical continuation algorithm for four-bar linkage synthesis is shown in Fig. 3. The equations for the start system, its roots, the homotopy equations, and the derivatives **J _{x}** and

**H**

_{t}are programed for execution on the GPU.

Execution starts with user input of task positions, tolerance zones, and the number of task iterations *L*. The number of steps *N* for the path tracker and the number *C* of iterations of the Newton correction are also user-specified. We set the number of step *N* = 100, step size Δ*t* = 1/*N*, and set the number of Newton corrections to *C* = 2. It reads the coefficients of the 16 parameters for both the start system (**q**) and the target system (**p**), along with each of the roots **x**_{k}, *k* = 1, …, 25.

The Task Generator reads this data and writes *L* different tasks to the GPU memory. The Path Tracker computes the roots for the synthesis equations for each task in the GPU memory. The Solution Filter evaluates each of the roots obtained for all of the tasks to determine those that define physical linkages.

The physical linkages identified in the GPU must be evaluated to determine they are defect-free, which we call effective solutions. Linkages with branch and circuit defects are rejected, but those with order defects are allowed. Examples of this analysis can be found in references such as [29] or [23]. The output of this algorithm is a list of linkage designs that reach the specified task positions within the given tolerance zones.

## 5 Demonstration

In order to demonstrate this algorithm, we use a Lenovo workstation with an Intel Xeon 2.10 GHz CPU, running Windows 10 with an NVIDIA Quadro M2000 GPU. The five task positions together are listed in Table 1 and shown in Fig. 4. The tolerance zones chosen were $\Delta \theta =0.5deg$, Δ*x* = Δ*y* = 0.1. The original transformed task positions that are used with our synthesis equations are listed in Table 2. Because the randomization can produce two different sets of task positions with differing numbers of solutions to analyze, we generated one standard randomized set and ran both the CPU and GPU code on it. The time to calculate 200 iterations or 5000 threads, with the GPU is shown in Table 3 to be 63 s.

j | $\theta j$ (deg) | P_{j} |
---|---|---|

1 | 80 | (2.7, 4.9) |

2 | 55 | (2.8, 4.6) |

3 | 0 | (2.9, 4.4) |

4 | −35 | (3.0, 4.2) |

5 | −60 | (3.0, 4.0) |

j | $\theta j$ (deg) | P_{j} |
---|---|---|

1 | 80 | (2.7, 4.9) |

2 | 55 | (2.8, 4.6) |

3 | 0 | (2.9, 4.4) |

4 | −35 | (3.0, 4.2) |

5 | −60 | (3.0, 4.0) |

j | $\theta j$ (deg) | P_{j} |
---|---|---|

1 | 0 | (0, 0) |

2 | −25 | (−0.28, −0.15) |

3 | −80 | (−0.46, −0.28) |

4 | −115 | (−0.64, −0.42) |

5 | −140 | (−0.83, −0.45) |

j | $\theta j$ (deg) | P_{j} |
---|---|---|

1 | 0 | (0, 0) |

2 | −25 | (−0.28, −0.15) |

3 | −80 | (−0.46, −0.28) |

4 | −115 | (−0.64, −0.42) |

5 | −140 | (−0.83, −0.45) |

Hardware | Time (s) | Solutions | Physical sols. | Effective sols. |
---|---|---|---|---|

CPU | 7588 | 3774 | 1510 | 455 |

GPU | 63 | 4425 | 2382 | 85 |

Hardware | Time (s) | Solutions | Physical sols. | Effective sols. |
---|---|---|---|---|

CPU | 7588 | 3774 | 1510 | 455 |

GPU | 63 | 4425 | 2382 | 85 |

For comparison, we used mathematica 11.1 to define the synthesis equations for 200 randomized tasks for computation using Bertini v1.5.1 on the Lenovo workstation. The Lenovo workstation has multiple cores which allows Bertini to be ran in parallel using eight CPU threads. Bertini by default performs path tracking using an adaptive step size. The speed up using the GPU was 120 times compared with the eight CPU thread computation. One example solution computed by the GPU is given by the coordinates in Table 4 and shown in each of the task positions in Fig. 5.

Point | (x, y) coordinates |
---|---|

O | (2.37, 4.43) |

A | (2.54, 4.13) |

B | (2.46, 4.58) |

C | (2.73, 4.29) |

Point | (x, y) coordinates |
---|---|

O | (2.37, 4.43) |

A | (2.54, 4.13) |

B | (2.46, 4.58) |

C | (2.73, 4.29) |

A comparison of the results of the two calculations shows the impact of adaptive step size and convergence test for the Newton corrector in Bertini as opposed to the fixed step size and fixed number of Newton iterations. Bertini calculated fewer physical solutions as our GPU code, but more effective designs. This is likely due to the fact that Bertini checks for paths crossing, whereas the GPU does not. If two paths cross, it is possible for one path to jump to the other, resulting in a repeated solution. When calculating the number of effective solutions in the GPU we removed duplicate solutions. Thus, the GPU calculation provides 85 effective designs in 63 s compared to 455 effective designs in just over 2 h of computation.

While it might be tempting to assume using 5000 threads should be 625 times faster than eight threads, it is important to note that GPU and CPU threads are not the same. The base clock speed for an Nvidia M2000 GPU is 1126 MHz while the clock speed for an Intel Xeon CPU is 2.10 GHz. Additionally, CUDA requires threads to be grouped up into blocks, where each block is processed by an SM, one warp at a time. The current GPU algorithm implementation uses only 25 threads per block, whereas on the Maxwell architecture 1024 threads can be executed simultaneously per block. This results in under-utilization of the GPU. More research into this area should be performed to further increase GPU performance.

Additionally, it is important to note that Bertini is not optimized to be executed on instance of the same problem with different parameters. Bertini as a program has many file input/output operations that invariably slows the computation time down. It is important to mention that there is a program being developed called Paramotopy which is intended to be ran on the same problem with different parameters, but the program was unavailable at the time of this paper.

## 6 Conclusion

This paper presents an algorithm for numerical continuation on a GPU for the solution of the polynomial systems that arise in kinematic synthesis. In order to obtain the benefits of increased computational speed of a GPU the algorithm must run a single instruction set. This lead to a CUDA implementation of Runge–Kutta integration and LU decomposition corrector algorithms for the path tracker for execution on a GPU. In order to eliminate conditional statements that degrade the performance of the GPU, our algorithms use a constant step-size for prediction and a fixed number of iterations for correction. This trades accuracy of the path tracker for speed of computation. This is well-adapted to our application to numerical kinematic synthesis, where a large number of solutions must be generated to find a few effective designs.

A comparison of the performance of our algorithm on 5000 GPU threads with the execution of the software Bertini on eight threads of Lenovo workstation shows a speed up of 120 times. The impact of the trade-off between speed and accuracy can be seen in the fewer number of effective solutions found by the GPU 85 compared with 455 by Bertini. It seems further research is needed to increase speed using a GPU and maintain accuracy.

## Data Availability Statement

The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request. The authors attest that all data for this study are included in the paper. Data provided by a third party are listed in Acknowledgments.

## Acknowledgment

This material is based on work supported by the National Science Foundation under Grant No. 1636017.

## References

*CoRR,*.

**abs/1501.06625**