GPU acceleration of pattern matching algorithms

OData support
Szántó Péter
Department of Measurement and Information Systems

Pattern Matching plays an important role in Computer Science. The rapidly increasing network band width or simply the growing amount of information needs to be evaluated require the computational throughput to keep being raised in Network Intrusion Detection Systems, Virus Scans, General Purpose Searching Systems and for locating nucleotides and amino acid sequence patterns into biological databases.

GPGPU as General Purpose Computing on Graphical Processing Unit provides a unique way of matching patterns and multipatterns. This solution allows the huge amount of computational capacity of GPUs otherwise being idle to be utilized in PCs while the overloaded CPU can be disencumbered. Furthermore, GPUs can outperform ASIC and FPGA-based solutions in terms of reconfigurability.

In the past decade, numerous implementations written in CUDA-C and optimized for NVIDIA products were published. Each of them are embarrassingly parallel approaches, that is to say, the Scalar Processors execute one of the existing sequential algorithms or their slightly modified versions on distinct, independent substrings eliminating the need for inter-thread communication.

In 1986, Hillis and Steele came up with the idea of a scan-based lexical analysis in which the function composition serves as the binary associative operator and the columns of the deterministic finite automaton-based algorithm's state table are the operands of the scan primitive. The work-complexity of the presented solution is the string length multiple the number of the states. This exceeds the effective work-complexity of the sequential algorithms by far thus its direct implementation is not recommended.

In my thesis, I present a modification of the original Hillis-Steele algorithm [1] which is capable of finding all occurrences of a pattern or multiple patterns in a text but its execution time is "usually sublinear" to the number of states. My kernels are optimized for the NVIDIA Fermi architecture.

Implementing the scan primitives optimized for the GPU's global memory appeared to be a great challenge since the existing publications (Horn [2], Sengupta et al. [3]) describe solutions using shared memory which has a completely different structure. I measured the execution time of the kernels using strings with various parameters and evaluated the results from practical points of view. Finally, I suggest some future development approaches at the end of the summary section.

[1] W. D. Hillis and G. L. Steele, “Data parallel algorithms,” Communications of the ACM, vol. 29, no. 12, pp. 1170–1183, Dec. 1986.

[2] D. Horn, “Stream reduction operations for GPGPU applications,” In GPU Gems 2, chap. ch. 36, 573–589, 2005.

[3] S. Sengupta, J. D. Owens, and A. E. Lefohn, “A Work-Efficient Step-Efficient Prefix-Sum Algorithm,” pp. 1–2, 1986.


Please sign in to download the files of this thesis.