Skip to content
This repository has been archived by the owner on Jan 28, 2021. It is now read-only.

Provide a single, default, performant and stable regex implementation #615

Closed
ajnavarro opened this issue Feb 8, 2019 · 9 comments
Closed
Assignees
Labels
enhancement New feature or request performance Performance improvements

Comments

@ajnavarro
Copy link
Contributor

Find a good regex implementation that will be used as a default. The usage of golang regex should be configurable from session variables.

@ajnavarro ajnavarro added enhancement New feature or request performance Performance improvements labels Feb 8, 2019
@ajnavarro ajnavarro added this to the OKR-2019-Q1-P2 milestone Feb 8, 2019
@kuba-- kuba-- self-assigned this Feb 11, 2019
@kuba--
Copy link
Contributor

kuba-- commented Feb 11, 2019

As far as I investigated, if we wanna get better performance than std. regex library we cannot avoid cgo.
So, how about cgo bindings with https://github.com/intel/hyperscan
?

Dependency Version Notes
CMake >=2.8.11  
Ragel 6.9  
Python 2.7  
Boost >=1.57 Boost headers required
Pcap >=0.8 Optional: needed for example code only

I'm sorry, but no way to use hyperscan (IMO)!

@ajnavarro
Copy link
Contributor Author

Maybe: https://github.com/linyows/go-onigmo ?

@kuba--
Copy link
Contributor

kuba-- commented Feb 12, 2019

Actually it turned out that installing hyperscan is not that hard (so far, but the latest version 5+ is important. version 4 doesn't work with go wrapper):

Debian

echo "deb http://*ftp.de.debian.org/debian* sid main" >> "/etc/apt/sources.list"
apt-get install libhyperscan5 libhyperscan-dev

OSX

brew install hyperscan

After that go bindings are getable:

go get github.com/flier/gohs

@creachadair
Copy link

creachadair commented Feb 12, 2019

If you care about regexp performance you should probably limit yourself to engines that do not support non-regular language features like backreferences, unbounded lookahead, and so on. If Go's native package isn't fast enough, I'd suggest considering RE2 (on which it's based). You would indeed need to use cgo or SWIG for that, but if performance counts that may be worth it.

Is there a particular benchmark you're trying to improve upon?

@kuba--
Copy link
Contributor

kuba-- commented Feb 12, 2019

We considered many libraries. Recently we integrated oniguruma, but it's not maintainable anymore. So now, I want to play with Intel's hyperscan which looks even faster than Google's re2

@creachadair
Copy link

We considered many libraries. Recently we integrated oniguruma, but it's not maintainable anymore. So now, I want to play with Intel's hyperscan which looks even faster than Google's re2

That's why I was curious about what benchmarks you're using: If your queries have a very particular structure, or are constrained in some way, it's often possible to do better with one engine than another.

@kuba--
Copy link
Contributor

kuba-- commented Feb 12, 2019

Regex is mainly used for LIKE queries, but basically our case is similar to enry (see: src-d/enry#167).
We need something what is thread-safe (for oniguruma we had to use pool of matchers), but from API point of view, ideally if we have something what let us match independent patterns (like simple go's std. library) without caching and/or managing matchers

@creachadair
Copy link

I think the LIKE grammar is pretty constrained, which probably helps quite a bit. No captures, no back-references, etc. I think the difficulty is likely to be that the best-performing matchers compile the input pattern down into a state machine (either explicit or bytecode), and that compilation step is relatively expensive. I don't think you're likely to find a performant in-place matcher, though if you know the patterns are restricted enough that compile-time is low you might be able to skip caching.

That said: Having gone down this path before, I recommend being skeptical of general-purpose regexp benchmarks. Matchers are incredibly sensitive to workload, so if you have (or can produce) a workload that looks like your production use cases, that will give you a better evaluation.

@ajnavarro
Copy link
Contributor Author

Take into account that we have also REGEXP expression.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request performance Performance improvements
Projects
None yet
Development

No branches or pull requests

3 participants