RegExp.prototype.count

# kai zhu (6 years ago)

a common use-case i have is counting newlines in largish (> 200kb) embedded-js files, like this real-world example [1]. ultimately meant for line-number-preservation purposes in auto-lint/auto-prettify tasks (which have been getting slower due to complexity).

would a new RegExp count-method like (/\n/g).count(largeCode) be significantly more efficient than existing largeCode.split("\n").length - 1 or largeCode.replace((/[^\n]+/g), "").length?

-kai

[1] calculating and reproducing line-number offsets when linting/autofixing files kaizhu256/node-utility2/blob/2018.12.30/lib.jslint.js#L7377, kaizhu256/node-utility2/blob/2018.12.30/lib.jslint.js#L7377

kaizhu256/node-utility2/blob/2018.12.30/lib.jslint.js#L7586, kaizhu256/node-utility2/blob/2018.12.30/lib.jslint.js#L7586

# Isiah Meadows (6 years ago)

If performance is an issue, regular expressions are likely to be too slow to begin with. But you could always do this to count the number of lines in a particular string:

var count = 0
var re = /\n|\r\n?/g
while (re.test(str)) count++
console.log(count)

Given it's already this easy to iterate something with a regexp, I'm not convinced it's necessary to add this property/method.

# kai zhu (6 years ago)

benchmarked @isiah’s while-loop test-case vs str.split vs str.replace for regexp counting on jsperf.com, jsperf.com [1], and the results were surprising (for me).

benchmarks using 1mb random ascii-string from fastest to slowest.

  1. (fastest - 1,700 runs/sec) regexp-counting with largeCode.split(/\n/).length - 1
  2. (40% slower - 1000 runs/sec) regexp-counting with while-loop (/n/g)
  3. (60% slower - 700 runs/sec) regexp-counting with largeCode.replace((/[^\n]+/g), "").length

looks like the go-to design-pattern for counting-regexp is str.split(<regexp>).length - 1

[1] regexp counting 2 jsperf.com/regexp

# Isiah Meadows (6 years ago)

Nit: you should use .spilt(/\n/g) to get all parts.

I like the benchmarks here. That's much appreciated, and after further investigation, I found a giant WTF: jsperf.com/regexp-counting-2/8

TL;DR: for string character counting, prefer indexOf.

For similar reasons to that JSPerf thing, I'd like it to be on the String prototype rather than the RegExp prototype, as in str.count(/\n/).


Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com

# kai zhu (6 years ago)

+1 for string.count

i don’t think the g-flag is necessary in str.split, so the original performance claims are still valid:

  • for counting regexp - use split + length
  • for counting substring - use while + indexOf