The issue is that machines today are rather different from machines from 5, 10, or 15 years ago. The biggest cost on modern processors is cache misses (both primary and secondary), and mispredicted branches.
I don't know whether current versions of flex still generate the same inner loops they used to, or if they've been updated to be more modern-machine friendly, but the upshot of my experiments is that they're plenty fast.
I coded a lexer for C (ignoring processing of \-at-end-of-line, and without a preprocessor) using flex, using a flex-like runtime library from my stb.h library, and with a hand-coded state machine. Then I ran them on a semi-synthetic test: the 7500-line stb.h file repeated 1000 times, producing 7.5 MLOC, 250MB. The file was actually 125MB, and each scanner scanned it twice.
|method||time to scan 7.5MLOC||read method||notes|
|flex (default)||13.56 s||fread() 8K||de-escapes strings|
|stb_lex + hashing||4.84 s||fread() whole file||identifiers and keywords stored in a global symbol table|
|stb_lex||4.23 s||fread() whole file|
|flex -F (fast)||3.07 s||fread() 8K||de-escapes strings|
|flex -f (full)||2.92 s||fread() 8K||de-escapes strings|
|handcoded||2.45 s||fread() whole file||computes hash value for identifiers; de-escapes strings|
|handcoded mmap||2.14 s||mmap()||computes hash value for identifiers; de-escapes strings|
|wc||1.73 s||fread() 64K|
All of the scanners ran strtol() or strtod() on numeric constants to include the cost of converting those.
The stb_lex versions have a bug: they don't count newlines inside /* */ comments, so their line numbers are wrong. The two stb_lex versions different only in the hashing of identifiers, so the difference in times (0.6 seconds) is a reasonable estimates for the cost of hashing the identifiers with this hash table in these 7.5MLOC, regardless of the scanner, on this machine (a 2.8Ghz dual-core P4).
The flex versions do not hash identifiers (and keywords are parsed as identifiers). This was just because I didn't bother adding the code; as implied above, presumably the overhead would be similar. Note that the flex documents suggest that -F would be fastest if you're treating keywords as identifiers, and -f would be fastest if you're giving keywords their own lex regexps; since -f is actually fastest, we could probably put per-keyword regexps and avoid some hashing. Note that stb_lex basically uses the same inner loop as flex, it just computes the tables on the fly (and lazily). I'm not sure why it's so much slower.
The handcoded versions use a similar strategy to flex-with-equivalence; each input character is mapped into one of a set of classes, each of whose behavior is equivalent. Then a state machine is updated from this. There are ~20 live states and ~30 classes, so the transition tables are around 600 entries, or 1200 bytes. The state machine is a little more specialized than the flex one. The inner state machine loop updates the line count directly, indexing an increment from the current state; flex requires breaking out of the main loop on every newline. The character classes are premultiplied, avoiding a shift in the inner loop.
Additionally, the handcoded state machine scans any amount of whitespace and comments and then a token before breaking from the inner loop; flex always breaks out of the inner loop after each matched regex. This means the handcoded version will probably avoid around 2 mispredicts per piece of whitespace, per comment, and per line. To do this and still know where each token starts, the inner loop stops at the beginning of strings and identifiers; each of these is then parsed with custom code. The string parser simultaneously de-escapes the string while scanning it; the identifier parser computes a hash value while scanning it. (However, since I didn't make the flex code compute a hash value per identifier, it's kind of pointless. It's probably not much time anyway.)
Because the hardcoded scanner reads the whole file and doesn't change it, it was easy to switch it to use a memory mapped file. As can be seen, this gave better than a 10% performance gain. This isn't possible with flex, because flex writes 0s into the input stream to export tokens to the client in C string format (0-terminated). If flex passed tokens with explicit lengths, flex could probably be changed to work this way. [Reading the entire file (and memory-mapping it) should be reasonable these days, as modern machines should have plenty of memory compared to the size of source files; indeed, just opening this 125MB file in notepad took longer than the default flex scanner took, as did opening it in MSVC 6 and going to the end of the file; so the reality is that people are pretty unlikely to work with files this large.]
wc is included for comparison. It also uses a state machine with equivalence classes for the characters. However, it never exits the inner loop until the end of the file, and the inner loop is unrolled several times. Thus it offers a reasonable lower bound for the time any parser could ever possibly run. I tuned the buffer size to optimize for scanning this particular file (while in the disk cache).
Finally, all of the scanners are probably buggy. They all produced different counts for the number of various types of entries in the file; but since I was mainly interested in performance comparison, not actually parsing C, I didn't bother working very hard to fix it.
Once upon a time, lexical analysis was actually the slowest part of a compiler, simply due to a number-of-things issue; there are far more characters to process then there are tokens to process. These days, given any reasonable amount of optimization or analysis going on inside the compiler, this is probably no longer true. If you're writing your own little language, though, lexing could well be predominant, in which case you can get a big performance gain by making sure you run flex in fast/full mode (-f). The handcoded parser offers some speedup (1.5X in my test), but is probably not worth the engineering effort.
Finally, we can see that a compiler could reasonably do lexical analysis and some simple processing of 250MB of C source/headers in about 3 seconds. So I'm not sure why C++ compilers are so damn slow, even with the excessive-inclusion they tend to have. My current 20KLOC of C (plus headers) completes a full compile with VC6 in about 5 seconds debug, 6 seconds opt; including both source and headers, it's about 1MB of code (although that doesn't account for including the same files multiple times; I don't know of a good way to get an estimate of the total amount of code the compiler sees).