Micro-benchmarking php streams includes from database vs standard includes

During the drupal plugin/update manager discussions I had an aha moment. One of those weird and wonderful ideas came back to me. What if most of the code lived in the db? One would be able to arrange the co-habitation of several concurrent versions of the same website relatively easy. Backups would mean database backup.

Funnily enough, this can help two opposite (scale-wise) types of users - the bottom end, cheapest or free hosting ones and the load balanced crowd.

Why "back"? Well... I had this idea ever since the user streams appeared in php, version 4.3 or there abouts, but it just nestled cosily in the back of my mind, waiting for love, the shy little thing.

The problem

Ok. So what is this about? Since php allows you to write stream wrappers and include* and require* can use arbitrary streams to load code, one should be able to put the code in a database, load it and execute it. The biggest obvious downside is that it is probably slow. How much?

I decided to benchmark it. I've prepared a micro-benchmark to test the idea and to see how significant would be the difference in performance. One should note, that since this is mostly an IO bound task, the difference in performance will result mostly in higher response times, rather than cpu load. Bear in mind that the benchmarks were performed on a tiny Acer Aspire One netbook with 512MB RAM with its standard SSD drive.

The benchmark

I've prepared three different small programs. The first just including 20 php files. The third including the same code from sqlite3 via streams. The second is including the 20 php files, but contains the streams code to have a similar parsing time profile. The files are attached to this post, if you want to run them yourselves, just rename and assign the appropriate permissions.

I've used the criterion haskell library to gather, process the statistics for me and to draw the nice plots below.

The haskell program is simple. It just declares and executes the three benchmarks:


import Criterion.Main (defaultMain, bench, bgroup)
import System.Cmd (system)

main = defaultMain [
    bgroup "php includes" [
               bench "standard/clean" $ system "./clean.php"
            ,  bench "standard/mixed" $ system "./non-stream1.php"
            ,  bench "streams" $ system "./stream1.php"
            ]
   ]

To compile use


ghc --make bench

The streams

I've writtern a barebones TestStream class adhering to the streams api, pass it to stream wrapper and do 20 times include_once. The includes have one print statement ala hello world.

The non-stream versions

The base case "standard/clean" just includes the 20 files. The "standard/mixed" includes the 20 files and has a useless copy of the TestStream class to bulk up the code to judge the significance of the parsing overhead.

The benchmark results

Standard Clean


benchmarking php includes/standard/clean
collecting 100 samples, 2 iterations each, in estimated 12.13241 s
bootstrapping with 100000 resamples
mean: 58.12652 ms, lb 57.14786 ms, ub 60.15813 ms, ci 0.950
std dev: 6.912029 ms, lb 4.108045 ms, ub 13.29588 ms, ci 0.950
found 6 outliers among 100 samples (6.0%)
  2 (2.0%) high mild
  4 (4.0%) high severe
variance introduced by outliers: 1.000%
variance is unaffected by outliers


Standard mixed


benchmarking php includes/standard/mixed
collecting 100 samples, 2 iterations each, in estimated 11.08999 s
bootstrapping with 100000 resamples
mean: 58.86753 ms, lb 57.81748 ms, ub 60.82246 ms, ci 0.950
std dev: 7.118014 ms, lb 4.625828 ms, ub 12.58350 ms, ci 0.950
found 8 outliers among 100 samples (8.0%)
  5 (5.0%) high mild
  3 (3.0%) high severe
variance introduced by outliers: 1.000%
variance is unaffected by outliers


Streams


benchmarking php includes/streams
collecting 100 samples, 2 iterations each, in estimated 14.42270 s
bootstrapping with 100000 resamples
mean: 76.48482 ms, lb 74.66795 ms, ub 78.86988 ms, ci 0.950
std dev: 10.60164 ms, lb 8.515426 ms, ub 13.80536 ms, ci 0.950
found 8 outliers among 100 samples (8.0%)
  7 (7.0%) high mild
  1 (1.0%) high severe
variance introduced by outliers: 1.000%
variance is unaffected by outliers


Conclusions

As expected, the streams code is slower, it adds around 1ms per include file. If you compare the probability density estimates, you will see that there is a small, albeit probably insignificant, overlap between the standard and stream versions. The results suggest that in larger programs the effect will be far less significant. The results are encouraging. This technique definitely merits further investigation, run it with mysql - the most widespread database deployed alongside php and if time permits against a patched version of drupal.

AttachmentSize
clean.php_.txt91 bytes
non-stream1.php_.txt1.49 KB
stream1.php_.txt1.55 KB

For some context

me and vlado were discussing using stream handlers as a mechanism for the templating system in Drupal.

if you could 'include("phptemplate://template_name")' , the templates could be stored inside the database or anywhere really.

that is one of the possibilities

yes, that is one of the possibilities - it will allow in-place template editing, which is quite cool really.

Although I was thinking in other terms. For load-balanced and highly-available systems you need to maintain the code in only one place. This eliminates the code phase difference between machines, everything is in the db. Upgrade just in one place and leave replication to the database.

For low-end hosting, you can avoid doing the ftp round trip.

Using special protocols/streams, you can achieve a lot of funky stuff - like code generation, php sand-boxing, etc... Dream baby sweet dreams :)

clever

With a stream wrapper you are avoiding the previously existing problem that code in database required eval which is terribly slow. Worth investigating.

The best thing about this is

The best thing about this is the proper analysis of statistical errors. :p

The method as such also deserves to be more investigated.

Hoever, I am actually not that thrilled about mixing the DB with the Code.

APC?

did your webserver have an opcode cache like APC?

because this method will not benefit from the apc caching?
unless you would use the APC variable cache to put the code in memory.

another benefit for this would be that you do not need to have write access to the file system (big benefit for the plugin manager in drupal 7)

oops, only read part 2 so

oops, only read part 2 so didn't see the reason actually was the plugin manager ;-)

APC et al

I think this method will not benefit from APC, although there might be workarounds. I haven't studied the apc_compile function properly and if it allows streams as opposed to files. The php documentation just avoids mentioning it, since streams are fairly obscure.

Yes, you don't need write access to the file system. For that matter, this method allows you to have a 'code server', so that you have to update your code only there and reuse it on multiple machines or sites.

Powered by Drupal, an open source content management system