Python: os.walk vs linux find

AKA: Why Python needs a find module

I found myself needing to scan and modify a bunch of files (300k) for a really large project. The modifications were too complex for a perl or sed one-liner, and I thought Python would be good for quickly coding all the logic. But as I started iterating on my script, I found it to be very slow when traversing through the project’s files. I was using find on the command line to look up which files I expected to see changes in, and I noticed it was pretty fast, almost 10x faster than python, so of course I was like WTF? Long story short, find is very fast, and well, os.walk is not. Below are some performance comparisons I made of different ways of scanning the files in python vs find.

Test Setup

All my testing was done with the same directory of files, and I ran each test 4 times (discarding the first run). In other wards, all the filesystem caches were warmed and then I ran each test 3 times. I used the following script to profile the performance:

> cat timeit 
#!/bin/sh
START=`realtime`
$@
STOP=`realtime`
ELAPSED=`echo "$STOP - $START" | bc`
echo "Elapsed Time: $ELAPSED" >&2

And the realtime executable is simply this c program:

> gcc -o realtime realtime.c 
#include
#include <sys/time.h>
int main(void) {
    struct timeval time_now;
    gettimeofday(&time_now,NULL);
    printf ("%ld.%d\n",time_now.tv_sec,time_now.tv_usec);
    return 0;
}

Matching Files in Python

As I noted above, scanning files is pretty slow in python. Here is a quick performance comparison of the different approaches I tried:

Using the re module:

> cat re_walk.py 
#!/usr/bin/python
import os
import re
pattern = re.compile(r'.*\.png')
for root, dirs, files in os.walk(os.getcwd()):
    for f in files:
        if pattern.match(f):
            print os.path.join(root,f)

Using the builtin string functions:

 > cat endswith_walk.py 
#!/usr/bin/python
import os
for root, dirs, files in os.walk(os.getcwd()):
    for f in files:
        if f.endswith(".png"):
            print os.path.join(root,f)

Using the slice operator:

 > cat slice_walk.py 
#!/usr/bin/python
import os
for root, dirs, files in os.walk(os.getcwd()):
    for f in files:
        if f[-4:] == ".png":
            print os.path.join(root,f)

Summary

 command run1 run2 run3
 timeit find ./ -name “*.png” > /dev/null  1.58 1.48 1.48
 timeit re_walk.py >/dev/null  21.33 22.10 21.41
 timeit endswith_walk.py >/dev/null  22.31 21.39 19.87
 timeit slice_walk.py >/dev/null  22.15 21.24 20.87

Traversing Files

The previous results weren’t very telling other than python is slow. So this test is simply how many seconds does it take to print all the files in the project.

 command     run1  run2    run3
 timeit find ./ >/dev/null  1.61 0.80  1.26 
 timeit walk.py >/dev/null  20.69 20.79  20.85 
> cat walk.py 
#!/usr/bin/python
import os
for root, dirs, files in os.walk(os.getcwd()):
    for f in files:
        print os.path.join(root,f)

Matching strings in Python

Lets do the same comparison, but without os.walk in the mix:

Using the re module:

> cat re_match.py
#!/usr/bin/python
import re
import files
pattern = re.compile(r'.*\.png')
for f in files.files:
    if pattern.match(f):
        print f

Using the builtin string functions:

> cat endswith_match.py 
#!/usr/bin/python
import files
for f in files.files:
    if f.endswith(".png"):
        print f

Using the slice operator:

> cat slice_match.py 
#!/usr/bin/python
import files
for f in files.files:
    if f[-4:] == ".png":
        print f

Summary

 command  run1 run2  run3
 timeit re_match.py >/dev/null  0.390 0.377 0.390
 timeit ends with_match.py >/dev/null 0.276 0.271 0.277
 timeit slice_match.py >/dev/null  0.234 0.231 0.242

Final Thoughts

We now have a real measurable difference in python’s string matching options, but they are all pretty close. I’m not sure it matters to much which one you use. The real question I have is when are we going to get a “find” module for python. Something like the following would be great:

'''find module'''
def search(path, expression, callback=None):
    '''
    searches the given path for the regular expression passed in.
    @param path: path to search
    @param expression: regular expression to search with
    @param callback: (optional) callback to be executed every time a file is found.
    The callback should accept a filename as the only argument
    @return: iterator for all the matched files
    '''

# Usage:
import find
for f in find.search("~/Music", ".*\.mp3"):
    print "Mp3:", f
Advertisements