Log in

No account? Create an account

Ian Rolfe's Journal.

Random Jibberings on Programming

Previous Entry Share Next Entry
Python: Removing blank lines from a string.
I've been using the Django template system to generate xml and csv files for a project I'm working on, and all is fine. One "cosmetic" issues is all the blank lines that get produced. So I thought I'd just strip them out.
My first thought was actually to do a list comprehension, splitting the string into a list of lines, and re-assembling it without the blank lines. That, in my opinion, is the most "Pythonic" way of doing it. If I was using a language like C or BASIC I'd just search-and-replace double '\n's until no more can be found, but in Python split() and join() was the way most people would do it, if my (admittedly hurried) google is anything to go by.
As a result of my google, I determined that there where basically 4 methods recommended by the peanut gallery:
import re

def method1(txt):
    for l in txt.split("\n"):
        if l.strip()!='':
            ret += l + "\n"
    return ret

def method2(txt):
    return '\n'.join([x for x in txt.split("\n") if x.strip()!=''])

def method3(txt):
    while '\n\n' in txt:
    return txt

def method4(txt):
    return re.sub("\n\s*\n*", "\n", txt)

Of these methods, 2 & 4 are the easiest to include inline in your code, method 1 may well be just a matter of putting "if l.strip()!='': continue" in your existing program logic, but as functions method 2 looks best to me. I then considered the performance; surely all that search-and-replace was going to be faster than the list comprehension? List comprehensions are often pushed by pythonistas as more efficient than looping, so maybe this isn't the case? I decided therefore to use the timeit module to check this out:
text = """`Twas brillig, and the slithy toves
  Did gyre and gimble in the wabe:
All mimsy were the borogoves,
  And the mome raths outgrabe.

"Beware the Jabberwock, my son!
  The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
  The frumious Bandersnatch!"

He took his vorpal sword in hand:
  Long time the manxome foe he sought --
So rested he by the Tumtum tree,
  And stood awhile in thought.

And, as in uffish thought he stood,
  The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
  And burbled as it came!

One, two! One, two! And through and through
  The vorpal blade went snicker-snack!
He left it dead, and with its head
  He went galumphing back.

"And, has thou slain the Jabberwock?
  Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!'
  He chortled in his joy.

`Twas brillig, and the slithy toves
  Did gyre and gimble in the wabe;
All mimsy were the borogoves,
  And the mome raths outgrabe."""

if __name__=='__main__':
    from timeit import Timer


    print "For",n,"iterations,"
    t = Timer("method1(text)", "from __main__ import method1, text")
    print "Method1 =",1e6*t.timeit(number=n)/n,"uSec/pass"
    t = Timer("method2(text)", "from __main__ import method2, text")
    print "Method2 =",1e6*t.timeit(number=n)/n,"uSec/pass"
    t = Timer("method3(text)", "from __main__ import method3, text")
    print "Method3 =",1e6*t.timeit(number=n)/n,"uSec/pass"
    t = Timer("method4(text)", "from __main__ import method4, text")
    print "Method4 =",1e6*t.timeit(number=n)/n,"uSec/pass"

    For 10000 iterations,
    Method1 = 19.5718990692 uSec/pass
    Method2 = 15.6529871731 uSec/pass
    Method3 = 8.10906783357 uSec/pass
    Method4 = 18.5611668997 uSec/pass
Wow! The old fashioned replace-in-a-loop is substantially faster than the new-fangled list comprehensions and regular expressions!
I guess that's not entirely to be unexpected, especially in this rather restricted test, but still interesting none the less. That said, the actual time (8-20uS) does rather illustrate how fast modern PC's and languages are. It was only in the 80's that 10us was about the time it took for a microprocessor to add two 16 bit numbers!

  • 1

Python: Removing blank lines from a string

I did the same google search. I was fooling around with python. It's always good to start learning something new by trying to solve some practical problem. Seems like removing blanks lines should be easy to do in a scripting language.

It's not. And your choices don't exactly work correctly.

What's a blank line? How about repeated newlines? Yes. But how about a line with some combination of blanks, tabs, and other whitespace and *only* whitespace?

That sure *looks* like a blank line, and that's what I want to remove, also. I call that a blank line, because it looks like one.

However, what's NOT? How about *leading* spaces?

Take, for example,
return re.sub("\n\s*\n*", "\n", txt)

This doesn't do what you want, if you actually test it. It removes leading white space. It messes up the indentation of your Jabberwocky poem.

I'm quite sure that's not what you wanted. I sure don't.

And it's not trivial to fix.

Also, people write scripts and publish them as answers, but I've yet to find anything that actually works.

The other methods don't work, either. No one tested their stuff properly. Some people even write stuff on these forums and say... "this should work".

But thanks for your blurb. I still learned something.

take care

  • 1