Original source: Python Morsels – https://treyhunner.com/
Hey there,
I use Python for processing text-based data on a weekly basis (nearly daily actually). One of the more common things I need to do with text is split it up in various ways.
So let’s talk about various ways to split strings in Python!
Splitting by a character on a substring
If you’d like to split a string by a specific character, you can call the split method on your string and specify the character you’d like to split by:
>>> duration = "04:30" >>> minutes_and_seconds = duration.split(":") >>> minutes_and_seconds ['04', '30']
The string split method returns a list of substrings (which is just a fancy word meaning “a string that came from a bigger string”).
You can even split on a substring of multiple characters.
Here we’re splitting on ” – ” (space dash space):
>>> line = "Bill Withers - Just As I Am - 01 - Harlem" >>> fields = line.split(" - ") >>> fields ['Bill Withers', 'Just As I Am', '01', 'Harlem']
String splitting also pairs nicely with tuple unpacking:
>>> line = "Dolly Parton - Jolene - 04 - Early Morning Breeze >>> artist, album, n, title = line.split(" - ") >>> title 'Early Morning Breeze'
Splitting on whitespace
Splitting on spaces? You could split on the space character:
>>> quote = "I cup my hands to catch a multi-colored butterfly" >>> words = quote.split(" ") >>> words ['I', 'cup', 'my', 'hands', 'to', 'catch', 'a', 'multi-colored', 'butterfly']
But you’re probably better off not passing a separator at all:
>>> words = quote.split() >>> words ['I', 'cup', 'my', 'hands', 'to', 'catch', 'a', 'multi-colored', 'butterfly']
The default separator is “any amount of consecutive whitespace”. 😮
>>> quote = "Four in the morning\nI woke up from out of my dreams\n" >>> quote.split() ['Four', 'in', 'the', 'morning', 'I', 'woke', 'up', 'from', 'out', 'of', 'my', 'dreams']
In fact, without a separator, whitespace is even trimmed from the ends of the string (notice that final newline character above was trimmed).
Splitting on newlines
What if you’re splitting by newlines?
You could pass a newline character (“\n”) into the split method:
>>> quote = "Nowhere to go but back to sleep\nBut I'm reconciled\n" >>> >>> quote.split("\n") ['Nowhere to go but back to sleep', "But I'm reconciled", '']
But if your string might end in a newline character (as ours does above and as many multi-line strings do) then you’ll end up with an empty string at the end of your list of strings. 🤔
If you’re splitting by newlines you should use the splitlines method instead:
>>> quote = "Nowhere to go but back to sleep\nBut I'm reconciled\n" >>> quote.splitlines() ['Nowhere to go but back to sleep', "But I'm reconciled"]
Calling the split method with “\n” will simply split on newlines, but calling the splitlines method will actually trim a final newline (if there is one) rather than splitting on it.
If you need to make sure you only split N times, you can specify maxsplit as well:
>>> duration = "04:30" >>> minutes, seconds = duration.split(":", maxsplit=1) >>> seconds '30'
Splitting at most N times
Both split and splitlines accept maxsplit.
But keep in mind that using maxsplit might not solve your problems.
>>> duration = "01:15:25" >>> minutes, seconds = duration.split(":", maxsplit=1) >>> seconds '15:25'
If you’re using maxsplit because you don’t trust the incoming string format, you’ll likely still need to handle malformed data later on in your code. For example take a look at these malformed “seconds”:
>>> duration = "01:15:25" >>> minutes, seconds = duration.split(":", maxsplit=1) >>> seconds '15:25'
Splitting lines without removing newlines
Something to keep in mind if you’re splitting up lines: splitlines also accepts a keepends argument, in case you want to split by newline but not actually remove the newline characters.
>>> quote = "Just try to do your very best\nStand up be counted with all the rest\n" >>> quote.splitlines(keepends=True) ['Just try to do your very best\n', 'Stand up be counted with all the rest\n']
Splitting on something complex
Is your splitting a bit more sophisticated?
For example let’s say you’d like to split on commas or underscores. The string split method can’t do that. But the re.split function can!
>>> messy_id = "python_looping-techniques_explained" >>> import re >>> re.split(r"[_-]", messy_id) ['python', 'looping', 'techniques', 'explained']
If your split operation involves something more complex than a single substring, regular expressions are worth looking into.
Your choices: split method, splitlines method , and re.split
The most important takeaways are:
1. If you’re splitting on whitespace characters, you’ll usually want to call the string split method without any separator
2. If you’re splitting on newlines, you’ll likely want the string splitlines method instead of split
3. For complex splits, you may want to use regular expressions (via re.split)
4. Splitting often pairs nicely with tuple unpacking
I hope you learned something new. Happy splitting! – Tray Huner
Deixe um comentário