Over the years, Beautiful Soup has probably saved us more hours on scraping, data collection, and other projects than we can count. Crummy's landing page for the library even says:
Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.
You can plot me directly in the "days" column. When I'm starting a Python project that requires me to parse through HTML data, the first dependency I'll pull is BeautifulSoup. It makes what would normally be a nasty Perl-esque mess into a something nice and Pythonic, keeping sanity intact. But what about structured data other than HTML? I've also learned that BS can be a huge boon for XML data, but not without a couple of speed bumps.
Enter data from the client (not real data, but play along for a moment):
Seems well-structured! Our customer just needs all the links that we have in the data. We fire up our editor of choice and roll up our sleeves.
We get in return:
Wait, what? What happened to our content? This is a pretty basic BS use case, but something strange is happening. Well, I'll come back to this and start working with their other very hypothetical data sets where tags become tags, but the rest of the data is structured exactly the same. This time around:
...and corresponding result...
Interesting! To compound our problem, we're on a customer site where we don't have internet access to grab that sweet, sweet documentation we crave. After toiling on a Stack Overflow dump in all the wrong places, I was reminded of one of my favorite blog posts by SO's founder, Jeff Atwood. Read the Source. But what was I looking for? Well, let's dig around for <link> tags and see what turns up.
Sure enough, after some quick searches, we find what I believe to be the smoking gun (for those following along at home, bs4.builder.__init__.py, lines 228/229 in v4.3.2).
We have a seemingly harmless word with "link" in our XML, but it means something very different in HTML and more specifically, the TreeBuilder implementation that LXML is using. As a test, if I change our <link> turned <resource> tags into <base> tags we get the same result - no content. It also turns out that if you have LXML installed, BeautifulSoup4 will fall back to that for parsing. Uninstalling it grants us the results we want - tags with content. The stricter (but faster) TreeBuilder implementations from LXML take precedence over the built-in HTMLParser or html5lib (if you have it installed). How do we know that? Back to the source code!
bs4/builder/__init__.py, lines 304:321
As it turns out, when creating your soup, 'lxml' != 'xml'. Changing the soup creation gets us the results we're looking for (UPDATE: corresponding doc "helpfully" pointed out by a Reddit commenter here). BeautifulSoup was still falling back to HTML builders, thus why we were seeing the results we were when specifying 'lxml'.
While I didn't find that magic code snippet to fix everything, (UPDATE: Thanks Reddit). We found our problem, but went really roundabout to get there. Understanding why it was happening made me a feel a lot better in the end. It's easy to get frustrated when coding, but always remember, read the docs and - Read the Source, Luke. It might help you understand the problem.
We’re hiring! If you’re interested in geospatial, big data, social media analytics, Amazon Web Services (AWS), visualization, and/or the latest UI and server technologies, drop us an e-mail at firstname.lastname@example.org.