Rain Blog

If you found this page while trying to figure this out perhaps you’ve made the same mistake that I did. If you want to scrape tables from websites, I’d recommend using excel’s “Import data from the web” feature. Yes, I know it’s proprietary software, but if you’re trying to do a lot of things really quickly this would probably save the most amount of time.

If you’re comfortable enough with Pandas, you could try the following task for yourself. But be warned, explode() is very confusing. I malded at it for like three hours and I still don’t understand what went wrong.

Anyways, let me show you this dirty trick I discovered during the middle of a hackathon in august.

This is a CSV I generated from a page on Statistics Canada which I needed for my project. I did not have much experience scraping data from the internet so my first instinct was to copy the <table> section of the source code into some random website that claims to convert html tables to CSVs. I ended up with a spreadsheet that looked like this:

Spreadsheet Image

There are newline characters where cell separators are supposed to be. Should be simple to fix, right?

Unfortunately, no. Splitting cells like this is supported neither in Excel nor in its derivatives like LibreOffice Calc, and figuring out how to do it programmatically (without using pandas) seemed like another hackathon project within itself.

What I decided on in the end was to use a find-and-replace operation within a text editor. However, that could not be done in this spreadsheet’s current form. CSVs store data in rows, which means that I would need to move multiple values over multiple lines while those line numbers are constantly changing.

Values move long distances

So I had to transpose it, and the following is a list of instructions for fixing the all cells: