Nov
2022
Python Options for Converting Html to Text
Googling “extract text from webpage using python” will get you a huge number of articles explaining how to use Requests and BeautifulSoup to automate text extraction from webpages. Almost all of these articles will produce terrible output that requires a lot of cleaning. Some do some elementary filtering on the DOM to exclude some text but very few do any sort of careful filtering to return only the main content on the page and will return headers and sidebars plus footer information.
For most purposes this is not text you want to scrape. I used to used jusText (GitHub fork and Original) but have recently come across another more complete solution, Trafilatura.
jusText is only a html to text converter. It will extract text that can then be saved. To scrape a website you will next to use requests or selenium. Trafilatura is a crawler and extractor with multiple output formats.
Here is a youtube video introduction (there is no voice so you can mute the annoying music):