BeautifulSoup is NOT the king of HTML Parsers (try this one)

Рет қаралды 25,236

Жыл бұрын

Keep exploring at brilliant.org/JohnWatsonRooney/. Get started for free, and hurry-the first 200 people get 20% off an annual premium subscription
I'm moving on from defaulting to BS4! For the web scraping work i do i am after something fast and a bit more focused, minimal features but still easy to use. there are two options i consider in this video, Parsel from Scrapy, and Selectolax - a wrapper around a C library called Modest, written in cpython. which one do you think i'm going to use...
This video was sponsored by brilliant.
parsel.readthedocs.io/en/late...
github.com/rushter/selectolax
Scraper API www.scrapingbee.com/?fpr=jhnwr
Patreon: / johnwatsonrooney
Donations: www.paypal.com/donate/?hosted...
Proxies: iproyal.club/JWR50
Hosting: Digital Ocean: m.do.co/c/c7c90f161ff6
Gear I use: www.amazon.co.uk/shop/johnwat...
Disclaimer: These are affiliate links and as an Amazon Associate I earn from qualifying purchases

Пікірлер: 72

@JohnWatsonRooney Жыл бұрын

Visit brilliant.org/JohnWatsonRooney/ to get started learning STEM for free, and the first 200 people will get 20% off their annual premium subscription.

@BrandonJacobson Жыл бұрын

Perfect timing. I’m going to create my own headline news scraper and this is perfect. Thank you!

@srikanthkoltur6911 Жыл бұрын

Thanks for the introduction of the new parsing Library it is really worth a shot I was using scrapy for everything 😅

@xilllllix Жыл бұрын

thanks for introducing this to us, john!

@ianrickey208 Жыл бұрын

Nice! We are about to redesign our crawlers and I was starting to review parsers.

@wanderingfool7136 Жыл бұрын

Going to give this a try on a new script I'm writing for a client today! Thanks for everything you do 🙏🙏🙏

@tyricshuck3355 6 ай бұрын

Thank you for this! Was looking for a light weight HTML parser to get the JSON out of a script tag. This was perfect and fast! I also like their way of mimicking XPATH functionality.

@geniusdavid Жыл бұрын

Usually skip over sponsors but this is actually interesting 🧐 will check it out indeed.

@seangibbons4713 Жыл бұрын

As someone learning to code, your videos are a godsend. Keep up the great work. You're helping a lot of amateurs get their footing.

@JohnWatsonRooney Жыл бұрын

Thanks, that’s very kind!

@gitgosc7075 Жыл бұрын

can you make a series about neo vim configuration for webscraping? ;) - thanks for another great material!

@LuicMarin Жыл бұрын

Great video would be cool to see one on inspecting request/response headers without selenium

@nanjack5277 Жыл бұрын

hi sir, months ago i meet one web scraping project can only use xpath selector to get the exact element, which library should i use can go as nearly fast as the seletolax?

@bakasenpaidesu Жыл бұрын

UPDATE: I tried the selectolax and its really fast.... about 20x+

@karim_ghibli Жыл бұрын

You said "pure css selector(s)" multiple times in this video, I may have missed where you explain it, but what do you mean by "pure css selector"? Selectolax does look pretty clean, for now don't really care about scalability, but as long as it's as readable (if not more readable than BS4), definitely looks like something I wanna give a go next time I need to do some html parsing. Thanks for introducing this!

@JohnWatsonRooney Жыл бұрын

Hey yeah I realise I did, what I meant was selectolax works only with css selectors, meaning it’s more lightweight and potentially quicker. I can see how the words I used were a bit confusing sorry!

@karim_ghibli Жыл бұрын

@@JohnWatsonRooney oh, got it, thanks!

@hsider Жыл бұрын

Personally I don't mind beautifulSoup latency, it's serve as requests delay. If the parsing takes some time it's good specially if I have a loop to make multiple requests to the same website. Nice video of course 👍 Edit: I forgot to mention pofiling: Python has cProfile and pstats libs to profile and display nicely time consumed by funcs and io, it may help you compare these new librairies, instead of comparing syntaxe only. From what I've tested so far, requests connection take some time (> 10s often) so in my understanding it's the requests library which take time not parsing :) hope this helps.

@JohnWatsonRooney Жыл бұрын

hey, cool thanks i will check it out!

@hsider Жыл бұрын

@@JohnWatsonRooney There's something else I just remembered: requests-cache, cache the downloaded html document to avoid waiting for the request to finish to test new code.

@JohnWatsonRooney Жыл бұрын

@@hsider thanks! I actually did a video on that just recenelty caching API reponses!

@hsider Жыл бұрын

@@JohnWatsonRooney cool

@felipejardim2517 Жыл бұрын

Awesome! I'll try ! I really like BeautifulSoup because I can find elements in html using combinations, for example: class + attributes regex on attribute value I confess that I'm still not that good at finding elements by the css selector do you have any content about it? :D

@bakasenpaidesu Жыл бұрын

broo.... beautifulsoup has also css selector... and BTW css selector is super easy..... also the one he suggested is really faster than beautiful soup ... about 20x or more... also for me its the same work... becz I just need change a single line and some replacing .... becz in beautiful soup I also use css selector instead of find method

@felipejardim2517 Жыл бұрын

@@bakasenpaidesu ohh, i know that.. i'm saying i prefer to find the elements in other ways, but i would like to know more about css selectors.

@SaMi-se2qs Жыл бұрын

Can we use it for dynamic websites?

@aaroncatolico7550 Жыл бұрын

Hey John, which parser is quickest? I've been using Python 'Requests' library with the 'regex' library. Anything faster than this?

@JohnWatsonRooney Жыл бұрын

So far selectolax has been the quickest in my experience, I would recommend giving it a go

@aaroncatolico7550 Жыл бұрын

@@JohnWatsonRooney thanks! I'll check into it. 👍🏻👍🏻

@JohnWatsonRooney Жыл бұрын

@@JamesQHolden I still use selectolax as my main HTML parser - if you have HTML then its the best option. for interactive pages you'll need to render them first with an automated browser like selenium or playwright, then send the html to selectolax to parse.,

@chillydoog Жыл бұрын

Awesome. I'm going to build a best chili dog scraper.

@AS-fj7ox Жыл бұрын

Thanks! that was so koool. little correction on line 14 in selectolax .py file you need to add "( )" to ".text" in order to call the method properly

@JohnWatsonRooney Жыл бұрын

Ah yes I missed that thank you!

@bakasenpaidesu Жыл бұрын

Beautifulsoup do have css selector *soup.select_one("h1.className")*

@extropiantranshuman 8 ай бұрын

it's weird that it's working off of css - but then again - having the option of having html + css is really helpful.

@RonWaller 10 ай бұрын

Are you in Seattle? Seattle fan? just noticed your shirt.

@JohnWatsonRooney 10 ай бұрын

from UK but watch the NFL and support the Seahawks!

@RonWaller 10 ай бұрын

@@JohnWatsonRooney Ok that is awesome. I am from Ohio I do believe the Browns played there once.

@danlee1027 Жыл бұрын

Great if speed is key to scale as you say.

@codified1 Жыл бұрын

Please upload a video about how to solve a form based captcha.

@sulaimanahmed013 Жыл бұрын

Updates on selectolax? How's it goin for you?

@JohnWatsonRooney Жыл бұрын

Still use it, it’s my go to for html parsing

@sulaimanahmed013 Жыл бұрын

@@JohnWatsonRooney thank you for the reply. Have a nice day.

@s6yx Жыл бұрын

cant use selectolax to scrape items based on div styles attributes like i can on beautifulsoup, unfortunate

@pavelerokhin1512 Жыл бұрын

Your videos are super helpful and you're also a handsome man :)

@drac.96 Жыл бұрын

John, great video, I would like to know your thoughts on a few things. First, how would you approach crawling a website using GraphQL and requires scrolling down on a webpage to get more data? Is it possible to to retrieve this data without using a huge library like Playwright or Selenium to crawl it? Can we still get the data we want with our authentication cookies?

@karim_ghibli Жыл бұрын

This is very out of scope of the video and sounds more like something you should try getting John's consultation (with consultation fee included), where you show him the specific website and he can give you a walkthrough on how he'd do it and answer any questions : )

@SlackOps Жыл бұрын

Please I need an aliexpress web scraping tutorial

@MrSettler Жыл бұрын

bs4 has built-in support for CSS selectors using soup.select() or soup.select_one()

@alisheik3076 9 ай бұрын

Hello sir, When I try to code same as above, its throwing an error Please help how to rectify this error. Thanks

@danielhangan Жыл бұрын

Can you do a LinkedIn company scraper video?

@marcossahade9369 Жыл бұрын

What abaout request-html ? It does supports css and xpath.

@JohnWatsonRooney Жыл бұрын

Yeah I’ve used it a lot before, I think that’s why I didn’t include it - it’s good though and if it works well for you then great

@marcossahade9369 Жыл бұрын

@@JohnWatsonRooney your videos are great i have learned a lot . Thanks for sharing your knowlage. From Argentina

@extropiantranshuman 8 ай бұрын

why can't we have something where we can just import a pack, type in the websource to pull from, and just type in what we want to pull, and where it goes? Where's the template for that? Why so much extra stuff?

@AhmedThahir2002 Жыл бұрын

Is selectolax faster than scrapy?

@JohnWatsonRooney Жыл бұрын

From a pure html parsing point of view it’s the fastest one I’ve seen. There’s lots more to scraping than just parsing html though so if you like using scrapy and it works for you I would keep using it

@philwebb59 Жыл бұрын

Your videos are terrific at encouraging me to try new things, but latency isn't a problem. I've never been successful converting your scripts to run on "real" websites without getting blocked for life, even when adding a time.sleep(60) after each pull. I think the html-world just doesn't like me. &^) That said, I haven't found a good example of using selectolax to parse tables. Gonna take another look through your videos. Also, I see selectolax has modest and lexbor engines. Wonder what the pros and cons?

@cpaandy3380 Ай бұрын

i was shocked how slow BeautifulSoup compared to cheerio!!!, i thought because scraping is a big thing in python everything will be optimized but thats not the case!!!!

@00flydragon00 Жыл бұрын

What is scraping used for in the industry? Most of the scraping video's I have seen focus on "home projects". Selectolax looks cool tho!

@JohnWatsonRooney Жыл бұрын

Mostly competitor product analysis, pulling their comparable deals etc

@00flydragon00 Жыл бұрын

@@JohnWatsonRooney ah! That makes a lot more sense now! So a market watcher is a permanent function in a company? Or do they hire 1 guy to make a script like this and run it themselves? Are you in the python discord by any chance?

@extropiantranshuman 8 ай бұрын

I'm not convinced these are ideal, but do agree that beautifulsoup is honestly way too clunky for what it needs to do.

@loverboykimi Жыл бұрын

Gosh. It is really FAST.

@banuina_fps 5 ай бұрын

Thank you. I was having a tough time finding a substitute for BeautifulSoup and you helped me with no chit-chatting.

@djmill8000 Жыл бұрын

Just import pandas and do a pd.read_html

@extropiantranshuman 8 ай бұрын

beautifulsoup really is clunky, because it doesn't auto-turn web script automatically into html. Having something already do that really helps!

@extropiantranshuman 8 ай бұрын

truth be told - we really just need an interface - where people can click on the places on a website that would need to be copied, along with the direction of the copies and let it roll. Why code it when you can just get it to run?

@extropiantranshuman 8 ай бұрын

actually if you'd like to take things up a step - adding in categories into your videos would speed this up.

@miguellopez7089 Жыл бұрын

So cool! Will experiment with it one day 🤌🏽

@Frugtoy Жыл бұрын

I'm waiting for tool with regular expressions inside (many sites creating dynamic classes) and I don't like the way of solutions. Hello .... .... .... World Wanna do some parser.find(p.*hello_text)