UiPath Queues For Beginners
14:04
8 ай бұрын
Пікірлер
@hemenths.k9009
@hemenths.k9009 6 сағат бұрын
Hey, I am getting Invoke Code: Exception has been thrown by the target of an invocation. Error when ran this.
@Ashort12345
@Ashort12345 4 күн бұрын
The AI agent is unable to bypass Cloudflare, even after trying Ollama.
@dungtrananh1522
@dungtrananh1522 5 күн бұрын
Dear sir, can I use my local LLM models instead of OpenAI API?
@smokedoutmotions_
@smokedoutmotions_ 6 күн бұрын
Thanks bro
@redamarzouk
@redamarzouk 5 күн бұрын
You’re welcome 😄
@LearnAvecAmeen
@LearnAvecAmeen 9 күн бұрын
Hello Si Reda, all the best insh'Allah :)
@redamarzouk
@redamarzouk 6 күн бұрын
Thank you so much and to you too 😄
@sharifulislam7441
@sharifulislam7441 9 күн бұрын
Good technology to keep in good book!
@jatinsongara4459
@jatinsongara4459 12 күн бұрын
can we use this for email and phone number extraction
@redamarzouk
@redamarzouk 6 күн бұрын
Absolutely you just need to change the websites and the fields and you’re good to go
@JoaquinTorroba
@JoaquinTorroba 12 күн бұрын
What other options are beside Firecrawl? Thanks!
@JoaquinTorroba
@JoaquinTorroba 12 күн бұрын
Just found it in the comments: "Firecrawl has 5K stars on GitHub, Jina ai has 4k and scrapegraph has 9k."
@redamarzouk
@redamarzouk 6 күн бұрын
Exactly Jina AI, scrapegraph AI are also options
@FaithfulStreaming
@FaithfulStreaming 15 күн бұрын
I like waht you did, but for no code people this is so hard because we dont know what we should install for windows etc.. really really nice video
@benom3
@benom3 17 күн бұрын
Can you scrape multiple URLs at once? For example if you wanted to scrape all the zillow pages not just the first page with a few houses. @redamarzouk
@avramgrossman6084
@avramgrossman6084 18 күн бұрын
This is a nice video and very useful. In my applications I'm looking for the 'system' to have ALL the customer pdF Invoices uploaded, or better yet, as an SalesOrder table in a database. this seems like a lot of work for just one customer and one email. Is there a way to create Agents that could filter out which customer order? Etc.
@AmanShrivastava23
@AmanShrivastava23 23 күн бұрын
I'm curious - what do you do after structuring the data - do you store it in a vector DB? If so, do you store the Json as it is or something else? And can it actually be completely universal - by that i mean can it structure data by us not providing the fields on which it should strucutre the data. Can we make it in some way where upload a website and it understands the data and structures it according to it?
@ilanlee3025
@ilanlee3025 26 күн бұрын
Im just getting "An error occurred: name 'phone_fields' is not defined"
@nkofr
@nkofr 26 күн бұрын
nice! any idea on how to self host firecrawl? like with Docker? also, can it be coupled with n8n? how?
@redamarzouk
@redamarzouk 26 күн бұрын
I gotta be honest, I didn't even try. I tried to self host an agentic software tool before and my pc was going crazy, it couldn't take the load from Llama3-8B running on LM Studio plus docker plus filming at the same time, I simply don't have the hardware for it. if you want to self host here is the link: github.com/mendableai/firecrawl/blob/main/SELF_HOST.md it is with docker.
@nkofr
@nkofr 26 күн бұрын
@@redamarzouk thanks. Is there any sense to use it with n8n? or maybe n8n can do the same without firecrawl? (noob here)
@nkofr
@nkofr 26 күн бұрын
@@redamarzouk or maybe with things like Flowise?
@zvickyhac
@zvickyhac 27 күн бұрын
can Use LLMA 3/ Phi3 on local pc ?
@redamarzouk
@redamarzouk 26 күн бұрын
You theoretically can use it when it comes to Data Extraction, but you will need a large context window version of Llama3 or Phi3. I've seen a model where they have extended the context length to 1M tokens for Llama3-7B. you need to keep in my that your hardware need to match the requirements.
@karthickb1973
@karthickb1973 27 күн бұрын
awesome bro
@redamarzouk
@redamarzouk 26 күн бұрын
Glad you liked it
@kamalkamals
@kamalkamals 27 күн бұрын
nop it s not better that GPT
@redamarzouk
@redamarzouk 26 күн бұрын
you're right now it's not, these models are beating each other like there is no tomorrow, to this date GPT-4o is the one at the top.
@kamalkamals
@kamalkamals 26 күн бұрын
@@redamarzouk before gpt 4 omni, gpt 4 turbo still better, the only best point with llama is free model :)
@titubhowmick9977
@titubhowmick9977 27 күн бұрын
Nice video. Another helpful video on the same topic kzfaq.info/get/bejne/mrmIaMigqZqRpWg.htmlsi=8iKzgqHG97Ivf8wK
@titubhowmick9977
@titubhowmick9977 27 күн бұрын
Very helpful. How do you work around the output limit of 4096 tokens?
@redamarzouk
@redamarzouk 26 күн бұрын
Hello, if you're using open ai api, you need to add the parameter (max_tokens=xxxxxxxx) inside your client open ai call and define a number that don't exceed the max number of token of the model you're using (128 000 for gpt-4o for example)
@YOGiiZA
@YOGiiZA 27 күн бұрын
Helpful, Thank you
@redamarzouk
@redamarzouk 27 күн бұрын
Glad it helped!
@IlllIlllIlllIlll
@IlllIlllIlllIlll 29 күн бұрын
Does it work on mbp
@santhoshkumar995
@santhoshkumar995 29 күн бұрын
I get Error code: 429 when running the code. -'You exceeded your current quota,...
@ilianos
@ilianos 28 күн бұрын
In case you haven't used your OpenAI API key in a while: they changed the way it works, you need to pay in advance to refill your quota
@ArisimoV
@ArisimoV 29 күн бұрын
Can you use this for self operating pc ? Thanks
@redamarzouk
@redamarzouk 29 күн бұрын
Believe me I tried, but my NVIDIA RTX 3050 4Gb simply can’t withstand filming and running Llava at the same time. Hopefully I’ll upgrade my setup soon and be able to do it.
@ArisimoV
@ArisimoV 28 күн бұрын
So it is possible it's just matter of programing and pc sepecs
@PointlessMuffin
@PointlessMuffin 29 күн бұрын
Does it parse JavaScript, infinity scroll, button click navigations?
@morespinach9832
@morespinach9832 27 күн бұрын
Yes, you can ask LLMs to do all that like a human would.
@SJ-rp2bq
@SJ-rp2bq 29 күн бұрын
In the US, a “bedroom” is a room with a closet, a window, and a door that can be closed.
@bls512
@bls512 29 күн бұрын
Neat overview. Curious about API costs associated with these demos. Try zooming into your code for viewers.
@morespinach9832
@morespinach9832 27 күн бұрын
watch on big monitors as most coders do
@redamarzouk
@redamarzouk 26 күн бұрын
for only the demo you've seen, I spent 0.5$, for creating the code and launching it 60+ times, I spent 3$. I will zoom in next time.
@shauntritton9541
@shauntritton9541 29 күн бұрын
Wow! The AI was even clever enough to convert square meters into square feet, no need to write a conversion function!
@todordonev
@todordonev 29 күн бұрын
Webscraping as it is right now is here to stay and AI will not replace it (it can just enhance it in certain scenarios). First of all the term "scraping" is tossed everywhere and being used vaguely. When you "scrape" all you do is move information from one place to another. For example getting a website's HTML into your computer's memory. Then comes "parsing", which is extracting different entities from that information. For example extracting product price and title, from the HTML we "scraped". These are separate actions, they are not interchangeable, one is not more important than the other, and one can't work without the other. Both actions come with their own challenges. What these kind of videos promise to fix is the "parsing" part of it. It doesn't matter how advanced AI gets, there is only ONE way to "scrape" information, and that is to make a connection to the place the information is stored(whether its HTTP request, browser navigation, RSS feed request, FTP download or a stream of data). It's just semi-automated in the background. Now that we have the fundamentals, let me clearly state this: For the vast majority(99%) of the cases "web scraping with AI" is a waste of time, money, resources and our environment. Time: its deceiving, as AI promises to extract information with a "simple prompt", you'll need to iterate over that prompt quite a few times in order to make a somewhat reliable data parsing solution. In that time you could have built a simple python script to extract the data required. More complicated scenarios will affect both the AI, and the traditional route. Money: You either use 3rd party services for LLM inference or you self-host an LLM. Both solutions in the long term will be in orders of magnitude more expensive than a traditional python script. Resources: A lot of people don't realize this but running an LLM for cases in which an LLM is not needed is extremely wasteful on resources. Ive ran scrapers on old computers, raspberry pi's and serverless functions, this is just a spec of dust of hardware requirements compared to running an LLM on an industrial grade computer with powerful GPU(s) Environment: As per the resources needed, this affects our environment greatly, as new and more powerful hardware needs to be invented, manufactured and ran. For the people that don't know, AI inference machines (whether self-hosted or 3rd party) are powerhouses, thus a lot of watt/hours wasted, fossil fuels burnt etc. Reliability: "Parsing" information with AI is quite unreliable, manly because of the nature of how LLMs work, but also because a lot more points of failure are introduced(information has to travel multiple times between services, LLM models change, you hit usage and/or budget limits, LLMs experience high loads and inference speed sucks or it fails all together, etc.) Finally: most of AI extraction is just marketing BS letting you believe that you'll achieve something that requires a human brain and workforce with just "a simple prompt". I've been doing web automation and data extraction for more than a decade for a living. Ive also started incorporating AI in some rare cases, where traditional methods just don't cut it. All that being said, for the last 1% of the cases that do make sense to use AI for data parsing, here's what I typically do (after the information is already scraped): 1. First I remove vast majority of the HTML. If you need an article from a website, its not going to be in the <script>, <style>, <head>, <footer> tags(you get the idea), so using a python library (I love lxml) I remove all these tags, along with their content. Since we are just looking for an article I will also remove ALL of the HTML attributes, like classes(big one), ids, and so on. After that I will remove all the parent/sibling cases where it looks like a useless staircase of tags. I've tried converting to markdown and parsing, Ive tried parsing with a screenshot, but this method is vastly superior due to important HTML elements still being present, and the general HTML knowledge of LLMs. This step will make each request at least 10 times cheaper, and will allow us to use models with lower context sizes. 2. I will then manually copy the article content that I need and will put it along with the above resulting string into a json object + prompts to extract an article form given HTML, I will do this at least 15 times. This is the step where training data is created. 3. Then I will fine tune a GPT3.5Turbo model with that json data. After 10ish minutes of fine-tuning and around $5-10, I have an "article extraction fine-tuned model", that will always outperform any agentic solution in all areas(price, speed, accuracy, reliability). Then I just feed the model a new(un-seen) piece of HTML that has passed step1(above) and it will reliably spew out an article for a fraction of a cent in a single step (no agents needed). I have a few of those running in production for clients(for different datapoints), and they do very good, but its important that a human goes over the results every now and again. Also if there is an edge case and the fine-tune did not perform well, you just iterate and feed it more training data, and it just works.
@ilianos
@ilianos 28 күн бұрын
Thanks for taking the time to explain this! Very useful to clarify!
@rafael_tg
@rafael_tg 28 күн бұрын
Thanks man. I am specializing in web scraping in my career. Do you have some blog or similar where you share content of web scraping as a career?
@morespinach9832
@morespinach9832 27 күн бұрын
Nonsense. Scraping has for 10 years included both fetching data and then structuring it in some format, XML or JSON. Then we can do whatever we want with that structured that. Introducing "parsing" as some distinct construct is inane. More importantly, the way scraping can work today is leagues better than what the likes of APIFY used to do until 2 year ago, and yes this uses LLMs. Expand your reading.
@morespinach9832
@morespinach9832 27 күн бұрын
@@ilianos his "explanation" is stupid.
@morespinach9832
@morespinach9832 27 күн бұрын
@@rafael_tg watch more sensible videos and comments.
@6lack5ushi
@6lack5ushi 29 күн бұрын
Dumpling ai is a startup doing The same! I’m swapping to this they are 50$ a month for 10,000 and 6 a min
@ajax0116
@ajax0116 Ай бұрын
It seems Zillow is blocking my access --> Press & Hold to confirm you are a human (and not a bot). I was able to run on trulia, but without my VPN.
@nabil-nc9sl
@nabil-nc9sl Ай бұрын
tbarkallah 3lik a bro mashallah
@redamarzouk
@redamarzouk Ай бұрын
Lah yhafdk
@tirthb
@tirthb Ай бұрын
Thanks for the helpful content.
@redamarzouk
@redamarzouk Ай бұрын
You're most welcome!
@user-se9qv5pi1q
@user-se9qv5pi1q Ай бұрын
You said that sometimes the model returning the response with different keynames, but if you pass the pydantic model to the OpenAI model as a function, you can expect invariable object with the keys that you need
@user-se9qv5pi1q
@user-se9qv5pi1q Ай бұрын
Also, pydantic models can be scripted to have nested structure, in contrast to json schemas
@redamarzouk
@redamarzouk Ай бұрын
Correct I've actually used them while I was playing around with my code (alongside function calling), the issue I found is that I have to explain both pydantic schema and how I made it dynamic, because if I want a universal web scrapper that can use different fields everytime we're scrapping a different website. That ultimately would've made the video a 30mins+ video, so I opted for the easier less performant way.
@EddieGillies
@EddieGillies Ай бұрын
What about Angie list 😢
@egimessito
@egimessito Ай бұрын
What about captcha
@redamarzouk
@redamarzouk Ай бұрын
Websites don't like scrappers in general, so extensive scrapping will need a vpn (that can handle the volume of your scrapping).
@egimessito
@egimessito Ай бұрын
@@redamarzouk also a VPN would not defend from captcha. They are there for a good reason but would be interesting to find a way around it to build tools for customers
@Chamati_ab
@Chamati_ab Ай бұрын
Thank you Reda for sharing the knowledge! Very appreciated!
@redamarzouk
@redamarzouk Ай бұрын
Really appreciate the kind words, My pleasure 🙏🙏
@Yassine-tm2tj
@Yassine-tm2tj Ай бұрын
In my experience, function calling is way better at extracting consistent JSON than just prompting. Anyway, تبارك الله على ولد بلادي.
@Chillingworth
@Chillingworth Ай бұрын
Good idea
@redamarzouk
@redamarzouk Ай бұрын
You're on point with this, using function calling is always better for JSON Consistency. I actually used it when I was creating my original code. The issue is that I have a parameter "Fields" that can change depending on the type of website being scraped. So to account for that in my code I either need to make the schema inside the function calling generic (not so great) or I make it dynamic (really didn't want to go there, it will make the tutorial much more complicated). I also tried using pydantic expressions since Firecrawl has their own LLM Extractor that can use them, but it didn't perform as well. But yeah you're right function calling is always better. Lah yhfdk a sat.
@Yassine-tm2tj
@Yassine-tm2tj Ай бұрын
​@@redamarzouk You have a knack for this bro. Keep up the good work. وفقك الله
@Chillingworth
@Chillingworth Ай бұрын
You could just ask GPT-4 one time to generate the extraction code or the tags to look for, per website, so that it doesn't need to always use AI for scraping, and you might get better results, and then in that code if it fails you fall back to regenerating it and cache it again.
@redamarzouk
@redamarzouk Ай бұрын
Creating a dedicated script for a website is the best way to get the exact data you want, you're right in that sense, and you can always fix it with gpt-4 as well. But let say you're actively scraping 10 competitor websites where you only want to get their pricing updates and their new offerings, will it make sense to you to maintain 10 different scripts rather than have 1 script that can do the job and will need very minimal intervention? It depends on the use case, but there are times where customized scraping code isn't the best approach.
@Chillingworth
@Chillingworth Ай бұрын
@@redamarzouk I didn't mean like that. I meant you would basically do the same thing as your technique, but you could just use the AI one for each domain, asking it what the CSS selectors are for the elements you're interested in. That way when you're looking for updates you don't even need to do any calls to the LLM unless it fails because the structure is different. You don't even have to maintain multiple scripts, just make a Dictionary with the domain name and the CSS paths and there you go. Of course a lot of different pages may have different structure but you could probably just feed in the HTML from a few different pages of the site and use a prompt telling GPT-4 the URLs and the markup and tell it to figure out the URL pattern that will match the specific stuff to look for. You could even still do this with GPT-3.5-Turbo. Basically the only idea I'm throwing out there is to ask the AI to tell you the tag names and have your code simply extract the info using BeautifulSoup or something else that can grab info out of tags based on CSS query selectors. That way, you can cache that info and then scrape faster after you get that info the initial time. Would only be a little more work but might be a lot better for some use cases. Just thought it was a cool idea
@d.d.z.
@d.d.z. Ай бұрын
Thank you. I have a case use, can I use the tool to make querys to a database, save the results as your tutorial shows and also print to PDF the result of every query?
@redamarzouk
@redamarzouk Ай бұрын
If you already have a database you want to make queries against, you don't need any scraping (unless you need to scrape website to create that database). But yeah it sounds like you can do that without the need for any AI in the loop.
@ika9
@ika9 Ай бұрын
While it is a practical solution, it still requires access to an OpenAI API or other LLM API, which can incur costs. BeautifulSoup and Selenium remain free alternatives. However, using LM Studio locally can provide the advantage of utilizing your own LLM, offering greater flexibility and control.
@user-xq4yj6ni8v
@user-xq4yj6ni8v Ай бұрын
Nice idea. Now wake me up when there are no credits involved (completely free).
@redamarzouk
@redamarzouk Ай бұрын
it's open source, this is how you can run it locally and contribute to the project. github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md but honestly as IT folks we gotta stop going at each other for wanting to charge for an app we've created, granted I'm not recommending this to my clients yet and 50$/month is high, but if that what they want to charge it's really up to them.
@bastabey2652
@bastabey2652 Ай бұрын
just use LLM, pass it the source code of the page, and generate the a-la-carte the scraping function LLM is the secret sauce
@ridabrahim7604
@ridabrahim7604 Ай бұрын
Bghit ghir nfhm chno dawr dial firecrawl fhadchi kamel ? Banli la mafih ta haja spéciale ga3 !!!
@roblesrt
@roblesrt Ай бұрын
Awesome! thanks for sharing.
@redamarzouk
@redamarzouk Ай бұрын
My pleasure!
@ginocote
@ginocote Ай бұрын
It's easy to do it with free python library. Reading HTML convert to markdown, even convert it for free to vector with transformer ect
@actorjohanmatsfredkarlsson2293
@actorjohanmatsfredkarlsson2293 Ай бұрын
Exactly I didn't really understand the point of firecrawl in this solution!? Does Firecrawl do anything better then free python library. Any suggestion on Python libraries btw?
@morespinach9832
@morespinach9832 27 күн бұрын
Have you used it on complex websites with s or many ads, or logins or progressive JS based loads, or infinite scrolls? Clearly not.
@redamarzouk
@redamarzouk 26 күн бұрын
firecrawl has 5K stars on GitHub, Jina ai has 4k and scrapegraph has 9k. Saying that you can just implement these tools easily is frankly disrespectful to the developers who have created these libraries and made them open source for the rest of us. in the example I covered, I didn't show the capabilities of filtering the markdown to only keep the main content in a page nor did I show how to scrape using a search query. I've done scraping professionally for 7+ years now, and the amount of problems you could encounter is immense, from websites blocking you to websites with table looking elements that are in fact just a chaos of divs to Iframes... About Vectorizing your markdown, I once did that on my machine in a "chat with pdf" project, and just with 1024 dimensions and 20 pages of pdf I have to wait long minutes to generate the vectorstore that has to be searched for every request also locally (not everyone has the hardware for it).
@KarriemMuhammad-wq4lx
@KarriemMuhammad-wq4lx Күн бұрын
@@redamarzouk FireCrawl doesn't offer much value when there are free Python resources and paid tools that let you scrape websites without needing your own API key. You still have to input your OpenAI API key with FireCrawl, making it less appealing. Why pay for something when there are free or cheaper options that are easier to use? Thanks for sharing, but I'll stick with the alternatives.
@stanpittner313
@stanpittner313 Ай бұрын
50$ montly fee 🎉😂😅
@redamarzouk
@redamarzouk Ай бұрын
I actually filmed an hour and I wanted to go through the financials of this method and if it makes sense, but I edited that part out so the video is less than 30mins. but I agree 50$ is high, and the markdowns should be of quality for the tokens to be less therefore cheap LLM cost. btw I"m not sponsored by any means by firecrawl, I was gonna talk about jina ai or scrapegraph-ai which do the same thing before deciding on firecrawl.
@simonren4890
@simonren4890 Ай бұрын
firecrawl is not open-sourced!!!
@paulocacella
@paulocacella Ай бұрын
You too nailed it. We need to refuse these false open source codes that are in reality commercial endeavours. I use only FREE and OPEN codes.
@redamarzouk
@redamarzouk Ай бұрын
Except it is. Refer to its repo, it shows how to run it locally github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md
@paulocacella
@paulocacella Ай бұрын
@@redamarzouk I'll take a look. Thanks.
@javosch
@javosch 29 күн бұрын
But you are not using the open source, you are using their API... perhaps for the next time that you could do it run locally
@everbliss7955
@everbliss7955 14 күн бұрын
​@@redamarzouk the open source repo is still not ready for self hosting.
@Brodielegget
@Brodielegget Ай бұрын
Why do it this way, if you can do this without coding? Make . C for example
@redamarzouk
@redamarzouk Ай бұрын
Yeah you can create the same process with no code tools like Make or Zapier or even with low code tools like UiPath and Power Automate, but I just feel more control over formatting my output and integrating my script with my other local processes when I use code. I still use no code tools for other things.
@Byte-SizedTech-Trends
@Byte-SizedTech-Trends Ай бұрын
Make and Zapier would get very pricey if this were automated at scale.
@iokinpardoitxaso8836
@iokinpardoitxaso8836 Ай бұрын
Amazing video and great explanations. Many thanks.
@redamarzouk
@redamarzouk Ай бұрын
Appreciate it, Thank you for the kind word!
@squiddymute
@squiddymute Ай бұрын
another api key to pay ? what's the point of this really ?
@paulocacella
@paulocacella Ай бұрын
You nailed it. We need to refuse these false open source codes that are in reality commercial endeavours. I use only FREE and OPEN codes.