Processing Large XML Wikipedia Dumps that won't fit in RAM in Python without Spark

  Рет қаралды 14,203

Jeff Heaton

Jeff Heaton

Күн бұрын

The Python ElementTree object allows you to read any sized XML that you have time to process. Unlike a DOM the entire XML document does not need to be loaded. This video shows how the entire of Wikipedia can be processed without a large amount of RAM in Python.
My blog post for this video:
www.heatonresearch.com/2017/0...
The code for this video can be found here:
github.com/jeffheaton/present...

Пікірлер: 26
@opalkabert
@opalkabert 4 жыл бұрын
I am not just liking this but want to thank you for your time to show this. It is awesome Jeff!
@biologyigcse
@biologyigcse 4 жыл бұрын
As a person who is just starting out in the the research domain and have to work with wiki dumps, this was a god send. THANKS a ton, you just saved me tons of time and mental stress. Did I say thanks yet. THANKS A TON. You sir, get a like, subscribe, notification enabling and I am sharing your channel on my twitter space.
@sadiko3000
@sadiko3000 4 жыл бұрын
I took a look at the content of your channel and it is very impressive. Please keep doing this!
@mariagraetsch3700
@mariagraetsch3700 4 жыл бұрын
Thank you Jeff - your video provides a really structured example.
@DanielWeikert
@DanielWeikert 4 жыл бұрын
Thanks a lot for your videos. Love to see more on how to deal with big data in python. Best regards
@BiancaAguglia
@BiancaAguglia 4 жыл бұрын
Thank you for another great video, Jeff. Not only is it useful but, as the zombie apocalypse **has** been on my mind lately, it is also very timely. 😁 As others have already commented, I also think it would be nice to see the same process in spark. Keep up the great work.
@woetotheconquered3451
@woetotheconquered3451 2 жыл бұрын
You're amazing. Just what I needed
@mariumbegum7325
@mariumbegum7325 Жыл бұрын
Interesting video, keep it up!
@tonym5857
@tonym5857 4 жыл бұрын
* stars video 👏👏👏. It would be nice to see the same process using big data tech like hdsf, spark, etc.
@paulowiz
@paulowiz 3 жыл бұрын
I'm a beginner about that I will try this code after the file download =). Thanks for it
@lisanoorarida4009
@lisanoorarida4009 4 жыл бұрын
Thank you so much. I am working on this right now. For the output, I need to generate a new XML file after filtering the wiki. I tried to use the modul but they said "ElementTree is not a streaming writer". What do you recommend?
@HeatonResearch
@HeatonResearch 4 жыл бұрын
I have seen lxml used for that before, but have not done it myself.
@nonenogood
@nonenogood Жыл бұрын
Hello Mr. Heaton. I wonder, can we get the 'text' data from the dataset into csv too?
@quackcharge
@quackcharge 3 жыл бұрын
thanks so much!
@RollingcoleW
@RollingcoleW Жыл бұрын
Helpful !
@saleem801
@saleem801 4 жыл бұрын
Has a spark implementation been made since?
@rohitreddy3609
@rohitreddy3609 3 жыл бұрын
Thank you for this amazing tutorial. It's very informative. Can you please explain how to create a dataset of topics from Wikipedia dump, say to retrieve 100 topics for eg.? My question is, how we can crawl Wikipedia to get documents and images? Thanks in advance.
@victoriar8179
@victoriar8179 4 жыл бұрын
thanks for the video! would be awesome to have this to process with spark
@HeatonResearch
@HeatonResearch 4 жыл бұрын
Yes, that is coming. Once you start to add any NLP functions on that Wikipedia text the process can take weeks without Spark.
@tamastarisnyas1191
@tamastarisnyas1191 3 жыл бұрын
Hi there, thank you for the video, but there's an issue, namely when I use your code it won't fill the redirect column for some reason. Could you help me with this problem?
@HeatonResearch
@HeatonResearch 3 жыл бұрын
Let me have a look at that!
@tamastarisnyas1191
@tamastarisnyas1191 3 жыл бұрын
@@HeatonResearch and another thing that i wanted to do is to grab the text of each article and connect it to the table as a separate column for each title. Could you give me some pointers or tips on how I can do this, please? Would help a lot. Been trying to do it, but it without success.
@sarasmith1647
@sarasmith1647 Жыл бұрын
I get FileNotFoundError: [Error 2] No such file or directory although it created the 2 csv file in the directory
@sarasmith1647
@sarasmith1647 Жыл бұрын
The 3 csv files**
@Knightmare535
@Knightmare535 4 жыл бұрын
3:53 Funny you say that...
@623-x7b
@623-x7b 4 жыл бұрын
You can also torrent it it's much faster to download.
Low Level Data Extraction from Wikipedia Data with Python
23:11
Jeff Heaton
Рет қаралды 4,1 М.
XML & ElementTree  ||  Python Tutorial  ||  Learn Python Programming
10:30
MEGA BOXES ARE BACK!!!
08:53
Brawl Stars
Рет қаралды 36 МЛН
Did you believe it was real? #tiktok
00:25
Анастасия Тарасова
Рет қаралды 52 МЛН
Parse XML Files with Python - Basics in 10 Minutes
10:07
Max on Tech
Рет қаралды 30 М.
I Made a Graph of Wikipedia... This Is What I Found
19:44
adumb
Рет қаралды 2,6 МЛН
SGML HTML XML What's the Difference? (Part 1) - Computerphile
10:21
Computerphile
Рет қаралды 237 М.
Makefiles in Python For Professional Automation
13:43
NeuralNine
Рет қаралды 40 М.
the TRUTH about C++ (is it worth your time?)
3:17
Low Level Learning
Рет қаралды 635 М.
Full XML Processing Guide in Python
17:34
NeuralNine
Рет қаралды 73 М.
Critical Information to Get ASAP (While It's Legal)
12:59
City Prepping
Рет қаралды 1,5 МЛН
All Rust string types explained
22:13
Let's Get Rusty
Рет қаралды 153 М.
How to transform an XML document into a Pandas DataFrame
9:44
Epython Lab
Рет қаралды 20 М.
Samsung Galaxy 🔥 #shorts  #trending #youtubeshorts  #shortvideo ujjawal4u
0:10
Ujjawal4u. 120k Views . 4 hours ago
Рет қаралды 2,9 МЛН
Мой инст: denkiselef. Как забрать телефон через экран.
0:54
WATERPROOF RATED IP-69🌧️#oppo #oppof27pro#oppoindia
0:10
Fivestar Mobile
Рет қаралды 17 МЛН
Klavye İle Trafik Işığını Yönetmek #shorts
0:18
Osman Kabadayı
Рет қаралды 709 М.