No video

Refactoring A PDF And Web Scraper Part 1 // CODE ROAST

  Рет қаралды 43,644

ArjanCodes

ArjanCodes

Күн бұрын

Visit bit.ly/ARJAN50 to get 50% off the Pro version of Tabnine for 6 months.
In this code roast episode, I analyze and do a refactoring of a PDF and web scraper script that has a surprising design twist - I've never seen this before! This is part 1, next week I'll post the finale (part 2)!
The code I worked on in this episode is available here: github.com/Arj....
💡 Here's my FREE 7-step guide to help you consistently design great software: arjancodes.com....
🚀If you want to take a quantum leap in your software development career, check out my course The Software Design Mindset: www.arjancodes... Courses:
The Software Designer Mindset: www.arjancodes...
The Software Designer Mindset Team Packages: www.arjancodes...
The Software Architect Mindset: Pre-register now! www.arjancodes...
Next Level Python: Become a Python Expert: www.arjancodes...
The 30-Day Design Challenge: www.arjancodes...
🛒 GEAR & RECOMMENDED BOOKS: kit.co/arjancodes.
👍 If you enjoyed this content, give this video a like. If you want to watch more of my upcoming videos, consider subscribing to my channel!
💬 Join my Discord server here: discord.arjan....
🐦Twitter: / arjancodes
🌍LinkedIn: / arjancodes
🕵Facebook: / arjancodes
👀 Channel code reviewer board:
- Yoriz
- Ryan Laursen
- Sybren A. Stüvel
- Dale Hagglund
🔖 Chapters:
0:00 Intro
1:40 Explaining the example
3:41 Code review and analysis
13:45 Refactoring the pdf scraper
21:16 Refactoring the file/folder request inheritance
24:34 Further cleanup of pdf scraper class
35:52 Example unit test
#arjancodes #softwaredesign #python
DISCLAIMER - The links in this description might be affiliate links. If you purchase a product or service through one of those links, I may receive a small commission. There is no additional charge to you. Thanks for supporting my channel so I can continue to provide you with free content each week!

Пікірлер: 105
@laminatedmoth8282
@laminatedmoth8282 2 жыл бұрын
These types of videos are one of the most helpful resources for someone who's entirely self-taught. It's really hard to know what is/isn't "good practice" or how to improve existing work, but your code roast videos are perfect for figuring this out. Thank you!
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Thank you - glad you like them!
@sjmarel
@sjmarel 2 жыл бұрын
I like the distillation-of-pure-functions part of the roast. Conforming strictly to object composition is limiting and confusing. This channel is getting better by the day!
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Thank you - glad you liked it!
@ravenecho2410
@ravenecho2410 2 жыл бұрын
good point :)
@kiwidamien
@kiwidamien 2 жыл бұрын
I would agree with moving the thanks comment to a README, but the module string in the code means that you can access the information using help(module name). It is common for Python projects to start with a long string for this reason. As a developer, you should “collapse” the string in your editor so you don’t have to scroll past it - this advice struck me as an anti pattern. Keeping the __version__ is also important if you publish your package. There are arguments about where to put the version info (setup? Main file?) but just having the package in a repo doesn’t mean you can remove versioning from the code
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Good to know, thanks Damien!
@pabloshi4863
@pabloshi4863 2 жыл бұрын
I think for python code the __init__ and __main__ serves theses functions you said? Maybe we can move them there? Either way ,for me, it shouldn’t be written in the top of main.py.
@kiwidamien
@kiwidamien 2 жыл бұрын
@@pabloshi4863 module strings are the top of the module. The __init__.py is for a package, and a string at the usual notion of if name is __main__ would not get run on import. If you want help(module) to give you a sting, it must be at the top of the module (just like a docstring must be at the top of a function), otherwise it won’t be discovered. (Technically you can wrap it, just as fun tools.wraps assigns to a docstring, but this is a far from std mechanism)
@pabloshi4863
@pabloshi4863 2 жыл бұрын
@@kiwidamien Thank you for making this crystal clear for me!
@pabloshi4863
@pabloshi4863 2 жыл бұрын
@@ArjanCodes After the small talk with @Damien Martin I think it’s time for you to start a discord server so that’s a good python programmer can innovate the next generation of good python programmers.
@qwertyuiopsdfgh
@qwertyuiopsdfgh 2 жыл бұрын
I will rewatch as perhaps I'm misunderstanding, but I have a question about the most_common_words function introduced at around 31:20. I don't understand how passing a set as a parameter can be used to do a frequency count. A set will not contain duplicate values, so the frequency count should be exactly 1 for all words in word_set.
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Hm... you're right, well spotted. I was a bit overly optimistic changing everything to sets. I'll change this back to a list in the code in the repository.
@adamfarquhar1279
@adamfarquhar1279 4 ай бұрын
Glad you mentioned this way back then. I noticed this right away and was waiting for Arjan to circle back and fix it. The testing was not sufficient to check that the answers were correct, just that they didn't fall over.
@vbaclasses3553
@vbaclasses3553 2 жыл бұрын
10:14 is pure gold, your humor is of the charts. I'm assuming you did the sets thing on purpose, because you have code review and I cant imagine how a bug would get past you, firstly and them. I really enjoy your content.
@kevinlusignolo1236
@kevinlusignolo1236 2 жыл бұрын
Excellent roast! I absolutely love this series. Your eye for great design is always spot on. I believe you introduced somewhat of a bug by converting the all_words variable from a list to a set. It seems like all_words needs to remain a list so that duplicates can exist; without duplicates, the ScrapeResult.frequency would no longer have valid information (due to every word now having a frequency of 1). The same goes for the most_common_words function: by design, every word in the all_words set can occur only once.
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Thanks Kevin! And you're right - I was a bit too optimistic converting every list to a set :). I'll fix this in the code in the repository.
@albertomanfreda2125
@albertomanfreda2125 2 жыл бұрын
I was about to write just that. I jumped on the chair when I saw him passing a set to a function for computing frequencies. I guess this shows the difference between the eye of a scientist and the eye of a programmer :)))
@JohnFallot
@JohnFallot 2 жыл бұрын
Thanks for diving into this! Currently at the 1:00 minute mark, but judging from the previews... oh dear this is going to be good 😅
@JohnFallot
@JohnFallot 2 жыл бұрын
Note for around 32:38 and more generally, I _definitely_ have a bad habit of putting methods into classes when they don't need to be; and I'm glad for this, shall we say, intervention! I suspect that this habit came about back when I first learning Python via pygame tutorials. Putting code into classes was presented as good practice. It seemed as though not having methods in classes was a bit like leaving them exposed to the elements. It's reassuring to see that these functions can in fact 'go outside' as it were!
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Thank you so much, John for supplying the code! It was really fun working on this project (even though I introduced a bug near the end, haha). Indeed, in many cases classes are not needed, and a simple functional approach works really well, especially if those functions are relatively easy to test.
@JohnFallot
@JohnFallot 2 жыл бұрын
@@ArjanCodes Happy to see it was taken up and that a more recent version was used after all, relative to what I had originally sent along! 😃 Also, for inquiring minds: the “__new__” class/subclass scheme around 8:00 was definitely a choice! 😅 I saw a tutorial for a pattern like it-not that it was meant for this context at all-and I felt that letting the code adapt based on the provided file input would be a good idea. I’ll definitely be sure to consult your factory pattern video with fresh eyes! I recall having tried that pattern over the summer, but I had found it a bit advanced for me at the time.
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Visit bit.ly/ARJAN50 to get 50% off the Pro version of Tabnine for 6 months.
@FlamencoDeniz
@FlamencoDeniz 2 жыл бұрын
Arjan got really passionate about the class initialisation at 8:42 👍
@c32ax1
@c32ax1 2 жыл бұрын
This is pretty fantastic. The video has the quality of an experienced teacher walking through material sharing insights that aspiring software engineers like myself may not consider at first glance.
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Thank you, glad to hear you like it!
@shietzakaupf
@shietzakaupf Жыл бұрын
Arjan, I really enjoy the content you have put out but I would to suggest that when you share your screen to fill the screen with only code or user larger font to make it easier to read on mobile devices.
@williamduvall3167
@williamduvall3167 Жыл бұрын
I can only watch like 10 min at a time of these, but they are so good! Thank you for triggering my imposter syndrome in a good way!
@ArjanCodes
@ArjanCodes Жыл бұрын
Thanks so much William, glad the content is helpful!
@pabloshi4863
@pabloshi4863 2 жыл бұрын
This is such a nice video! For the first time I laughed out loud reading codes. The original code was (written by an amateur)(or, impolitely, a piece of shirt) yet works, hence I appreciate the hard work of the original author. Watching the refactoring process is a joy. What I don‘t understand is why there’s no likes!
@manuelpineda9067
@manuelpineda9067 2 жыл бұрын
Great episode! Can't wait for part 2
@thefrator5275
@thefrator5275 2 жыл бұрын
Excellent video and I learned a lot. Plus it's entertaining. Thank you for your contributions to the Python community.
@johnvillalovos
@johnvillalovos 2 жыл бұрын
I do this code cleanup sometimes. One of the first things I like to do is run 'black' and 'isort' on the code as a baseline. Surprising how badly formatted code is out there. Well badly formatted in my opinion because I have gotten used to having almost all the code I work on be formatted by 'black' and 'isort'.
@TheBalmix
@TheBalmix 2 жыл бұрын
"... AND THEN DON'T DO IT!!!" Roger that!
@kristjanjonsson3843
@kristjanjonsson3843 2 жыл бұрын
Your best content is easily code roasts
@bensums
@bensums 2 жыл бұрын
Refactoring is nice but you broke the most_common functionality by passing sets to FreqDist
@ArjanCodes
@ArjanCodes 2 жыл бұрын
You're right, I was a bit too optimistic changing everything to sets, haha. I'll change this back to a list in the code in the repository.
@antonhajdu8464
@antonhajdu8464 2 жыл бұрын
We want more refactoring!
@SeamusHarper1234
@SeamusHarper1234 2 жыл бұрын
This is soooo useful, I really love to see this advanced stuff, because you fix so many problems that my own code has.. And the scraper is a lot clearer, even after the first video.
@soaiside9555
@soaiside9555 2 жыл бұрын
This video is very good. I love watching it. It's like pair programming but better!
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Thanks, glad you liked it!
@firefouuu
@firefouuu 2 жыл бұрын
Isn't it considered a bad practice to use a function defined outside of a class without passing it as an argument? You hear everywhere that functions shouldn't use variables defined outside of their local scope, shouldn't it be the same for class?
@chilltake
@chilltake 2 жыл бұрын
Yes, for the most part - but this doesn't hold true for CONSTANTS and configurations:)
@jacksims8018
@jacksims8018 2 жыл бұрын
Brilliant. Thank you for this. Also I would 100% queue up to see a stream of this type of thing.
@BrunoReisVideo
@BrunoReisVideo 2 жыл бұрын
when working with local modules like this, all you have to do is create a folder with a bunch of py files and the __init__ file? or did you have to do anything else to be able to "import scrape.scraper?
@samarbid13
@samarbid13 2 жыл бұрын
More refactoring! ❤
@0xSLN
@0xSLN 2 жыл бұрын
This content is so helpful in helping me learn python quickly, great format and please keep the series going! Learning theory from practice is 👑
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Good to hear it’s helpful to you, Oscar! Will definitely keep the series going!
@niconeumann2963
@niconeumann2963 2 жыл бұрын
Great video, thanks a lot! What do you think about writing tests first to make sure that you don't break anything with the refactoring? Or writing tests while refactoring the code :)
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Thank you Nico! I’ve considered doing that. It would actually be my preferred way of doing the refactors, but I’m concerned that it’s going to make these videos too long. But perhaps I shouldn’t judge too quickly and try it at least once.
@niconeumann2963
@niconeumann2963 2 жыл бұрын
@@ArjanCodes Your videos have very high quality and I like how you refactor the code stepwise with great explanations. For me the length is totally okay because it is interesting to follow you and I learn something new :) About the aspects of tests I have found a lot of videos where easy unit tests for 1-2 functions are shown. But it would be quite interesting to see a real world example on a more complex project like you are showing.
@ncmathsadist
@ncmathsadist Жыл бұрын
At about 13:00, you see an example of the Blofeld Principle: Never let a variable outlive its usefulness.......
@binboy09
@binboy09 2 жыл бұрын
A great series thanks so much. It's great seeing how so much of this can be applied to my own code.
@sylvainprive1754
@sylvainprive1754 Жыл бұрын
I’m not too sure but I think the previous developer miscode your « compute_filtered_token » function. In the list comprehension (that you changed for a set), he is checking if w is not in stop_words and name_words … but name_words is considered as « a Boolean » here ? Right ? I might be wrong but I would set(w for w in … if w not in stop_word AND w not in name_word) … or something like this ?
@bckzilla
@bckzilla 2 жыл бұрын
So I like this a lot. Only it's a bit confusing with the red wavy lines and all the red areas in the right window of the IDE. I am used to that indicating errors. But super content!
@kevon217
@kevon217 Жыл бұрын
Great content. Very helpful!
@ArjanCodes
@ArjanCodes Жыл бұрын
Glad to hear it!
@klaasvaak2575
@klaasvaak2575 Жыл бұрын
hm, i always thought of trying to use classes in my python code but was not yet able to make projects that required it. Now i noticed it is mostly that i create my code for single projects with one clear target. It looks like as soon as all functions need to be used multiple times without prior knowledge a class is the next level to use to keep code clean. Ofcourse curious how you are going to clean this, it looks more like my level of code so its closer to what i am currently able to learn about.
@dinmammadennis
@dinmammadennis 2 жыл бұрын
Looking forward to part 2!
@jankucera8505
@jankucera8505 2 жыл бұрын
these roasts are the best, nothing else helps me learn faster
@sthiag0
@sthiag0 2 жыл бұрын
Really like this series. Can people submit? Also, one question: why does his vscode complain so much with the squiggles even before he starts refactoring? Keep up the good work!
@NiallOCallaghan
@NiallOCallaghan 2 жыл бұрын
It's a setting you can turn on in Pylint (at least partially). If you search for linting, you change the level of aggression of the highlighting and whether to apply it to just the file you're working on or the full workspace.
@Firegregor1
@Firegregor1 2 жыл бұрын
Good job, yet another case where dataclass is usefull. I see you are probably using vim plugin for VsCode, I'm not sure if you are new to vim or you're not using it's potential to not bring in confusion on your videos. Here are some shortcuts that in my opinion can make your editing on screen faster and not bring in confusion: - dap - remove a paragraph - ussualy methods ara one - gU/uw - make word uppercase/lowercase - . - rapeat last operation (like remove self)
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Thanks for the tips! I'm using Vim since about a month, still learning a lot every week :).
@Firegregor1
@Firegregor1 2 жыл бұрын
@@ArjanCodes I ise vim for over 3 years. I remember my biggest quality of life improvement in vim was when I found out about text objects ( i/a ). Like in first example operating on entire paragraph, it can be extended to word, brackets, quotes, tags and even sentence (if you editing text, not code). Usage - operate on object with cursor inside (dos'nt have to be on begining): i( - everything inside brackets a( - like above including bracket itself
@romaindesparbes7251
@romaindesparbes7251 2 жыл бұрын
Awesome video! I don't really get why you convert `self.all_words` to a set at 26:55 though. Doesn't it add some unnecessary overhead as the argument of the `intersection` method can be an `Iterable` and not necessarly a `set`?
@jacobwalters9660
@jacobwalters9660 2 жыл бұрын
This was fun to follow along with! Great content
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Glad you enjoyed it, Jacob!
@oscarsix4702
@oscarsix4702 2 жыл бұрын
Does anyone know how @ArjanCodes has the option to see the line numbers from his selected line? example at 21:27 Is it a plugin or a VSC hotkey?
@marcotroster8247
@marcotroster8247 2 жыл бұрын
Haha. Those messy data science code files seem so familiar 😂 Idk why it's always the data science projects that become such a time bomb 😅
@djhoese
@djhoese 2 жыл бұрын
Technically the type annotation for the "frequency" would be `list[tuple[Union[str, int], ...]]` to indicate that the tuple contains an arbitrary number of strings or integers. At least I think that's right.
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Good to know, David - thanks!
@NateROCKS112
@NateROCKS112 2 жыл бұрын
Based on the type hint for the most_common method (which we can see for a single frame in the video at 16:19), and its examples, it looks like list[tuple[str, int]] actually is the correct type hint. tuple[str, int] means a tuple with first element "str" and second element "int," edit: and this matches up for a frequency table.
@kuriankattukaren
@kuriankattukaren 2 жыл бұрын
Very good example.
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Thank you, glad you liked it!
@dmytrokorbanytskyi1586
@dmytrokorbanytskyi1586 2 жыл бұрын
nice work! What do you think about temp variables, that's are used immediately after creation, such as: doi = get_doi return { "doi": doi } Maybe, better avoid it and use it inside the returning object: return { "doi": get_doi() }
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Thanks! What I do mostly is if the expression is simple, such as a basic function call with few arguments, directly use it in the object like in your second piece of code. If the expression is more complex, I use a separate variable for clarity.
@alexanderzikal7244
@alexanderzikal7244 7 ай бұрын
The face at 10:15 😃😃😃
@aadithyavarma
@aadithyavarma 2 жыл бұрын
Great video! For any file or path, isn't using Pathlib the better option?
@padreigh
@padreigh 2 жыл бұрын
Great roast :) learning something all the time. set.intersection(iterable) works on any iterable - not sure the conversion to set is needed or useful :)
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Good to know! I do think set actually still makes sense, because there's no need to have duplicates of words, and the order also doesn't really matter :).
@sagiziv927
@sagiziv927 2 жыл бұрын
Another great video, thank you Arjan 😃 In my current project, I need to create an instance of the sub-class based on a string from a configuration file. I have the same implementation as in the video, where I override the __init_subclasses__ function and save the sub-classes into a dictionary based on a string. The problem is that all the sub-classes must be in the same file with the super class, otherwise they won't be added to the dictionary (because the files would never be loaded). I thought about using `importlib` to import the packages, but then how do I get the module's name¿ Do I hard-code it or is it the same name as in the config¿ How you would separate the classes to their own files whilst maintaining the same functionality¿
@theninjascientist689
@theninjascientist689 2 жыл бұрын
Wouldn't you be able to do "from subclassfile.py import subclass" at the end of your file? Realised halfway through writing the above reply that that completely ruins the point of doing this in the first place. Yes, I'd use something like os.walk to find all of the python files in the same directory and importlib to import them. In fact, I have the same exact problem in my code, I'll try it myself and get back to you.
@pythonistaprogramm
@pythonistaprogramm 2 жыл бұрын
For me it's looks like a protype for a startup when you don't have good specifications.
@1oglop1
@1oglop1 2 жыл бұрын
I've seen people adding "self" to instance variables just because IDE said " this can be a static method because you are not using self".
@adjbutler
@adjbutler 2 жыл бұрын
Great video!
@CoentraDZ
@CoentraDZ 2 жыл бұрын
I'm learning a lot from you ❤️
@67Keldar
@67Keldar 2 жыл бұрын
OMG... Awesome... Thanks
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Thanks, glad you liked it!
@Optimusjf
@Optimusjf 2 жыл бұрын
Excelente
@xavierpierre5586
@xavierpierre5586 2 жыл бұрын
Impressive wow
@HexenzirkelZuluhed
@HexenzirkelZuluhed 2 жыл бұрын
I'd wager your tests might fail. Unfortunately I'm already subscribed and liked the video, so I cannot comply.
@ashoomow
@ashoomow 2 жыл бұрын
@ArjanCodes, do you have patreon?
@ArjanCodes
@ArjanCodes 2 жыл бұрын
Hi Anthony, I don’t have Patreon, but I do have a buymeacoffee link if you’d like to support me: www.buymeacoffee.com/arjancodes.
@ErikS-
@ErikS- Жыл бұрын
Scihub is an illegal download site for getting free access to academic papers... Not sure if its a good idea to include such things in the videos on this channel.
@Hamsters_Rage
@Hamsters_Rage 2 жыл бұрын
why are you doing all this refactoring manually, not using ide built-in refactoring tools? it makes beginner developers think that renaming functions and classes are such as easy as you do it, which is wrong for large projects.
@abdeljalilyahya6361
@abdeljalilyahya6361 Жыл бұрын
Oh I heard scihub, that's illegal in most countries, careful with the algorithm.
@sergeytsybenko2786
@sergeytsybenko2786 2 жыл бұрын
Господи помилуй, если бы я сделал всю программу в одном файле меня преследовали бы ночные кошмары
@Andrumen01
@Andrumen01 2 жыл бұрын
Man, academicians are so bad at writing code...it's embarrassing! I say it from personal experience, my colleagues are great computational physicists, but horrible software engineers. Kudos for helping.
@JohnFallot
@JohnFallot 2 жыл бұрын
Worse yet, I’m not even an academian! At best you could say that I’m a ‘citizen behavioral researcher’, who happened to pick up Python this past year 😅
@Andrumen01
@Andrumen01 2 жыл бұрын
@@JohnFallot 😅, I am also guilty of it, but trying to improve. It is a challenge, I do have to say.
Refactoring A PDF And Web Scraper Part 2 // CODE ROAST
33:41
ArjanCodes
Рет қаралды 19 М.
Refactoring Conway's Game of Life | ArjanCodes Code Roast
31:49
ArjanCodes
Рет қаралды 28 М.
Я обещал подарить ему самокат!
01:00
Vlad Samokatchik
Рет қаралды 10 МЛН
Викторина от МАМЫ 🆘 | WICSUR #shorts
00:58
Бискас
Рет қаралды 6 МЛН
SPILLED CHOCKY MILK PRANK ON BROTHER 😂 #shorts
00:12
Savage Vlogs
Рет қаралды 42 МЛН
CHOCKY MILK.. 🤣 #shorts
00:20
Savage Vlogs
Рет қаралды 26 МЛН
Refactoring A Tower Defense Game In Python // CODE ROAST
36:49
ArjanCodes
Рет қаралды 256 М.
How Senior Programmers ACTUALLY Write Code
13:37
Thriving Technologist
Рет қаралды 1,5 МЛН
How principled coders outperform the competition
11:11
Coderized
Рет қаралды 1,6 МЛН
7 Python Code Smells: Olfactory Offenses To Avoid At All Costs
22:10
Why I prefer attrs over dataclasses
6:21
mCoding
Рет қаралды 63 М.
25 nooby Python habits you need to ditch
9:12
mCoding
Рет қаралды 1,7 МЛН
How To Implement Domain-Driven Design (DDD) in Go
1:57:42
ProgrammingPercy
Рет қаралды 40 М.
The Ultimate Guide to Writing Classes in Python
25:39
ArjanCodes
Рет қаралды 110 М.
The Worst Programming Language Ever - Mark Rendle - NDC Oslo 2021
1:00:41
NDC Conferences
Рет қаралды 1,3 МЛН
Protocol Or ABC In Python - When to Use Which One?
23:45
ArjanCodes
Рет қаралды 200 М.
Я обещал подарить ему самокат!
01:00
Vlad Samokatchik
Рет қаралды 10 МЛН