Binary data exercise: how to tell if a file is a jpeg?

  Рет қаралды 13,752

Jacob Sorber

Jacob Sorber

Жыл бұрын

Patreon ➤ / jacobsorber
Courses ➤ jacobsorber.thinkific.com
Website ➤ www.jacobsorber.com
---
Binary data exercise: how to tell if a file is a jpeg? // today I thought we would look at binary data, in the form of JPG images, and specifically see if we can write a program that will test whether or not a file is JPEG. We're looking at this in both C and Ruby, so you can see some of the pitfalls that you can run into with higher-level languages.
Related Videos:
Hexadecimal: • Why do programmers use...
Hex Editors: • Tame binary headaches ...
***
Welcome! I post videos that help you learn to program and become a more confident software developer. I cover beginner-to-advanced systems topics ranging from network programming, threads, processes, operating systems, embedded systems and others. My goal is to help you get under-the-hood and better understand how computers work and how you can use them to become stronger students and more capable professional developers.
About me: I'm a computer scientist, electrical engineer, researcher, and teacher. I specialize in embedded systems, mobile computing, sensor networks, and the Internet of Things. I teach systems and networking courses at Clemson University, where I also lead the PERSIST research lab.
More about me and what I do:
www.jacobsorber.com
people.cs.clemson.edu/~jsorber/
persist.cs.clemson.edu/
To Support the Channel:
+ like, subscribe, spread the word
+ contribute via Patreon --- [ / jacobsorber ]
Source code is also available to Patreon supporters. --- [jsorber-youtube-source.heroku...]

Пікірлер: 76
@litlkaiser
@litlkaiser Жыл бұрын
The read method in ruby has an option for encoding, e.g. f.read(3, encoding="UTF-8")
@JacobSorber
@JacobSorber Жыл бұрын
Good point. Thanks!
@Raugturi
@Raugturi Жыл бұрын
Maybe it's a little pedantic of me, but I'd just let Ruby read it in as binary and in my compare do "\xFF\xD8\xFF".encode('ASCII-8BIT'). We make the test bytes match what we expect rather than mutating what we're reading in to see if the mutated value matches something else. And I think there's an alias of 'BINARY' so "\xFF\xD8\xFF".encode('BINARY') should also work and is maybe more explicit about what we want.
@redcrafterlppa303
@redcrafterlppa303 Жыл бұрын
​@@Raugturi can't ruby just read the bytes as a number array and let you create a number array with 3 hex constants and compare them? Would be weird if any language couldn't do numbers amd arrays.
@shivisuper91
@shivisuper91 Жыл бұрын
​@@Raugturiwas about to write the exact same comment😅
@suncrafterspielt9479
@suncrafterspielt9479 Жыл бұрын
Lets have a deep dive into the meta data
@NonTwinBrothers
@NonTwinBrothers Жыл бұрын
Definitely would be interested if more file format videos are to come :)
@donaldmickunas8552
@donaldmickunas8552 Жыл бұрын
Interesting. I’m taking a python course currently. This will make an interesting exercise in python. Thanks! 😀
@CoolKoon
@CoolKoon 8 ай бұрын
I'm pretty sure that this is not an issue in Python though as long as binary mode is being used.
@bolter841009
@bolter841009 Жыл бұрын
Thank you for the example! Really a good intro to magic bytes 🙂 Maybe another nice video would be a simple jfif/exif low-level parser - just the big stuff - display the block type and size, could be useful for integrity check 🙂 most out-of-the-box libraries “fix” minor errors or ignore erroneous information when possible.
@unperrier5998
@unperrier5998 Жыл бұрын
Python3 doesn't have this problem: as you can read directly binary and get "bytes" objects.
@JacobSorber
@JacobSorber Жыл бұрын
Thanks. Good point. I've definitely still seen the bytes object cause some newcomer confusion when the programmer doesn't understanding why a bytes object and a string object (with a string of bytes) are not the same thing. It feels like a different flavor of the same problem - but maybe/probably a more sensible solution.
@reverse_shell
@reverse_shell Жыл бұрын
Yes for metadata and more file disassembly please!
@valsk01
@valsk01 Жыл бұрын
I love your vids... they helped me understand pointers :)
@JacobSorber
@JacobSorber Жыл бұрын
Yessss! Glad they helped.
@pseudopseudo3679
@pseudopseudo3679 Жыл бұрын
a video on reading/writing bitmap data would be cool :)
@samuelmartin7319
@samuelmartin7319 Жыл бұрын
I would love more videos on this topic!
@rexjuggler19
@rexjuggler19 Жыл бұрын
❤ This is one of my favorites! A real world example. Unix/Linux has "magic" built in, and I've edited /etc/magic by hand to add file types, but I do data migrations, so this indepth how-to is very helpful for building custom tools myself regardless of platform. It maybe a bit off the channel topic, but it might be useful for you to do a video on encoding - ASCII, UTF, ISO, EBCDIC - and maybe even add byte transmission like 7bit even parity, 8bit no parity...stuff like that. Working at the atomic level of data is very helpful to develop a better understanding of computers in general.
@justcurious1940
@justcurious1940 Жыл бұрын
Thanks Jacob, great video, lets play with sockets and threads, I think it will be more fun.
@robertstrickland9722
@robertstrickland9722 Жыл бұрын
I would love to see some videos on binary file manipulation, especially something like writing your own encryption program.
@tommcboatface1908
@tommcboatface1908 Жыл бұрын
Great video!
@DelgardAlven
@DelgardAlven 11 ай бұрын
feeling like home when somebody types things in C. Things are exactly what they are in 99%, and 1% stays just for rare machines’ unique memory conventions, and nothing else.
@k1defjoel397
@k1defjoel397 Жыл бұрын
I'm on a roll here binge watching your videos. Super impressed. Hoping you can help me connect the dots of my limited understanding. I was under the impression C++ has better string handling capabilities and figured C++ would be your go-to. In this case, you chose C for it's simplicity for its lack of encoding confusion. Does that mean C++ adds complexity compared to C, such that you'd need to concern yourself with encoding choices?
@JacobSorber
@JacobSorber Жыл бұрын
Thanks! I'm glad they're helping out. C++ doesn't do any automatic string encoding stuff, and you could do essentially the same thing that I did here using C++ (fopen, fread are available in both). But, yes, C++ does provide some nice string-related tools (nowhere near what you get from python or ruby), but when I'm working with binary data and individual bytes, I often don't see an advantage to using C++, unless I need object oriented stuff elsewhere in the code. In this case, I didn't.
@raul_ribeiro_bonifacio
@raul_ribeiro_bonifacio Жыл бұрын
Just found out about this channel. Nice content!
@JacobSorber
@JacobSorber Жыл бұрын
Thanks, and welcome!
@coolbrotherf127
@coolbrotherf127 Жыл бұрын
That's pretty cool. I've never tried to do this before.
@andreisoceanu4320
@andreisoceanu4320 Жыл бұрын
I love this way of questioning everything. Next: How to tell if a JPEG is a file.
@JacobSorber
@JacobSorber Жыл бұрын
You would have to define what you mean by "JPEG". 🤔
@andreisoceanu4320
@andreisoceanu4320 Жыл бұрын
@@JacobSorber that is easy, just #define JPEG
@gerdsfargen6687
@gerdsfargen6687 Жыл бұрын
Could you check if it is not a real jpg but may have some hidden data within the file?
@gerdsfargen6687
@gerdsfargen6687 Жыл бұрын
I know probably not to expect any reply from Jacob 😢
@JacobSorber
@JacobSorber Жыл бұрын
Not sure I understand the question. Are you asking if the magic numbers could check out and it not follow the JPG format - maybe holding other data? Yes, it could.
@gerdsfargen6687
@gerdsfargen6687 Жыл бұрын
@JacobSorber oh wow,.hi Jacob! I suppose yes, I'm asking if those magic numbers could pose as a jpg yet maybe carry some hidden data within those very magic numbers. I want to thank you for your reply, and will take it on board when checking an example of this case out. Cheers!
@eddaly4160
@eddaly4160 Жыл бұрын
Great video as usual, What is the proper way to exit a program?..."exit(EXIT_FAILURE)" or with "return EXIT_FAILURE"...or use EXIT_SUCCESS...not sure when to use "return" or "exit" to end the program.
@JacobSorber
@JacobSorber Жыл бұрын
With most C runtimes, main is called like this from some other libC startup routine - result = main(argc, argv); exit(result); So, returning from main and calling exit are essentially doing the same thing. I suppose that calling exit from main rather than returning might be slightly more efficient (avoiding a function call return), but it's not likely to make any difference (especially once compiler optimizations get involved).
@ercntreras
@ercntreras Жыл бұрын
Nice!
@zrodger2296
@zrodger2296 Жыл бұрын
"Strings are strings; bytes are bytes." That's the way it should always be! Good video.
@torrenttv7567
@torrenttv7567 Жыл бұрын
Please make a next video of socket server handle part 4 - event driven
@anon_y_mousse
@anon_y_mousse Жыл бұрын
That's completely bonkers to me that Ruby would have that issue. Especially considering that 7-bit ASCII maps exactly into UTF-8. At this point I should probably be keeping a list of reasons to never learn Ruby. As an aside, being a Linux user I would never bother to write such a program because `file` exists and I could just check with it, however, a good example you might want to make a future video for is serializing data structures. I prefer to use a text based method, such as TOML, for simple structures, but when it's complex I use the binary approach.
@redcrafterlppa303
@redcrafterlppa303 Жыл бұрын
I wrote an image grouping Tool that groups png files into 1 file (png hat it's own set of magic numbers) and I thought it would be problematic if my seperator identifiers would randomly appear in the binary file (unlikely but possibly). My theory (not confirmed) was that the magic numbers of the file format likely won't appear in the binary. So I literally packed the images byte to byte after each other and seperated them by splitting them at the magic start. It works and it's as efficient as possible.
@flippert0
@flippert0 8 ай бұрын
6:51 nice jab at Windows
@minhajsixbyte
@minhajsixbyte 10 ай бұрын
basically an oddly specific version of "file" command/program
@beardlyinteresting
@beardlyinteresting Жыл бұрын
Because I usually do a typedef char byte or more specifically if I know I'm only targeting C99 and later typedef uint8_t byte just to be explicit when writing code. When I saw a char array of size 3 my brain went "that's not big enough, what about the null terminator" took my brain a second to go, "no it's just a byte array of fixed length so we know how long it is" 😅 Also only using printf for error messages instead of fprintf(stderr, ...); bugged me way more then it should lol. Good explanation though and I love the ruby example because I've run into encoding issues before with scripting languages, never had an encoding problem with C. So it's certainly something to keep in mind.
@redcrafterlppa303
@redcrafterlppa303 Жыл бұрын
Wait until you try to use windows and c/c++ and you get char signed char unsigned char wchar_t char8_t char16_t char32_t All being different character types used in various functions in windows. Ps: And yes I looked it up char is defined as being neither unsigned char nor signed char in the Microsoft compiler.
@beardlyinteresting
@beardlyinteresting Жыл бұрын
@@redcrafterlppa303 Yeah that's why I use linux 🤣
@MECHANISMUS
@MECHANISMUS Жыл бұрын
Helpful presentation! Why put env in shebang line instead of interpreter alone? Seems would be prettier with collapsed or squashed explorer.
@ramadhanafif
@ramadhanafif Жыл бұрын
Yes, this encoding bs really frustates me when I'm doing a byte or bit level manipulation in python. Things that seemingly so easy in C can get tangled due to mismatching data type.
@redcrafterlppa303
@redcrafterlppa303 Жыл бұрын
The worst thing is you don't directly see the datatypes because python is stupid (sorry not sorry)
@thomas_m3092
@thomas_m3092 Жыл бұрын
Why does the c version work? FF and D8 are outside the range of a char, which is normally signed. Shouldn't the compiler warn about it?
@JonnyRobbie
@JonnyRobbie Жыл бұрын
Why not #define the MAGIC_NUM_BYTES? I know the low level difference between defines and declarations, but what is the high level practical difference?
@JacobSorber
@JacobSorber Жыл бұрын
That would work, as well. I have a video about this somewhere in the list. Making it a variable allows the compiler to help you with type checking and some forms of error detection. #define might in some cases have performance advantages (I don't think it would in this case). For this example, both are viable options.
@unperrier5998
@unperrier5998 Жыл бұрын
At 16:50 isn't it better to encore the UTF-8 string into 8-bit ASCII instead? How can you be sure that the 3 bytes read from the files form valid utf-8?
@JacobSorber
@JacobSorber Жыл бұрын
Yeah, probably. I just picked one, since I just needed the encodings to match. But, yes, if I were doing anything else with the strings, forcing both to ASCII-8 would have been better.
@rdwells
@rdwells Жыл бұрын
@@JacobSorber In this particular case, since you know that the string you're looking for is not a valid UTF-8 string, I think you'd definitely be better off using the ASCII-8BIT encoding. Otherwise, you're depending on the language to do the right thing when comparing invalid UTF-8 strings. If it is doing a byte-by-byte comparison you're safe, but if "equal" means "represents the same sequence of Unicode code points", I would think all bets are off if the strings being compared are not valid UTF-8 strings. (It is possible for two UTF-8 strings that are not identical byte-wise to be equal in terms of what they encode.)
@XESCoolX
@XESCoolX Жыл бұрын
10:17 I know it’s not super important, but I think it would make more sense to have this print “No!” instead of return an error. Because if it’s reading less, i.e. if the file is smaller than 3 bytes, then we know that it’s not a JPG. I’m not sure if it’s possible though for fread to fail, but the file may still be a JPG? Would it be safe to assume this doesn’t happen if fread does not return 3?
@redcrafterlppa303
@redcrafterlppa303 Жыл бұрын
You can check rather a file io error occurred or you hit end of file by calling and checking feof() //end of file And ferror() // error So the number returned really just serves to confirm the success case. To fully answer your question, you could do some retrying by rerunning fread or completely reopening the file in case fread did not fail with eof (which would confirm it not being a jpg as it is smaller than 3 bytes). But I'm not sure rather putting this much effort would be worth it in most cases.
@billmoo
@billmoo Жыл бұрын
Any reason as to why you never closed the FILE * ?
@hashi856
@hashi856 9 ай бұрын
How are you using that not equal sign?
@andrewporter1868
@andrewporter1868 Жыл бұрын
wide char argv tutorial for Win32 wen (wmain and wWinMain)?
@atabac
@atabac 10 ай бұрын
what IDE is that? looks like its using some syntactical sugar coating. it uses enequal symbol instead of != . small thing but kind a annoying it hides the real code hehe.
@yooyo3d
@yooyo3d Жыл бұрын
You should do proper jpg chunks reading. Testing the first 3 bytes is not enough. People can learn more about formats, and common practices how to work with binary data and maybe how to save and load their own binary files
@sajolsajol8393
@sajolsajol8393 Жыл бұрын
Sir, Suggest me a book where I can learn about these things...
@brockdaniel8845
@brockdaniel8845 26 күн бұрын
Pretty nicee
@__hannibaal__
@__hannibaal__ Жыл бұрын
The Programmer word gave very scary Word like , CPU DPU GPU JPEG ZIP MPI LLVM …; that take me out fare but when dive in deep to understand i found it only to make different between principles; that remember me to very scared mathematical theorems that push people away to studying mathematics.
@greg4367
@greg4367 Жыл бұрын
Greetings from San Francisco. Let's get to the important stuff: Where can I get one of those malloc() T-shirts? It is not on you Merch section.
@JacobSorber
@JacobSorber Жыл бұрын
I'm glad you have your priorities straight. It should be on there now.
@greg4367
@greg4367 Жыл бұрын
@@JacobSorber I'll get my order in now, thanks.
@ForeverNils
@ForeverNils Жыл бұрын
did you forgot to close file?
@pcuser80
@pcuser80 Жыл бұрын
Yeo i see no fclose(fp);
@RobertFletcherOBE
@RobertFletcherOBE Жыл бұрын
when a process exits its resources are released.
@pcuser80
@pcuser80 Жыл бұрын
@@RobertFletcherOBE Yep i know that. But is better to close/free all. For a short living program you dont have to use free. For programs that run always you must you use free.
@JacobSorber
@JacobSorber Жыл бұрын
Yes, I did. Sorry about that.
@ForeverNils
@ForeverNils Жыл бұрын
@@RobertFletcherOBE ok but it would be nice to manually release resource when it's not needed any more
@soniablanche5672
@soniablanche5672 7 ай бұрын
I don't think reading 3 random bytes as UTF-8 is a good idea, not all 3 bytes sequences are valid UTF-8 so your program might crash, give error or return a garbage string. I think it's better to convert the string you are comparing with to ASCII / bytes / char / whatever it's called in your language
@DhruvTrivedi
@DhruvTrivedi Жыл бұрын
For those using GCC, you have to initialize magicNumber using malloc: char *magicNumber = malloc(MAGIC_NUM_BYTES * sizeof(char));
How to sort part of an array in C
5:44
Jacob Sorber
Рет қаралды 6 М.
How does fork work with open files?
13:12
Jacob Sorber
Рет қаралды 9 М.
Опасность фирменной зарядки Apple
00:57
SuperCrastan
Рет қаралды 12 МЛН
UNO!
00:18
БРУНО
Рет қаралды 2,3 МЛН
路飞太过分了,自己游泳。#海贼王#路飞
00:28
路飞与唐舞桐
Рет қаралды 37 МЛН
A better hash table (in C)
41:20
Jacob Sorber
Рет қаралды 27 М.
find memory errors quickly. (-fsanitize, addresssanitizer)
9:44
Jacob Sorber
Рет қаралды 16 М.
Does it matter what hash function I use? (hash table example in c)
11:14
Sockets and Pipes Look Like Files (Unix/fdopen)
12:45
Jacob Sorber
Рет қаралды 19 М.
Explaining Image File Formats
14:20
ExplainingComputers
Рет қаралды 124 М.
The What, How, and Why of Void Pointers in C and C++?
13:12
Jacob Sorber
Рет қаралды 52 М.
Reading and Writing from Binary Files in C!
20:26
Astrocode
Рет қаралды 10 М.
How to get an IP address from a host name? (Example in C)
18:05
Jacob Sorber
Рет қаралды 11 М.