Reverse Engineering Data Files

  Рет қаралды 38,326

Tsoding Daily

Tsoding Daily

Күн бұрын

Chapters:
- 0:00:00 - Announcement
- 0:01:03 - Intro
- 0:09:16 - Bootstrapping the Project
- 0:15:05 - Should you handle result of malloc?
- 0:16:53 - Porting build system to nob
- 0:19:17 - First Naive Attempt
- 0:23:54 - Saving PNG
- 0:27:59 - Frequencies
- 0:32:55 - Analyzing a bunch of files
- 0:35:58 - Logarithmic Scale
- 0:38:42 - Command Line Arguments
- 0:42:20 - Output file path
- 0:43:26 - ASCII pattern
- 0:44:06 - Debugging
- 0:48:21 - Analyzing a bunch of files
- 0:49:55 - How image data pattern should look like
- 0:52:38 - Analyzing x86_64 executables
- 0:55:29 - Generating arm64 executables with Go
- 0:59:28 - Comparing x86_64 and arm64
- 1:02:12 - Planning
- 1:03:09 - I like tmux
- 1:03:29 - Why do we have so many languages
- 1:05:16 - img2raw
- 1:11:19 - Parallel builds with nob
- 1:12:45 - How run async is implemented
- 1:14:14 - Silly MSVC being MSVC
- 1:16:47 - Analyzing raw images
- 1:18:01 - binviz
- 1:18:52 - nob sub-commands
- 1:22:57 - "Incremental" builds with nob
- 1:28:02 - Harvesting image data from $HOME
- 1:44:36 - Harvesting executable data
- 1:49:20 - Harvesting sound data
- 1:54:35 - Analyzing wav files
- 1:56:30 - Why do patterns look like that?
- 1:59:00 - Outro
References:
- 4 2 1 Christopher Domas The future of RE Dynamic Binary Visualization - • 4 2 1 Christopher Doma...
- Tsoding - musializer - github.com/tsoding/musializer
- MSVC - C complex math support - learn.microsoft.com/en-us/cpp...
- Tsoding - binviz - github.com/tsoding/binviz
Socials:
- Twitch: / tsoding
- Twitter: / tsoding
Support:
- BTC: bc1qj820dmeazpeq5pjn89mlh9lhws7ghs9v34x9v9

Пікірлер: 105
@DarthVader-xu8oo
@DarthVader-xu8oo 8 ай бұрын
I lose my love for programming, I watch this man then I'm inlove again(the cycle repeats)
@qwqeqrqtqz
@qwqeqrqtqz 8 ай бұрын
It has been a few years, but I also programmed something like this after watching that video. I had to chuckle when you called it binviz, because I called it binvis
@anon_y_mousse
@anon_y_mousse 8 ай бұрын
Are you British or from a former British colony? I do find it interesting that he used Z instead of S, but maybe he learned English from watching US TV shows.
@nashiora
@nashiora 8 ай бұрын
I'm already using nob as my primary build tool for all of my projects, and I am fully prepared to migrate to the final version when it comes out. I've really enjoyed using it so far.
@kuyajj68
@kuyajj68 8 ай бұрын
Tsoding is the person I wish I am 😂😂
@jordixboy
@jordixboy 8 ай бұрын
then go grind and code a lot, dont just wish, take action lol
@anon-fz2bo
@anon-fz2bo 8 ай бұрын
only thing we have in common is duckduckgo
@kuyajj68
@kuyajj68 8 ай бұрын
@@jordixboy I code a lot, and I use arch btw.
@ce5983
@ce5983 8 ай бұрын
Just get on the command line and start experimenting and learning, it would be great to have another Tsoding making interesting content
@kuyajj68
@kuyajj68 8 ай бұрын
@@anon-fz2bo I use vanilla emacs 😂
@CrossbowBeta
@CrossbowBeta 8 ай бұрын
Such a cool topic, this is my favorite zozin session so far
@metaltyphoon
@metaltyphoon 8 ай бұрын
Is that what he says on his intros? I can't ever tell :D
@rodelias9378
@rodelias9378 7 ай бұрын
Really interesting stream. Thanks and keep up with the good work!
@cobbcoding
@cobbcoding 8 ай бұрын
12:56 this is the best home folder I've ever seen
@para-be4bf
@para-be4bf 7 ай бұрын
1:52:07 you would probably enjoy making yourself a "slugify" command for such scenarios, basically makes any string into an acceptable filename, only thing you'd have to take care of is somehow handling the / character
@tiberiumihairezus417
@tiberiumihairezus417 7 ай бұрын
Thank you so much, the value from your videos is astonishing
@ce5983
@ce5983 8 ай бұрын
Such an interesting topic Zozi, hadnt heard of this at all
@TsodingDaily
@TsodingDaily 8 ай бұрын
Same!
@olekbeluga314
@olekbeluga314 8 ай бұрын
You guys are like doing spectrum analyzer for files awesome :)
@arcxm
@arcxm 8 ай бұрын
Very interesting stuff as always! Would love a follow up on this topic
@frechjo
@frechjo 7 ай бұрын
Those patterns make sense for the most part. In sounds, each sample will probably more or less be in a normal distribution, with the highest probability near the center, and smaller probabilities near the extremes. Sound is usually normalized, and it's a wave that crosses zero periodically (127 likely represents 0 in this case). And what we see is that correlations involving two samples in the middle are more often than those involving one or two samples in the extremes. For images, the lines parallel to the major diagonal lines are likely due to pixels involving the same proportions in color components (major diagonal being black to white), so monochromatic gradients. For x86 code, it's really interesting. Vertical and horizontal lines should form when there's some particular value associated to a set of bytes, and when the values in that set only change in the lower bits. As pairs interleave (probably instructions and operands), verticals and horizontals switch for each pair. I think the x86 instruction set encodes registers in the lowest 3 or 4 bits? I don't really know anything about it, but that sounds like something I've seen somewhere. Ascii is the most obvious one. Now, ogg is amazing. My guess is that it has some very frequent bit width in it's encoding, which ends up aligning a lot with byte boundaries, and that's what's causing those squares. But that's not a full explanation, just a general direction in which it could make sense. The compressed files (other than ogg) could be explained by entropy (a good compression should eliminate patterns as much as possible), but I think that's not all of it. I think a bigger reason for not displaying discernible patterns, is because they won't align with byte boundaries: compression schemes typically encode variable width bits. PNG in particular uses some LZ scheme with Huffman encoding iirc. If there are patterns, they'll happen across bytes, and each time appearing in different value positions. Very interesting for sure :)
@ratchet1freak
@ratchet1freak 7 ай бұрын
for x86 I believe the main culprit would be prefix bytes which are common for instructions operating on 32 and 64 bit registers.
@frechjo
@frechjo 7 ай бұрын
@@ratchet1freak Lots of different bytes appearing immediately after and before a few repeating ones. Makes sense, yo are probably right :)
@SerBallister
@SerBallister 6 ай бұрын
@@ratchet1freak ARM32 had something similar, each instruction had a conditional flag like LE or GT.. but the general case it was AL (always) - so every 32bits you would see E at the start of the opcode.. making it really easy to eyeball ARM code in a hex dumps.
@flleaf
@flleaf 5 ай бұрын
interestingly you wav files can be stored with floats so i wonder how that would look like. also he did ffprobe on his wav files and it said signed 16 (on the stream)
@frechjo
@frechjo 5 ай бұрын
@@flleaf Ah I probably missed that. 16b signed, that's a bit puzzling. IIRC, the wav files were symmetrical, with the highest frequencies towards the center. 2 byte two's complement numbers, that's interesting.Haven't thought about that one, or why could it look like that at all. For floats, I guess some pattern should emerge from a few repeating or similar exponents? (Depending how they decompose into bytes). That would be a nice one to look at :D
@johanngambolputty5351
@johanngambolputty5351 8 ай бұрын
Apart from where you transition from the end of one row to the beginning of another, it makes a lot of sense that the pattern is diagonal, because images are normally smooth, so the value of one pixel and a neighbouring one is usually not too different (for just an array of pixels anyway, although I guess rgb triplets might be packed next to one another, which could jump?).
@mire6134
@mire6134 8 ай бұрын
Right, as well as it makes sense that ASCII characters always form the same pattern as certain characters are a lot more frequent than others and the frequency of each character is, to an extent, predictable given a long enough piece of text.
@ratchet1freak
@ratchet1freak 7 ай бұрын
it's more the values of the color channels though if there is a certain hue used often at a range of brightnesses over the image then that's gonna be 2 diagonals, one for the RG pair and one for the GB pair.
@ivanjermakov
@ivanjermakov 7 ай бұрын
Regarding why data looks the way it is: - Executable format lines represent instructions and registers, it is a line because names are usually multiple bytes - Images usually look smooth, thus adjacent bytes are similar and diagonal lines appear - Soundwave data consist of byte sized samples, low value following high value, thus it "sticks" to view sides
@bbq1423
@bbq1423 8 ай бұрын
Maybe it's time to create a binary format that always looks like Rick Astley when viewed via binviz
@EliasOjeda-mv6cg
@EliasOjeda-mv6cg 8 ай бұрын
after watching your videos, i got back to programming in c for fun.
@jtucker87
@jtucker87 7 ай бұрын
Hey, that's Derbycon! I'm in Louisville! Wasn't expecting that.
@adolfocarrillo248
@adolfocarrillo248 8 ай бұрын
Man, you've a keen mind😂😂. Amazing what you can do with a computer language!!
@chiefxtrc
@chiefxtrc 7 ай бұрын
I think you could have bumped up the base "brightness" of the pixels to make it easier to see the lower values
@brissance
@brissance 7 ай бұрын
сэр , спасибо за отличные видео.
@jabere-flow2186
@jabere-flow2186 8 ай бұрын
Tsoding Daily,which I search Daily for a programming magic.
@omaramo190683
@omaramo190683 8 ай бұрын
yet another fucking good recreation programming session
@TheMASTERshadows
@TheMASTERshadows 7 ай бұрын
map[y][x] + 1 is a better fit, to omit sub 1 values, edit: also I was thinking maybe normalizing the values by dividing by the squared deviation would make the result less contrasty and remove the overshadowing of the high frequency values
@bbq1423
@bbq1423 8 ай бұрын
Another way to visualize 3d stuff would be to map the 3rd dimension to a color. That way you don't need a fancy 3d viewer to look at the visualization.
@CD4017BE
@CD4017BE 8 ай бұрын
That method is not possible in this case because the dataset to visualize is effectively `float[256][256][256]`. If you put that on a 256 x 256 image then each pixel still needs to represent float[256] which doesn't work with only 3 (or 4) color channels (you would need 256 color channels). But what you could do is make a video with 256 frames of 256 x 256 images displaying gray-scale.
@MACAYCZ
@MACAYCZ 8 ай бұрын
I love you and your awesome content!♥
@iTsBadboyJay
@iTsBadboyJay 8 ай бұрын
I use a jetbrains editor and if you make the mistake of opening multiple large repos, you will DEFINITElY run out of memory. when you restart the IDE, it restores your previous work spaces as expected. but instead of indexing just the current window and queuing the rest, it immediately tries to index all the repos you had opened previously. and then i get the popup on mac, running low on memory and everything grinds to a halt
@KitsuneAlex
@KitsuneAlex 7 ай бұрын
PNG is an interesting one because afaik it's essentially a bitmap inside a GZip archive.
@forayer
@forayer 7 ай бұрын
Very cool topic! Ty
@0ne87
@0ne87 8 ай бұрын
"I don't even see the code. All I see is blonde, brunette, red-head."
@ChaoticNeutral6
@ChaoticNeutral6 8 ай бұрын
This is an amazing video, although I can't get the line pattern at all when I tried this out at home (the x64 pattern works fine). I wonder if you used stenography to hide an executable inside an image, would this sort of visualisation technique be able to identify that? Maybe it would just see higher entropy in the picture but no other signs
@vantadaga
@vantadaga 8 ай бұрын
Reverse engineering is a really interesting field
@rogo7330
@rogo7330 8 ай бұрын
My sollution for dependencies: just steal the code and fix it yourselves. If you can't fix it yourselves: dismember it and write again. God bless MIT licenses with which you can forget that you steal someones code and you will not be crucified by GNU church or sold on the black market by big corp.
@ThatGuyexe
@ThatGuyexe 8 ай бұрын
Love this guy ❤
@LeysTeamProsperity
@LeysTeamProsperity 8 ай бұрын
❤ Nice topic
@SiddheshPardeshi-mp9cr
@SiddheshPardeshi-mp9cr 8 ай бұрын
Tsoding poggers as usual
@GlobalYoung7
@GlobalYoung7 6 ай бұрын
Thank you 😊
@skr-kute1677
@skr-kute1677 7 ай бұрын
I think u could just apply log AFTER you normalize and it would work just fine How ever normalize to a max value of like 10 so that the log curve is actually "used" And you can avoid the none defined 0 by adding a 1 to all the values It doesnt affect the patern much
@KalinRangelov
@KalinRangelov 8 ай бұрын
Very cool stuff. But kind of missing the point on images. Binviz should recognize the file type. In order to recognize image, you need to know its image and read it raw.
@nkusters
@nkusters 7 ай бұрын
as for "why are raw audio files like that?", it's a wave, so the most values will be at the extremes as it slows down and moves to the other side again, where you'll have more values near 0 and 255, as it's a wave and slows down at those points.
@volbla
@volbla 7 ай бұрын
Haha, i did this on a text file and only got single pixel lines at the top and left of the image. It turned out the file was coded in UFT-16, so every other byte was just zero ^ -^ Another loss for UTF-16. Great stuff.
@danielleontiev7134
@danielleontiev7134 7 ай бұрын
My assumption as to why images look the way they do, is because any image element is a run of pixels in an x direction, spanning multiple sections of the same section downward in the y direction. In the output it looks like the byte patterns are shifted right and down, because all images take their origin of 0.0 at the top left corner 🤔
@mire6134
@mire6134 8 ай бұрын
How about training a generative model on this kind of data, then having it generate raw bytes and seeing what they look/sound like, depending on whether the model was trained on images or sounds.
@user-ni2we7kl1j
@user-ni2we7kl1j 8 ай бұрын
Sadly, it probably won't look very interesting, because the structure of the resulting 256x256 image simply describes probabilities for each pair of bytes. It's just not enough to capture meaningful details of the real data, so the generative model won't significantly outperform an actual bigram model
@Lofen
@Lofen 8 ай бұрын
Why not put nob in its own repo and simply use it as a submodule for new projects?
@Ivan-qw4mn
@Ivan-qw4mn 7 ай бұрын
I am really eager to understand THE MATH behind this stuff (to be precise it is probably math statistics) -- any ideas where to start precisely?
@hoteny
@hoteny 8 ай бұрын
8:58 i would love a NN that can guess structures from a multistructured file though. Idk how that would work, though. Like, how to guess when a structure ends and the next begins.
@belst_
@belst_ 8 ай бұрын
you can probably scan small slices of the file, then when u find a pattern, increase the slice until the pattern gets more blurry, and then mark the area as that file structure, then continue after the marked area with a smaller slice again
@hoteny
@hoteny 8 ай бұрын
@@belst_ yeah but how do you decide the slice size and slice incrementation amount to minimize risk while not making this operation take hundreds of years? Also can another approach be possible or an ai approach?
@rogo7330
@rogo7330 8 ай бұрын
I feel like man pages will produce very specific pattern when doing that shit. Also, thinking while watching video, you probably want to make most popular pixels brighter, while rare hits must have low value, because they are just random hits.
@aemogie
@aemogie 8 ай бұрын
is it possible to use nob's GO_REBUILD_URSELF from the main file instead of a seperate build script?
@Author-Bangladesh
@Author-Bangladesh 8 ай бұрын
Why you don't use doom emacs or spacemacs? How row emacs save your time?
@ecosta
@ecosta 7 ай бұрын
1:14:14 - This weirdness in MSVC is true for a bunch of POSIX stuff. They have "stat" as "_stat" and other shite like that. Stupid vendor-locks are stupid.
@cobbcoding
@cobbcoding 8 ай бұрын
1:58:10 super smash bros in executable confirmed?
@angelomarano8458
@angelomarano8458 8 ай бұрын
I'm trying to make it 3D and obviously i can't. Any suggestions?
@fickthissut
@fickthissut 8 ай бұрын
Is there any possibility that You'll create Odysee channel's?
@diegorocha2186
@diegorocha2186 8 ай бұрын
Who needs sha256 if we have file fingerprints like this!
@ShanyGolan
@ShanyGolan 8 ай бұрын
Let's ggoooooo
@0ne87
@0ne87 8 ай бұрын
cheese viz
@mrcrafter_y
@mrcrafter_y 8 ай бұрын
13:09 Epic scheiße
@ymathh3808
@ymathh3808 8 ай бұрын
unfortunately the video has no subtitles :[ I'm not 100% fluent in English 😢
@iamdozerq
@iamdozerq 8 ай бұрын
His accent VERY easy to hear understand.
@PeterJepson123
@PeterJepson123 8 ай бұрын
Here is a thought. With a powerful Neural Network it might be possible to reverse this process and produce an executable binary from the image. Lol. And then, we could apply the gaussian diffusion process which is used by midjourney et-al to mix different images based on labels and produce entirely new binary files. Then we could skip the programming and compiling alltogether and simply text-prompt features we want and produce a binary application. I imagine that would be very difficult to program but it certainly seems possible. Good video as always. Cheers.
@drdca8263
@drdca8263 8 ай бұрын
These images are only a statistical summary of the file. They always are 256 by 256 regardless of the size of the input file. It isn’t like a spectrogram of an audio file, which includes enough info about the file to recover a large amount of the original sound (or possibly even all of it if you keep the phase information?). At most, you could see each of these visual representations as being like, a simple statistical model (specifically a Markov chain) created from a single file, and you could sample bytes according to it, but there would be tons of possible files fitting with the same bigram statistics, and most of them would be complete garbage.
@RandomGeometryDashStuff
@RandomGeometryDashStuff 8 ай бұрын
10:25 why do most c hello worlds use printf("Hello, world ") instead of puts("Hello, world")?
@TsodingDaily
@TsodingDaily 8 ай бұрын
Because if everyone was using puts("Hello, world") you would be asking why not printf("Hello, world ")
@RandomGeometryDashStuff
@RandomGeometryDashStuff 8 ай бұрын
@@TsodingDailyno, because printf is more complicated because f like %s, %d, %%, %c...
@Anubis10110
@Anubis10110 8 ай бұрын
😅
@anon_y_mousse
@anon_y_mousse 8 ай бұрын
@@RandomGeometryDashStuff It's because printf is like the gateway drug into C.
@CD4017BE
@CD4017BE 8 ай бұрын
Probably because the function name `printf` is more intuitive than `puts` for a person that just starts programming for the first time.
@josephcbs6510
@josephcbs6510 7 ай бұрын
This msvc complex numbers thing is so fucking annoying Every time I need to do something physics related, I need to re implement complex numbers because this stupid msvc I remember my reaction the first time I tried to compile something with c99 complex numbers on msvc and it did not compiled. The sensation I felt, the moment I opened the msvc documentation and saw how the complex numbers works on msvc, is burned on my mind. I still have nightmares about it from time to time
@zahash1045
@zahash1045 8 ай бұрын
Yeah right, an elite Russian programmer thats "definitely not a hacker". Nice try.
@Maximxls
@Maximxls 3 ай бұрын
I'm pretty sure your application of log is VERY wrong. You basically set the maximum to the log of itself, while not changing the data in any way. I think the right way to do it would've been to just replace all the numbers with their logs (leave zeros as is).
@ilikegeorgiabutiveonlybeen6705
@ilikegeorgiabutiveonlybeen6705 8 ай бұрын
antiplagiat 0_o
@kawaikaede2269
@kawaikaede2269 8 ай бұрын
💀
@noctavel
@noctavel 8 ай бұрын
I start with a like and then every time i see something cool, i dislike and like it again. avg like rate: 7 likes per video. you're welcome
@satchelfrost6531
@satchelfrost6531 8 ай бұрын
lol 1:24:58
@upbeatsarcastic8217
@upbeatsarcastic8217 8 ай бұрын
This guy nobs
@Joorin4711
@Joorin4711 8 ай бұрын
To claim that there are many programming languages because some company wanted to own some market is, at best, naive and, at worst, stupid. Have some companies tried to own some market by trying to introduce their own version of a language? Yes. But to extrapolate from that is just not valid. Programming languages are tools that often are designed with a specific problem in mind and that has been true all the way from ALGOL, PROLOG, FORTH and LISP up to Rust and Python and the rest of them all.
@anon_y_mousse
@anon_y_mousse 8 ай бұрын
At worst it's an exaggeration. It can't be denied that nearly every product that Microsoft has put out was an attempt to dominate the market. While he may have exaggerated by saying programming language, had he restrained himself to just Microsoft products he'd be 100% spot-on.
@TsodingDaily
@TsodingDaily 8 ай бұрын
> Programming languages are tools that often are designed with a specific problem in mind Speaking of stupid, how does that contradict what I've said? And thinking that Rust, for instance, is not an attempt at expanding someone influence is speaking of naive. 🤡
@iivarimokelainen
@iivarimokelainen 7 ай бұрын
as someone using modern tools like an IDE... this was so painful to watch. it's like coding on a C64 and calling it productive
Web in Native Assembly (Linux x86_64)
2:03:41
Tsoding Daily
Рет қаралды 54 М.
Hash Table in C
2:11:31
Tsoding Daily
Рет қаралды 57 М.
Just try to use a cool gadget 😍
00:33
123 GO! SHORTS
Рет қаралды 61 МЛН
Why You Should Always Help Others ❤️
00:40
Alan Chikin Chow
Рет қаралды 108 МЛН
How to bring sweets anywhere 😋🍰🍫
00:32
TooTool
Рет қаралды 43 МЛН
I hope FFmpeg's Twitter won't Cancel me for This...
2:27:17
Tsoding Daily
Рет қаралды 22 М.
Easy Web Games in C
2:54:16
Tsoding Daily
Рет қаралды 47 М.
You don't need Generics in C
1:37:38
Tsoding Daily
Рет қаралды 57 М.
Reverse Engineering - Computerphile
19:49
Computerphile
Рет қаралды 182 М.
To-Do App in Assembly
1:05:27
Tsoding Daily
Рет қаралды 119 М.
Async Engine in C
3:12:16
Tsoding Daily
Рет қаралды 42 М.
Cracking Secret Message with C and Computer Vision
2:00:36
Tsoding Daily
Рет қаралды 20 М.
iPhone 12 socket cleaning #fixit
0:30
Tamar DB (mt)
Рет қаралды 37 МЛН
Bluetooth Desert Eagle
0:27
ts blur
Рет қаралды 7 МЛН
Iphone or nokia
0:15
rishton vines😇
Рет қаралды 1,7 МЛН