str vs bytes in Python

Рет қаралды 75,989

Күн бұрын

strings vs. bytes, what's the diff?
Strings and bytes are both fundamental types in Python. At a surface level they also appear do be very similar objects. From the similar notation, to the functions they offer, to their use cases like writing to a file, bytes and strings appear to do nearly the same thing. And yet, Python enforces a strict separation between them, making them unable to mix and requiring explicit conversions using str.encode() or bytes.decode() in order to translate between them. So what makes them different at all? In this video we'll find out how str and bytes differ and talk about encodings, like utf-8, which are the missing link between them.
― mCoding with James Murphy (mcoding.io)
Source code: github.com/mCodingLLC/VideosS...
bytes docs: docs.python.org/3/library/std...
Official unicode site: home.unicode.org/
UTF-8 wiki: en.wikipedia.org/wiki/UTF-8
PEP 686: peps.python.org/pep-0686/
SUPPORT ME ⭐
---------------------------------------------------
Sign up on Patreon to get your donor role and early access to videos!
/ mcoding
Feeling generous but don't have a Patreon? Donate via PayPal! (No sign up needed.)
www.paypal.com/donate/?hosted...
Want to donate crypto? Check out the rest of my supported donations on my website!
mcoding.io/donate
Top patrons and donors: Jameson, Laura M, Dragos C, Vahnekie, Neel R, Matt R, Johan A, Casey G, Mark M, Mutual Information
BE ACTIVE IN MY COMMUNITY 😄
---------------------------------------------------
Discord: / discord
Github: github.com/mCodingLLC/
Reddit: / mcoding
Facebook: / james.mcoding
CHAPTERS
---------------------------------------------------
0:00 Intro
0:20 str and bytes syntax
0:50 str and bytes functions
1:29 they don't mix
2:17 amazing sponsor
2:40 smiley
3:33 the meaning of bytes
4:53 encodings
6:07 dangers of not specifying encoding
7:21 warn default encoding
7:35 utf-8 mode
8:06 Outro and thanks

Пікірлер: 194

@lawrencedoliveiro9104 Жыл бұрын

1:34 Fun fact: separating bytes from strings was the most important major breaking change between Python 2 and Python 3. Trying to keep strings as byte-encoded led to all kinds of unfortunate trouble in Python 2, which could not be fixed without sacrificing backward compatibility. And they thought, while they were breaking things anyway, they might as well fix a few other things in a cleaner, non-backward-compatible way while they were at it.

@kadeemaustin1259 Жыл бұрын

That’s actually super interesting 😮 thanks for the info

@BenjaminWheeler0510 Жыл бұрын

When I started learning Rust, this was something that actually comes up quite a bit, since you can't iterate over a string object (you don't necessarily know its encoding at compile time). It was the first time I realized that the difference between ascii, utf, and others is actually really important!

@0LoneTech Жыл бұрын

This seems mistaken. Any decodable encoding is iterable, and Rust proponents keep bragging about compile time checks and zero overhead abstractions. A quick look up of Rust's std::string::String does define it as encoded using UTF-8, and it inherits the iteration method chars() from str. So it appears you're talking about something else, perhaps an equivalent to Python's bytes type.

@mr.bulldops7692 Жыл бұрын

@@0LoneTech Remember not everything is an object in Rust. Rust has two "strings" to be aware of. The first is a stack "str" which is a primitive type stored on (you guessed it) the stack as bytes. This is the default behavior when declaring a string in Rust. It might be "iterable" at this point, but I don't think the memory safety checks can hold true if you start manipulating values on the stack. Rather, Rust makes you create a slice of "str" on the heap as a "String" struct before doing manipulation.

@tomiesz Жыл бұрын

I think the comment is just weirdly stated. String itself does not implement the Iterator trait (aka there is no "default" way to iterate a String), instead it makes you choose between characters or bytes by using the chars() or bytes() method that return the appropriate iterator.

@0LoneTech Жыл бұрын

@@mr.bulldops7692 You're talking about mutability, a separate subject in both Rust and Python. This is distinct in turn from owned or borrowed, the prime difference between Rust's String and str (with lifetimes guarding stack frame ownership, iirc). &mut str does let you modify the data, irrespective of where it is stored, but one should take care not to violate UTF-8 encoding when doing so. The borrow checker should prevent breaking character boundaries while slices depend on them. I expect string literals to be in rodata, as 'static str, neither mutable nor in stack or heap.

@VivekYadav-ds8oz Жыл бұрын

The way you phrased it makes it seem like Rust doesn't know the encoding of a String/str at compile-time, which is bonkers. The type itself enforces the variant that it holds UTF-8 data. It's not encoding that's unknown at runtime, it's the grapheme clusters. Indexing is supposed to be O(1) for example, but you don't know what str[5] character will be in O(1), since maybe an emoji is in b/w which will take more than one byte, and so there's no direct mapping b/w index and character position. If you just want to assume byte array, just call .to_bytes() on it.

@japedr Жыл бұрын

Windows encodings are a real nightmare. There are the OEM/MS-DOS codepages used by the console which make almost impossible to consistently write non-English characters from a .bat script. Then there are the "ANSI" codepages which are used by the Win32 functions accepting strings as char pointers (e.g. MessageBoxA). It is usually Windows-1252 in western countries which is a slightly incompatible variant of ISO 8859-1 (also known as "Latin1"). Then there are the "Unicode" strings/MBCS/wchar_t pointers which are actually UTF-16 (even MS documentation states wrongly that "Unicode is a 16-bit character encoding"), meaning that Emojis will probably work in some places and not in others (try calling MessageBoxW with an emoji...). Except not really because in some cases it is UCS-2 instead of UTF-16 (another slightly incompatible variant). BTW, at least until recently you needed to add the BOM character to make stuff like notepad to recognize a UTF-16 file. Note that NONE of those encodings are UTF-8.

@lawrencedoliveiro9104 Жыл бұрын

Remember that MS-DOS was originally created for the IBM PC, so it had to incorporate the whole IBM “code page” concept.

@timogden9681 Жыл бұрын

Wow really informative. I wrote most of a project in windows, started using it in a Linux Google cloud VM, but I realized some of my data in a csv file was invalid. In the interest of getting a proof of concept out quick, I just quickly wrote a script in the VM that opens the file as a pandas dataframe, removes the invalid rows, and stores it as a csv file again. Except when I went to open this new file before giving it to my ML algorithm, it kept telling me the file was corrupted. I couldn't understand it, I was at a total loss, and I ended up just writing another hacky solution in which if I encountered an error loading one of the rows during the training process, I would just default to loading the first row instead. This makes total sense that this could have been the problem. Thanks James!

@mCoding Жыл бұрын

A true war story from the field! This is exactly what can happen with mixed encodings and I'm glad this helped figure out the problem!

@lawrencedoliveiro9104 Жыл бұрын

I’m sure Pandas has ways to hook into the loading/saving process. Python has options in decoding as to how to treat invalid byte sequences: for example, you could ignore them, or replace them with some marker character.

@0LoneTech Жыл бұрын

@@lawrencedoliveiro9104 Yep, in this case pandas.read_csv has arguments encoding, encoding_errors and on_bad_lines. One guess at a cause of the corruption might be how Windows NT Notepad silently injects an invalid BOM into UTF-8 files.

@nitishvirtual4745 8 ай бұрын

Yet another informative and well put video. Thanks!

@finnthirud Жыл бұрын

Decoded the mystery in a few minutes, thank you! ☺

@kyleaustin7768 Жыл бұрын

Its crazy how one day I am wondering about something and a week later you have a great video on it. Thanks for another great one!

@mCoding Жыл бұрын

Great to have you watching!

@cleverclover7 Жыл бұрын

It's crazy how much you come across decoding/encoding issues in the wild. I sometimes work with large text datasets with mixed encodings, sometimes even in the same line! The worst is that if you try and decode with the wrong encoding it can raise a runtime error, so I ended up writing a short program with a bunch of try/excepts for the different possibilities (utf-8 first of course). I did the same thing when I worked in C and Tcl. Gotta be a better way...

@mCoding Жыл бұрын

*Mixed* encodings! That's nightmare fuel for developers if I ever heard it!

@Bobbias Жыл бұрын

Chardet is a helpful library for trying to guess the most likely encoding. However, if you've got a single string with mixed encodings, then that might not be helpful.

@0LoneTech Жыл бұрын

Nightmare indeed. One rare program I've seen handling it decently is mlterm with mixes like ISO 8859-1 and EUC-JP.

@AntonioZL Жыл бұрын

I have dealt with that just recently. Absolutely terrible.

@cleverclover7 Жыл бұрын

@@Bobbias thank you I'll check it out!

@WalterVos Жыл бұрын

When you're completely unsure what the encoding of any file that you're processing is, the chardet package is really helpful.

@mCoding Жыл бұрын

Great tip!

@che_kavo Жыл бұрын

Thank you! I' ve always struggled to understand the diff between str, bytes and what is encoding. And now I finally understand! Thank you 😊

@SeanCrites Жыл бұрын

As I was searching the interwebs as to what a type 'byte' was and how to convert it to a string, my YT refreshed and there was this video at the top of my subscription, 4 minutes old. This timing was apropos.

@Dmittry 9 ай бұрын

The best integration I've ever seen.

@mrtnsnp Жыл бұрын

Looking forward to "from __future__ import default_encoding".

@mCoding Жыл бұрын

A very likely possibility. Maarten called it first! My guess is `from __future__ import utf8`

@francescoferazza9341 Жыл бұрын

One of the best explanations ever.

@tiagomacedo7068 Жыл бұрын

That was the best message from a sponsor I've ever seen.

@AngryArmadillo Жыл бұрын

Hey James, I’d love to see a video showcasing how to use the Textual package. It’s really neat, and fits your style.

@lawrencedoliveiro9104 Жыл бұрын

5:25 Not just the most popular, but some languages, including Python, have embraced Unicode to the point that identifiers can contain any Unicode characters that are classed as “letters”. So for example while “in” is a reserved word, “ın” is not, and can be used as an identifier.

@BenjaminWheeler0510 Жыл бұрын

Does it warn you about doing this? This rings a bell... I think some language out there actually does warn you if you do silly stuff like this. Maybe it was Rust or c++? Not sure.

@lawrencedoliveiro9104 Жыл бұрын

@@BenjaminWheeler0510 It’s a feature, not a bug.

@fltfathin Жыл бұрын

@@BenjaminWheeler0510 it only warns you if it's not a "letter" character (emoji,etc), katakana/hiragana/etc works as variable name without warning

@benhetland576 Жыл бұрын

It can be fun, then, to mix otherwise identical characters from the Latin, Greek and Cyrillic alphabets. You can write dеf instead of def, for example.

@TheAnonymmynona Жыл бұрын

@@BenjaminWheeler0510 Some IDEs warn you about it, for example vscode has a warning about non standart characters

@eternlyytc7300 Жыл бұрын

New subscriber here. Just wanted to say that I love your videos. Very informative and fun to watch! Keep up the good work

@mCoding Жыл бұрын

Thanks for your kind words! Welcome to the channel and I hope you learn a lot!

@eternlyytc7300 Жыл бұрын

@@mCoding thanks man! See you around

@lethalantidote Жыл бұрын

I absolutely love your videos. Regardless of my familiarity with a topic, every video seems to have some piece of information that I would not have discovered on my own. I never knew that files were encoded with the system encoding unless specified. It has never been an issue, but I know that one day it will be and without this knowledge, I would have really struggled to identify the issue. Future me really appreciates your hard work.

@mCoding Жыл бұрын

Great to hear! And I was also surprised when I ran the example and found out that my default encoding was not utf-8. Then I remembered I record my videos on Windows!

@Bobbias Жыл бұрын

@@mCoding yep, the "we do things different because.... (Usually) bad reasons." OS. As a longtime windows user, it's so damn frustrating sometimes.

@LiveType Жыл бұрын

This byte talk brought me back to the days of learning fixed point programming using only int8 adds and subtracts. Wowza. What a throwback. We're so spoiled these days with all of these high level languages. I even started to use micropython for small tasks despite bashing on it before.

@tusharsnn Жыл бұрын

When I came to know about the state of unicode support in c, c++, I immediately stopped learning it and switched to rust 🥲. Ascii, wide chars lol 😂

@nollix Жыл бұрын

Perhaps some of the confusion comes from this: Bytes are interpreted as bytes, but you type them as if they were string literals. So then, how does the string get transformed into bytes? Doesn't it have some sort of implicit encoding when you type it in your IDE?

@jacobgoldsmith7651 Жыл бұрын

yes, ascii. a=1, b=2, etc

@cyrilsli Жыл бұрын

@@jacobgoldsmith7651 that’s… not the ascii table?

@lawrencedoliveiro9104 Жыл бұрын

Everything is Unicode these days. When you convert between Unicode strings and bytes, the default decoding/encoding is “utf-8”.

@Howtheheckarehandleswit Жыл бұрын

@@jacobgoldsmith7651 In ASCII, a = 97, b = 98, etc, and A=65, B=66, etc.

@Howtheheckarehandleswit Жыл бұрын

I don't know for sure, but I'd imagine that it either uses the literal bytes that whatever text editor you used decided to encode the bytes literal as

@broccoloodle Жыл бұрын

Really comprehensive explanation

@jasonhenson7948 Жыл бұрын

Excellent video, thank you. I've had a couple of issues where I've had to use the IO and locale libraries to "fix" encoding shenanigans, but I think if I revisited those lines I'd now have an actual understanding of what was happening, how the changes worked and, most importantly, how /to do it better/.

@mCoding Жыл бұрын

Thank you for your kind words! I'm glad this was able to help you understand and you will know how to fix it when it crops up again.

@Unpug Жыл бұрын

Incredible explanation

@denyspisotskiy75 Жыл бұрын

interesting theme. waiting for your next video :)

@mjdevlog Жыл бұрын

so insightful for me as a beginner!

@PouriyaJamshidi Жыл бұрын

Fantastic explanation!

@DK-eo9vj Жыл бұрын

always great vids. thanks a lot!

@mCoding Жыл бұрын

Glad you enjoyed!

@4647540 10 ай бұрын

very good explanation, Cleared my head :)

@PetrSzturc Жыл бұрын

Thanks for this.

@tbonethechamp Жыл бұрын

wow you learn something new everydaay! had no idea that python has a built in way to do intersections 1:00

@gloweye Жыл бұрын

Huh, didn't know about that system encoding. Very much agree with PEP 686.

@lawrencedoliveiro9104 Жыл бұрын

4:19 Just a note that, in Python, the len() function is counting “code points”, not characters. So strings are really being interpreted as sequences of code points, not of actual Unicode characters (which can be encoded in multiple code points).

@mCoding Жыл бұрын

This isn't particularly a video about the specifics of Unicode, but since you mention it, a "character" as defined by the Unicode standard is the same as a code point, and this is the same way that I use the term in this video. You may be confusing the term "character" with a "glyph", which is a shape that is rendered as a representation of one or more characters. Various relationships may exist between character and glyph: a single glyph may correspond to a single character or to a number of characters, or multiple glyphs may result from a single character. Python's len function counts both characters and code points because these are the same, but it does not count glyphs. I refer you to Section 2.2 "Characters, Not Glyphs" in the Unicode standard for further explanation.

@lawrencedoliveiro9104 Жыл бұрын

@@mCoding No. Consider a character with diacritic marks, like for example “ä”. This has its own code point U+00E4, but it can also be represented as U+0061 (“a”) followed by the combining diacritic mark U+0308. The most common character-plus-diacritic combinations have their own assigned code points, but not every combination can be represented this way. Hence the need for multiple code points to represent a character.

@anon-fz2bo Жыл бұрын

yea newer languages such as GO ([]byte) and ZIG ([]const u8) use a slice of bytes to interchangeably represent strings which makes sense from a C/C++ perspective considering that strings in C/C++ are essentially just an array of characters and characters are essentially uint8_t (bytes)

@mCoding Жыл бұрын

Yeah this is an important choice that lower level languages make. "Strings" in those languages are more like the bytes object in Python, a contiguous container of bytes with stringlike functions. If you want true unicode support you have to use some external lib, which makes sense in performance driven languages because parsing utf8 at runtime is a huge performance penalty.

@GabrielEdu 9 ай бұрын

Muchas gracias, me ayudaste un montón!!!

@mCoding 9 ай бұрын

De nada!

@Zifox20 Жыл бұрын

Always there to teach me life ahah, thanks!

@mCoding Жыл бұрын

Any time!

@mattholden5 Жыл бұрын

Thanks, James.Very concise, well-informed and well-executed . I especially like the grounded references to Python 3.1x.x My inbox needs a vaccine for "Python 4 .*" titles. I might take a look at the meta on this vid to see if I can spot such effusion for my personal ytube feed.

@albogdano Жыл бұрын

Very interesting, thank you

@nathanoy_ Жыл бұрын

YAY new video!

@jullien191 Жыл бұрын

와 고마워요. 최고 ㅋㅋㅋ

@Yotanido Жыл бұрын

This makes me glad I only ever work on Linux systems. Utf-8 everywhere. I would have never even considered python using anything other than utf-8 when opening a file in text mode. Although, I also didn't know encode and decode could be used without an argument. I always specified utf-8 and will continue to do so.

@Veptis 4 ай бұрын

I have been using tree sitter for a language model dataset. I use the start_byte and end_byte to cut out a function and replce it with generation for the benchmark. I spend a few hours hunting down some offset issues... And it was due to difference in len for str vs bytes, also indexing. So i do a lot of encode, slice, decode. and its awful. I woild love to simply use the byte index to slice a str.

@lionkg81 Жыл бұрын

Great video as always, thanks! But still not really clear when to use each of these types.

@volbla Жыл бұрын

Bytes are mostly just useful if you mean to interpret them as something very specific that's not text. For example encryption keys or raw image data. If you're writing or handling text you should go with strings.

@avinoamkugler2720 Жыл бұрын

Great video😊

@petrskupa6292 Жыл бұрын

Great! So thankful, it cleared my confusion (I still didn’t go to Stack overflow for it 😅😂) ... May I just have curious question for an end? What might be the reason for anyone to have system not having UTF-8 as a default? (why not?)

@lawrencedoliveiro9104 Жыл бұрын

Legacy reasons. Before Unicode--indeed, for a long while after--there were these things called “national character sets”. In fact, there is likely still a large collection of text stored in these legacy encodings.

@pdmkdz Жыл бұрын

I needed this explanation 3y ago :/

@Lestibournes Жыл бұрын

Yesterday I wrote a self-extracting installer script. Today I see this. I found it easiest to write the installer file as a string and then write the files it contains as bytes that are encoded as utf-8, especially if they are binary files. Writing the whole installer to file as bytes caused me trouble with the python interpreter.

@Mr.Beauregarde Жыл бұрын

Thank God for UTF-8

@tusharsnn Жыл бұрын

Heads up: Unicode: It's like a dictionary of characters. Each character has a unique entry and a value which identifies it, aka code-point. A Code-point is a 4 bytes value. An encoding (there are several), encodes this unique code-point(4 bytes)to a sequence of bytes(variable sized), so as to save space. Eg. A utf8 encoded character can use 1/2/3/4 bytes depending on its code-point. Similarly, a utf-16 encoded character can use 2/4 bytes. Why utf8 is so popular you might ask? Reason is backward compatiblity to ASCII, all the ascii characters encoded to utf8 shares the same "value" when encoded to ASCII, E.g. 'A' is 65 in both encodings. All ascii chars uses only 1 byte in utf8. Why utf16, well utf8 cannot represent all the unicode chars, there are some chars that have code-points outside the range of utf8.

@benhetland576 Жыл бұрын

News to me. Which Unicode code points do you claim cannot be encoded in utf8, then?

@tusharsnn Жыл бұрын

@@benhetland576 Just checked and it's looks like utf8 does support all code points according to wikipedia, but I'm not sure if it's correct. I saw this warning when I was working with powershell script, it needed input encoded specifically in utf-16LE since it mentioned that this supports 'all' code points. Again, not sure why it might say that.

@mCoding Жыл бұрын

Maybe I should make a video not just on str vs bytes, but Unicode specifically. There are lots of interesting (and dark) corners in there!

@benhetland576 Жыл бұрын

@@tusharsnn The encoding that utf-8 uses theoretically allows a max of 7 octets (or bytes if you like), in which case the first octet would start with 7 ones followed by a 0, i.e, 0xFE. The next 6 octets each encode 6 bits for a total of 36 bits, and only 21 bits are needed to cover all possible Unicode codepoints (17 "planes" of 65536 codepoints each). 4 octets encode (8 - 5) + 3 × 6 = 21 bits, so that is the longest octet sequence ever needed to encode a single Unicode codepoint. There are byte sequences that are not valid utf-8 for several reasons (even single octets like a 0xFF), but not vice versa.

@0LoneTech Жыл бұрын

There are also characters with multiple code points, such as latin V and roman numeral V, and characters with distinct glyphs such as Han unified ones, and combined characters that may be separable like ä, and the classification and ordering of characters is language dependent. Text is messy. Unicode does not provide the tools needed to compare a Chinese and Japanese phrase in one text, by design. History and politics (not necessarily national) are involved.

@aceae4210 Жыл бұрын

so this solved a thing that I didn't think about so you know the *base64.b64decode* function (import base64) so when you decode a base 64 string the output is b'(decoded content)' which is as I just found out a byte formatted string before what I was doing was this (mind the naming schema) base64_decoded= b'some|text|here' str_base64_decoded= str(base64_decoded) and then str_base64_decoded[2:-1] (which is the same as slice() which is formatted as (start, stop, step) so what that did was remove the *b'* and then also removes the ending *'* to give *some|text|here* so yeah knowing byte formatted strings exist helps as instead I can just do this base64_decoded.decode() which will get me the same output *some|text|here* thanks for reading my weird experience, have a good day

@AssemblyWizard Жыл бұрын

Won't always give the same output, try decoding the base64 "4oKs" (that's a lowercase O not a zero), and then compare str with slicing vs decode

@aceae4210 Жыл бұрын

@@AssemblyWizard so doing a test I see what you mean the first row is with the byte to string (with str()) and the bottom one is .decode() "\xe2\x82\xac" byte to string then cut "€" using built in func which the main difference is .decode() being able to properly represent characters that byte strings can't thanks for letting me know the code I used is down bellow import base64 base_decode = base64.b64decode("4oKs") str_decoded_cut = str(base_decode)[2:-1] base_decode_builtin_func = base_decode.decode() print(f'"{str_decoded_cut}" byte to string then cut "{base_decode_builtin_func}" using built in func')

@zachb1706 Жыл бұрын

The encoding type changing depending on the system's configuration is nightmare fuel.

@yash1152 8 ай бұрын

6:12 ohw, that was the thing the pylance/mypy was shouting at me "encoding not specified"

@jedpittman6739 Жыл бұрын

mcoding != encoding. Amazing. 😂

@MithicSpirit Жыл бұрын

Discord gang

@mCoding Жыл бұрын

Best gang!

@swizice Жыл бұрын

> Stack Overflow?

@whannabi Жыл бұрын

@@swizice yuuuup'

@TanUv90 Жыл бұрын

2:17 Most humble ad read ever lol

@silverKirilljedi Жыл бұрын

Great video! But is not it weird that python's string encode() and open() use different default encodings? I've asked chat GPT and it says Python 3.10 has 17 built in functions and class methods that use encoding parameter and there's 5 default values utf-8, None (system's default), latin-1, ascii, utf-16. This is bad, right?

@NathanHedglin Жыл бұрын

Sounds like an absolute mess

@hemerythrin Жыл бұрын

Why ask ChatGPT instead of reading the documentation?

@Plajerity Жыл бұрын

ChatGPT is the best storyteller humankind has ever seen. To distinguish the fake from the truth, possible it's not. Do not put your faith on it if your question might be less popular, it generalizes everything.

@mCoding Жыл бұрын

Yes it is very weird, and there were historical reasons for doing it that are somewhere between no longer very relevant and a mistake. That's why PEP 686 is finally switching the default to utf8, but since this is big change they have to wait until 3.15!

@lawrencedoliveiro9104 Жыл бұрын

Checking help(open), it says the default encoding is taken from your locale. But I always set my locale to something UTF-8-based anyway, so no biggie.

@shukterhousejive 9 ай бұрын

With all the stuff Python gladly broke in the 2to3 switch I'm shocked bytestrings stayed around, all they do is confuse people for minimal convenience. Shoulda swapped it out with a fixed-length bytearray implementation, that way nobody gets confused about the intended purpose.

@GameSmilexD Жыл бұрын

but how to convert for binary shellcode in python3? i have a chroot python2 version for that

@quintencabo Жыл бұрын

One thing that's missing I feel is that a bytes is actually a list of ints between 0 and 256 nothing more. It makes sense but it was like an ah moment for me

@MattSpaul Жыл бұрын

Am I right in assuming the storage size the same between string and byte?

@mCoding Жыл бұрын

For the actual size in memory it's actually up to the implementation, they can differ due to things like small string optimization, ascii-only optimization, cached properties, and a few other things. When you write a string to disk, it is always converted to bytes first (it is done automatically in text mode) so in that sense the storage size is the same. However, the "length" of a string is the number of characters, which can differ from the number of bytes because some characters can take multiple bytes (like the smiley).

@lawrencedoliveiro9104 Жыл бұрын

Unicode “code points” can officially have any value in [0 .. 0x10FFFF]. That means a single code point could fit in 21 bits.

@lawrencedoliveiro9104 Жыл бұрын

Let me amend that. The valid ranges for Unicode code points are [0 .. 0xD7FF] and [0xE000 .. 0x10FFFF]. The values in the gap are called the “High Surrogates” and “Low Surrogates”, and are reserved for representing UTF-16 encodings. Which nobody should be using any more.

@yash1152 8 ай бұрын

1:21 which IDE?

@jeffkevin3 Жыл бұрын

What a coincidence! I just tried to survey the difference between them and found this video that just came out! 😀 So... why isn't your computer in UTF-8? 🤣

@yomajo Жыл бұрын

Ask Billy Jeans

@sleeper789 Жыл бұрын

5:24 "UTF-8 is by far the most common encoding across all programming languages." I don't think this is actually true. UTF-8 is the most common encoding for on disk and on the web text, but programming language implementations will often internally work in a different encoding than the wire/disk encoding. Both in Java and DotNet the String type is internally implemented using UTF-16, not UTF-8.

@lawrencedoliveiro9104 Жыл бұрын

UTF-16 is an unfortunate hangover from the early days of Unicode. Nobody uses it voluntarily any more.

@benhetland576 Жыл бұрын

@@lawrencedoliveiro9104 Voluntary or not, utf-16 is now deeply imbued into every NT-derived Windows computer out there. The ubiquitous windows-1252 et al is only what we see on "the surface" within some GUI apps and the command window. NT used to have "16-bit Unicode", but after Unicode expanded past the BMP they redefined it to be utf-16 instead. I wonder how many bugs are still hiding in there that don't actually handle the utf-16 "escapes" correctly and just assume every character is 16 bits...

@lawrencedoliveiro9104 Жыл бұрын

@@benhetland576 And that’s why you wouldn’t choose to use it.

@b4ttlemast0r Жыл бұрын

This seems to be a pretty good way to handle it. Meanwhile in C++ I'm still trying to figure out how to work with unicode characters at all..

@creed404 10 ай бұрын

What i know is that utf-8 is also 8-bit length so how he knows that he should interpret the 4 bytes as a emoji instead of some other 4 8-bit characters? Shouldn’t we use utf-32?

@bp56789 Жыл бұрын

A message from our sponsor: (quiet voice) me.

@zd2600 Жыл бұрын

By default, we should be good to know why we use string in Python. But, is there a practical use case for us to use bytes ? That may helps us to differentiate the uses between string and bytes here.

@denisfrunza1040 Жыл бұрын

many low level libraries will make you to use bytes a good example: try to write a web server from scratch

@0LoneTech Жыл бұрын

It's simply one level less of abstraction. Bytes hold arbitrary data and can be used in I/O, like storing or transmitting. String is for when data is text, while other formats could be handled with e.g. struct or ctypes. This video had a couple of examples decoding some bytes as little or big endian integers. It wouldn't make sense to pass a string to zlib.decompress() for instance.

@lawrencedoliveiro9104 Жыл бұрын

For example, the struct module lets you convert between various Python numeric/string types and strings of bytes.

@playerguy2 Жыл бұрын

Instructions unclear: Slapped the like button an even number of times.

@maxwellsmart3156 Жыл бұрын

Is there a command to tell you the system default encoding?

@briannormant3622 Жыл бұрын

On Linux you would set the LOCALE to language.utf-8 but no clue if you can do that on windows

@nirvana8145 Жыл бұрын

python3 -c "import sys; print(sys.getdefaultencoding())"

@ConstantlyDamaged Жыл бұрын

Instructions not specific, slapped like button 256 times.

@Kingofgnome Жыл бұрын

One question, i still have: x = b"Hello World 😉" will then automaticly convert my string into bytes using the system encoding as default?

@mCoding Жыл бұрын

The bytes literal syntax b"...." always uses ascii encoding and does not allow non-ascii characters like "😉" in the literal. If you want to include bytes (0-255 allowed) outside the ascii range (0-127 allowed), then you have to feed it an iterable of integers like bytes([255, 255, 255]) instead of using the literal syntax.

@bartlomiejodachowski Жыл бұрын

5:50 if utf-8 character has 4 bytes shouldnt have there been padding bytes after/before every byte from 65 to 68 ? 0,0,0,65, 0,0,0,66, 0,0,0,67, 0,0,0,68, 201,184,240,159 ...

@lawrencedoliveiro9104 Жыл бұрын

No. UTF-8 is variable-length. In particular, all the values in the range 0 .. 127 fit in a single byte.

@bartlomiejodachowski Жыл бұрын

@@lawrencedoliveiro9104 variable length explains. i dont get how it can be variable length and work but i will google it. thx

@mCoding Жыл бұрын

Great question and indeed the solution is that utf-8 is a variable-length encoding. The way this works is by encoding the number of total bytes in the character within the first byte. If the first byte starts like: 0xxxxxxx -> 1 total byte 110xxxxx -> 2 total bytes 1110xxxx -> 3 total bytes 11110xxx -> 4 total bytes. In particular, since ascii values are 0-127, they all start out 0xxxxxxx and hence all ascii values are encoded in a single byte in uft-8. Clever! Read more here: en.wikipedia.org/wiki/UTF-8

@quillaja 18 күн бұрын

@@bartlomiejodachowski Hopefully you found your answer, but if not, there was a very good Computerphile video featuring Tom Scott about UTF8

@bartlomiejodachowski 18 күн бұрын

@quillaja i didint delete my comment in case someone had simmilar question. i have alredy studied encodings, but thx for your response.

@norude Жыл бұрын

Can you make a video on sets. How they are implemented? What is the time/space complexity of adding elements, creation from list, removing elements and others? Why it is at all in the language? Is using lists faster in specific situatutions?

@leesweets4110 Жыл бұрын

So how does a sequence of four bytes get interpreted as a smiley face?

@mCoding Жыл бұрын

When you ask python to decode using utf8 (whether you specify the encoding specifically or whether python just does that by default), you are asking python to interpret the bytes according to the unicode standard, and the unicode standard specifies that that exact sequence of bytes means "smiling face" or some other similar description. It is then up to the author of the font you are using to create a glyph (the picture of an actual smiley) to draw whenever you try to display the smiley as a character. This allows for different sets of smileys, like the Apple smileys vs the Microsoft smileys, choosing between them is simar to choosing a different font.

@phantomzkarma7633 Жыл бұрын

Looks interesting

@volbla Жыл бұрын

Be warned! The standard library json.write() function has an "ensure_ascii" variable that for some reason defaults to True. If you want to save data that's not just standard latin text you have to set it to False. I guess we have to wait for 3.15 for that to change...

@0LoneTech Жыл бұрын

ensure_ascii generates \u escape sequences. It makes the JSON ASCII compatible, it does not alter the data contained within strings.

@volbla Жыл бұрын

@@0LoneTech And that makes it unreadable in a text editor. I guess it's cool that all the data is still there, but is there a reason to not keep the data _and_ have it readable? What doesn't support UTF-8 these days?

@0LoneTech Жыл бұрын

@@volbla 6:08 is one example, likely Windows-1252. The default ensure_ascii format will function through that, utf-8 won't. JSON does not support indicating encoding like e.g. HTML or HTTP. So if it was stored in a file, the default system encoding is the only suggestion.

@qexat Жыл бұрын

strife horde 🤙🤙🤙

@lawrencedoliveiro9104 Жыл бұрын

Erisian expedition!?

@qexat Жыл бұрын

@@lawrencedoliveiro9104 discord gang but 🤵‍♀

@lawrencedoliveiro9104 Жыл бұрын

Out-of-tune ensemble!?

@eliseuantonio6652 Жыл бұрын

Why isn't your machine's default utf-8 if it's so popular?

@NostraDavid2 Жыл бұрын

Because UTF-16 is older than UTF-8, and Windows decided on the UTF-16 standard, back in the day.

@NostraDavid2 Жыл бұрын

And Microsoft is VERY big on backwards compatibility, which means they won't replace UTF-16 with UTF-8, unless they find a way to stay compatible.

@lawrencedoliveiro9104 Жыл бұрын

UTF-16 is an unfortunate hangover from the early days of Unicode. Nobody uses it voluntarily any more.

@trag1czny Жыл бұрын

discord gang 🤙🤙🤙🤙

@mCoding Жыл бұрын

Always appreciated!

@AssemblyWizard Жыл бұрын

UTF8 ≠ Unicode I was expecting you to explain the difference, since it's super related to bytes vs. str, but instead you said these terms commonly refer to the same thing 😓

@AssemblyWizard Жыл бұрын

FWIW here's the difference: Unicode usually refers to the conversion between characters and numbers (ord/chr in python), and UTF8 is the conversion between these numbers and bytes. (although Unicode is technically the name for the entire standard, including both UTF8 and the mapping between characters and numbers)

@mCoding Жыл бұрын

I completely agree that UTF-8 != Unicode. However, it is widespread and commonplace for developers (and others) to colloquially use the word "Unicode" interchangeably with UTF-8 as most people don't bother with technical distinctions and it is usually clear from context which conversion is meant.

@RazeVX Жыл бұрын

even i knew it already since i literaly had to google it a hundret times because i found it funny to even ask the question ^^ like bites are just a string of 1 and 0 just that every 1 and 0 of your string takes at least 8 1´s and 0´s and thus is at least 8 times your memory used utf16 utf32. Let me tell you if came to the idea just useing str ('10010011' ) work with actual biits since you more comfortable with them then bytearrays dont its so painfuly slow i know cause i tryed it -.-

@jerrylu532 Жыл бұрын

Small hint: Instead of writing `encoding="utf-8"`, you can just write `encoding="u8"`, which saves you up to 3 keystrokes! Check the Python doc and you can see that `u8` is just another name for `utf-8`.

@mCoding Жыл бұрын

Nice tip! I think i still prefer writing out the long form just for readability, but I didn't know about this shortcut before!

@AntonioZL Жыл бұрын

Just don't use u8 instead of utf-8 and then go on about your day writting for i in range(len(x)) 😁