[CVPR'23 WAD] Challenge - BDD100K
[CVPR'22 WAD] Challenge - BDD100K
[CVPR'21 WAD] Challenge - Argoverse
@carlrossum Ай бұрын
Thank you Dear Waymo for making this available to the world. This is really help full for both fundamental researchers and engineering specialists. Thanks. Keep up the great work.
@jaysrinivasan8205 7 ай бұрын
Ashok is amazing
@karunald 10 ай бұрын
I will never understand why Tesla continues to put intersection cameras 9 feet behind the nose of the car and behind our backs. When there's no room to creep to see - it doesn't work! Such an obvious massive flaw. Maybe if I lived in CA with bike lanes & road buffers it would work. FIX IT
@xJRx7777 10 ай бұрын
Keynote speech presented by Ashok Elluswamy at CVPR'23 WAD (Workshop on Autonomous Driving), where he discusses Tesla's self-driving technology, including the advancements in machine learning techniques, robotics, real-world implementations, and the underlying concepts of Tesla's technology. Let's break down the themes and deeper insights, including the esoteric aspects of this conversation. ### 1. **The Foundation of Autonomy and Robotics**: - **Machine Learning at the Core**: Elluswamy emphasizes the use of machine learning in building the foundation for autonomy in vehicles. The explanation of neural networks, occupancy, voxel prediction, transformer models, and 3D scene construction reflect the cutting-edge practices of AI and machine learning. - **Generalization and Robustness**: He emphasizes the stack's robustness to errors and its ability to adapt to different situations and environments. This ties to the broader theme of machines being adaptable, like humans, and being able to operate in the real world. - **Integration with Robotics**: The technology is not limited to cars but is extendable to other robotic platforms, highlighting the unifying concepts within robotics and machine autonomy. ### 2. **Full Self-Driving (FSD) Technology**: - **Real-World Implementation**: Elluswamy discusses the actual deployment of Tesla's FSD system in the U.S. and Canada. He doesn't merely focus on theoretical concepts; instead, he gives insights into the real-world challenges and accomplishments. - **Camera and Sensing Technology**: The 360-degree coverage and the way cameras are used to interpret the world resonate with how human senses work. It reflects an attempt to create machines that perceive the world much like humans do. - **Lane and Object Prediction**: The real-time prediction and analysis of lanes, objects, and traffic signals is a sophisticated task that mimics human cognition. The comparison of modeling lanes to modeling language (mentioning GPT) connects two complex realms of machine learning - natural language processing and computer vision. ### 3. **Simulation and Prediction of Different Futures**: - **Dynamic World Modeling**: There's a profound concept of creating a "neural network simulator" that can simulate different futures based on different actions. It implies a move from static to dynamic models, mirroring human ability to predict potential outcomes. - **Generative Models and Action-Conditioning**: The ability to generate and condition models on the past to predict the future aligns with human imagination and intentionality, extending AI into realms previously exclusive to human consciousness. ### 4. **Hybrid Approaches and Configurability**: - **Hybrid Neural Representational Field (Nerf) Approach**: The discussion about 3D reconstruction with a hybrid Nerf approach represents the synthesis of different AI techniques to create something novel and effective. - **Customizable System**: The configurable nature of the models, allowing for different applications and needs, reflects a move towards personalized and adaptable technology. ### 5. **Tesla as a Leader in Compute**: - **Dojo and Scalability**: Mention of Dojo, Tesla's training hardware, and the company's aspiration to become a world leader in compute emphasizes the role of technology not only in driving but in broader societal change. ### 6. **Questions and Further Insights**: - **Engaging with Complexity**: The subsequent questions and responses delve into complexities like auto-regressive models, voxel sizes, occupancy, inference times, and map components. These details allow a nuanced understanding of the technology. ### Conclusion This transcript is more than just a description of Tesla's autonomous driving technology. It's a glimpse into the current and future state of AI and machine learning, offering insights into how these technologies are shaping our interaction with the world. It reflects the ongoing journey to bridge the gap between machines and human-like understanding, adaptability, and intuition. Moreover, the real-world application of these complex technologies represents a crucial step in the transition from theoretical research to tangible, everyday experiences, contributing to the evolution of our transportation systems and potentially our broader interaction with machines.
@simsonyee 11 ай бұрын
How does the network account for different calibrations on different cars? The relative poses of the 8 cameras will a little different from car to car and also drift in time and influenced by heat/cold between day and night.
@simsonyee 11 ай бұрын
Autonomous shipping? Autonomous trains?
@Nolan.Gurule 9 ай бұрын
Trains would be awesome!
@fredharris929 11 ай бұрын
They are hackable.
@ThomasButryn 11 ай бұрын
I think the most interesting thing is that Tesla's approach is really based on experimenting with ideas
@nonietoomila8890 Жыл бұрын
1:08 🎉🎉🎉😅😅😅😂😂😂😮😮😮😢😢😢😢😢😢🎉🎉😂😂❤❤❤❤❤❤❤❤❤
@nonietoomila8890 Жыл бұрын
30:36 🎉🎉😅😅😅😢😢😂😂😂😮😮🎉🎉😢😢😂😂😮😅😮😢🎉😂❤❤❤❤❤❤❤❤❤❤❤❤❤
@GloriaKerluke Жыл бұрын
P r o m o s m 😣
@nonietoomila8890 Жыл бұрын
0:33 🎉🎉🎉😅😅😅😢😢😮😮🎉😂😂😂😂😂😂🎉❤❤❤❤❤❤❤❤❤❤❤❤❤
@gregchristie2763 Жыл бұрын
Imagine this applied to military robots with guns or worse.. so Elon now has both vehicles robots and humanoid robots already where if this AI is applied to them can learn by itself with no restraints whatsoever...it is really is quite scary this needs stopping now.. even Elon himself has said it needs legislating.. and this is just the beginning.. people joke about skynet and the terminator but this is a real threat..Tesla also has the coms via Starlink .. very scary.😢😢 Elon himself in the Tucker Carlson interview available on KZfaq .... The dangers of hyper intelligent AI ... said this sort of AI unchecked could result in the annihilation of the human race ..or "civilisational destruction" as he called it .. just thinks if a mad man like Putin got hold of this tech..very very scary.
@meamzcs Жыл бұрын
Lol... Elon has access to literal ICBMs...
@nonietoomila8890 Жыл бұрын
0:14 😢🎉🎉🎉🎉😅😅😮😮😢😢😮😂😂😂😂🎉🎉❤❤❤❤❤❤❤❤❤❤❤❤
@nonietoomila8890 Жыл бұрын
@nonietoomila8890 Жыл бұрын
1:08 🎉😢😢😢🎉🎉🎉😅😅😮😢😂😂❤❤❤❤❤❤❤🎉🎉❤❤❤
@nonietoomila8890 Жыл бұрын
@nonietoomila8890 Жыл бұрын
13:34 🎉🎉🎉😅😅😅😢😢😮😮😮😮😂😂😂😂🎉🎉❤❤❤❤❤❤❤❤
@nonietoomila8890 Жыл бұрын
0:47 🎉🎉😅😅😅😢😢😮😮😮😮😢😢😂😂😂😂😂❤❤❤❤❤🎉🎉🎉❤❤❤❤❤❤❤
@joeysipos Жыл бұрын
12:05 bro that was totally the Tesla’s fault. It’s the one that blew through the stop sign…
@galileo3431 Жыл бұрын
That's exactly what Ashok also says. In this case, FSD wasn't enabled, but the human was driving and made the error. The visual AEB system was anyways able to perform emergency breaking.
@SkradaczTENZNANY Жыл бұрын
yes, that's what he said. "the ego driver" means the system controlling the Tesla
@joeysipos Жыл бұрын
@@galileo3431 Ah ok, yeah now that I rewatched it. I thought he said the red car blew through the stop sign...
@meamzcs Жыл бұрын
​@@SkradaczTENZNANYLOL... The title of the slide literally says MANUAL DRIVING...
@doublesman0 Жыл бұрын
For the love of God please add radar as a data point.
@tiro0oO5 11 ай бұрын
Imaging radar to be precise
@berlusmafia Жыл бұрын
Thanks to people like you and and the engineers who works at Tesla, there hope for humanity for a decent future
@marcusaurelius6607 Жыл бұрын
5 more lidars will be enough
@seojimjames Жыл бұрын
Brilliant all the Way, thanks for the great informative video, Tesla Investors appreciate knowing just how great your work is. Also, consider how many drivers are visually challenged and still driving, or impaired drivers, or distracted drivers, the FSB is consistent and a much better option than borderline bad drivers.
@pranjal86able Жыл бұрын
Here are the key points from the video titled "[CVPR'23 WAD] Keynote - Ashok Elluswamy, Tesla": - The speaker, Ashok Elluswamy, is a member of the autopilot team at Tesla. - He presents their work on what they believe will be the foundation model for autonomy and robotics. - Tesla has shipped the full self-driving beta software to all purchasers in the United States and Canada, with roughly 400,000 vehicles having driven up to 250 million miles on the full self-driving beta program. - The self-driving stack is scalable and can navigate to any destination within the US, handling intersections, stopping at traffic lights, and interacting with other objects. - The system is driven primarily by eight cameras on the car that provide a full 360-degree coverage. - The self-driving stack is based on modern machine learning, with many components folded into neural networks. This is different from the traditional approach to self-driving, which uses localization maps and various sensors. - The system works primarily with cameras, and it performs quite well. - The speaker discusses the importance of occupancy networks in their stack, which predict whether a voxel in 3D space is occupied or not. This model task is general and robust to ontology errors. - The occupancy networks also predict the flow of voxels in the future, providing arbitrary motion. Everything runs in real time. - The architecture of the system may look complicated, but it's quite straightforward. Videos from multiple cameras stream in, and a large Transformer block builds up features and does temporal attention with some geometry thrown in. - The same architecture can be used for other tasks needed for driving, such as predicting lanes and roads. - Lanes are crucial for driving tasks but are challenging to predict due to their high-dimensional nature, graph structure, and large uncertainty. They can span the entire road, fork, merge, and sometimes even humans cannot agree on their structure. - The team uses state-of-the-art generative modeling techniques, such as autoregressive transformers, to predict lanes. This approach is similar to GPT and predicts lanes one token at a time, considering the full graph structure. - Moving objects like vehicles, trucks, and pedestrians need to be detected with their full kinematic state. The models used are multi-modal, taking in not just camera video streams but also other inputs like the vehicle's own kinematics and navigation instructions. - The entire motion planning can also be done using a network, making the system a modern machine learning stack where everything is done end-to-end. - The success of this system is attributed to the sophisticated auto-labeling pipeline that provides data from the entire fleet. This allows for multi-trip reconstruction, where multiple Tesla vehicles driving through the same location provide their video clips and kinematic data to construct the entire 3D scene. - The team uses multi-trip reconstruction to gather data from the entire fleet, enabling them to reconstruct lanes, road lines, and other elements from anywhere on Earth. - They use a hybrid approach to Neural Radiance Fields (NeRF) and general 3D reconstruction, which results in accurate and clear reconstructions of the scene, including vehicles, barriers, and trucks. - Additional neural networks are run offline to produce labels for lanes, roads, and traffic lights, creating a vector representation that can be used as labels for the online stack. - The system can auto-label traffic lights, predicting their shape, color, and relevancy, and these predictions are multi-view consistent. - These predictions provide a superhuman understanding of the world from cameras, creating a foundation model that can be used in various places. - The system helps with both autonomous and manual driving, providing emergency braking for crossing vehicles. This is a new feature, as crossing objects are harder to predict than vehicles in your own lane. - The team is working on learning a more general world model that can represent arbitrary things, using recent advances in generative models like Transformers and diffusion. - The neural network can predict future video sequences given past videos. It predicts for all eight cameras around the car jointly, understanding depth and motion on its own without any 3D priors. - The model can be action-conditioned. For example, given the same past context, when asked for different futures (like keep driving straight or change lanes), the model can produce different outcomes. - This creates a neural network simulator that can simulate different futures based on different actions, representing things that are hard to describe in an explicit system. - Future prediction tasks can also be done in semantic segmentation or reprojected to 3D spaces, predicting future 3D scenes based on the past and action prompting. - The team is working on solving various nuances of driving to build a general driving stack that can drive anywhere in the world and be human-like, fast, efficient, and safe. - Training these models requires a lot of compute power. Tesla is aiming to become a world leader in compute with their custom-built training hardware, Dojo, which is starting production soon. - The models are not just being built for the car but also for the robot, with several networks shared between the car and the robot. - The foundational models for vision that the team is building are designed to understand everything and generalize across cars and robots. They can be trained on diverse data from the fleet and require a lot of compute power. - The team is excited about the progress they expect to make in the next 12 to 18 months. - In the Q&A session, the speaker explains that they can track moving objects in the 3D reconstruction with their hybrid NeRF approach, using various cues and signals in the data. - The world model for future prediction tasks is a work in progress, but it's starting to work now, providing a simulator where they can roll out different outcomes and learn representations. - The use of autoregressive models for predicting lanes is due to the graph structure of lanes and the need to model a distribution in high-dimensional space. This approach provides clear, non-blurry predictions that are useful downstream. - The voxel size in the occupancy network output is a trade-off between memory and compute and can be configured based on the needs of the application. - The same principles of the world model should apply to humanoid robots. The model should be able to imagine what actions like picking up a cup or walking to a door would look like. - The occupancy network is used for collision avoidance in the full self-driving (FSD) system. It's particularly useful for dealing with unusual vehicles or objects that are hard to model using other methods. - The general world model is still being optimized and hasn't been shipped to customers yet. It might be ready later in the year. - The system doesn't use high-definition maps, so alignment isn't super critical. The maps used are low-definition, providing enough information to guide the network on which roads and lanes to take. This concludes the summary of the video "[CVPR'23 WAD] Keynote - Ashok Elluswamy, Tesla". The speaker, Ashok Elluswamy, discusses the development of Tesla's self-driving technology, focusing on the use of machine learning and neural networks. He also answers questions about the technical details of the system.
@pascalg.8772 Жыл бұрын
Thanks for your work Ashok is actually the head of Tesla autopilot team. He took over when Andrej Karpathy left
@jacolantern1 Жыл бұрын
If you’re going to use GPT to summarize the transcript, then at least say that’s what you’re doing. Neglecting to mention that is dishonest and misleading
@nioncao Жыл бұрын
Too little progress compared to Ai day
@Jsmith32t Жыл бұрын
It’s pretty great progress actually. With the rate of change in the ML world you have to re-evaluate your approach every 3-6 months now. What they showed at AI day is what they are shipping now, but they have already hit a wall with the rare corner cases. The World model will take advantage of their auto label system and will provide clean data for the new approach. Probably in 2 years from now the World model will be at a stable release and metrics for regulators will start accumulating towards proper level 5
@SyntheticSpy Жыл бұрын
@@Jsmith32tif their amount of compute scales like they are planning, it will likely be sooner than 2 years
@falconxlc Жыл бұрын
12:10 the tesla missed the stop sign but ashok says the red car blew a stop sign??
@f2yd Жыл бұрын
"In this case, on the left side the *ego* driver for some reason blew past the stop sign" The ego driver the one driving the Tesla
@falconxlc Жыл бұрын
@@f2ydi stand corrected, he did say ego.
@FinanceNinja Жыл бұрын
@@f2yd It was a human driving the Tesla and blew the stop sign, not FSD. He was explaining how FSD saw the path of the perpindicular car and interjected to stop the Tesla from hitting it.
@f2yd Жыл бұрын
@@FinanceNinja I agree, that's what I was saying too. Ego driver = the human driving the car from which we see the video
@pervezbhan1708 Жыл бұрын
3D representations can provide valuable information for self-driving systems, but they are typically used in conjunction with other sensor modalities and data sources to create a comprehensive understanding of the environment. While 3D representations, such as point clouds or voxel grids, can capture geometric details of the surroundings, they have certain limitations that make additional information necessary. Here are a few reasons why 3D representations alone may not be sufficient for self-driving: Limited Field of View: 3D representations typically rely on sensors like lidar to capture the surrounding environment. However, lidar sensors have a limited field of view and can struggle with detecting objects beyond their range or behind obstacles. Therefore, additional sensor modalities like cameras and radar are used to provide complementary information. Perception Challenges: Understanding the environment goes beyond geometry. Perception tasks like object detection, classification, and semantic segmentation require information beyond 3D representations. Cameras, for example, provide valuable color and texture information that aids in object recognition and scene understanding. Dynamic Objects: Self-driving systems need to account for the motion and behavior of dynamic objects like pedestrians, cyclists, and other vehicles. 3D representations alone may not capture the temporal aspect of these objects. Additional information, such as object tracking and prediction algorithms, are necessary to anticipate and plan for the future movements of these dynamic entities. Environmental Context: Environmental context, such as traffic signs, road markings, and traffic signals, plays a critical role in self-driving. While some aspects of this information can be captured in 3D representations, other details may require explicit detection and recognition algorithms designed for specific objects or structures. To build robust and reliable self-driving systems, it's common to integrate data from various sensors, including cameras, lidars, radars, and other sources, to create a holistic understanding of the environment. Combining multiple sensor modalities and data sources allows for redundancy, robustness in different environmental conditions, and a richer representation of the surroundings. User 3D occupancy prediction ChatGPT 3D occupancy prediction is a technique used in self-driving and robotics to estimate the occupancy or occupancy probability of the environment in three-dimensional space. It involves predicting which regions of the 3D space are occupied by obstacles or objects and which regions are free or unoccupied. This information is crucial for path planning, obstacle avoidance, and collision prevention. There are various approaches and algorithms used for 3D occupancy prediction, and here are a few common techniques: Voxel Grids: Voxel grids are a popular representation for 3D occupancy prediction. The 3D space is discretized into small volumetric cells called voxels. Each voxel can be marked as occupied, free, or unknown based on sensor measurements. Algorithms like occupancy grid mapping or probabilistic occupancy mapping can update and maintain the occupancy probabilities of the voxels over time. Point Cloud Processing: Point clouds obtained from lidar sensors are often used for 3D occupancy prediction. Various techniques, such as point cloud segmentation, clustering, and surface reconstruction, can be applied to identify and estimate the occupancy of objects in the point cloud data. Deep Learning: Deep learning methods, particularly convolutional neural networks (CNNs) and their variants, can be employed for 3D occupancy prediction. These networks can learn to directly predict occupancy probabilities or voxel-wise occupancy labels from 3D input data, such as point clouds or voxel grids. Training these models requires labeled data that indicates the occupancy status of each voxel or point. Sensor Fusion: 3D occupancy prediction can also benefit from sensor fusion, where data from multiple sensors, such as lidar, cameras, and radars, are combined to create a more comprehensive representation. Sensor fusion techniques, like Kalman filters or particle filters, can be used to integrate the data from different sensors and refine the occupancy predictions. The accuracy and reliability of 3D occupancy prediction depend on factors such as sensor quality, sensor placement, calibration, and the algorithms used. It's common to combine 3D occupancy prediction with other perception tasks, such as object detection and semantic segmentation, to create a more detailed understanding of the environment for autonomous systems.
@MrMolledm Жыл бұрын
Really enjoyed Can’t wait for autonomous mass transit buses
@jascfdrac Жыл бұрын
Nice work!
@user-ry3zg5im3z Жыл бұрын
is there paper name of this work? occupancy networks
@pavelt9391 Жыл бұрын
Simple. Genius. Amazing.
@johanngerberding5956 Жыл бұрын
sad that this doesn't get the hype like gpts get, this is much cooler to me
@tenzinpassang4812 Жыл бұрын
#RemindMeIn2030: How much is TSLA worth? 🤑🤑🤑🤑🤑🤑
@changtimwu Жыл бұрын
the dataset is great but the reader lib strictly depends on tensorflow 2.6.
@brewedicedcode1341 Жыл бұрын
Great presentation Ashok! So good to get in-depth understanding of how FSD works. Give you sense of how complex collision avoidance problem is. Thanks for great work from you and your team.
@foodmaker5771 Жыл бұрын
Wow we need more such of this lec!
@foodmaker5771 Жыл бұрын
Thanks for your valuable info
@Tommm73 Жыл бұрын
Thank you Ashok -. Great work, great presentation, and as one of those 100,000 FSD Beta drivers it’s exciting to see what’s under the hood.
@explor794 Жыл бұрын
Brilliant, I hope you got a lot of stock options for your work.
@frangalarza Жыл бұрын
I like the approach. It reminds me a lot to Tesla and Comma. I wonder how you compete with these 2
@gaussian3750 Жыл бұрын
very good keynote!
@RealRusty Жыл бұрын
Very insightful, thanks!
@zshn25 Жыл бұрын
How is the occupancy network trained? Where is the supervision coming from? As was mentioned, NeRF could be an additional way to supervise it but what is the primary supervision?
@yy-xv9vw Жыл бұрын
Fascinating! Note that the fake v.s. real pedestrian issue, I actually encountered it today myself! I made a video about it, at kzfaq.info/get/bejne/etyibNST2Je0iXk.html
@suryatamilan576 Жыл бұрын
தமிழன் ❤️🔥
@frangalarza Жыл бұрын
An interesting exercise is to see how many views the videos on this channel get. They're all around ~1K except the Tesla ones :D
Жыл бұрын
It would be interesting to find an intersection like kzfaq.info/get/bejne/oLZziZd4sL6ddKs.html where the left lane is the end of a ramp/bridge. So that looking flat and straight left would not actually indicate the correct road surface.
@aresdilin Жыл бұрын
Why 36 Frames per second? 1/36 == 0.027777777777777777... How to process elapsed time ?
@bennabulsi6756 Жыл бұрын
I really hope they have a better documentation method internally.