Пікірлер
@VikashKumar-jx4qx
@VikashKumar-jx4qx 7 сағат бұрын
Hey, I want to know, is there any platform like hackerrank or leetcode for PySpark. Where I can practice PySpark.
@vivekdutta7131
@vivekdutta7131 23 сағат бұрын
with credit AS ( select user_id,user_name, isnull((credit + amount),credit) as tot_credit from users u left join Transactions t on u.user_id = t.paid_to ) select user_id,user_name, isnull((tot_credit - amount),tot_credit) as final_credit, case when isnull((tot_credit - amount),tot_credit) > 0 then 'NO' else 'Yes' end as credit_limit_breached from credit u left join Transactions t on u.user_id = t.paid_by
@satirthahalder9528
@satirthahalder9528 Күн бұрын
with cte as (select *,lag(log_date,1,log_date) over(partition by emp_id order by log_date) prev_date from t where flag ='Y'), cte2 as (select *,case when log_date-prev_date<2 then 0 else 1 end gp from cte), cte3 as (select *,sum(gp) over(partition by emp_id order by log_date) as gps from cte2) select emp_id, min(log_date)start_date,max(log_date)end_date,count(*) consecutive_days from cte3 group by emp_id, gps having count(*)>=2
@user-gq6cg3ls7f
@user-gq6cg3ls7f 2 күн бұрын
another approach with success as( select *, DATEADD(day, -1*ROW_NUMBER() over (order by success_date), success_date) s_bucket, ROW_NUMBER() over (order by success_date) S_RN from succeeded where year(success_date) != '2018' ), s_final as( select min(success_date) start_date, max(success_date) end_date, ROW_NUMBER() over (order by (select null)) RN from success group by s_bucket ), fail as( select *, DATEADD(day, -1*ROW_NUMBER() over (order by fail_date), fail_date) f_bucket, ROW_NUMBER() over (order by fail_date) F_RN from failed where year(fail_date) != '2018' ), f_final as( select min(fail_date) start_date, max(fail_date) end_date, ROW_NUMBER() over (order by (select null)) RN from fail group by f_bucket ) select 'Succeeded' as period_date, format(start_date, 'dd-MM-yyyy') as start_date, format(end_date,'dd-MM-yyyy') as end_date from s_final union all select 'failed' as period_date, format(start_date,'dd-MM-yyyy') as start_date, format(end_date,'dd-MM-yyyy') as end_date from f_final order by start_date
@user-gq6cg3ls7f
@user-gq6cg3ls7f 3 күн бұрын
another approach using SQL Server with U1 as( select User1, count(*) cnt from facebook group by User1 ), U2 as( select User2, count(*) cnt from facebook group by User2 ),final as( select coalesce(User1, User2) as Users, (isnull(U1.cnt,0) + isnull(U2.cnt,0)) as Popularity_Percentage from U1 full join U2 on U1.User1=U2.User2 ) select Users, round(cast(round(Popularity_Percentage,2) as decimal)*100/ count(user) over (order by user rows between unbounded preceding and unbounded following),3) total_users from final
@nagumallasravansai249
@nagumallasravansai249 4 күн бұрын
you are awesome buddy!
@Tech_world-bq3mw
@Tech_world-bq3mw 4 күн бұрын
my solution: #df- original, and df1-with range numbers df_missing=df1.join(df, df1.id==df.id, how='left_anti') df_missing.show()
@vivekdutta7131
@vivekdutta7131 9 күн бұрын
with cte AS ( select question_id, isnull(sum(case when action = 'show' then 1 end),0) as show_count, isnull(sum(case when action = 'answer' then 1 end),0) as answer_count from survey_log group by question_id ),ctf as ( select question_id,(answer_count/show_count) as answer_rate, row_number() over(order by (answer_count/show_count) desc ,question_id asc) as rn from cte ) select question_id from ctf where rn = 1
@user-ju4ih5xr8e
@user-ju4ih5xr8e 10 күн бұрын
Hello, i think you might have missed this scenario, what if the id's are some random integers or some strings. Please review my solution and let me know if my understanding is correct. Thanks with cte as (select *,row_number() over() rn from swap_id) select name,id,case when a%2=0 then lag(id) over() when a%2<>0 then coalesce(lead(id) over(),id) else id end swapped from cte
@Hope-xb5jv
@Hope-xb5jv 10 күн бұрын
with cte as ( select question_id, sum(case when action = 'answer' then 1 else 0 end)/sum(case when action = 'show' then 1 else 0 end) as answer_rate from survey_log group by question_id ),cte2 as ( select question_id ,row_number()over(order by question_id ) as id, dense_rank()over(order by answer_rate desc) as rate from cte ) select question_id from cte2 where id = 1 and rate = 1
@Aman-lv2ee
@Aman-lv2ee 11 күн бұрын
select name, case when id%2 =0 then lag(id,1,id)over(order by id) when id%2 !=0 then lead(id,1,id)over(order by id) end as swap_ids from swap_id;
@user-oz7ol1bg3n
@user-oz7ol1bg3n 11 күн бұрын
Thankyouuuuuuuuuu
@DEwithDhairy
@DEwithDhairy 11 күн бұрын
Glad u found useful
@rahulmittal116
@rahulmittal116 12 күн бұрын
Nice🤙
@DEwithDhairy
@DEwithDhairy 11 күн бұрын
Thanks Rahul
@AnilKumar-qe6er
@AnilKumar-qe6er 12 күн бұрын
with cte as( select t2name,STRING_AGG(topping_name,',') as pizza,sum(cost) as total_cost from( select t1.*,t2.topping_name as t2name,t2.cost as t2cost from toppings t1 cross join toppings t2 where t1.topping_name <> t2.topping_name)a group by t2name) select pizza,total_cost from cte order by total_cost;
@DEwithDhairy
@DEwithDhairy 11 күн бұрын
Great approach.
@AnilKumar-qe6er
@AnilKumar-qe6er 11 күн бұрын
@@DEwithDhairy Thank you so much 🥰
@Aman-lv2ee
@Aman-lv2ee 15 күн бұрын
nice video, share more on python questsion for DE, thanks
@DEwithDhairy
@DEwithDhairy 13 күн бұрын
Thabks, Its in the pipeline
@vivekdutta7131
@vivekdutta7131 16 күн бұрын
with cte AS ( select * , row_number() over(order by topping_name) as rn from toppings ),ctf AS ( select t.topping_name as topping_name1,g.topping_name as topping_name2,t.cost as cost from cte t , cte g where 1 = 1 and t.rn <> g.rn ) select string_agg(topping_name1,',') as pizza,sum(cost) as total_cost from ctf group by topping_name2
@user-gq6cg3ls7f
@user-gq6cg3ls7f 16 күн бұрын
select concat(z.topping_name, ',' , y.topping_name, ',', x.topping_name) as pizza, x.cost+y.cost+z.cost as total_cost from toppings x inner join toppings y on x.topping_name > y.topping_name inner join toppings z on y.topping_name > z.topping_name
@DEwithDhairy
@DEwithDhairy 17 күн бұрын
Correction At 4:16 : Self Join *
@nikunjmistry373
@nikunjmistry373 17 күн бұрын
Thank you, Dhiraj, for using my website and giving a shoutout! I'm really happy to see my website being used by KZfaq content creators.😊
@vrishabhbhonde6899
@vrishabhbhonde6899 9 күн бұрын
@nikunjmistry373 Sir aapne same python ke questions solve krei hai kya Leetcode or Statascractch ke ? Is there any website like this for python questions that you've made? Thank you for SQL website. Its helpful
@cretive549
@cretive549 17 күн бұрын
Sir me clg student hu mujhe data engineer banna h but samaj nhi aa raha h ki kya kya padhu or kaha se padhu. Mujhe python and sql aata h bus
@DEwithDhairy
@DEwithDhairy 17 күн бұрын
SQL and python me master bano, Cloud sikh lo koi
@cretive549
@cretive549 17 күн бұрын
@@DEwithDhairy sir koi bolta h pyspark sikhena padega or bhi kuch and cloud me kya kya sikhna h please sir guide me
@nikunjmistry373
@nikunjmistry373 17 күн бұрын
@@cretive549 I would suggest something. Don't go what all r doing mern and full stack. For DE do etl script play with docker apache spark handle data in apache supaset... step by step . For any fresher first thing is sql and python . Rest as needed..
@CharanSaiAnnam
@CharanSaiAnnam 19 күн бұрын
my solution: @udf def removeSpaceandVowels(r): s = "" for i in r: if i not in "aeiou ": s += i return s dfr = df.withColumn("consonants" , removeSpaceandVowels(col("message")))\ .drop("message").withColumn("consonantCount" , length(col("consonants")))
@satishgs5355
@satishgs5355 21 күн бұрын
from pyspark.sql.window import Window df1 = df.withColumn("rnk",row_number().over(Window.partitionBy(df.emp_id).orderBy(col("emp_id")))).filter(col("rnk")>1) df1.show() df2 = df1.dropDuplicates(df.columns) df2.show()
@DEwithDhairy
@DEwithDhairy 21 күн бұрын
👏
@KeshannaKummari-y6h
@KeshannaKummari-y6h 22 күн бұрын
l=[1,2,3,[4,5],[6,7],[7,8,9]] m=[] for i in l: if type(i)==int: m.append(i) elif type(i)==list: for j in range(len(i)): m.append(i[j]) else: pass print(m)
@DEwithDhairy
@DEwithDhairy 22 күн бұрын
Hi, This solution will not work if we have nesting in the list like this [[[2, 3]], 10]
@gauravpathak9625
@gauravpathak9625 23 күн бұрын
from pyspark.sql.window import Window from pyspark.sql.functions import col,row_number w1 = Window.partitionBy(col("cust_id")).orderBy(col("flight_id")) w2 = Window.partitionBy(col("cust_id")).orderBy(col("flight_id").desc()) df_origin = df_flight.withColumn("rank_asc", row_number().over(w1)).filter((col("rank_asc") == 1)).withColumnRenamed("origin","origin_1").withColumnRenamed("cust_id","cust_id_o") df_destination = df_flight.withColumn("rank_desc", row_number().over(w2)).filter((col("rank_desc") == 1)).withColumnRenamed("destination","destination_1") df_final = df_origin.join(df_destination, df_origin.cust_id_o == df_destination.cust_id, "inner").select(col("cust_id"),col("origin_1").alias("origin"),col("destination_1").alias("destination")) display(df_final)
@KapilKumar-hk9xk
@KapilKumar-hk9xk 23 күн бұрын
Amazing <3
@DEwithDhairy
@DEwithDhairy 22 күн бұрын
Thanks 😄
@nupoornawathey100
@nupoornawathey100 25 күн бұрын
Easier solution- Scala Spark using DF API- val common_likes_df = friends_df.as("f").join(likes_df.alias("l"), $"f.friend_id" === $"l.user_id", "inner" ).select("f.user_id", "l.page_id") common_likes_df.as("cl").join(likes_df.as("l"), $"cl.user_id" === $"l.user_id" && $"cl.page_id" === $"l.page_id", "left_anti").orderBy($"user_id").show(false) Using Spark SQL- spark.sql(""" with common_likes as ( select f.user_id, l.page_id from friends f inner join likes l on f.friend_id=l.user_id ) select user_id, page_id from common_likes where (user_id, page_id) not in (select user_id, page_id from likes) order by user_id """).show(false) Output: +-------+-------+ |user_id|page_id| +-------+-------+ |2 |B | |2 |C | |3 |A | |4 |C | |4 |C | |4 |A | +-------+-------+
@Chathur732
@Chathur732 26 күн бұрын
with cte as ( select *, extract (day from P1_date) - row_number() over(order by (select null)) as diff2 from ( select *, case when extract (day from ld) - extract (day from P1_date) = 1 then 1 end as diff from ( select P1.emp_id as P1_emp_id, P1.log_date as P1_date, P1.flag as P1_flag, lead(P1.log_date) over(partition by P1.emp_id order by P1.log_date) as ld from pwc_attandance_log P1 where P1.flag = 'Y') a where case when extract (day from ld) - extract (day from P1_date) = 1 then 1 end is not null) b) select p1_emp_id, min(p1_date), max(ld), count(diff2) + 1 as no_of_days from cte group by p1_emp_id, diff2
@kalaivanik8872
@kalaivanik8872 27 күн бұрын
mysql with per_month as (select year(created_at) as yr,month(created_at) as mon,sum(value)total_per_mon from amazon_monthly_rev group by year(created_at),month(created_at)) ,prev_revenue as (select *,lag(total_per_mon,1,0) over()pre from per_month) select concat(yr,'-',mon)as yearMonth,round((total_per_mon-pre)/pre*100,2)as rev_diff from prev_revenue;
@kalaivanik8872
@kalaivanik8872 27 күн бұрын
with percentage as (select student_id,round(sum(marks)/count(1),0) as percentage from marks group by student_id) select s.student_id,s.name,p.percentage, case when p.percentage >=70 then 'Distinction' when p.percentage between 60 and 69 then 'First Class' when p.percentage between 50 and 59 then 'Second Class' when p.percentage between 40 and 49 then 'Thid Class' when p.percentage <= 39 then 'Fail' end as Result from student s join percentage p on s.student_id = p.student_id;
@kalaivanik8872
@kalaivanik8872 27 күн бұрын
with first_joindate as (select *,min(join_date) over(partition by user_id)as first from user_data) ,new_user_count as ( select join_date, sum(case when join_date = first then 1 else 0 end)new_user,count(1)total_user from first_joindate group by join_date) select join_date,new_user,case when new_user > 0 then round((new_user/total_user) * 100,0) else 0 end as percentage_new_user from new_user_count;
@kalaivanik8872
@kalaivanik8872 28 күн бұрын
with cte as (select *, (total_sales_revenue-lag(total_sales_revenue,1,0) over(partition by product_id))diff from salesq) select * from products where product_id = (select product_id from cte group by product_id having min(diff) > 0)
@kalaivanik8872
@kalaivanik8872 28 күн бұрын
create table pop(user1 int,user2 int); insert into pop values (1,5), (1,3), (1,6), (2,1), (2,6), (3,9), (4,1), (7,2), (8,3); with all_pairs as( select user1,user2 from pop union select user2,user1 from pop) select user1 as user,round(count(user1)/(select count(distinct user1) from all_pairs)*100,2) as per from all_pairs group by user1;
@kalaivanik8872
@kalaivanik8872 28 күн бұрын
select a.source,a.destination from flight a join flight b on a.source < b.source and a.destination = b.source;
@kalaivanik8872
@kalaivanik8872 29 күн бұрын
with report as( select success_date as date ,'succeeded' as status from succeeded union select fail_date as date,'fail' as status from failed) ,cte2 as( select *,row_number() over(partition by status order by date)rn,(day(date) - row_number() over(partition by status order by date))diff from report where date >= '2019-01-01') select case when status = 'succeeded' then 'succeeded' else 'failed' end as period_state, min(date)as start_date,max(date)as end_date from cte2 group by diff,status
@DEwithDhairy
@DEwithDhairy 28 күн бұрын
Great approach thanks for sharing
@koushiksinha3007
@koushiksinha3007 29 күн бұрын
Hey Man , Thank you for your videos , I have been following you since long , I have a question , in the Coding interview for Data Engineers do they provide Leetcode style predefined code or we have to write the entire code including the input handling ?
@DEwithDhairy
@DEwithDhairy 29 күн бұрын
Interview questions as not like leetcode format.. Interviewer just gives u the input and output u just need to write the core logic of it and then dry run on the given input to check if we are getting the desired output.
@sureshrecinp
@sureshrecinp 29 күн бұрын
Thank you for sharing very useful info.
@DEwithDhairy
@DEwithDhairy 29 күн бұрын
Thanks do share in your network
@BiswajitSibun-n4b
@BiswajitSibun-n4b Ай бұрын
The Best Video about this topic I found on YT
@DEwithDhairy
@DEwithDhairy Ай бұрын
Glad u found helpful.
@vivekdutta7131
@vivekdutta7131 Ай бұрын
a = [1,2,0,4,-1,5,6,0,0,7,0] b = [] c = [] for i in a: if i != 0: b.append(i) else: c.append(i) final = b.extend(c) print(b)
@vivekdutta7131
@vivekdutta7131 Ай бұрын
def operation(st): st_new = "" cnt = 0 dr = st[0] print(dr) for i in st: if i == dr: cnt +=1 else: st_new = st_new + dr + str(cnt) cnt = 1 dr = i st_new = st_new + dr + str(cnt) print(st_new) if __name__ == "__main__": st = "abcabbbccaabd" print(st) operation(st)
@vivekdutta7131
@vivekdutta7131 Ай бұрын
a = [2,0,2,1,1,0] for i in range(len(a)): for j in range(0,len(a)-1): if a[j] > a[j+1]: a[j],a[j+1] = a[j+1],a[j] print(a)
@vivekdutta7131
@vivekdutta7131 Ай бұрын
a = [2,3,[10,20,[100,200],[2,5]],50] b = [] def rec(n): for i in n: if isinstance(i,list): rec(i) else: b.append(i) return b print(rec(a))
@DEwithDhairy
@DEwithDhairy 29 күн бұрын
Great approach
@sushanthsai2078
@sushanthsai2078 Ай бұрын
Tried with different approach from pyspark.sql.functions import * frnds_like_df = friends_df.alias("fdf").join(likes_df.alias("ldf"),friends_df['friend_id'] == likes_df['user_id'],'left').select('fdf.user_id','ldf.page_id').groupBy("user_id").agg(collect_set(col('page_id')).alias('liked_array')) user_like_df = friends_df.alias("fdf").join(likes_df.alias("ldf"),friends_df['user_id'] == likes_df['user_id'],'left').select('fdf.user_id','ldf.page_id').groupBy("user_id").agg(collect_set(col('page_id')).alias('user_likes_array')) # final_df = frnds_like_df.join(user_like_df, ['user_id']) final_df = frnds_like_df.join(user_like_df, ['user_id']) \ .withColumn("uncommon", array_except(col('liked_array').cast('array<string>'), col('user_likes_array').cast('array<string>'))) \ .filter(size(col("uncommon")) > 0) \ .withColumn("values", explode(col("uncommon"))).drop(*['liked_array','user_likes_array','uncommon']) final_df.show()
@ithisrinu9593
@ithisrinu9593 Ай бұрын
I really apricate you brother . i was encountering many issues even i could not figure out from out . but this video resolves all errors .Thank you .
@DEwithDhairy
@DEwithDhairy Ай бұрын
Glad u found useful
@SureshK-gr4vc
@SureshK-gr4vc Ай бұрын
pls provide code repo for this
@DEwithDhairy
@DEwithDhairy Ай бұрын
Haven't created any repo.. You may write the code while going through the vdo.
@DEwithDhairy
@DEwithDhairy Ай бұрын
We need this piece also after the filter condition. # Finding the unique records. answer_df = friend_page_concat_df.select(col("friend_id").alias("user_id"), col("page_id")).distinct() answer_df.show()
@srinivasn2646
@srinivasn2646 Ай бұрын
Thanks Man
@DEwithDhairy
@DEwithDhairy Ай бұрын
Glad u liked it.
@neelbanerjee7875
@neelbanerjee7875 Ай бұрын
I have a simpler solution with first_value and last_value window function as below (wrote in spark sql and can be adjusted in pyspark script acordingly - ************************************************************************************ %sql with cte as ( select cust_id, first_value(origin) over( partition by cust_id order by flight_id range between unbounded preceding and unbounded following ) as origin, last_value(destination) over( partition by cust_id order by flight_id range between unbounded preceding and unbounded following ) as destination from flight ) select distinct(*) from cte ************************************************************************************
@Sudeep-ow4pe
@Sudeep-ow4pe Ай бұрын
Thank you so much for making these different types of pyspark questions, This is really helpful.
@DEwithDhairy
@DEwithDhairy Ай бұрын
Glad u found it useful.
@DataEngineerPratik
@DataEngineerPratik Ай бұрын
could you please make alternate solution in mysql ?
@Arnob_111
@Arnob_111 Ай бұрын
This solution only works for a specific month. This will fail if the data is scaled over several months.
@DEwithDhairy
@DEwithDhairy Ай бұрын
Yes correct, To cover that scenario take the difference between the date and row number to make the group that covers all the cases... I have covered this approach in my videos.