Tumgik
#Data Wrangling
ibarrau · 11 hours
Text
[Python] PySpark to M, SQL or Pandas
Hace tiempo escribí un artículo sobre como escribir en pandas algunos códigos de referencia de SQL o M (power query). Si bien en su momento fue de gran utilidad, lo cierto es que hoy existe otro lenguaje que representa un fuerte pie en el análisis de datos.
Spark se convirtió en el jugar principal para lectura de datos en Lakes. Aunque sea cierto que existe SparkSQL, no quise dejar de traer estas analogías de código entre PySpark, M, SQL y Pandas para quienes estén familiarizados con un lenguaje, puedan ver como realizar una acción con el otro.
Lo primero es ponernos de acuerdo en la lectura del post.
Power Query corre en capas. Cada linea llama a la anterior (que devuelve una tabla) generando esta perspectiva o visión en capas. Por ello cuando leamos en el código #“Paso anterior” hablamos de una tabla.
En Python, asumiremos a "df" como un pandas dataframe (pandas.DataFrame) ya cargado y a "spark_frame" a un frame de pyspark cargado (spark.read)
Conozcamos los ejemplos que serán listados en el siguiente orden: SQL, PySpark, Pandas, Power Query.
En SQL:
SELECT TOP 5 * FROM table
En PySpark
spark_frame.limit(5)
En Pandas:
df.head()
En Power Query:
Table.FirstN(#"Paso Anterior",5)
Contar filas
SELECT COUNT(*) FROM table1
spark_frame.count()
df.shape()
Table.RowCount(#"Paso Anterior")
Seleccionar filas
SELECT column1, column2 FROM table1
spark_frame.select("column1", "column2")
df[["column1", "column2"]]
#"Paso Anterior"[[Columna1],[Columna2]] O podría ser: Table.SelectColumns(#"Paso Anterior", {"Columna1", "Columna2"} )
Filtrar filas
SELECT column1, column2 FROM table1 WHERE column1 = 2
spark_frame.filter("column1 = 2") # OR spark_frame.filter(spark_frame['column1'] == 2)
df[['column1', 'column2']].loc[df['column1'] == 2]
Table.SelectRows(#"Paso Anterior", each [column1] == 2 )
Varios filtros de filas
SELECT * FROM table1 WHERE column1 > 1 AND column2 < 25
spark_frame.filter((spark_frame['column1'] > 1) & (spark_frame['column2'] < 25)) O con operadores OR y NOT spark_frame.filter((spark_frame['column1'] > 1) | ~(spark_frame['column2'] < 25))
df.loc[(df['column1'] > 1) & (df['column2'] < 25)] O con operadores OR y NOT df.loc[(df['column1'] > 1) | ~(df['column2'] < 25)]
Table.SelectRows(#"Paso Anterior", each [column1] > 1 and column2 < 25 ) O con operadores OR y NOT Table.SelectRows(#"Paso Anterior", each [column1] > 1 or not ([column1] < 25 ) )
Filtros con operadores complejos
SELECT * FROM table1 WHERE column1 BETWEEN 1 and 5 AND column2 IN (20,30,40,50) AND column3 LIKE '%arcelona%'
from pyspark.sql.functions import col spark_frame.filter( (col('column1').between(1, 5)) & (col('column2').isin(20, 30, 40, 50)) & (col('column3').like('%arcelona%')) ) # O spark_frame.where( (col('column1').between(1, 5)) & (col('column2').isin(20, 30, 40, 50)) & (col('column3').contains('arcelona')) )
df.loc[(df['colum1'].between(1,5)) & (df['column2'].isin([20,30,40,50])) & (df['column3'].str.contains('arcelona'))]
Table.SelectRows(#"Paso Anterior", each ([column1] > 1 and [column1] < 5) and List.Contains({20,30,40,50}, [column2]) and Text.Contains([column3], "arcelona") )
Join tables
SELECT t1.column1, t2.column1 FROM table1 t1 LEFT JOIN table2 t2 ON t1.column_id = t2.column_id
Sería correcto cambiar el alias de columnas de mismo nombre así:
spark_frame1.join(spark_frame2, spark_frame1["column_id"] == spark_frame2["column_id"], "left").select(spark_frame1["column1"].alias("column1_df1"), spark_frame2["column1"].alias("column1_df2"))
Hay dos funciones que pueden ayudarnos en este proceso merge y join.
df_joined = df1.merge(df2, left_on='lkey', right_on='rkey', how='left') df_joined = df1.join(df2, on='column_id', how='left')Luego seleccionamos dos columnas df_joined.loc[['column1_df1', 'column1_df2']]
En Power Query vamos a ir eligiendo una columna de antemano y luego añadiendo la segunda.
#"Origen" = #"Paso Anterior"[[column1_t1]] #"Paso Join" = Table.NestedJoin(#"Origen", {"column_t1_id"}, table2, {"column_t2_id"}, "Prefijo", JoinKind.LeftOuter) #"Expansion" = Table.ExpandTableColumn(#"Paso Join", "Prefijo", {"column1_t2"}, {"Prefijo_column1_t2"})
Group By
SELECT column1, count(*) FROM table1 GROUP BY column1
from pyspark.sql.functions import count spark_frame.groupBy("column1").agg(count("*").alias("count"))
df.groupby('column1')['column1'].count()
Table.Group(#"Paso Anterior", {"column1"}, {{"Alias de count", each Table.RowCount(_), type number}})
Filtrando un agrupado
SELECT store, sum(sales) FROM table1 GROUP BY store HAVING sum(sales) > 1000
from pyspark.sql.functions import sum as spark_sum spark_frame.groupBy("store").agg(spark_sum("sales").alias("total_sales")).filter("total_sales > 1000")
df_grouped = df.groupby('store')['sales'].sum() df_grouped.loc[df_grouped > 1000]
#”Grouping” = Table.Group(#"Paso Anterior", {"store"}, {{"Alias de sum", each List.Sum([sales]), type number}}) #"Final" = Table.SelectRows( #"Grouping" , each [Alias de sum] > 1000 )
Ordenar descendente por columna
SELECT * FROM table1 ORDER BY column1 DESC
spark_frame.orderBy("column1", ascending=False)
df.sort_values(by=['column1'], ascending=False)
Table.Sort(#"Paso Anterior",{{"column1", Order.Descending}})
Unir una tabla con otra de la misma característica
SELECT * FROM table1 UNION SELECT * FROM table2
spark_frame1.union(spark_frame2)
En Pandas tenemos dos opciones conocidas, la función append y concat.
df.append(df2) pd.concat([df1, df2])
Table.Combine({table1, table2})
Transformaciones
Las siguientes transformaciones son directamente entre PySpark, Pandas y Power Query puesto que no son tan comunes en un lenguaje de consulta como SQL. Puede que su resultado no sea idéntico pero si similar para el caso a resolver.
Analizar el contenido de una tabla
spark_frame.summary()
df.describe()
Table.Profile(#"Paso Anterior")
Chequear valores únicos de las columnas
spark_frame.groupBy("column1").count().show()
df.value_counts("columna1")
Table.Profile(#"Paso Anterior")[[Column],[DistinctCount]]
Generar Tabla de prueba con datos cargados a mano
spark_frame = spark.createDataFrame([(1, "Boris Yeltsin"), (2, "Mikhail Gorbachev")], inferSchema=True)
df = pd.DataFrame([[1,2],["Boris Yeltsin", "Mikhail Gorbachev"]], columns=["CustomerID", "Name"])
Table.FromRecords({[CustomerID = 1, Name = "Bob", Phone = "123-4567"]})
Quitar una columna
spark_frame.drop("column1")
df.drop(columns=['column1']) df.drop(['column1'], axis=1)
Table.RemoveColumns(#"Paso Anterior",{"column1"})
Aplicar transformaciones sobre una columna
spark_frame.withColumn("column1", col("column1") + 1)
df.apply(lambda x : x['column1'] + 1 , axis = 1)
Table.TransformColumns(#"Paso Anterior", {{"column1", each _ + 1, type number}})
Hemos terminado el largo camino de consultas y transformaciones que nos ayudarían a tener un mejor tiempo a puro código con PySpark, SQL, Pandas y Power Query para que conociendo uno sepamos usar el otro.
0 notes
techinfotrends · 1 month
Text
Tumblr media
Discover how automation streamlines the data wrangling process, saving time and resources while ensuring accuracy and consistency. Explore the latest tools and techniques revolutionizing data preparation.
0 notes
rajaniesh · 3 months
Text
Microsoft Fabric: Pioneering the AI and Machine Learning Frontier
🌟 Explore Microsoft Fabric: Revolutionizing Data Science & Machine Learning! From data analysis to model training, uncover insights & drive innovation. Read now! #DataScience #MachineLearning #MicrosoftFabric #AI #Innovation #Tech
In today’s dynamic business landscape, organizations are increasingly turning to data science and machine learning to gain insights, make informed decisions, and drive innovation. Imagine you’re working for a supermarket chain. You want to optimize your inventory management to meet customer demands efficiently while minimizing food waste. Or perhaps you aim to personalize your marketing…
Tumblr media
View On WordPress
0 notes
garymdm · 10 months
Text
Customer Experience means business wins control of BI
Customer Experience means business wins control of BI via @bevelson #BigData
Time for IT to let business users grow up? Earlier this month, Forrester Consultant, Boris Evelson ( @bevelson ), commented that business has finally won the battle with IT over who will control Business Intelligence and Analytics. The customer is the priority The reason – businesses agree that the customer is the priority – and that BI must support the priorities and needs of customer-facing…
Tumblr media
View On WordPress
0 notes
mobius-services · 1 year
Link
0 notes
Text
A Data Analytics Company in Uganda, East Africa
As an analytics solution organization we help our clients and partners make sense of their data and extract the hidden insights within organization. We take your data through the data analytics process and produce actionable insights and recommendations for your organizations to apply.
We explore what happened within the data.
We understand why it happened.
We predict what will happen.
For every project, you can expect our highest performance as we produce analysis to answer your business questions. Each project will follow three core steps.
Project Scope and Plan: In this step we will provide an overview and project plan document covering a definition of your project roles, project roadmap, objectives, communication plans, and work agreements, as well as a refined problem statement and draft project scope.
Low – FI Recommendation Pitch: We will then get back to you After conducting some exploratory data analysis (EDA) and initial visualizations, we will prepare a presentation of the findings and insights discovered and solicit feedback from you as the client on areas to further expand.
Final Report and Presentation: To finalize we will provide the report and presentation based on the project business questions or problem statements. We will summarize the project development process, technical overview of the code as well as insights, trends, and patterns.
Learn more https://analyticalkeysolutions.com
1 note · View note
fabiodisconzi · 2 years
Text
Data Wrangling (from IBM Data Analyst course)
Data Wrangling is an iterative process that involves data exploration, transformation, and validation.  
Transformation of raw data includes the tasks you undertake to:
Structurally manipulate and combine the data using Joins and Unions.
Normalize data, that is, clean the database of unused and redundant data.
Denormalize data, that is, combine data from multiple tables into a single table so that it can be queried faster.
Clean data, which involves profiling data to uncover quality issues, visualizing data to spot outliers, and fixing issues such as missing values, duplicate data, irrelevant data, inconsistent formats, syntax errors, and outliers.
Enrich data, which involves considering additional data points that could add value to the existing data set and lead to a more meaningful analysis.
0 notes
theestuaryandthesea · 2 years
Text
~*sad covid noises~*
y'all, covid is no joke. even after being super careful (N95 mask indoors always, no crowded outdoor spaces) N and I still have the new variant and it feels like being hit by a bus. Unfortunately, work continues, so I'm trying to do one thing each day to move my diss project forward and meet deadlines before my advisor emails me in August ahahaha This week I:
Modmailed the reddit people to see if a) they wanna do interviews and b) if I can ask other people in the sub to do interviews
Set up all the prep material introducing folks to the study (already approved by IRB/my committee)
Made a Discord to host interviews + materials
Figured out that I can probably use vlookup instead of futzing with python to match and extract terms across excel books
The last one has been a weird problem to have. Basically, I used Pushshift to scrape a bunch of posts and comments from a subreddit, but the scripts are different for each kind of content, so posts + comments are returned in separate CSVs. The topic modeling method I'm using doesn't care about word order, so for month-based analysis, I can just dump all the text posts and all the comment text from a particular month into a new CSV and call it good. The problem has been recreating data sets by flair, or the user-generated label for a particular post. Flair is a returnable aspect of the metadata, but Pushshift isn't able to scrape posts by flair yet. So I've got a giant spreadsheet full of posts that I can filter by flair in Excel, and from there, I can identify the unique post ID for every text post in a given flair (300-700 posts). Then, I need to look for those unique post IDs in the comments spreadsheet (which contains both a unique ID for each comment and the parent post ID to which the comment replies), extract the comment, and place it in a new spreadsheet. The problem is that each post has between 0 and several thousand comments, so it's way too much labor to do manually. I think I can use vlookup to match batches of post IDs to their corresponding comments, but we'll see. If anyone is good at this kind of stuff and has a better idea, please let me know!!
1 note · View note
soldier-poet-king · 6 months
Text
Unfortunately I AM a fussy little office worker in love with standards and procedures AND I'm easily annoyed by everyone else's incomptency AND I'm full of hubris and ambition and pride in my own work
Do u see why fantasy secretary & right hand man is the best trope to ever matter to me
24 notes · View notes
greghatecrimes · 4 months
Text
also yes i'm still working on the census survey. i haven't had vyvanse for weeks so spreadsheets are a nightmare rn
7 notes · View notes
mademoisellesarcasme · 5 months
Text
i have been intending, for well over a week, to make more peanut butter cookies so I can have them as an easy snack.
I have still not done it.
if I have not done it by bedtime tonight someone needs to yell at me tomorrow.
5 notes · View notes
unproduciblesmackdown · 9 months
Text
okay one more summer stogging post (summer stock blogging) via also the one other review that does a wiggly hand gesture about it but was like "this one guy though" and highlights that [will roland back at it again finding a very human Performance in the writing even if you didn't like that writing] phenomenon:
"The only comic in the cast who never pushes is Will Roland as Cox’s henpecked son Orville (Eddie Bracken in the film), who grew up with Jane and whom everyone expects to marry her. The book doesn’t do much for him either, but Roland does it for himself – he makes Orville into a flesh-and-blood creation. He’s the most likable performer on the stage."
3 notes · View notes
dancingplague · 7 months
Text
Gotta say that my favorite subtitle malfunction so far has been the surname Simos getting rendered as CMOs, ironic for one of the few recurring antagonists in the setting who is as far as I can tell zero percent finance themed.
5 notes · View notes
badolmen · 10 months
Text
Power outage might last longer than the promised 12 hours 😭
2 notes · View notes
garymdm · 10 months
Text
Are you Hadooping?
Are you Hadooping? You should be! http://wp.me/pSc3Y-qSE #Hadoop #BigData
“Are you Hadooping?” [Tweet this] This was the key question asked by Gartner analysts, Merv Adrian and Nick Heudecker, during their insightful Hadoop 2015: The Road Ahead webinar which you can register to watch. 40% have deployed Hadoop Their research shows that nearly 40% of all respondents have either deployed Hadoop into production or are a long way into deployment. This is significantly…
Tumblr media
View On WordPress
0 notes
slicedblackolives · 2 years
Note
you have a follower who works for the united states federal reserve! technically im more of an economist than an accountant and also technically the reserve is a separate institution from the us government but im basically government accountant adjacent.
ARE YOU THE GUYS ENGINEERING A GLOBAL RECESSION 😭😭😭
also gonna shoot my shot are any of yall in academia looking for RAs a bitch needs resesrch experience 😔
3 notes · View notes