#Data Wrangling | Explore Tumblr Posts and Blogs

ibarrau · 11 hours

Text

[Python] PySpark to M, SQL or Pandas

Hace tiempo escribí un artículo sobre como escribir en pandas algunos códigos de referencia de SQL o M (power query). Si bien en su momento fue de gran utilidad, lo cierto es que hoy existe otro lenguaje que representa un fuerte pie en el análisis de datos.

Spark se convirtió en el jugar principal para lectura de datos en Lakes. Aunque sea cierto que existe SparkSQL, no quise dejar de traer estas analogías de código entre PySpark, M, SQL y Pandas para quienes estén familiarizados con un lenguaje, puedan ver como realizar una acción con el otro.

Lo primero es ponernos de acuerdo en la lectura del post.

Power Query corre en capas. Cada linea llama a la anterior (que devuelve una tabla) generando esta perspectiva o visión en capas. Por ello cuando leamos en el código #“Paso anterior” hablamos de una tabla.

En Python, asumiremos a "df" como un pandas dataframe (pandas.DataFrame) ya cargado y a "spark_frame" a un frame de pyspark cargado (spark.read)

Conozcamos los ejemplos que serán listados en el siguiente orden: SQL, PySpark, Pandas, Power Query.

En SQL:

SELECT TOP 5 * FROM table

En PySpark

spark_frame.limit(5)

En Pandas:

df.head()

En Power Query:

Table.FirstN(#"Paso Anterior",5)

Contar filas

SELECT COUNT(*) FROM table1

spark_frame.count()

df.shape()

Table.RowCount(#"Paso Anterior")

Seleccionar filas

SELECT column1, column2 FROM table1

spark_frame.select("column1", "column2")

df[["column1", "column2"]]

#"Paso Anterior"[[Columna1],[Columna2]] O podría ser: Table.SelectColumns(#"Paso Anterior", {"Columna1", "Columna2"} )

Filtrar filas

SELECT column1, column2 FROM table1 WHERE column1 = 2

spark_frame.filter("column1 = 2") # OR spark_frame.filter(spark_frame['column1'] == 2)

df[['column1', 'column2']].loc[df['column1'] == 2]

Table.SelectRows(#"Paso Anterior", each [column1] == 2 )

Varios filtros de filas

SELECT * FROM table1 WHERE column1 > 1 AND column2 < 25

spark_frame.filter((spark_frame['column1'] > 1) & (spark_frame['column2'] < 25)) O con operadores OR y NOT spark_frame.filter((spark_frame['column1'] > 1) | ~(spark_frame['column2'] < 25))

df.loc[(df['column1'] > 1) & (df['column2'] < 25)] O con operadores OR y NOT df.loc[(df['column1'] > 1) | ~(df['column2'] < 25)]

Table.SelectRows(#"Paso Anterior", each [column1] > 1 and column2 < 25 ) O con operadores OR y NOT Table.SelectRows(#"Paso Anterior", each [column1] > 1 or not ([column1] < 25 ) )

Filtros con operadores complejos

SELECT * FROM table1 WHERE column1 BETWEEN 1 and 5 AND column2 IN (20,30,40,50) AND column3 LIKE '%arcelona%'

from pyspark.sql.functions import col spark_frame.filter( (col('column1').between(1, 5)) & (col('column2').isin(20, 30, 40, 50)) & (col('column3').like('%arcelona%')) ) # O spark_frame.where( (col('column1').between(1, 5)) & (col('column2').isin(20, 30, 40, 50)) & (col('column3').contains('arcelona')) )

df.loc[(df['colum1'].between(1,5)) & (df['column2'].isin([20,30,40,50])) & (df['column3'].str.contains('arcelona'))]

Table.SelectRows(#"Paso Anterior", each ([column1] > 1 and [column1] < 5) and List.Contains({20,30,40,50}, [column2]) and Text.Contains([column3], "arcelona") )

Join tables

SELECT t1.column1, t2.column1 FROM table1 t1 LEFT JOIN table2 t2 ON t1.column_id = t2.column_id

Sería correcto cambiar el alias de columnas de mismo nombre así:

spark_frame1.join(spark_frame2, spark_frame1["column_id"] == spark_frame2["column_id"], "left").select(spark_frame1["column1"].alias("column1_df1"), spark_frame2["column1"].alias("column1_df2"))

Hay dos funciones que pueden ayudarnos en este proceso merge y join.

df_joined = df1.merge(df2, left_on='lkey', right_on='rkey', how='left') df_joined = df1.join(df2, on='column_id', how='left')Luego seleccionamos dos columnas df_joined.loc[['column1_df1', 'column1_df2']]

En Power Query vamos a ir eligiendo una columna de antemano y luego añadiendo la segunda.

#"Origen" = #"Paso Anterior"[[column1_t1]] #"Paso Join" = Table.NestedJoin(#"Origen", {"column_t1_id"}, table2, {"column_t2_id"}, "Prefijo", JoinKind.LeftOuter) #"Expansion" = Table.ExpandTableColumn(#"Paso Join", "Prefijo", {"column1_t2"}, {"Prefijo_column1_t2"})

Group By

SELECT column1, count(*) FROM table1 GROUP BY column1

from pyspark.sql.functions import count spark_frame.groupBy("column1").agg(count("*").alias("count"))

df.groupby('column1')['column1'].count()

Table.Group(#"Paso Anterior", {"column1"}, {{"Alias de count", each Table.RowCount(_), type number}})

Filtrando un agrupado

SELECT store, sum(sales) FROM table1 GROUP BY store HAVING sum(sales) > 1000

from pyspark.sql.functions import sum as spark_sum spark_frame.groupBy("store").agg(spark_sum("sales").alias("total_sales")).filter("total_sales > 1000")

df_grouped = df.groupby('store')['sales'].sum() df_grouped.loc[df_grouped > 1000]

#”Grouping” = Table.Group(#"Paso Anterior", {"store"}, {{"Alias de sum", each List.Sum([sales]), type number}}) #"Final" = Table.SelectRows( #"Grouping" , each [Alias de sum] > 1000 )

Ordenar descendente por columna

SELECT * FROM table1 ORDER BY column1 DESC

spark_frame.orderBy("column1", ascending=False)

df.sort_values(by=['column1'], ascending=False)

Table.Sort(#"Paso Anterior",{{"column1", Order.Descending}})

Unir una tabla con otra de la misma característica

SELECT * FROM table1 UNION SELECT * FROM table2

spark_frame1.union(spark_frame2)

En Pandas tenemos dos opciones conocidas, la función append y concat.

df.append(df2) pd.concat([df1, df2])

Table.Combine({table1, table2})

Transformaciones

Las siguientes transformaciones son directamente entre PySpark, Pandas y Power Query puesto que no son tan comunes en un lenguaje de consulta como SQL. Puede que su resultado no sea idéntico pero si similar para el caso a resolver.

Analizar el contenido de una tabla

spark_frame.summary()

df.describe()

Table.Profile(#"Paso Anterior")

Chequear valores únicos de las columnas

spark_frame.groupBy("column1").count().show()

df.value_counts("columna1")

Table.Profile(#"Paso Anterior")[[Column],[DistinctCount]]

Generar Tabla de prueba con datos cargados a mano

spark_frame = spark.createDataFrame([(1, "Boris Yeltsin"), (2, "Mikhail Gorbachev")], inferSchema=True)

df = pd.DataFrame([[1,2],["Boris Yeltsin", "Mikhail Gorbachev"]], columns=["CustomerID", "Name"])

Table.FromRecords({[CustomerID = 1, Name = "Bob", Phone = "123-4567"]})

Quitar una columna

spark_frame.drop("column1")

df.drop(columns=['column1']) df.drop(['column1'], axis=1)

Table.RemoveColumns(#"Paso Anterior",{"column1"})

Aplicar transformaciones sobre una columna

spark_frame.withColumn("column1", col("column1") + 1)

df.apply(lambda x : x['column1'] + 1 , axis = 1)

Table.TransformColumns(#"Paso Anterior", {{"column1", each _ + 1, type number}})

Hemos terminado el largo camino de consultas y transformaciones que nos ayudarían a tener un mejor tiempo a puro código con PySpark, SQL, Pandas y Power Query para que conociendo uno sepamos usar el otro.

#spark #pyspark #python #pandas #sql #power query #powerquery #notebooks #ladataweb #data engineering #data wrangling #data cleansing

0 notes

techinfotrends · 1 month

Text

Discover how automation streamlines the data wrangling process, saving time and resources while ensuring accuracy and consistency. Explore the latest tools and techniques revolutionizing data preparation.

#DATA WRANGLING

0 notes

rajaniesh · 3 months

Text

Microsoft Fabric: Pioneering the AI and Machine Learning Frontier

🌟 Explore Microsoft Fabric: Revolutionizing Data Science & Machine Learning! From data analysis to model training, uncover insights & drive innovation. Read now! #DataScience #MachineLearning #MicrosoftFabric #AI #Innovation #Tech

In today’s dynamic business landscape, organizations are increasingly turning to data science and machine learning to gain insights, make informed decisions, and drive innovation. Imagine you’re working for a supermarket chain. You want to optimize your inventory management to meet customer demands efficiently while minimizing food waste. Or perhaps you aim to personalize your marketing…

View On WordPress

#AI Models #Data Analysis #Data Science #Data Visualization #Data Wrangling #Experiment Tracking #Linear Regression #Logistic Regression #machine learning #microsoft azure #Microsoft Fabric #MLflow #Model Training #Python Programming #scikit-learn

0 notes

garymdm · 10 months

Text

Customer Experience means business wins control of BI

Customer Experience means business wins control of BI via @bevelson #BigData

Time for IT to let business users grow up? Earlier this month, Forrester Consultant, Boris Evelson ( @bevelson ), commented that business has finally won the battle with IT over who will control Business Intelligence and Analytics. The customer is the priority The reason – businesses agree that the customer is the priority – and that BI must support the priorities and needs of customer-facing…

View On WordPress

#BI #customer experience #data wrangling #Datameer #Forrester #predictions #self service

0 notes

mobius-services · 1 year

Link

#DATA AGGREGATION #DATA WRANGLING

0 notes

analyticalkeysolutions · 2 years

Text

A Data Analytics Company in Uganda, East Africa

As an analytics solution organization we help our clients and partners make sense of their data and extract the hidden insights within organization. We take your data through the data analytics process and produce actionable insights and recommendations for your organizations to apply.

We explore what happened within the data.

We understand why it happened.

We predict what will happen.

For every project, you can expect our highest performance as we produce analysis to answer your business questions. Each project will follow three core steps.

Project Scope and Plan: In this step we will provide an overview and project plan document covering a definition of your project roles, project roadmap, objectives, communication plans, and work agreements, as well as a refined problem statement and draft project scope.

Low – FI Recommendation Pitch: We will then get back to you After conducting some exploratory data analysis (EDA) and initial visualizations, we will prepare a presentation of the findings and insights discovered and solicit feedback from you as the client on areas to further expand.

Final Report and Presentation: To finalize we will provide the report and presentation based on the project business questions or problem statements. We will summarize the project development process, technical overview of the code as well as insights, trends, and patterns.

Learn more https://analyticalkeysolutions.com

#dataanalysis #big data #dataanalytics #datamanagement #data #data collection #datavisualization #data wrangling #descriptive #predictive #diagnostic

1 note · View note

fabiodisconzi · 2 years

Text

Data Wrangling (from IBM Data Analyst course)

Data Wrangling is an iterative process that involves data exploration, transformation, and validation.

Transformation of raw data includes the tasks you undertake to:

Structurally manipulate and combine the data using Joins and Unions.

Normalize data, that is, clean the database of unused and redundant data.

Denormalize data, that is, combine data from multiple tables into a single table so that it can be queried faster.

Clean data, which involves profiling data to uncover quality issues, visualizing data to spot outliers, and fixing issues such as missing values, duplicate data, irrelevant data, inconsistent formats, syntax errors, and outliers.

Enrich data, which involves considering additional data points that could add value to the existing data set and lead to a more meaningful analysis.

#data wrangling

0 notes

theestuaryandthesea · 2 years

Text

~*sad covid noises~*

y'all, covid is no joke. even after being super careful (N95 mask indoors always, no crowded outdoor spaces) N and I still have the new variant and it feels like being hit by a bus. Unfortunately, work continues, so I'm trying to do one thing each day to move my diss project forward and meet deadlines before my advisor emails me in August ahahaha This week I:

Modmailed the reddit people to see if a) they wanna do interviews and b) if I can ask other people in the sub to do interviews

Set up all the prep material introducing folks to the study (already approved by IRB/my committee)

Made a Discord to host interviews + materials

Figured out that I can probably use vlookup instead of futzing with python to match and extract terms across excel books

The last one has been a weird problem to have. Basically, I used Pushshift to scrape a bunch of posts and comments from a subreddit, but the scripts are different for each kind of content, so posts + comments are returned in separate CSVs. The topic modeling method I'm using doesn't care about word order, so for month-based analysis, I can just dump all the text posts and all the comment text from a particular month into a new CSV and call it good. The problem has been recreating data sets by flair, or the user-generated label for a particular post. Flair is a returnable aspect of the metadata, but Pushshift isn't able to scrape posts by flair yet. So I've got a giant spreadsheet full of posts that I can filter by flair in Excel, and from there, I can identify the unique post ID for every text post in a given flair (300-700 posts). Then, I need to look for those unique post IDs in the comments spreadsheet (which contains both a unique ID for each comment and the parent post ID to which the comment replies), extract the comment, and place it in a new spreadsheet. The problem is that each post has between 0 and several thousand comments, so it's way too much labor to do manually. I think I can use vlookup to match batches of post IDs to their corresponding comments, but we'll see. If anyone is good at this kind of stuff and has a better idea, please let me know!!

#gradblr #dissertation #excel #python #tagging all these so smart people see it sorry #data wrangling #text dataset #text analysis

1 note · View note

soldier-poet-king · 6 months

Text

Unfortunately I AM a fussy little office worker in love with standards and procedures AND I'm easily annoyed by everyone else's incomptency AND I'm full of hubris and ambition and pride in my own work

Do u see why fantasy secretary & right hand man is the best trope to ever matter to me

#KAMET KINGNAMER I UNDERSTAND #his pride and his ambition in tat. like. hes escaping. he should be happy.#and yet hes sooooo annoyed with costis at first #and he misses his work (in a fucked up way)#and he mourns the loss of his ambition to be nahuseresh's right hand even as he is free #like #who is he in attolia? no one. I GET IT. I GET IT #see also: csevet aisava #and: cliopher mdang #oh kamet csevet kip we're really in it now #u know when kip overhauls the entire secretariat??? GOD I WISH THAT WERE ME RN #anyway i am going to take a break from screaming to go to a boring zoom meeting (mandatory)#after which i will return to wrangling this data and screaming internally

24 notes · View notes

greghatecrimes · 4 months

Text

also yes i'm still working on the census survey. i haven't had vyvanse for weeks so spreadsheets are a nightmare rn

#anya shush #houseblr census 2023 #i've managed to wrangle writing without vyvanse but actual data analysis... woof #i guess less analysis and more data organization and cleaning

7 notes · View notes

mademoisellesarcasme · 5 months

Text

i have been intending, for well over a week, to make more peanut butter cookies so I can have them as an easy snack.

I have still not done it.

if I have not done it by bedtime tonight someone needs to yell at me tomorrow.

#i need to also get some paid work done #very likely editorial work but the data-wrangling also needs to happen #i got nothing done last week and my focus is shot to hell

5 notes · View notes

unproduciblesmackdown · 9 months

Text

okay one more summer stogging post (summer stock blogging) via also the one other review that does a wiggly hand gesture about it but was like "this one guy though" and highlights that [will roland back at it again finding a very human Performance in the writing even if you didn't like that writing] phenomenon:

"The only comic in the cast who never pushes is Will Roland as Cox’s henpecked son Orville (Eddie Bracken in the film), who grew up with Jane and whom everyone expects to marry her. The book doesn’t do much for him either, but Roland does it for himself – he makes Orville into a flesh-and-blood creation. He’s the most likable performer on the stage."

#now it's steve using the exact term ''henpecked''....#summer stock #orville wingate #will roland #could've condensed posts here but i didn't b/c i was discovering the data piece by piece lol #and now it'd be a lot to wrangle thee perfectly edited categorization and combinations #sorry to whomever's opening up the daily nightly milo summer stock info blogging and shaking their head. that's hilarious of you fr thanks

3 notes · View notes

dancingplague · 7 months

Text

Gotta say that my favorite subtitle malfunction so far has been the surname Simos getting rendered as CMOs, ironic for one of the few recurring antagonists in the setting who is as far as I can tell zero percent finance themed.

#unsleeping city season two has nudged over into being somewhat about real estate making me once again somewhat the villain #so we're up to like 6/10 awkwardness to watch specifically at my bank job while wrangling commercial prepayment data #honorable mention to 'wreathed' getting rendered as 'reeved' which is a mistake that I don't understand man or machine making #reeved is technically a word but 'reeved in fire' is neither grammatical or coherent #dimension 20

5 notes · View notes

badolmen · 10 months

Text

Power outage might last longer than the promised 12 hours 😭

#ra speaks #personal #I was sooo productive Wednesday and then they were like hey yeah your next four days are fucked #so I got all my hot water/electricity needing chores done yesterday instead of riding the productivity high #and now I can’t force myself to work for fear of wasting battery bc my brain isn’t ready to wrangle data

2 notes · View notes

garymdm · 10 months

Text

Are you Hadooping?

Are you Hadooping? You should be! http://wp.me/pSc3Y-qSE #Hadoop #BigData

“Are you Hadooping?” [Tweet this] This was the key question asked by Gartner analysts, Merv Adrian and Nick Heudecker, during their insightful Hadoop 2015: The Road Ahead webinar which you can register to watch. 40% have deployed Hadoop Their research shows that nearly 40% of all respondents have either deployed Hadoop into production or are a long way into deployment. This is significantly…

View On WordPress

#big data architecture #cloud #data wrangling #Datameer #Hadoop

0 notes

slicedblackolives · 2 years

Note

you have a follower who works for the united states federal reserve! technically im more of an economist than an accountant and also technically the reserve is a separate institution from the us government but im basically government accountant adjacent.

ARE YOU THE GUYS ENGINEERING A GLOBAL RECESSION 😭😭😭

also gonna shoot my shot are any of yall in academia looking for RAs a bitch needs resesrch experience 😔

#asks #anonymous #i can do qual research also basic data wrangling in r and python so *sparkles*

3 notes · View notes