In this section, we are going to look at how data is imported and transformed using the Power Query M formula language.
In Power BI Desktop, we define datasets by using one query to specify the data to be included and the way the data should be transformed. The query consists of several related steps, each building on the last, and resulting in the final dataset. Once the dataset has been defined, it can be used for the visualizations you might want to add to your reports, and these can then be published to Power BI Service.
At the very core of this query is the Power Query M language. This is a formula language, a bit of mashup language for Power BI Desktop, Excel 2016’s Get & Transform import feature, and Power Query. It is a case sensitive language, and, like many languages, the statements from Power Query are a combination of variables, functions, expressions, values (structured and primitive), and other language elements that all come together to define the necessary logic for shaping the data.
Every Power BI Desktop query is just one Power Query expression called let; this has all the code needed for defining the dataset. These expressions consist of two statements -0 let and in – as you can see from the syntax below:
let
<em>variable</em> = <em>expression</em> [,...]
in
<em>variable</em>
Each let statement has at least one procedural step the helps define the query. Each procedural step is a variable assignment, and this consists of the name of the variable and the variable value, provided by an expression. That expression also defines the required logic to add data, remove it and transform it in a particular way and this is where the majority of your work is done, as you will see throughout this chapter.
The variables may be of any type supported by Power Query and must be given a unique name; however, Power Query does offer a certain amount of flexibility in the naming, even allowing spaces. You do need to enclose the name inside a set of double quotes, and it must be preceded with a hashtag – this is a bit of a convoluted way of naming variables, but you can’t avoid it. The name you give to your variable is also the name given to the relevant step in the Query Editor’s Applied Steps section, so it pays to use names that make some kind of sense.
The let statement does not limit the number of procedural steps you can use, so long as they are practical and necessary. If you do have several steps, they must be comma-separated, and each step must build on the previous one – the variable in a step is used to define the logic in the next step.
Your procedural steps do not need to be defined in the physical order that matches their logical order. You could, for example, refer to a variable in step one that doesn’t get defined until the final step. However, this isn’t the best route to take as it could lead to your code being tough to debug, and it could cause some confusion. As such, it is best to keep the logical and physical order in synchronization.
A variable value is returned by the in statement, and this value is used for defining the final shape of the dataset. Most of the time, this is the final variable you define in the let statement. A different variable can be specified to the one in the let statement, but, in all honesty, there is rarely a reason to do this. If you want to see the value of any variable at any point in time, simply choose the associated step from Applied steps; when you select any step, you can see what is in the related variable.
We are going to be looking at several examples over the course of the chapter, demonstrating the definition of the procedural steps that you need for defining a query. We will base our examples on the titanic dataset, which you can download from https://vincentarelbundock.github.io/Rdatasets/datasets.html. This dataset contains a list of the passengers on the ill-fated ship, showing those who died and those who survived. If you want to follow along with these examples, you must first create a titanic.csv file from the dataset, saving it to where it can be accessed from the Power BI Desktop. I saved the file on my system to C:\DataFiles\titanic.csv. When you go to the page to download the dataset, make sure it is the one called Stat2Data Titanic – there are a few titanic datasets on the site.
The first step to building Power Query scripts is to put a blank query into the Desktop. In the main Power BI window, go to the Home ribbon and click on Get Data. Go to Other and then double-click on Blank Query. Query Editor opens, and you will see, in the Queries pane, a brand-new query. It will be called Query1 or something similar, depending on how many queries you created already.
Go to the Queries pane and right-click the new query. Click on Rename and type in SurvivalLog, then press Enter. Go to the View menu and click on Advanced Editor. When the editor opens, it will already have a newly defined let expression in it.
This statement has one procedural step; it has a variable called Source and an empty string for the expression – the double quotes indicate this. Note that the same variable is in the statement and in the Applied Steps as a new step. We want to change the name of this variable to GetPassengers.
The first added procedural step retrieves the data from the titanic.csv file and saves it to the variable named GetPassengers. Adding the procedure requires that the current let expression is replaced with the expression below:
let GetPassengers = Csv.Document(File.Contents("C:\DataFiles\titanic.csv"), [Delimiter=",", Encoding=1252])in GetPassengers
The variable has been specified in both the let and the in statements in the procedural steps. Doing this ensures that the value of that variable is returned when the let expression has been run.
In the procedural step, the expression contains everything that follows the = sign. The function called Csv.Document is used to retrieve the contents of the file in the format of a Table object. Several parameters are supported by the function, but the source data is identified using just the first parameter. And, as part of that parameter, the File.Contents function is required to return just the data in the document.
We can specify the second parameter as a record containing settings that are optional. Power Query records are sets of fields consisting of a set of brackets containing a name and value pair. In our example, the record has the Delimiter option and the Encoding option, together with their respective values. Delimiter specifies that the CSV document delimiter is a comma, while Encoding specifies that 1252 is the type of text encoding, based on the code page for Windows Western Europe.
That is all you need to do to set a let expression up. Once the code has been entered, click on Done; the Advanced Editor will close, and the procedural step runs. If you go to the Applied Steps section, you will see that GetPassengers is the first step, and this matches the variable name. The expression can also be seen in the window at the top of the dataset, and that is where you can edit the code if you need to.
Whenever you want to add a step, modify an existing one, or delete one, make sure your changes are applied and saved. To do this, just click Save in the top-left of the menu bar and, when asked if you want the changes applied, click on Apply.
The second procedural step for the let statement will take Column7 out of the dataset. As you save your dataset as a Table object, there are a few Table functions in Power Query that you can choose from for updating your data. To do this, we’ll be filtering the specific column using a function called Table.RemoveColumns.
To go back to the query, click on Advanced Editor. After the first of the procedural steps, insert a comma; on a new line, use the following code to define another variable and expression pair – this will remove the column:
let GetPassengers = Csv.Document(File.Contents("C:\DataFiles\titanic.csv"), [Delimiter=",", Encoding=1252]), RemoveCols = Table.RemoveColumns(GetPassengers, "Column7")in RemoveCols
Two parameters are required in the Table.RemoveColumns function. The first is used for specifying the target table you want to be updated – in this case, it is the previous step’s GetPassengers variable. The second parameter is used to specify the column or columns that are being removed. We only want to remove Column7 from our GetPassengers table.
What is important here is that step two is built based on the first one, and a new result is returned. That result is then assigned to the variable named RemoveCols; this variable has the same dataset that is in the GetPassengers variable with one difference – the data for Column7 is removed.
Once the procedural step has been added, the GetPassengers variable from the in the statement should be replaced with the RemoveCols variable. Then click on Done, and the Advanced Editor will close. Save your changes and apply them.
The Applied Steps section now has a new step, RemoveCols and Column7 has been deleted from the dataset. You can continue to build your steps as you require, using the same steps and logic from here.
Next, we want the values from the first row in the dataset promoted to the header row, ensuring that those values become the names of the columns. To do this, go to the last defined step and insert a comma after it. On a new line, the variable/expression pair below should be added:
PromoteNames = Table.PromoteHeaders(RemoveCols, [PromoteAllScalars=true])
This expression is making use of a function called Table.PromoteHeaders to ensure the first row is promoted to the column headers. The first parameter is a required one, specifying which table is to be used as the source data – the variable called RemoveCols. The second parameter is an optional one and is called PromoteAllScalars. It is a good parameter to learn about, though; Power Query promotes just number and text values by default, but if you use PromoteAllScalars and set it to true, all the scalar values from the first row are promoted to headers.
The results of the expression are assigned to the variable called PromoteNames, so make sure the in statement is updated with this name.
Now we will rename some of the dataset columns using the function called Table.RenameColumns. A new Table object is returned by the function, containing the specified updates to the column names. Renaming the columns is done by adding the procedural step below into the let statement, making sure a comma is inserted after the previous statement:
RenameCols = Table.RenameColumns(PromoteNames, {{"", "PsgrID"},
{"PClass", "PsgrClass"}, {"Sex", "Gender"}})
Two parameters are required by the function; the first is the target table name – the variable called PromoteNames form the last step – while the second contains a list of column names, both old and new. Power Query lists are values in an ordered sequence, comma-separated and enclosed in a set of curly brackets. In our example, each of the values is a list with two values – the old name and the new name for the column.
Next, we want to filter out any row that has NA in the Age column, followed by any row that has an Age value of 5 or lower. When the dataset was imported, the Age column was typed as Text automatically, and that means there are three steps to filtering the data, beginning with removing any NA values. The variable/expression pair below should be added to your code to filter the NA values out:
FilterNA = Table.SelectRows(RenameCols, each [Age] <> "NA")
The function called Table.SelectRows will return a table that has just the rows matching the condition we defined – Age not equal to NA. As we saw previously, the first argument for the function is the variable from the previous step, the variable called RenameCols.
The next argument is an each expression, and this specifies that the value for Age must not be equal to the value for NA. The keyword, each, is indicating that the expression has to be applied to every one of the target table’s rows. The result of this is a new table, assigned to the variable called FilterNA, and this will not contain any rows with an NA value for Age.
Once the NA values are removed, the Age column needs to be converted to a data type of Number; this will ensure that the data can be worked with more efficiently and effectively; for example, using numerical ages to filter the data. When you change to Age column type, the PsgrID column type can also be changed to Int64. Doing this means the IDs are referenced by integers and not text. The expression/variable pair below must be added to your let statement to do the type conversion:
ChangeTypes = Table.TransformColumnTypes(FilterNA,
{{"PsgrID", Int64.Type}, {"Age", Number.Type}})
The function called Table.TransformColumnTypes is used by the expression to change the types. There are two parameters, the first being the target table name – FilterNA – and the second being the list of the columns you want updated, together with the new type. Each of the values is a list of its own containing the column name and type.
Once you have updated the Age column with the Number type, the ages can be filtered out based on numerical values, as you can see in the procedural step below:
FilterKids = Table.SelectRows(ChangeTypes, each [Age] > 5)
Now, your dataset should now only have the data required, as it is in the variable called FilterKids.
Sometimes, you may need to replace the value in a column with another one. We are going to replace two values in the column called Survived. 0 is being replaced with No and 1 is being replaced with Yes and, to do this, we will use a function called Table.ReplaceValue. The procedural step below is added to the code to replace the 0 values:
Replace0 = Table.ReplaceValue(FilterKids, "0", "No",
Replacer.ReplaceText, {"Survived"})
Table.ReplaceValue needs five parameters:
Most are pretty self-explanatory; the one that may not be is the replacer parameter. There are several functions that work with Table.ReplaceValue for updating values; we used Replacer.ReplaceText as we wanted to replace text values.
Once the 0 values have been replaced, you can go ahead and do the 1 values, replacing them with Yes, in much the same way:
Replace1 = Table.ReplaceValue(Replace0, "1", "Yes", Replacer.ReplaceText, {"Survived"})
Now the Survived column values are easier to read for those who don’t understand 0 and 1 values. And doing this will also make it less likely that confusion will arise with the data.
You can also change the values by changing how the capitalization used. We will take the Gender column values and change the first letter to a capital letter instead of being completely lower-case. The procedural step below is added to the code in the let statement to make that change:
ChangeCase = Table.TransformColumns(Replace1, {"Gender", Text.Proper})
The Table.TransformColumns function is used by the expression for updating the values, and two parameters are required. The first parameter is the table we want to be updated, and the second lists all the operations needed. For every operation, we need the target column and the right expression to do the operation. We only have one operation in our examples, so all we need is the Gender column and the function called Text.Proper, wrapped in a set of curly braces. The first letter of every word in the Gender column is converted to a capital letter.
Now, your let expression should look something like this:
let
GetPassengers = Csv.Document(File.Contents("C:\DataFiles\titanic.csv"),
[Delimiter=",", Encoding=1252]),
RemoveCols = Table.RemoveColumns(GetPassengers, "Column7"),
PromoteNames = Table.PromoteHeaders(RemoveCols, [PromoteAllScalars=true]),
RenameCols = Table.RenameColumns(PromoteNames, {{"", "PsgrID"},
{"PClass", "PsgrClass"}, {"Sex", "Gender"}}),
FilterNA = Table.SelectRows(RenameCols, each [Age] <> "NA"),
ChangeTypes = Table.TransformColumnTypes(FilterNA,
{{"PsgrID", Int64.Type}, {"Age", Number.Type}}),
FilterKids = Table.SelectRows(ChangeTypes, each [Age] > 5),
Replace0 = Table.ReplaceValue(FilterKids, "0", "No",
Replacer.ReplaceText, {"Survived"}), Replace1 = Table.ReplaceValue(Replace0, "1", "Yes", Replacer.ReplaceText, {"Survived"}), ChangeCase = Table.TransformColumns(Replace1, {"Gender", Text.Proper})
in
ChangeCase
Included in the let statement are all the steps in your query, each one building on the last one. Note that, in the Applied Steps section, there is one step per variable, in the same order they are in the statement. Choose any step, and you will see the data in the main window showing you what is in the variable.
Up to now, each of the steps your let statement has represents one discrete action, built on the previous step. Sometimes though, you may need your steps structured in a not-so linear way. For example, you might want a calculated column, showing the difference between the age of a passenger and the average age for that gender. These are the steps you can take:
// add a calculated column based on the average ages
Female = Table.SelectRows(ChangeCase, each [Gender] = "Female"),
AvgFemale = List.Average(Table.Column(Female, "Age")),
Male = Table.SelectRows(ChangeCase, each [Gender] = "Male"),
AvgMale = List.Average(Table.Column(Male, "Age")),
AddCol = Table.AddColumn(ChangeCase, "AgeDiff", each if [Gender] = "Female" then [Age] - AvgFemale else [Age] - AvgMale),RoundDiff = Table.TransformColumns(AddCol, {"AgeDiff", each Number.Round(_, 2)})
Let’s break this down a bit so you can understand it better:
The first line with the two // (forward slashes) is indicating that this is a one-line comment. Anything that follows the // is ignored and not processed as it is for information only. You can also use multi-line comments in Power Query, using /* to start them and */ to end them. You don’t have to include any comments; it's entirely your choice.
The next line calculates the average female age – the two steps below are what provide the value of that average age:
Female = Table.SelectRows(ChangeCase, each [Gender] = "Female"),
AvgFemale = List.Average(Table.Column(Female, "Age")),
The first one generates a table using Table.SelectRows. This table only has rows containing a value for the Female Gender. This function is being used in much the same way as what we saw earlier, but this time, we are filtering out different data and saving the results to a variable called Female. Note that the function used by the source table is the variable from the previous step, ChangeCase.
Step two is using the function called List.Average to calculate the average for the Female table Age values. A scalar value is returned, and this is saved in the variable called AvgFemale. Only one parameter is required, and this includes another function called Table.Column; this will pass the Age column values to the List.Average function.
Next, we want the average male passenger age, and we can do this in a similar way, with just a couple of changes:
Male = Table.SelectRows(ChangeCase, each [Gender] = "Male"),
AvgMale = List.Average(Table.Column(Male, "Age")),
Note that the ChangeCase variable must be used for the source table when the function called Table.SelectRows is called, irrespective of the fact that this is no longer the previous step. You can use any of the variables that came before in an expression, but only as long as it is sensible to do it.
Now that you have your AvgMale and AvgFemale variables, the column can be added, using the Table.AddColumn function:
AddCol = Table.AddColumn(ChangeCase, "AgeDiff", each if [Gender] = "Female"
[Age] - AvgFemale else [Age] - AvgMale)
Three parameters are required by this function; the first is ChangeCase, the target table, and the second is AgeDiff, which is new column’s name.
The third is the expression that will generate the values for the column. The each keyword is used to start the expression, and this will iterate every row in the target table. An if…then…else expression follows this to calculate the value of the row based on male or female passengers. If the value for Gender equals Female, the value for AgeDiff is set as Age less the value for AvgFemale; if it is male, AgeDiff is set as Age less AvgMale.
Once the new column has been defined, the last step is to round the values for AgeDiff to two decimal points:
RoundDiff = Table.TransformColumns(AddCol, {"AgeDiff", each Number.Round(_, 2)})
In the expression is the function called Table.TransformColumns; this time, it uses another function called Number.Round, which rounds the values, instead of altering the case. The function takes two parameters, the first of which is an underscore. This is representing the current value of the column while the second parameter, which is 2, states that two decimal places is what the value should be rounded to.
Those are the steps required for creating calculated columns; now, the let statement should look like this:
let
GetPassengers = Csv.Document(File.Contents("C:\DataFiles\titanic.csv"),
[Delimiter=",", Encoding=1252]),
RemoveCols = Table.RemoveColumns(GetPassengers, "Column7"),
PromoteNames = Table.PromoteHeaders(RemoveCols, [PromoteAllScalars=true]),
RenameCols = Table.RenameColumns(PromoteNames, {{"", "PsgrID"},
{"PClass", "PsgrClass"}, {"Sex", "Gender"}}),
FilterNA = Table.SelectRows(RenameCols, each [Age] <> "NA"),
ChangeTypes = Table.TransformColumnTypes(FilterNA,
{{"PsgrID", Int64.Type}, {"Age", Number.Type}}),
FilterKids = Table.SelectRows(ChangeTypes, each [Age] > 5),
Replace0 = Table.ReplaceValue(FilterKids, "0", "No",
Replacer.ReplaceText, {"Survived"}),
Replace1 = Table.ReplaceValue(Replace0, "1", "Yes",
Replacer.ReplaceText, {"Survived"}),
ChangeCase = Table.TransformColumns(Replace1, {"Gender", Text.Proper}),
// add calculated column based on average ages
Female = Table.SelectRows(ChangeCase, each [Gender] = "Female"),
AvgFemale = List.Average(Table.Column(Female, "Age")),
Male = Table.SelectRows(ChangeCase, each [Gender] = "Male"),
AvgMale = List.Average(Table.Column(Male, "Age")),
AddCol = Table.AddColumn(ChangeCase, "AgeDiff", each if [Gender] = "Female" then [Age] - AvgFemale else [Age] - AvgMale), RoundDiff = Table.TransformColumns(AddCol, {"AgeDiff", each Number.Round(_, 2)})in RoundDiff
Again, going to Applied Steps, you should see all the variables we used for finding the average ages, irrespective of the fact that the AddCol step is building on the previous variable of ChangeCase. Select any step to see the variable contents but be aware that, if you choose a variable that is storing the average, you will only see the scalar value in Query Editor.
It shouldn’t surprise you to learn that we have only covered a fraction of what Power Query offers, but you do now have a decent base from which to build queries of your own in the Power BI Desktop. And you should also better understand that way that we construct a query using the built-in point and click features for importing and transforming data. This helps you to examine the code and gain an understanding of why the results may not be as you expect them to be. You can also use the point and click operations to build the query and use Advanced Editor to refine the datasets or bring in some logic that you can’t achieve very easily via the interface.
To get the best out of Power Query, you will need to go much deeper into the elements of the language, in particular, into the built-in functions. Do be aware that Power Query was really built for Excel, so you may come up against one or two elements that simply won't transfer to the Desktop easily. That said, the language and the syntax are defined by some basic principles and, the better an understanding you have, the more power you have at your disposal when you are working in Power BI Desktop.