Using * in OPENROWSET to use Data Lake metadata
Use the values of wildcards in your SQL queryโVery powerful way to convert your file path or filename metadata into actual data!
Letโs say you stored your files like this:
yellow_tripdata_<year>-<month>.csv
For example, yellow_tripdata_ ๐ฎ๐ฌ๐ฎ๐ฎ-๐ฌ๐ญ.csv
Letโs fetch those year and month data and turn them into SQL columns, with two simple steps (and a bonus tip)
๐ญ) ๐๐ฑ๐ฑ ๐๐ถ๐น๐ฑ๐ฐ๐ฎ๐ฟ๐ฑ๐ ๐ถ๐ป ๐ข๐ฃ๐๐ก๐ฅ๐ข๐ช๐ฆ๐๐ง()
By replacing those dynamic values with wildcards we can ensure we read all the data, but also subsequently that we can use the wildcards as data.
Code example:
FROM OPENROWSET(
BULK 'csv/taxi/yellow_tripdata_*-*.csv'
๐ฎ) ๐จ๐๐ฒ ๐๐ถ๐น๐ฑ๐ฐ๐ฎ๐ฟ๐ฑ๐ ๐ถ๐ป ๐ฆ๐๐๐๐๐ง ๐๐ถ๐๐ต ๐ณ๐ถ๐น๐ฒ๐ฝ๐ฎ๐๐ต()
The function filepath() allows you to return the entire filepath of your files, but if you add a number inside the () you can refer to any wildcard from step 1.
Filepath(1) will represent the value of the first wildcard, (2) the second, and so on.
Code example:
SELECT
r.filepath() AS filepath
,r.filepath(1) AS [year]
,r.filepath(2) AS [month]
๐ฏ) ๐๐ผ๐ป๐๐ ๐๐ถ๐ฝ - ๐ช๐ผ๐ฟ๐ธ๐ ๐ถ๐ป ๐ช๐๐๐ฅ๐ ๐๐๐ฎ๐๐ฒ๐บ๐ฒ๐ป๐
Using the wildcard references is not limited to SELECT, but also works in WHERE statements. This means you can use the metadata of your filepath or file name to filter your query.
Quite powerful!
Code example:
WHERE
r.filepath(1) IN ('2017')
AND r.filepath(2) IN ('10', '11', '12')
It even works if you use the alias instead, even though the value has been casted.
And thatโs it! A little neat trick with lots of utility in the right hands.