One thing that I am always looking for in a new "reusable", "composable" SQL tool is reuse of the same analytical queries across different source tables.
My litmus test:
I have a table "people" with the columns "people.firstname", "people.lastname", and a table "persons" with the columns "persons.firstname", "persons.lastname". I now want to create a query that gives me the "fullname" (".firstname" + " " + ".lastname") of all rows of both tables. If I have to spell out the logic for how to calculate the fullname in the query twice, the test is failed.
(Taking the shortcut of creating the union of both tables first is not allowed, but I can't think of a simple example that enforces that restriction).
For some reason, all of the solutions (PRQL, Malloy, dbt) that try to make SQL more reusable don't really consider this kind of reuse, and with that ultimately fall flat for the use-cases I would typically have for them. Sadly, Trilogy doesn't seem to be any better on that front.
hn_throwaway_99 14 hours ago [-]
TBH, I don't think your test is very useful in real world environments. That is, you have 2 independent tables, and you're wanting the solution to depend on the fact that there are columns that are named the same across both tables.
IMO these kinds of "shortcuts based on column naming across tables" usually end in disaster down the road. For example, I've been bitten in the past by "natural joins" when we've wanted to refactor something later.
I definitely agree that I don't want to have to repeat logic within a single table, but the kind of syntactic sugar that is your litmus test is a big foot gun IMO.
solidsnack9000 56 minutes ago [-]
As far as I can tell, the parent is describing something like a trait or interface: both tables have `.firstname: text` and `.lastname: text`. This may not really be relational -- I guess there should be a third table and they should both reference it -- but it does show up in real world schemas.
hobofan 13 hours ago [-]
Oh, I 100% agree with you.
> IMO these kinds of "shortcuts based on column naming across tables" usually end in disaster down the road.
I can see that point, and that was not what I wanted to express with my litmus test. It's only supposed to be a litmus test after all. In a proper solution there would be additional things I would be looking for, but so far everything I've seen already fails that "trivial" test.
One could easily re-formulate it, so that in the one tabel the column is named ".firstname", and in the other one it is named ".first_part_of_the_name".
The core point is more that no matter the relational logic you layer on top of a table/view, that logic should be paramterizable by table/view/column names, to be properly relocatable. I'd be happy about suggestions for better examples! Some solutions (I think dbt) do have some relocateability across schemas, but usually in a more singleton-like manner rather than being able to instantiate the logic multiple times.
I can just tell you that I interact with queries that would benefit from such kind of reuse on a daily basis. One common thing would also be mechanisms that you want to reuse across many different tables in your schema. E.g. soft-deletes or historic/snapshot tables. Nowadays those kinds of solutions usually end up being expressed in the ORM/query builder of a programming language (and thus highly fragmented across programming language ecosystems), instead of living on an SQL-like level and being able to mature better.
fooododododo 12 hours ago [-]
Curious what your opinion on plpgsql functions is? Could easily solve your initial problem (if I follow). They don't seem to come up much though
hobofan 12 hours ago [-]
I do like PL\pgSQL functions, and I think they can to some extent be used to solve this problem, though I think they are limited in how their internal structure is parameterizable.
I am rarely in a position at my client projects where I can employ PL\pgSQL though, so I opt more for out-of-database solutions for composing my queries, as that usually is easier to debug.
default-kramer 14 hours ago [-]
> IMO these kinds of "shortcuts based on column naming across tables" usually end in disaster down the road.
But sometimes this decision has been made years ago and it's not realistic to change it now. I've wanted to do this many times, and I've never been the person who created said tables.
Also, certain use cases perform much better if you create temp tables with small subsets of data from the main tables. It sure would be nice to be able to reuse fragments of SQL written against the main tables... if only SQL were better.
carlineng 13 hours ago [-]
I agree 100% that this needs to be more of a thing. For data engineers building data pipelines, queries are like functions, and table schemas are like types. There needs to be a way to write a query that runs on an abstract interface, rather than an actual table. To do this, most folks rely on string templating in Python or Jinja, which makes the development process really cumbersome. As a result, most teams end up in scenarios where data pipelines are always a big mess of spaghetti SQL, or they are stuck maintaining complex frameworks that abstract away common logic, but are inscrutable to the average user.
I think your blog post frames the problem very well!
Seeing that both someone working on PRQL and Malloy replied and to both of you it's an understood pain makes me feel a lot better about the future of these tools! When talking about that with people that are not that deep into the problem it is often hard to transport the difference between this kind of composability vs. the composability that the tools are offering today, and the implications that come with that.
At a past startup I had the fortune to be able to work on a similar system to what I am looking for: Packageable, reusable relation algebra inspired by Substrait. It had the downside though that it was quite tied to RDF and SPARQL in its implementation, and now I'm chasing something similar in the SQL world :D
munk-a 11 hours ago [-]
I have used CTEs with dynamic query stitching to solve this problem (specifically my business operates over two very similar but distinct domains which we keep in separate buckets). If you build the majority of your logic into a CTE that processes named columns coming out of a prior chunk you can swap out what actual columns in the DB are mapped into the columns coming out of that earlier CTE with its definition. It may be possible to make this more magical using pl/pgsql but I've found that dynamic query stitching at the CTE resolution is a level of fiddly-ness I'm comfortable building into resilient products.
I work with complex data models and keeping all that structure in my brain takes enough effort that I want to keep my queries as simple as possible because when it's time to debug one there's no way I'm carrying over _any_ memory from when I originally wrote it.
12 hours ago [-]
thesz 12 hours ago [-]
> There needs to be a way to write a query that runs on an abstract interface, rather than an actual table.
Proper use of SQL inverts control. Instead of parameterizing query by table, you write a query and at the actual use site you join it on the table you need by fields your query provides. VIEWs allows you to not repeat yourself too often.
Best thing is that you do not need to even mention that "abstract interface table" as a parameter at all.
hobofan 12 hours ago [-]
> VIEWs allows you to not repeat yourself too often.
No they don't. They only offer a solution to the problem "many different predicates for a few tables", but don't offer a solution to the problem "a few similar predicates for many different tables", as views as per their declaration are already tied to a single table.
fooododododo 12 hours ago [-]
Do they ever write plpgsql? If not why not?
snthpy 13 hours ago [-]
Hi,
Having this sort of "table polymorphism" is something we've thought a lot about for PRQL and is definitely something we want to get right. That said it's not straightforward but you can do a lot of it already. You can try the following examples for yourself in the PRQL Playground (https://prql-lang.org/playground/).
First a simple example using functions as they are documented:
```prql
let fullname = func firstname lastname -> f"{firstname} {lastname}"
from customers
select full_name=(fullname first_name last_name)
```
Now the example above isn't quite what you're looking for because you still have to specify the columns as function arguments and there really isn't much gained here. It serves to illustrate the principle though as the `fullname` function could be doing something more complicated.
What you want is:
```prql
let add_full_name = func tbl<relation> -> (
from tbl
derive full_name = f"{first_name} {last_name}"
)
from customers
add_full_name
select full_name
```
Now this requires the `<relation>` type annotation which hasn't been documented because it's still quite experimental. However this works right now and can be applied to different tables or relations, for example you could use the same function in the following:
```prql
from i=invoices
join c=customers (==customer_id)
select {c.first_name, c.last_name, i.total}
sort {-total}
add_full_name
select {full_name, total}
```
I'll add some more examples in child comments.
Disclaimer: I'm a PRQL contributor.
snthpy 13 hours ago [-]
Here's a function to normalize values relative to the column range:
```prql
let normalize = func x -> ((x - min x)/((max x) - (min x)) | math.round 2)
SELECT
track_id,
bytes,
milliseconds
FROM
tracks
LIMIT
5
)
SELECT
track_id,
milliseconds AS ms,
ROUND(
(milliseconds - MIN(milliseconds) OVER ()) / (
MAX(milliseconds) OVER () - MIN(milliseconds) OVER ()
),
2
) AS ms_norm,
bytes,
ROUND(
(bytes - MIN(bytes) OVER ()) / (MAX(bytes) OVER () - MIN(bytes) OVER ()),
2
) AS bytes_norm
Thanks for the thorough reply! Really glad to see at least one of the solutions working towards it, and I'll definitely check PRQL out again in more depth!
efromvt 10 hours ago [-]
To expand on previous answer, right now this would be represented as:
key firstname string;
key lastname string;
auto full_name <- concat(firstname, ' ', lastname);
datasource people (
firstname:firstname,
lastname:lastname
)
address people;
The missing bit if you're trying to define a universe across both is actually the union construct; right now a concept is assumed to have one cardinality space.
Something like:
auto all_first_names <- union(first_name1, first_name2);
There's a coupling between the concept definition both as a function input and as a semantic value. They could be decomposed, but you'd still need to recompose them at some point before running a query.
This is something I've never thought about before, and haven't had a use case for, so I'm genuinely interested in learning a little more about your use cases if you can elaborate a little futher.
default-kramer 14 hours ago [-]
Sometimes you want to be able to do something like "run some SQL, but instead of using the normal tables use these temp tables I just created."
In particular, I wanted to do this in SQLite recently. I wanted to have one write process which would always remain unblocked. And I also wanted to be able to run certain tasks which would do some temporary/discardable DB manipulations as part of producing an output file. These tasks could open the SQLite DB in read-only mode; load relevant data into temp tables, manipulate that data, and write the output file. Everything would have worked great if only SQL were a more composable language.
totalhack 12 hours ago [-]
If I'm understanding your intent correctly, I think you could do that with Zillion.
Disclaimer: been sidetracked by an acquisition at my day job this year, intend to put more time into this project soon, but I use it in production to great effect.
hobofan 12 hours ago [-]
Sure, I could do that in Zillion the same way I'm currently composing queries in Python at runtime with PyPika.
I'm looking for a more programming language agnostic solution that tools like this (e.g. also PRQL, Malloy) usually offer.
efromvt 10 hours ago [-]
Hmm - in Trilogy, if both tables had firstname and lastname as concepts bound to them, and you created full name that was the concat of those concepts, you'd only need to define fullname once and the calculation would work against both without any direct binding to either. The unioning is actually the unsupported part right now, though it's planned to be implemented!
jasonpbecker 14 hours ago [-]
This feels like a case for a function.
hobofan 13 hours ago [-]
See my reply to hn_throwaway_99. The example is overly trivialized and should expand to more complex relational logic as well.
emmanueloga_ 6 hours ago [-]
I might be off here, but this seems like the right place to ask: don't most SQL replacements focus heavily on querying while largely overlooking insertion and updating?
I get why querying gets more attention, insertions are usually straightforward and don’t need much simplification. Updates, on the other hand, can be a bit trickier since they often involve modifying data derived from complex queries. These tools seem geared toward data analysis and not data generation, which is ok: is nice focusing on a single problem and solving it "right".
But! for projects where a single person handles data creation, analysis, and management, it feels cumbersome to use one set of tools for querying ("R" in CRUD) and another for creation, updates, and deletions ("C," "U," and "D"). I think a "SQL replacement" or approach covering all of CRUD could be interesting for projects of any scale. Something that I could pick instead of shopping for ORMs and/or lightweight query generators.
tluyben2 18 hours ago [-]
This is great; I have been thinking about this for a long time. I like reading about past and current implementations that try to better sql; from a programming and a data science and performance perspective. I am aware of the ones you linked and some others like 'Real' (shakti.com) sql and some enhancements from papers.
Anyway; nice one! Will try.
16 hours ago [-]
totalhack 12 hours ago [-]
Congrats on the launch. I made a tool that has some similar objectives but doesn't present as SQL itself like Trilogy seems to. I'll take a deeper look at Trilogy soon, always interested to see the variety of approaches to this.
Oh wow yeah, a lot of parallels - thanks for sharing, I'll take a deeper dive in a bit. I think there's a lot of demand and a lot of space for different solutions; Trilogy definitely aspires to hew closer to standard SQL. (I actually really like SQL for the most part!)
snthpy 13 hours ago [-]
Cool project!
Will take a proper look when I get a chance.
In the meantime, I just wanted to say: nice name! ;-)
efromvt 8 hours ago [-]
Thanks on the name, hah. Been fun to see the progress PRQL has made going mainstream!
mritchie712 18 hours ago [-]
Not a SQL replacement, but if you're looking for an open source semantic layer, Cube is the way to go [0]
I was reading this comment and I have no expertise in sql; would you like to explain why do exemples "look like junior-level bloat" to you ?
hipadev23 16 hours ago [-]
Their examples showing why Trilogy is so good are comparing it against poorly written SQL with poorly designed schemas. It reduces my confidence that their tool is actually solving real problems but instead was borne out of frustration in learning SQL and databases.
efromvt 7 hours ago [-]
Any particular examples you have in mind? The demo is just referencing https://github.com/duckdb/duckdb/tree/main/extension/tpcds/d... which I wouldn't regard as a standard of good SQL; (implicit joins, yikes!) - but is a useful capability reference (as is tpc-ds in general).
As I tried to convey, I like SQL a lot - my frustration is more around the lifecycle and maintainability.
Happy to add more ergonomic references in other places, if you have some good examples to reference against?
Rendered at 08:45:52 GMT+0000 (UTC) with Wasmer Edge.
My litmus test:
I have a table "people" with the columns "people.firstname", "people.lastname", and a table "persons" with the columns "persons.firstname", "persons.lastname". I now want to create a query that gives me the "fullname" (".firstname" + " " + ".lastname") of all rows of both tables. If I have to spell out the logic for how to calculate the fullname in the query twice, the test is failed.
(Taking the shortcut of creating the union of both tables first is not allowed, but I can't think of a simple example that enforces that restriction).
For some reason, all of the solutions (PRQL, Malloy, dbt) that try to make SQL more reusable don't really consider this kind of reuse, and with that ultimately fall flat for the use-cases I would typically have for them. Sadly, Trilogy doesn't seem to be any better on that front.
IMO these kinds of "shortcuts based on column naming across tables" usually end in disaster down the road. For example, I've been bitten in the past by "natural joins" when we've wanted to refactor something later.
I definitely agree that I don't want to have to repeat logic within a single table, but the kind of syntactic sugar that is your litmus test is a big foot gun IMO.
> IMO these kinds of "shortcuts based on column naming across tables" usually end in disaster down the road.
I can see that point, and that was not what I wanted to express with my litmus test. It's only supposed to be a litmus test after all. In a proper solution there would be additional things I would be looking for, but so far everything I've seen already fails that "trivial" test.
One could easily re-formulate it, so that in the one tabel the column is named ".firstname", and in the other one it is named ".first_part_of_the_name".
The core point is more that no matter the relational logic you layer on top of a table/view, that logic should be paramterizable by table/view/column names, to be properly relocatable. I'd be happy about suggestions for better examples! Some solutions (I think dbt) do have some relocateability across schemas, but usually in a more singleton-like manner rather than being able to instantiate the logic multiple times.
I can just tell you that I interact with queries that would benefit from such kind of reuse on a daily basis. One common thing would also be mechanisms that you want to reuse across many different tables in your schema. E.g. soft-deletes or historic/snapshot tables. Nowadays those kinds of solutions usually end up being expressed in the ORM/query builder of a programming language (and thus highly fragmented across programming language ecosystems), instead of living on an SQL-like level and being able to mature better.
I am rarely in a position at my client projects where I can employ PL\pgSQL though, so I opt more for out-of-database solutions for composing my queries, as that usually is easier to debug.
But sometimes this decision has been made years ago and it's not realistic to change it now. I've wanted to do this many times, and I've never been the person who created said tables.
Also, certain use cases perform much better if you create temp tables with small subsets of data from the main tables. It sure would be nice to be able to reuse fragments of SQL written against the main tables... if only SQL were better.
I wrote a longer blog post about this recently: https://carlineng.com/?postid=holy-grail-data-engineering#bl...
Seeing that both someone working on PRQL and Malloy replied and to both of you it's an understood pain makes me feel a lot better about the future of these tools! When talking about that with people that are not that deep into the problem it is often hard to transport the difference between this kind of composability vs. the composability that the tools are offering today, and the implications that come with that.
At a past startup I had the fortune to be able to work on a similar system to what I am looking for: Packageable, reusable relation algebra inspired by Substrait. It had the downside though that it was quite tied to RDF and SPARQL in its implementation, and now I'm chasing something similar in the SQL world :D
I work with complex data models and keeping all that structure in my brain takes enough effort that I want to keep my queries as simple as possible because when it's time to debug one there's no way I'm carrying over _any_ memory from when I originally wrote it.
Proper use of SQL inverts control. Instead of parameterizing query by table, you write a query and at the actual use site you join it on the table you need by fields your query provides. VIEWs allows you to not repeat yourself too often.
Best thing is that you do not need to even mention that "abstract interface table" as a parameter at all.
No they don't. They only offer a solution to the problem "many different predicates for a few tables", but don't offer a solution to the problem "a few similar predicates for many different tables", as views as per their declaration are already tied to a single table.
Having this sort of "table polymorphism" is something we've thought a lot about for PRQL and is definitely something we want to get right. That said it's not straightforward but you can do a lot of it already. You can try the following examples for yourself in the PRQL Playground (https://prql-lang.org/playground/).
First a simple example using functions as they are documented:
```prql
let fullname = func firstname lastname -> f"{firstname} {lastname}"
from customers
select full_name=(fullname first_name last_name)
```
Now the example above isn't quite what you're looking for because you still have to specify the columns as function arguments and there really isn't much gained here. It serves to illustrate the principle though as the `fullname` function could be doing something more complicated.
What you want is:
```prql
let add_full_name = func tbl<relation> -> (
from customersadd_full_name
select full_name
```
Now this requires the `<relation>` type annotation which hasn't been documented because it's still quite experimental. However this works right now and can be applied to different tables or relations, for example you could use the same function in the following:
```prql
from i=invoices
join c=customers (==customer_id)
select {c.first_name, c.last_name, i.total}
sort {-total}
add_full_name
select {full_name, total}
```
I'll add some more examples in child comments.
Disclaimer: I'm a PRQL contributor.
```prql
let normalize = func x -> ((x - min x)/((max x) - (min x)) | math.round 2)
from tracks
take 5
derive {ms_norm=(normalize milliseconds), bytes_norm=(normalize bytes)}
select {track_id, ms=milliseconds, ms_norm, bytes, bytes_norm}
```
which produces the following SQL:
```sql
WITH table_0 AS (
)SELECT track_id,
FROM table_0-- Generated by PRQL compiler version:0.13.2 (https://prql-lang.org)
```
key firstname string; key lastname string;
auto full_name <- concat(firstname, ' ', lastname);
datasource people ( firstname:firstname, lastname:lastname ) address people;
datasource persons ( firstname: firstname, last_name:lastname ) address persons;
And a select full_name;
Could resolve from either table.
The missing bit if you're trying to define a universe across both is actually the union construct; right now a concept is assumed to have one cardinality space.
Something like: auto all_first_names <- union(first_name1, first_name2);
There's a coupling between the concept definition both as a function input and as a semantic value. They could be decomposed, but you'd still need to recompose them at some point before running a query.
In particular, I wanted to do this in SQLite recently. I wanted to have one write process which would always remain unblocked. And I also wanted to be able to run certain tasks which would do some temporary/discardable DB manipulations as part of producing an output file. These tasks could open the SQLite DB in read-only mode; load relevant data into temp tables, manipulate that data, and write the output file. Everything would have worked great if only SQL were a more composable language.
https://github.com/totalhack/zillion
Disclaimer: been sidetracked by an acquisition at my day job this year, intend to put more time into this project soon, but I use it in production to great effect.
I'm looking for a more programming language agnostic solution that tools like this (e.g. also PRQL, Malloy) usually offer.
I get why querying gets more attention, insertions are usually straightforward and don’t need much simplification. Updates, on the other hand, can be a bit trickier since they often involve modifying data derived from complex queries. These tools seem geared toward data analysis and not data generation, which is ok: is nice focusing on a single problem and solving it "right".
But! for projects where a single person handles data creation, analysis, and management, it feels cumbersome to use one set of tools for querying ("R" in CRUD) and another for creation, updates, and deletions ("C," "U," and "D"). I think a "SQL replacement" or approach covering all of CRUD could be interesting for projects of any scale. Something that I could pick instead of shopping for ORMs and/or lightweight query generators.
Anyway; nice one! Will try.
https://github.com/totalhack/zillion
Will take a proper look when I get a chance.
In the meantime, I just wanted to say: nice name! ;-)
0 - https://github.com/cube-js/cube
As I tried to convey, I like SQL a lot - my frustration is more around the lifecycle and maintainability.
Happy to add more ergonomic references in other places, if you have some good examples to reference against?