Viewing a single comment thread. View all comments

dgrsmith t1_j6wpjae wrote

This was discussed over on r/datascience too. We’d love it if it worked out of the box, but the knowledge requirements needed to tell the tool what tables do and what each of their columns mean requires a level of documentation that most companies don’t have reliably, nor would it be standardized enough to allow a model such as GPT to generalize. In a perfect world, metadata is available, and data governance is a significant focus. Often, companies don’t have time to focus on these tasks as they require considerable work. Additionally, even though there are a lot of efforts to standardize, sometimes the underlying concepts need a lot of human intervention prior to being pushed into models.

With this in mind, the title should read, “GPT tool that lets you connect to unrealistically well documented databases, and ask questions in text.”

This May be a factor in convincing a company’s CTO that they need to let us focus on documentation, but right now, governance and metadata are far from priorities for analytics teams.

8

futebollounge t1_j6x36l4 wrote

Not sure this is actually a huge bottle neck. You will just dedicate a few people to always ensuring that the metadata and documentation is updated. That is then how data roles start to shift in an AI world. You then might not need a team of 20 data people, but can get away with 10.

5

dgrsmith t1_j6xe2bu wrote

Totally agree. The company's CTO or business users need to buy into this in order to allow resource allocation. It's promising, it just requires a hell of a lot of "human in the loop" at the moment in order to finess the data to a point that the AI could produce reliable results from hidden concepts and constructs in raw tables. I think the assumption currently is that your data is finessed already for GPT to take over and produce reliable and clean results. Those 10 data people will certainly be supported by data cleaning staff. That's where it should be anyhow. No data scientist likes spending the 80% of their time cleaning and prepping the data, but that's where we are now.

7

nutidizen t1_j6wtwb8 wrote

> the knowledge requirements needed to tell the tool what tables do and what each of their columns mean requires a level of documentation that most companies don’t have reliably

See the potential. This is where this tool it's now. Where is it gonna be in a year? .)

3

dgrsmith t1_j6xe8fx wrote

I do too! Just need the buy in from stakeholders, and support, as stated in my other comments.

1

HighTechPipefitter t1_j6x0ki5 wrote

>This was discussed over on
>
>r/datascience
>
>too. We’d love it if it worked out of the box, but the knowledge requirements needed to tell the tool what tables do and what each of their columns mean requires a level of documentation that most companies don’t have reliably

If your tables and columns are named explicitely you can get away with just feeding it your database schema and the AI will figure out what you are talking about.

If not, you can create views to make it more clear what each table and column means and feed it that instead.

You can also give it special rules to keep in mind. For example, if in your DB a "man" is identified as "1" and a woman as "2", you can add this instruction to your prompt and the AI will understand that whenever you are looking for a man it needs to check for the value "1".

I expect text-to-SQL will become a standard pretty soon. It's just way too strong.

1

dgrsmith t1_j6xa0mw wrote

That’s the thing though, you need to know what you’re looking for in the database in order for the database to be able to provide you with data. AI can guess, sure, but you won’t be able to trust the results unless you’re familiar with the database, and ensure the AI is as well. I agree it’s not a breaking case, it’s just a case of considerable resource reallocation.

In your example as well, even though it is an implicit assumption that gender is an easy construct to define, that May not be the case. Are we talking sex at birth? Sex at point of observation? Identifying gender? Constructs require a lot of data understanding and finessing in a manner that end users won’t be able to clearly be able to pull without a human directing the AI somehow by providing data availability and documentation. Once you have those human data prep processes done, yes, you want your end users to be able to ask questions of the data readily. But this requires a fair bit of human anticipation as to what should be available to the AI given end-user business needs.

1

HighTechPipefitter t1_j6xe1ge wrote

There's definitely a learning curve for the user to learn to properly express themselves. But there's also different strategies you can use to help them.

A library of common prompt examples is a first one.

A UI with predefined chunks of query that you assemble is another.

You could also use embeddings to detect ambiguity and ask your user for precision.

You also don't need to expose your whole schema right away, this can be done gradually. You start with the most common requests and build from there. This way you don't need to invest a huge amount of resources from the beginning.

We are barely scratching the surface on how to use it. This will be common practice pretty soon.

If you are in a position that you have access to a database at work, I strongly suggest that you give it a try. It's surprisingly good.

1

dgrsmith t1_j6xffed wrote

>If you are in a position that you have access to a database at work, I strongly suggest that you give it a try. It's surprisingly good.

I'll give it a try with synthetic data! Maybe I'll be surprised at the amount of finessing it doesn't take. I assume it's gonna take quite a bit to make it work, but I'll give it a shot!

1

HighTechPipefitter t1_j6xjlgw wrote

Fun starts here: https://platform.openai.com/examples/default-sql-translate

Then "all you need" is to create an API call with python to get the query from OpenAI and send that query to your database through another API call.

Start small, there's a lot of little quirks but the potential is definitely there.

I expect that in the coming years you will start to see a bunch of articles about the best practice on how to integrate an AI with a database.

Good luck.

2