How to select a random row from a table?
Hi community,
I need to write an SQL query to fetch a random record from a table, that table has millions of data.
In postgresql, for example, there is a RANDOM() function to do something like that:
SELECT column FROM table ORDER BY RANDOM() LIMIT 1
Is it possible to do something like that in Caché?
Thanks in advance
One thing I've done to split machine learning datasets is to use an auxiliary table that maps IDs to a random number. I write a stored procedure that returns a random number. I create a table with two columns, one for the ID of the source table, and one to hold a random number. I populate the column for the source IDs:
insert into random_table (source_id) select id from source_table
I then populate the column for the random number:
update random_table set random_number = MySchema.MySP_Random(1E9)
Then I can select with an ORDER BY clause on the random number:
select top 10 source_id from random_table order by random_number, source_id
It depends on your use case whether this will be appropriate for a source table with millions of rows. It's an expensive way to select just one row.
Jonathan, where do you perform the training/fit part of the AI/ML pipeline?
This is for a simple train-test split as part of testing for IntegratedML, currently in beta.
As there is no default RANDOM() in Caché/IRIS SQL you have to write it yourself.
Create this class:
{ ClassMethod Random(maxval As %Integer = 2) As %Integer [ SqlName = RANDOM, SqlProc ]
{
quit $random(maxval)
} }
Assuming your table has defautl ID (%Integer, MINVAL=1)
HAVING ID >= RANDOM(MAX(ID))
>= ensures that you just get an existing ID and don't drop on a deleted one
Of course, if you know in advance the highest existing ID (e.g. maxid) then RANDOM(maxid) will do it.
Thank you very much guys
I'm curious to know what you're doing that you are looking for a random record in such a large dataset? Sincerely interested in what you're doing so I can learn and not challenging the validity of your task!
Hi @Mike.Davidovich is to audit the data with impartiality
Well that makes perfect sense!