Tuesday, March 20, 2012

Calculate median of difference in days between records

I have a table of sample data

Samples(sample_no, sample_date..)

I have no idea how to do the following in sql server or if its even possible:

1. Calculate the difference in days between all samples.
2. Select the median result

Any trick to get this done would be really helpful

thanks,

DB1. you can start by reading about datediff.

2. median. that is the value in the physical middle, right?. I would create a temp tale using the IDENTITY function and a order by on the column in question and then do the math on the new identity value.|||2. Create a query. Cross join table to itself. Deduct your field from itself (T1.Field1 - T2.Field2). Apply the Sign function to the result and sum it. The median is the one that comes up 0. If there are an even numbers of values then there will be two medians (1 and -1) so you need to decide if you want the higher or lower one.

DATEDIFF will help you with the above too.

HTH

EDIT - the above allows you to use Groupings so you can get medians by group too.|||Is there any chance I could get an example in sql code of what you are describing... IM a little confused...|||Well, first, do you mean the difference in days between all samples, or just the difference between a sample and the immediately preceding sample? One, requires a cross-join, and the other does not, so the devil is in the details.|||EDIT - the above allows you to use Groupings so you can get medians by group too.Oh yeah - but that would require a join by the grouping column(s). Forgot that little detail :o|||here is some sample data:

Sample_date sample_no
4/29/1962 0:00 1962706086
5/1/1962 0:00 1962704968
5/3/1962 0:00 1962704891
5/4/1962 0:00 1962722248
5/15/1962 0:00 1962706641
5/18/1962 0:00 1962722249
5/18/1962 0:00 1962704969
5/22/1962 0:00 1962706344
6/5/1962 0:00 1962722690
6/7/1962 0:00 1962722250
6/8/1962 0:00 1962708068
6/9/1962 0:00 1962722842
6/12/1962 0:00 1962708069
6/12/1962 0:00 1962722928
6/13/1962 0:00 1962722691
6/13/1962 0:00 1962706345
6/13/1962 0:00 1962706427
6/16/1962 0:00 1962706561
6/16/1962 0:00 1962704970
6/16/1962 0:00 1962704971
6/16/1962 0:00 1962705794
6/17/1962 0:00 1962708070
6/17/1962 0:00 1962708071
6/17/1962 0:00 1962722929
6/17/1962 0:00 1962722930
6/18/1962 0:00 1962722692
6/18/1962 0:00 1962722693
6/18/1962 0:00 1962722694
6/18/1962 0:00 1962722695
6/18/1962 0:00 1962722696
6/18/1962 0:00 1962722697
6/18/1962 0:00 1962722698
6/18/1962 0:00 1962722843
6/23/1962 0:00 1962706562
6/23/1962 0:00 1962722251
6/23/1962 0:00 1962722252
6/29/1962 0:00 1962704892
6/29/1962 0:00 1962704893
7/1/1962 0:00 1962722931
7/2/1962 0:00 1962708072
7/2/1962 0:00 1962722844
7/4/1962 0:00 1962708073
7/6/1962 0:00 1962722932
7/7/1962 0:00 1962705881
7/7/1962 0:00 1962722253
7/11/1962 0:00 1962722933
7/15/1962 0:00 1962722254
7/18/1962 0:00 1962708074
7/18/1962 0:00 1962708075
7/18/1962 0:00 1962722934
7/22/1962 0:00 1962722255
7/22/1962 0:00 1962704894
7/25/1962 0:00 1962722935
7/27/1962 0:00 1962708076
7/27/1962 0:00 1962708077
7/28/1962 0:00 1962704895
7/28/1962 0:00 1962722256
8/1/1962 0:00 1962708078
8/1/1962 0:00 1962708079
8/1/1962 0:00 1962722699
8/3/1962 0:00 1962722936
8/3/1962 0:00 1962722937

When you look at the first 2 rows:
4/29/1962 0:00 1962706086
5/1/1962 0:00 1962704968

you see there is a difference of 3 Days.

When you look at row 2 and 3
5/1/1962 0:00 1962704968
5/3/1962 0:00 1962704891

you see there is a difference of 2 Days.

This difference can be from 1 - infinate days.

calculate all the differences between all the consecutive days in this pattern then get the median value (the number in the MIDDLE of the set of given numbers)
ie if you have 9 values
1. 12
2. 25
3. 44
4. 55
5. 89
6. 122
7. 155
8. 325
9. 8000

the median is 89. I dont want to have to calculate median manually like this, its just way to much processing although im not sure if its built into sql server.

so for each station there should be a single value comming out and:
if num between 1 - 7 then "Monthly"
if num between 7 - 14 then "Bi-Weekly"
if num between 14 - 30 then "Monthly"
if num between 30 - 360 then "Yearly"

... or something similar

To make matters worse this is only a small fraction of a larger "Station Level" query that will be including this query somehow. Groups of samples belong to different "Stations" which I have to seperate by. For example, all the data you see here is for 1 station. There are about 300 stations being returned in the higher level view that this query will belong to. For kicks I will include it:

you can see that Sampling_Frequency is there as "UNKNOWN" cause i dont know how to do it. also if anyone wants to suggest how i optimize this query feel free to add suggestions:

SELECT DISTINCT stat.station_number AS Site_ID, stat.station_number AS Site_Number, stat.station_name AS Site_Name, stat.station_description AS Site_Description,
stat.creation_date AS Site_Activation_Date, dbo.station_coordinates.lat_dec_deg AS Latitude,
dbo.station_coordinates.lat_dec_deg AS North_bounding_Latitude, dbo.station_coordinates.lat_dec_deg AS South_bounding_Latitude,
dbo.station_coordinates.lon_dec_deg AS Longitude, dbo.station_coordinates.lon_dec_deg AS West_bounding_Longitude,
dbo.station_coordinates.lon_dec_deg AS East_bounding_Longitude, 'N/A' AS Site_Photograph_link, 'N/A' AS Local_site_influences_link,
'http://map.ns.ec.gc.ca/NA' AS Collection_level_metadata_link, 'http://map.ns.ec.gc.ca/envirodat/' AS Other_info_link,
'Environment Canada, Atlantic Region, Environmental Conservation Branch' AS Organization_name, '5068512398' AS Contact_Phone_number,
'5068516608' AS Contact_Fax_number, 'PO BOX 23005' AS Contact_street_address, 'Moncton' AS Contact_city, 'NB' AS Contact_province,
'E1A 6S8' AS Contact_postal_code, 'cathy.cormier@.ec.gc.ca' AS Contact_e_mail, 'http://map.ec.gc.ca/envirodat' AS Organization_web_site,
'0800-1600' AS Contact_hours_of_service, 'N/A' AS Contact_other_instructions, dbo.projects.project_no, dbo.projects.project_name,

(SELECT DISTINCT wshed
FROM envguest.stations_watersheds
WHERE station = stat.station_number) AS watershed,


case active_indicator WHEN 'Y' Then NULL else (select max(sample_date) from samples where station_no = stat.Station_number) END as Site_Final_Sampling_Date,

CASE active_indicator WHEN 'Y' THEN 'Active' ELSE 'Inactive' END AS Current_Status,

'UNKNOWN' as Jurisdiction, 'UNKNOWN' as Sampling_Frequency


FROM dbo.station_coordinates INNER JOIN
dbo.station stat ON dbo.station_coordinates.station_number = stat.station_number INNER JOIN
dbo.samples ON stat.station_number = dbo.samples.station_no INNER JOIN
dbo.projects ON dbo.samples.project_no = dbo.projects.project_no|||Hi

A few quickies:

Seen this posted a few times and now it is my turn (ahem):

Order has no meaning in a relational database.

This means that the apparent order of any rows in a table is arbitary - they are in fact descrete elements.

I presume that you want to order the dates in ascending order and get the difference in days between successive dates?

Also (not being picky - just to understand):
calculate all the differences between all the consecutive days in this pattern then get the median value (the number in the MIDDLE of the set of given numbers)
ie if you have 9 values
1. 12
2. 25
3. 44
4. 55
5. 89
6. 122
7. 155
8. 325
9. 8000
If there were nine dates there should only be eight numbers returned right?
Finally - I notice the above return is ordered ascending. Is this a requirement? You aren't looking for the difference in days between the two median dates but the median of the differences?

BTW - are you a surfing French man?|||Lost the server for a bit so had a play. Below is based on assumption that my assumptions were correct :D

Decided not to use Thrasys Identity column - however you would be forgiven for thinking I have coded this with obsucation in mind. Not true - I just wanted to create a single query solution

IF EXISTS(SELECT * FROM sysobjects O WHERE O.name = 'MedianT' AND O.xtype = 'U') BEGIN
DROP TABLE MedianT
End

CREATE TABLE MedianT (TheDate DATETIME)

INSERT INTO MedianT (TheDate)
SELECT '4/29/1962' UNION
SELECT '5/1/1962' UNION
SELECT '5/3/1962' UNION
SELECT '5/4/1962' UNION
SELECT '5/8/1962' UNION
SELECT '5/10/1962' UNION
SELECT '5/13/1962' UNION
SELECT '5/14/1962' UNION
SELECT '5/29/1962' UNION
SELECT '5/29/1963' UNION
SELECT '5/29/1967' UNION
SELECT '5/29/1969' UNION
SELECT '5/29/1974' UNION
SELECT '5/29/1980' UNION
SELECT '5/29/2000'

SELECT Da.TheDifference AS TheMedian
FROM (SELECT DISTINCT TheDifference FROM (SELECT DATEDIFF(D, A.TheDate, MIN(B.TheDate)) AS TheDifference
FROM MedianT A, MedianT B
WHERE A.TheDate < B.TheDate
GROUP BY A.TheDate) Daa) Da,
(SELECT DATEDIFF(D, A.TheDate, MIN(B.TheDate)) AS TheDifference
FROM MedianT A, MedianT B
WHERE A.TheDate < B.TheDate
GROUP BY A.TheDate) Db
GROUP BY Da.TheDifference
HAVING sum(SIGN(Da.TheDifference - Db.TheDifference)) IN (0, -1)|||BTW - the IN(0, -1) bit means:
If there are an odd number of vlaues then there is a natural median -> the result being 0. If there are an even number of values then there are two candidates for median. The -1 selects the lower of the two. Change to 1 if you want the higher.|||That looks really good. Hopefully it will work on a dataset of 10,000's of records. Havent had a chance to implement it yet but will today.
BTW. Im not french, but a decendant of French west coast Newfoundlanders who jumped ship from france in the 1800's and I dont surf. hah

No comments:

Post a Comment