一个有意思的SQL问题

前阵子，在工作中收到一个业务部门发起的取数工单，取数口径比较有意思。前阵子，因为其他事情比较忙，现在抽空记录一下问题。

部分业务背景已省略，简化问题如下：

…筛选连续三个月及以上有交易流水的用户…

这个问题看起来很简单，最笨蛋的方法无非就是筛选1,2,3,4,5,6,7,8,9,10,11,12月每个月有交易的用户，形成12个临时表，然后再暴力遍历。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


with tmp_1 as (
  select cust_id 
  from trx_act
  where trx_date between '2022-01-01' and '2022-01-31'
  group by cust_id
),tmp_2 as (
  select cust_id 
  from trx_act
  where trx_date between '2022-02-01' and '2022-02-29'
  group by cust_id
),
.....
,tmp_12 as (
  select cust_id 
  from trx_act
  where trx_date between '2022-12-01' and '2022-12-31'
  group by cust_id
)





select a.cust_id 
from tmp_1 a 
left join tmp_2 b 
on a.cust_id = b.cust_id 
left join tmp_3 c 
on a.cust_id = c.cust_id 
where b.cust_id is not null and c.cust_id is not null


.....

当然，我这里又想出了两种比较好的方法：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


-- 方法1
select cust_id,action,count(*) as times 
from (
      select cust_id,action,(month - rank_num) as flag
      from ( 
        select cust_id,action,month,rank() over(partion by cust_id,action order by month) as rank_num
        from tbl 
        group by month,action,cust_id a
      )b
)
group by cust_id,action,flag
having times >= 3;

-- 方法二
select distinct cust_id,action 
from (
     select cust_id 
            ,action
            ,month
            ,lag(month,2) over(partition by cust_id,action order by month) as lag_2_month
     from tbl 
     group by month,cust,action 
)a 
where month - lag_2_month = 2;  -- 某一行的月份 - 前两行的月份的差值等于2

方法一在取一些连续n次行为的数据提取的时候很常用，原理就是将特定行为去重排序，取日期/月份和序号的差值作为计数器

数学原理：

一般地，等差数列 {a_n} = a₁,a₂,a₃…… ,{b_n} = b₁,b₂,b₃……

那么，{a_n} - {b_n} 就是一个常数数列。

方法二的数学原理和方法一的数学原理差不多，差别在于方法二使用了一个 lag()窗口函数。

1
2
3
4


LAG(<expression>[,offset[, default_value]]) OVER (
    PARTITION BY expr,...
    ORDER BY expr [ASC|DESC],...
) 

expression

LAG()函数返回expression当前行之前的行的值，其值为offset 其分区或结果集中的行数。

offset

offset是从当前行返回的行数，以获取值。offset必须是零或文字正整数。如果offset为零，则LAG()函数计算expression当前行的值。如果未指定offset，则LAG()默认情况下函数使用一个。

default_value

如果没有前一行，则LAG()函数返回default_value。例如，如果offset为2，则第一行的返回值为default_value。如果省略default_value，则默认LAG()返回函数NULL。

`PARTITION BY` 子句

PARTITION BY子句将结果集中的行划分LAG()为应用函数的分区。如果省略PARTITION BY子句，LAG()函数会将整个结果集视为单个分区。

`ORDER BY` 子句

ORDER BY子句指定在LAG()应用函数之前每个分区中的行的顺序。

LAG()函数可用于计算当前行和上一行之间的差异。

expression#

offset#

default_value#

PARTITION BY 子句#

ORDER BY 子句#

expression

offset

default_value

`PARTITION BY` 子句

`ORDER BY` 子句