SQL 是数据分析中极其核心且常用的取数工具。绝大多数数据分析工作都离不开 SQL 从数据库中提取所需数据。 - 如果你想从事数据分析工作,熟练使用 SQL 进行高效、准确的取数和数据准备是必须掌握的基本功。
SQL数据分析取数,搞定平时90%的取数工作。
一、基础查询
--查询所有字段
SELECT*FROM table_name;
--查询指定字段
SELECT columnl, column2 FROM table_name;
--别名(Alias)
SELECT columnl AS name, column2 AS value FROM table_name;
二、条件过滤
--基础过滤(WHERE)
SELECT * FROM table_name WHERE column1>100;
--多条件组合(AND/OR)
SELECT * FROM table_name WHERE column1 ='A' AND (column2>50 OR column3 IS NOT NULL);
--IN 操作符
SELECT * FROM table_name WHERE columnl IN ('A', 'B', 'C');
--LIKE 模糊匹配
SELECT * FROM table_name WHERE columnl LlKE '%keyword%';
--NULL值判断
SELECT * FROM table_name WHERE columnl IS NULL;
三、排序与分页
--排序(ORDER BY)
SELECT * FROM table_name ORDER BY column1 DESC, column2 ASC;--降序/升序
--分页(LIMIT&OFFSET)
SELECT * FROM table_name LIMIT 10 OFFSET 20;--跳过村20条,取10条(第3页)
四、聚合统计
--常用聚合函数
SELECT
COUNT(*) AS total_rows, --计数
SUM(sales) AS total_sales, --史和
AVG(price) AS avg_price, --平均值
MAX(date) AS latest_date,--最大值
MIN(date) AS earliest_date --最小值
FROM sales table;
--按分组聚合(GROUP BY)
SELECT category, SUM(sales) As total_sales FROM sales_table;
--HAVING 过滤分组结果
SELECT category, SUM(sales) AS total_sales
FROM sales table
GROUP BY category
HAVING SUM(sales)>1000;
五、子查询与临时表
--子查询(Subquery)
SELECT * FROM orders
WHERE user id IN (SELECT user_id FROM users WHERE is_vip=1);
--公共表达式(CTE,WTTH 语句)
WITH top users AS (
SELECT user id, SUM(sales) AS total_sales
FROM orders
GROUP BY user_id
ORDER BY total sales DESC
LIMIT 10
)
SELECT * FROM top_users;
六、表连接(J0IN)
--内连接(INNER JOIN)
SELECT a.*, b.extra_info
FROM table a a
INNER J0lN table b b ON a.id =b.a_id;
--左连接(LEFT JOIN)
SELECT a.*,b.extra info
FROM table a a
LEFT JOIN table b b ON a.id = b.a_id;
--多表连接
SELECT a.name, b.order_id, c.product_name
FROM users a
LEFT J0lN orders b ON a.user id= b.user_id
INNER JOIN products c ON b.product_id=c.product_id;
七、日期处理
--提取日期部分
SELECT
DATE(order_time) AS order_date,--提取日期(如2023-10-01)
EXTRACT(YEAR FROM order time) As year,-提取年份
EXTRACT(MONTH FROM order time) AS month--提取月份
FROM orders;
--日期计算SELECT
order time, order time+INTERVAL '7 DAY' AS next_week_date --加7天FROM orders;
--日期时间格式化
--date_format() 按指定格式返回时间(对日期时间格式化)
#将“2023-09-21 15:06:51”转化如下格式
select date_format('2023-09-21 15:06:51','%Y-%m-%d');
select date_format('2023-09-21 15:06:51','%Y-%M-%D');
select date format('2023-09-21 15:06:51','%M-%d-%y');
select date_format('2023-09-21 15:06:51','%m/%d/%y');
select date_format('2023-09-21 15:06:51','%m/%d/%Y %H:%i :%s');
select date format('2023-09-2115:06:51','%Y年%m月%d日 %H点%i分%s秒');
八、窗口函数(Window Functions)
--排名(RANK,ROWLNUMBER)
SELECT
user_id,
sales,
RANK() OVER (ORDER BY sales DESC) AS sales rank,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales DESC) AS row_num
FROM sales_table;
--累计求和(SUM OVER)
SELECT
date ,
daily_sales,
SUM(daily sales) OVER (ORDER BY date) AS cumulative_sales
FROM daily_sales_table;
九、条件判断
--适用场景:数据分类打标
SELECT
id
CASE
WHEN score >=90 THEN 'A'
WHEN score >= 80 THEN 'B'
ELSE 'C'
END AS grade
FROM students;
十、去重计数
--适用场景:统计UV等去重指标
SELECT
COUNT (DISTINCT user_id) AS uv
FROM page_views
WHERE visit date='2023-10-01';
十一、排序+ToP N
--取销售额TOP10商品
SELECT product id, SUM(sales) AS total_sales
FROM sales_data
GROUP BY product_id
ORDER BY total_sales DESC
LIMIT 10;
--适用场景:排行榜、头部数据分析
十二、分类打标
--条件分支判断
SELECT
user_id,
CASE
WHEN age<18 THEN '未成年'
WHEN age BETWEEN 18 AND 6O THEN '成年'
ELSE '老年'
END AS age_group
FROM users;
十三、分析函数
分析函数row_number, rank, dense_rank, cume_dist, percent_rank, ntile
1.row number()
按顺序排序,排序的值不会重复,总数不变:
select uid,
dt,
pv,
row_number() over (partition by uid order by pv desc) as row_number_pv
from user_pv
order by uid, pv desc;
2. rank( )
大小一样排序的值一样,但会占用排名的位置,总数不变;下面对用户每天浏览量进行一个排名。
select uid,
dt,
pv,
rank() over (partition by uid order by pv desc) as rank_pv
from user_pv
order by uid, pv desc;
3. dense_rank()
排序值相同时重复,排名并列,排名依次增加,排序相同时总数会减少;例如,如果两行排名为3,则下一个排名为4,不同于RANK()函数返回5。下面对用户每天浏览量进行一个排名:
select uid,
dt,
pv,
dense_rank() over (partition by uid order by pv desc) dense_rank_pv
from user_pv
order by uid, pv desc;
对比看下,row_number, rank, dense_rank 的运行效果:
select uid,
dt,
pv,
row_number() over (partition by uid order by pv desc) as row_number_pv,
rank() over (partition by uid order by pv desc) as rank_pv,
dense_rank() over (partition by uid order by pv desc) dense_rank_pv
from user_pv
order by uid,pv desc;
总结来说,ROW_NUMBER函数为每一行分配唯一的行号,而RANK函数和DENSE_RANK函数在处理具有相同排序值的行时有所不同。RANK函数会跳过下一个排名,而DENSE_RANK函数会紧随其后。
选择使用哪个函数取决于具体的需求和对重复值的处理方式。
4.cume dist()
累积分布cume_dist()函数,用于计算当前行在排序结果中的累积分布比例。
计算公式=前面的行数/窗口分区中的总行数
#4、5的合并案例
select uid,
dt,
pv,
cume_dist() over (partition by uid order by pv) cume_dist_pv
from user_pv
order by uid, pv;
5. percent_rank()
非常类似于cume_dist函数。同样用于计算当前行在排序结果中的累积分布比例。
计算公式=前面的行数-1/窗口分区中的总行数-1
select uid,
dt,
pv,
percent_rank() over (partition by uid order by pv) as percent_rank_uv
from user_pv
order by uid, pv;
#计算机专业 #数据分析我在行 #程序员 #数据库 #sql语句 #sql刷题 #sql学习 #SQL #数据分析师 #数据分析我在行
阅读原文:原文链接
该文章在 2025/8/25 21:47:10 编辑过