Adaptive query optimization is the extension of standard PostgreSQL cost-based query optimizer. Its basic principle is to use query execution statistics for improving cardinality estimation. Experimental evaluation shows that this improvement sometimes provides optimal perfomance. The installation of AQO is described here.
To know about AQO, you can read articles, documentation and see presentations from these sources:
- AQO documentation
- Oleg Ivanov, Sergey Bartunov :Adaptive Cardinality Estimation
- Oleg Ivanov: Adaptive query optimization in PostgreSQL
- Oleg Ivanov, Yerzhaisang Taskali: Adaptive query optimization in PostgreSQL
AQO links (Russian version):
- Павел Толмачев: AQO — адаптивная оптимизация запросов в PostgreSQL
- Олег Иванов: Применение машинного обучения для увеличения производительности PostgreSQL
- Олег Иванов: Адаптивная оптимизация запросов
- Андрей Лепихов: Постгрессовый планнер с памятью
- Алена Рыбакина: Адаптивная оптимизация запросов в PostgreSQL
Database consists of four tables:
Students' table contains information about students - names, genders, learning groups.
Teachers' table contains information about teachers (name and gender) who teach the courses.
Course table describes courses. It contains the ID information of one specific teacher their assigned course.
Scores table collects information about students course grades/marks and analytics for essays (word count) and whether students have a course before an exam.
See the database schema link bellow:

Note: you can restore dump_students.sql dump using command:
pg_dump -d postgres > 'dump_students2.sql'
You can generate own data for experiments using database_students.sql script. Note, if you run it you can generate 100 teachers, 10 courses, 3000 students and 15000 data for score table. You can use below commands to generate more data (instead of N you set how many data you want to generate):
To generate students:
insert into student (sno, gen, sgroup) select id, CASE WHEN random()<0.25 THEN 'Female' ELSE 'Male' END,
CASE
WHEN random()>=0.89 THEN 'ClassE'
WHEN random()>=0.79 THEN 'ClassB'
WHEN random()>=0.58 THEN 'ClassA'
WHEN random()>=0.32 THEN 'ClassD'
ELSE 'ClassC'
END as sgroup
from generate_series(1,N) id;
update student set sname = random_string(gen);
To generate teachers:
insert into teacher (tno, gen) select id, CASE WHEN random()<0.5 THEN 'Female' ELSE 'Male' END from generate_series(1,N) id;
update teacher set tname = random_string(gen);
To generate courses:
select select_random_course(N);
To generate score data:
select random_cno(N, 0);
select random_cno(N, 1);
The second parameter is about having preparation course students or not.
Anyway, you'll likely have nonuniform distribution. There are some my analytical research about database.
- The result short analysis with pandas profilling. To do it I prepared csv file with collected all important information about student academic performance through this script:
psql -d postgres -c "\copy (select degree, essay_text_len, clevel, gen, sgroup, test_preparation from student, course, score where score.cno = course.cno and student.sno = score.sno) to '/home/alena/score.csv' DELIMITER ',' CSV HEADER"
After that I ran pandas profilling to analyse data. There are analytical results.
Breaf analytical information:
As you see on the picture above I had 30000 data at all and 4800 among of them was dublicate.
There are data distribution between degree, essay_text_len and test_preparation columns by gender:

There are data distribution between degree, essay_text_len and test_preparation columns by groups:
Degree column has nonuniform distribution, where we have a lot of values between 40 and 90 degree.
There are some more statistical information about this column (MCV):
MCV statistics of degree column:
Extreme MCV statistics of degree column:

The most interesting distribution of this column by groups:
Degree distribution for Group A:
Degree distribution for Group B:
Degree distribution for Group C:
Degree distribution for Group D:
Degree distribution for Group E:

It also have nonuniform distribution, where we have a lot of students who didn't write essay.
There are some more statistical information about this column (MCV):
Statistic info about essay_text_len column:
MCV statistics of essay_text_len column:
Extreme MCV statistics of essay_text_len column:

It looks exciting to observe the distribution of the number of letters in the essay for some groups:
Distribution of number of letter of essay for Group E:
Distribution of number of letter of essay for Group D:
Distribution of number of letter of essay for Group C:
Distribution of number of letter of essay for Group B:
Distribution of number of letter of essay for Group A:

As we see about 80% of them have it.

As you see from picture bellow, we have more students from 'A' and 'B' classes, but less from 'E' class.

Here are statistic information how much data columns have by groups:

If you wanna install pandas profilling you can read about it there
I saw high corellation between degree and test_preparation columns (0.73) and highest one between essay_text_len and degree (1.0):

Corellation analysis for group A:

Corellation analysis for group B:

Corellation analysis for group C:

Corellation analysis for group D:

Corellation analysis for group E:

I use such script to check it:
corr = df_copy.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
f, ax = plt.subplots(figsize=(8, 8))
ax = sns.heatmap(corr,mask=mask,square=True,linewidths=.8,cmap="autumn",annot=True)
All script analysis is in database_analyzes.ipynb
This repository is a fork of the project. Here we unite queries and data in one place. Main goal of this repository is simplifying of the deploy process.
Large tables divided into chunks each less than 100 MB.
Repository contains some scripts for quick deploy & run of the benchmark:
schema.sql - creates the data schema. copy.sql - copies data from csv files into the tables. Uses datadir variable to make absolute paths to csv files. Use psql -v to define it. parts.sql - script which splitted the tables. Just for future reusage. job - a template script for quick launch of the benchmark. Example below shows export procedure of all tables:
psql -f ~/join-order-benchmark/schema.sql
psql -f ~/join-order-benchmark/fkindexes.sql
psql -vdatadir="'/home/user/jo-bench'" -f ~/jo-bench/copy.sql
NB! Deploying tested on database with C-locale.
Please note that you should specify the absolute path to the data files when using the 'copy' functions, otherwise you may catch the error that the files were not found.
To test AQO with JOB you should gather statistics with (disabled)[https://postgrespro.ru/docs/postgrespro/14/aqo#AQO-CONFIGURATION] mode. To do it you should adjust your database applying gucs:
aqo.mode = 'disabled'
aqo.force_collect_stat = 'on'
force_collect_stat allows you to gather statistic and store it in aqo_query_stats in any modes. You can use it to consider planning and execution time and also an avarage of error of cardinality to see is ot better to use AQO for queries or not.
I used 15 iterations for disabled mode to estimate the average planning and execution time for every query. But I also gathered statistic from 'Explain' description of query and use both options to analyze AQO working on JOB database.
I used the script:
for (( i=1; i<=disabled_iters; i++ ))
do
for file in $QUERY_DIR/queries/*.sql
do
short_file=$(basename "$file")
echo -n "EXPLAIN (ANALYZE, VERBOSE, FORMAT JSON) " > test.sql
cat $file >> test.sql
result=$(psql -d postgres -f test.sql)
query_hash=$(echo $result | grep -Eo '"Query Identifier": [+-]?[0-9]*' | grep -Eo '[+-]?[0-9]*')
exec_time=$(echo $result | sed -n 's/.*"Execution Time": \([0-9]*\.[0-9]*\).*/\1/p')
plan_time=$(echo $result | sed -n 's/.*"Planning Time": \([0-9]*\.[0-9]*\).*/\1/p')
echo -n "EXPLAIN " > test.sql
cat $file >> test.sql
result=$(psql -d postgres -f test.sql)
echo -e "$i,$short_file,$exec_time,$plan_time,$query_hash" >> $file_output
echo -e "$i,$short_file,$exec_time,$plan_time,$query_hash"
done
done
psql -d postgres -c "\copy (select * from aqo_query_stat) to '${mode}_aqo_query_stat.csv' DELIMITER ',' CSV HEADER"
What you should do to learn queries on JOB database is to change GUC's mode to learn and repeat the script above. During learning, you should disabled parallel workers to better and speed up the results of learning. I applyed such gucs:
min_parallel_table_scan_size = 1
min_parallel_index_scan_size = 1
max_parallel_workers_per_gather = 1
max_parallel_maintenance_workers = 1
I needed 15 iterations to see the minimal cardinalty error, but initially I set 22 iterations.
In addition, you can see an error cardinality after every learning iteration and notice is there a convergence in cardinality in queries or not. Use this command to realize it:
psql -d postgres -c "SELECT error AS error_aqo FROM aqo_cardinality_error(true)"
Last stage is running AQO in controlled or frozen mode. The main difference between them is collect new queries or not to AQO data to learn it. For known queiries planner uses cardinality information stored in AQO. I applyed GUC:
aqo.mode = 'controlled'
And I returned parallel working there, applyed GUCs like:
min_parallel_table_scan_size = 1
min_parallel_index_scan_size = 1
max_parallel_workers_per_gather = 1
max_parallel_maintenance_workers = 1
You can see my test results in JOB folder for three modes: disabled, frozen and learn. Files like "1_report.csv" or "frozen_1_report.csv" ("learn_1_report") contain execution, planning time and query hash of queries. Files like "setup_learn_settings.sh" contain settings that I applyed on every stage of AQO testing. Files aqo_job_disabled, aqo_job_forced and aqo_job_learn consist of test script that I described before.
You should pay attention on paths set in script, particulary the values of variables enumerated bellow: INSTDIR - the path where install folder with "bin" is located QUERY_DIR - the path where JOB queries are folder_output - the catalog where you store test results.
To compare statistic of working planner with and without AQO, first of all, you should consider an error of cardinality on every iteration of learning. To be noted, an cardinality error will be descreassing only for learning process, in disabled and controlled it will be changing slightly. If query are learnable, an error will be dropping and in the last itration it will even reach 0. In learning mode, planning and execution time might be unstable because planner tries to generate new optimal plan again after using new information about cardinality.
The graphics there show how an error of cardinality is falling during 22 iteration when queries were learning.
As I mentioned before, during learning, planner will probably regenerate the plan of execution query, so planning time will be more.
See the picture with the planning and execution time for every query there
The next thing, that you should consider is time of execution query. If with AQO planner built better plan than it was, the execution time must be less.
See the difference between execution time with and without used AQO for every query there
I used analytic information from aqo_query_stats copied the results within query:
psql -d postgres -c "\copy (select * from aqo_query_stat) to 'aqo_query_stat.csv' DELIMITER ',' CSV HEADER"
So, after preprocessing (I used code bellow and my owned functions), I got the presentative view of data for analysis:
import functions
from preprocess_dataframe import get_preprocess_dataframe
df_query_stat=pd.read_csv('/home/alena/Test-AQO/JOB/disabled/aqo_query_stat.csv')
lst_stat_columns = ['execution_time_with_aqo', 'execution_time_without_aqo',
'planning_time_with_aqo', 'planning_time_without_aqo',
'cardinality_error_with_aqo', 'cardinality_error_without_aqo']
lst_stat_cut = [1, 1, 1, 1, 1, 1]
df_query_stat = get_preprocess_dataframe(df_query_stat, lst_stat_columns, lst_stat_cut)
I got the table looking like that:

For processing files with analytical information (like learn_1_report.csv) and file with cardinality error (like I use learn_aqo_query_err_1.csv) I used script bellow:
df_ex_pl_time = pd.DataFrame()
df_ex_pl_time_orig = pd.DataFrame()
df_query_err=pd.DataFrame()
filename_basics=glob.glob("{}/*_report.csv".format(path_folder, mode))
for file in filename_basics:
df_list = list(pd.read_csv(file) for file in filename_basics)
if len(df_list)>1:
df_list = []
for file in filename_basics:
df=pd.read_csv(file)
df['Query hash'] = df['Query hash'].astype(str)
df_list.append(df)
df_ex_pl_time = pd.concat(df_list, ignore_index=True)
else:
df_ex_pl_time = df_list[0]
#report
df_ex_pl_time = df_ex_pl_time.fillna(0)
df_ex_pl_time = df_ex_pl_time.rename(columns = {'Plan time': 'plan_time',
'Query Number':'query_number',
'Query Name':'query_name',
'Execution Time':'execution_time',
'Query hash':'query_hash'})
df_ex_pl_time['query_hash'] = df_ex_pl_time['query_hash'].astype(str)
df_ex_pl_time['plan_time'] = df_ex_pl_time['plan_time'].astype(float)
df_ex_pl_time_orig=df_ex_pl_time.copy()
filename_basics=glob.glob("{}/*_err*.csv".format(path_folder, mode))
lst=[]
df_list=[]
def f(lst,k,l):
lst_temp=[k]*l
lst = lst+lst_temp
df_list = list(pd.read_csv(file) for file in filename_basics)
if len(df_list)>1:
df_query_err = pd.concat(df_list, ignore_index=True)
else:
df_query_err = df_list[0]
df_query_err['id'] = df_query_err['id'].astype(str)
df_query_err['errdelta'] = df_query_err['errdelta'].astype(float)
To assign what query names are determinated to query hash I used script:
dict_hash = {}
for i in df_ex_pl_time.itertuples(index=False):
if i.query_hash in dict_hash:
dict_hash[str(i.query_hash)].append(i.query_name)
else:
dict_hash[i.query_hash] = []
dict_hash[str(i.query_hash)].append(i.query_name)
df_temp=pd.DataFrame(dict_hash.items())
df_temp.columns=['queryid', 'query_name']
df_temp['queryid'] = df_temp['queryid'].astype(str)
#print(df_query_stat.columns)
df_query_err=df_query_err.merge(df_temp,left_on='id', right_on='queryid')
df_query_err['query_name'] = df_query_err['query_name'].astype(str)
df_query_err=df_query_err.drop(columns=['queryid'])
df_query_stat['queryid'] = df_query_stat['queryid'].astype(int)
df_query_stat=df_query_stat.merge(df_temp, how='left', on='queryid')
Note, you need to collect information from explained queries to realize it.
Code for drawing graphics mentioned before:
fig = make_subplots(rows=11, cols=11, subplot_titles=name_query)
for it, i in enumerate(name_query):
df_temp=learn_df_query_err[learn_df_query_err['query_name']==name_query[it]]
df_temp2=stats[stats['query_name']==name_query[it]]['cardinality_error_without_aqo_split'].to_list()
if len(df_temp2)==0:
fig.add_trace(go.Scatter(name='cardinality_error', x = ox,
y = [2]*22, line = dict(color= 'black', dash='dash')),
row=(it//11)+1, col=(it%11)+1)
else:
fig.add_trace(go.Scatter(name='cardinality_error', x = ox,
y = [df_temp2[0][0]]*22, line = dict(color= 'black', dash='dash')),
row=(it//11)+1, col=(it%11)+1)
fig.add_trace(go.Scatter(name='cardinality_error', x = ox,
y = df_temp['errdelta_learn'].to_list()[:22], line = dict(color= 'red', dash='dot')),
row=(it//11)+1, col=(it%11)+1)
fig.update_layout(legend_orientation="h",
legend=dict(x=.5, xanchor="center"),
title_text="Error of cardinality")
fig.update_xaxes(title_text="Iterations", row=(it//11)+1, col=(it%11)+1)
fig.update_yaxes(title_text="Cardinality Error", row=(it//11)+1, col=(it%11)+1)
fig.update_layout(width=4000, height = 4200)
upload_pics(fig, '{}/{}_folder{}'.format(main_path, 'learn',cycle_test), 'Difference cardinality')
os.rename('temp-plot.html', '{}_learn_Difference cardinality.html'.format(cycle_test))
fig.show()
So, the benefits of using AQO under withut it you can see there or below:
The code to make the graphic:
stackData = {
"last execution time with aqo":lst_execution_time_with_aqo,
"last execution time without aqo": lst_execution_time_without_aqo,
# Use differece
"labels": name_query2
}
fig3 = go.Figure(
data=[
go.Bar(
name="frozen execution time",
x=stackData["labels"],
y=stackData["last execution time with aqo"],
offsetgroup=0,
marker_color = '#800080'
),
go.Bar(
name="disabled execution time",
x=stackData["labels"],
y=stackData["last execution time without aqo"],
offsetgroup=1,
marker_color = '#ff4d00'
)
],
layout=go.Layout(
title="Execution time with and without AQO",
yaxis_title="Execution time",
xaxis_title="queries",
width=1400, height = 800, font=dict(
family="Courier New, monospace",
size=16,
color="RebeccaPurple"
),
yaxis=dict(range=[0, 20]),
legend=dict(
orientation = 'h', xanchor = "center", x = 0.5, y= 1)
)
)
upload_pics(fig3, '{}/{}_folder{}'.format(main_path, 'learn',cycle_test), 'Stack')
fig3.show()
Evaluate the benefits as the ratio Execution time of query without AQO by Execution time with it:

Code to build it:
fig3 = go.Figure(
data=[
go.Scatter(name = 'Relative execution time', x = name_query2, y= [(b) / (m)-1 for b,m in zip(lst_execution_time_without_aqo, lst_execution_time_with_aqo)], line = dict(color= 'black', dash='solid')),
go.Scatter(name = 'No better no worse (1)', x = name_query2, y= [1]*len(name_query2), line = dict(color= 'red', dash='dash')),
go.Scatter(name = 'Better in 4 times (4)', x = name_query2, y= [4]*len(name_query2), line = dict(color= 'red', dash='dash')),
go.Scatter(name = 'Worse (0)', x = name_query2, y= [0.01]*len(name_query2), line = dict(color = '#800080', dash='dash'))],
layout=go.Layout(
title="The ratio between execution time without and with using AQO",
yaxis_title="Relative execution time",
xaxis_title="queries",
width=1400, height = 800, font=dict(
family="Courier New, monospace",
size=16,
color="Black"),
yaxis=dict(range=[-1, 10]),legend=dict(orientation = 'h', xanchor = "center", x = 0.5, y= 1)
),
)
fig3.update_traces(texttemplate='%{text:.2s}', textposition='top center')
upload_pics(fig3, '{}/{}_folder{}'.format(main_path, 'learn',cycle_test), 'Stack')
fig3.show()
The script file with all code to analyze AQO functionality is located [here](analyze AQO final version.ipynb).
