This repository contains materials which were used for AQO testing

Content of repository:

What AQO is
1. Description and you can find it
2. Extra links about AQO
Database about students 0. Database description
1. Analyze data from database
AQO tests on JOB database

What AQO is

Description and you sources

Adaptive query optimization is the extension of standard PostgreSQL cost-based query optimizer. Its basic principle is to use query execution statistics for improving cardinality estimation. Experimental evaluation shows that this improvement sometimes provides optimal perfomance. The installation of AQO is described here.

Extra links about AQO

To know about AQO, you can read articles, documentation and see presentations from these sources:

AQO links (Russian version):

Database about students

Database description

Database consists of four tables: Students' table contains information about students - names, genders, learning groups. Teachers' table contains information about teachers (name and gender) who teach the courses. Course table describes courses. It contains the ID information of one specific teacher their assigned course. Scores table collects information about students course grades/marks and analytics for essays (word count) and whether students have a course before an exam. See the database schema link bellow:

Note: you can restore dump_students.sql dump using command:

pg_dump -d postgres > 'dump_students2.sql'

You can generate own data for experiments using database_students.sql script. Note, if you run it you can generate 100 teachers, 10 courses, 3000 students and 15000 data for score table. You can use below commands to generate more data (instead of N you set how many data you want to generate):

To generate students:

insert into student (sno, gen, sgroup) select id, CASE WHEN random()<0.25 THEN 'Female' ELSE 'Male' END,
			       CASE
			       WHEN random()>=0.89 THEN 'ClassE'
			       WHEN random()>=0.79 THEN 'ClassB'
			       WHEN random()>=0.58 THEN 'ClassA'
			       WHEN random()>=0.32 THEN 'ClassD'
			       ELSE 'ClassC'
			       END as sgroup
from generate_series(1,N) id;

update student set sname = random_string(gen);

To generate teachers:

insert into teacher (tno, gen) select id, CASE WHEN random()<0.5 THEN 'Female' ELSE 'Male' END from generate_series(1,N) id;
update teacher set tname = random_string(gen);

To generate courses:

select select_random_course(N);

To generate score data:

select random_cno(N, 0);
select random_cno(N, 1);

The second parameter is about having preparation course students or not.

Analyze data from database

Anyway, you'll likely have nonuniform distribution. There are some my analytical research about database.

The result short analysis with pandas profilling. To do it I prepared csv file with collected all important information about student academic performance through this script:

psql -d postgres -c "\copy (select degree, essay_text_len, clevel, gen, sgroup, test_preparation from student, course, score where score.cno = course.cno and student.sno = score.sno) to '/home/alena/score.csv' DELIMITER ',' CSV HEADER"

After that I ran pandas profilling to analyse data. There are analytical results. Breaf analytical information: As you see on the picture above I had 30000 data at all and 4800 among of them was dublicate.

There are data distribution between degree, essay_text_len and test_preparation columns by gender:

There are data distribution between degree, essay_text_len and test_preparation columns by groups:

Degree column

Degree column has nonuniform distribution, where we have a lot of values between 40 and 90 degree. There are some more statistical information about this column (MCV): MCV statistics of degree column: Extreme MCV statistics of degree column:

The most interesting distribution of this column by groups: Degree distribution for Group A: Degree distribution for Group B: Degree distribution for Group C: Degree distribution for Group D: Degree distribution for Group E:

The length of letters of essay (essay_text_len column)

It also have nonuniform distribution, where we have a lot of students who didn't write essay. There are some more statistical information about this column (MCV): Statistic info about essay_text_len column: MCV statistics of essay_text_len column: Extreme MCV statistics of essay_text_len column:

It looks exciting to observe the distribution of the number of letters in the essay for some groups: Distribution of number of letter of essay for Group E: Distribution of number of letter of essay for Group D: Distribution of number of letter of essay for Group C: Distribution of number of letter of essay for Group B: Distribution of number of letter of essay for Group A:

The mark about having preparation course students before taking exam

As we see about 80% of them have it.

Groups of students

As you see from picture bellow, we have more students from 'A' and 'B' classes, but less from 'E' class.

Here are statistic information how much data columns have by groups:

If you wanna install pandas profilling you can read about it there

Corellation between data

I saw high corellation between degree and test_preparation columns (0.73) and highest one between essay_text_len and degree (1.0):

Corellation analysis for group A:

Corellation analysis for group B:

Corellation analysis for group C:

Corellation analysis for group D:

Corellation analysis for group E:

I use such script to check it:

corr = df_copy.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(8, 8))
    ax = sns.heatmap(corr,mask=mask,square=True,linewidths=.8,cmap="autumn",annot=True)

All script analysis is in database_analyzes.ipynb

AQO tests on JOB database

How to prepare database

This repository is a fork of the project. Here we unite queries and data in one place. Main goal of this repository is simplifying of the deploy process.

Large tables divided into chunks each less than 100 MB.

Repository contains some scripts for quick deploy & run of the benchmark:

schema.sql - creates the data schema. copy.sql - copies data from csv files into the tables. Uses datadir variable to make absolute paths to csv files. Use psql -v to define it. parts.sql - script which splitted the tables. Just for future reusage. job - a template script for quick launch of the benchmark. Example below shows export procedure of all tables:

psql -f ~/join-order-benchmark/schema.sql
psql -f ~/join-order-benchmark/fkindexes.sql
psql -vdatadir="'/home/user/jo-bench'" -f ~/jo-bench/copy.sql
NB! Deploying tested on database with C-locale.

Please note that you should specify the absolute path to the data files when using the 'copy' functions, otherwise you may catch the error that the files were not found.

How to test the AQO

disabled mode

To test AQO with JOB you should gather statistics with (disabled)[https://postgrespro.ru/docs/postgrespro/14/aqo#AQO-CONFIGURATION] mode. To do it you should adjust your database applying gucs:

aqo.mode = 'disabled'
aqo.force_collect_stat = 'on'

force_collect_stat allows you to gather statistic and store it in aqo_query_stats in any modes. You can use it to consider planning and execution time and also an avarage of error of cardinality to see is ot better to use AQO for queries or not.

I used 15 iterations for disabled mode to estimate the average planning and execution time for every query. But I also gathered statistic from 'Explain' description of query and use both options to analyze AQO working on JOB database.

I used the script:

for (( i=1; i<=disabled_iters; i++ ))
do
  for file in $QUERY_DIR/queries/*.sql
  do
    short_file=$(basename "$file")
    echo -n "EXPLAIN (ANALYZE, VERBOSE, FORMAT JSON) " > test.sql
    cat $file >> test.sql
    result=$(psql -d postgres -f test.sql)

    query_hash=$(echo $result | grep -Eo '"Query Identifier": [+-]?[0-9]*' | grep -Eo '[+-]?[0-9]*')
    exec_time=$(echo $result | sed -n 's/.*"Execution Time": \([0-9]*\.[0-9]*\).*/\1/p')
    plan_time=$(echo $result | sed -n 's/.*"Planning Time": \([0-9]*\.[0-9]*\).*/\1/p')

    echo -n "EXPLAIN " > test.sql
    cat $file >> test.sql
    result=$(psql -d postgres -f test.sql)

    echo -e "$i,$short_file,$exec_time,$plan_time,$query_hash" >> $file_output
    echo -e "$i,$short_file,$exec_time,$plan_time,$query_hash"
  done
done

psql -d postgres -c "\copy (select * from aqo_query_stat) to '${mode}_aqo_query_stat.csv' DELIMITER ',' CSV HEADER"

learn mode

What you should do to learn queries on JOB database is to change GUC's mode to learn and repeat the script above. During learning, you should disabled parallel workers to better and speed up the results of learning. I applyed such gucs:

min_parallel_table_scan_size = 1
min_parallel_index_scan_size = 1
max_parallel_workers_per_gather = 1
max_parallel_maintenance_workers = 1

I needed 15 iterations to see the minimal cardinalty error, but initially I set 22 iterations.

In addition, you can see an error cardinality after every learning iteration and notice is there a convergence in cardinality in queries or not. Use this command to realize it:

psql -d postgres -c "SELECT error AS error_aqo FROM aqo_cardinality_error(true)"

controlled mode

Last stage is running AQO in controlled or frozen mode. The main difference between them is collect new queries or not to AQO data to learn it. For known queiries planner uses cardinality information stored in AQO. I applyed GUC:

aqo.mode = 'controlled'

And I returned parallel working there, applyed GUCs like:

min_parallel_table_scan_size = 1
min_parallel_index_scan_size = 1
max_parallel_workers_per_gather = 1
max_parallel_maintenance_workers = 1

You can see my test results in JOB folder for three modes: disabled, frozen and learn. Files like "1_report.csv" or "frozen_1_report.csv" ("learn_1_report") contain execution, planning time and query hash of queries. Files like "setup_learn_settings.sh" contain settings that I applyed on every stage of AQO testing. Files aqo_job_disabled, aqo_job_forced and aqo_job_learn consist of test script that I described before.

You should pay attention on paths set in script, particulary the values of variables enumerated bellow: INSTDIR - the path where install folder with "bin" is located QUERY_DIR - the path where JOB queries are folder_output - the catalog where you store test results.

How analyze AQO test

To compare statistic of working planner with and without AQO, first of all, you should consider an error of cardinality on every iteration of learning. To be noted, an cardinality error will be descreassing only for learning process, in disabled and controlled it will be changing slightly. If query are learnable, an error will be dropping and in the last itration it will even reach 0. In learning mode, planning and execution time might be unstable because planner tries to generate new optimal plan again after using new information about cardinality.

The graphics there show how an error of cardinality is falling during 22 iteration when queries were learning.

As I mentioned before, during learning, planner will probably regenerate the plan of execution query, so planning time will be more.

See the picture with the planning and execution time for every query there

The next thing, that you should consider is time of execution query. If with AQO planner built better plan than it was, the execution time must be less.

See the difference between execution time with and without used AQO for every query there

I used analytic information from aqo_query_stats copied the results within query:

psql -d postgres -c "\copy (select * from aqo_query_stat) to 'aqo_query_stat.csv' DELIMITER ',' CSV HEADER"

So, after preprocessing (I used code bellow and my owned functions), I got the presentative view of data for analysis:

import functions
from preprocess_dataframe import get_preprocess_dataframe

df_query_stat=pd.read_csv('/home/alena/Test-AQO/JOB/disabled/aqo_query_stat.csv')
    lst_stat_columns = ['execution_time_with_aqo', 'execution_time_without_aqo',
       'planning_time_with_aqo', 'planning_time_without_aqo',
       'cardinality_error_with_aqo', 'cardinality_error_without_aqo']
lst_stat_cut = [1, 1, 1, 1, 1, 1]
df_query_stat = get_preprocess_dataframe(df_query_stat, lst_stat_columns, lst_stat_cut)

I got the table looking like that:

For processing files with analytical information (like learn_1_report.csv) and file with cardinality error (like I use learn_aqo_query_err_1.csv) I used script bellow:

df_ex_pl_time = pd.DataFrame()
df_ex_pl_time_orig = pd.DataFrame()
df_query_err=pd.DataFrame()

filename_basics=glob.glob("{}/*_report.csv".format(path_folder, mode))

for file in filename_basics:
    df_list = list(pd.read_csv(file) for file in filename_basics)
    if len(df_list)>1:
        df_list = []
        for file in filename_basics:
            df=pd.read_csv(file)
            df['Query hash'] = df['Query hash'].astype(str)
            df_list.append(df)
        df_ex_pl_time = pd.concat(df_list, ignore_index=True)
    else:
        df_ex_pl_time = df_list[0]

#report
df_ex_pl_time = df_ex_pl_time.fillna(0)
df_ex_pl_time = df_ex_pl_time.rename(columns = {'Plan time': 'plan_time',
                                        'Query Number':'query_number',
                                        'Query Name':'query_name',
                                        'Execution Time':'execution_time',
                                        'Query hash':'query_hash'})
df_ex_pl_time['query_hash'] = df_ex_pl_time['query_hash'].astype(str)
df_ex_pl_time['plan_time'] = df_ex_pl_time['plan_time'].astype(float)
df_ex_pl_time_orig=df_ex_pl_time.copy()


filename_basics=glob.glob("{}/*_err*.csv".format(path_folder, mode))
lst=[]
df_list=[]
def f(lst,k,l):
    lst_temp=[k]*l
    lst = lst+lst_temp
df_list = list(pd.read_csv(file) for file in filename_basics)
if len(df_list)>1:
    df_query_err = pd.concat(df_list, ignore_index=True)
else:
    df_query_err = df_list[0]
df_query_err['id'] = df_query_err['id'].astype(str)
df_query_err['errdelta'] = df_query_err['errdelta'].astype(float)

To assign what query names are determinated to query hash I used script:

dict_hash = {}
for i in df_ex_pl_time.itertuples(index=False):
    if i.query_hash in dict_hash:
        dict_hash[str(i.query_hash)].append(i.query_name)
    else:
        dict_hash[i.query_hash] = []
        dict_hash[str(i.query_hash)].append(i.query_name)
df_temp=pd.DataFrame(dict_hash.items())
df_temp.columns=['queryid', 'query_name']
df_temp['queryid'] = df_temp['queryid'].astype(str)
#print(df_query_stat.columns)
df_query_err=df_query_err.merge(df_temp,left_on='id', right_on='queryid')
df_query_err['query_name'] = df_query_err['query_name'].astype(str)
df_query_err=df_query_err.drop(columns=['queryid'])
df_query_stat['queryid'] = df_query_stat['queryid'].astype(int)
df_query_stat=df_query_stat.merge(df_temp, how='left', on='queryid')

Note, you need to collect information from explained queries to realize it.

Code for drawing graphics mentioned before:

fig = make_subplots(rows=11, cols=11, subplot_titles=name_query)
for it, i in enumerate(name_query):
    df_temp=learn_df_query_err[learn_df_query_err['query_name']==name_query[it]]
    df_temp2=stats[stats['query_name']==name_query[it]]['cardinality_error_without_aqo_split'].to_list()
    if len(df_temp2)==0:
        fig.add_trace(go.Scatter(name='cardinality_error', x = ox,
                         y = [2]*22, line = dict(color= 'black', dash='dash')),
              row=(it//11)+1, col=(it%11)+1)
    else:
        fig.add_trace(go.Scatter(name='cardinality_error', x = ox,
                         y = [df_temp2[0][0]]*22, line = dict(color= 'black', dash='dash')),
              row=(it//11)+1, col=(it%11)+1)
    fig.add_trace(go.Scatter(name='cardinality_error', x = ox,
                         y = df_temp['errdelta_learn'].to_list()[:22], line = dict(color= 'red', dash='dot')),
              row=(it//11)+1, col=(it%11)+1)
    fig.update_layout(legend_orientation="h",
                  legend=dict(x=.5, xanchor="center"),
                   title_text="Error of cardinality")
    fig.update_xaxes(title_text="Iterations", row=(it//11)+1, col=(it%11)+1)
    fig.update_yaxes(title_text="Cardinality Error", row=(it//11)+1, col=(it%11)+1)
fig.update_layout(width=4000, height = 4200)
upload_pics(fig, '{}/{}_folder{}'.format(main_path, 'learn',cycle_test), 'Difference cardinality')
os.rename('temp-plot.html', '{}_learn_Difference cardinality.html'.format(cycle_test))
fig.show()

So, the benefits of using AQO under withut it you can see there or below:

The code to make the graphic:

stackData = {
    "last execution time with aqo":lst_execution_time_with_aqo,
    "last execution time without aqo": lst_execution_time_without_aqo,
    # Use differece
    "labels": name_query2
}


fig3 = go.Figure(
    data=[
        go.Bar(
            name="frozen execution time",
            x=stackData["labels"],
            y=stackData["last execution time with aqo"],
            offsetgroup=0,
            marker_color = '#800080'
        ),
        go.Bar(
            name="disabled execution time",
            x=stackData["labels"],
            y=stackData["last execution time without aqo"],
            offsetgroup=1,
            marker_color = '#ff4d00'
        )
    ],
    layout=go.Layout(
        title="Execution time with and without AQO",
        yaxis_title="Execution time",
        xaxis_title="queries",
        width=1400, height = 800, font=dict(
        family="Courier New, monospace",
        size=16,
        color="RebeccaPurple"
    ),
        yaxis=dict(range=[0, 20]),
                 legend=dict(
    orientation = 'h', xanchor = "center", x = 0.5, y= 1)
    )
)
upload_pics(fig3, '{}/{}_folder{}'.format(main_path, 'learn',cycle_test), 'Stack')
fig3.show()

Evaluate the benefits as the ratio Execution time of query without AQO by Execution time with it:

Code to build it:

fig3 = go.Figure(
    data=[
        go.Scatter(name = 'Relative execution time', x = name_query2, y= [(b) / (m)-1 for b,m in zip(lst_execution_time_without_aqo, lst_execution_time_with_aqo)], line = dict(color= 'black', dash='solid')),
        go.Scatter(name = 'No better no worse (1)', x = name_query2, y= [1]*len(name_query2), line = dict(color= 'red', dash='dash')),
        go.Scatter(name = 'Better in 4 times (4)', x = name_query2, y= [4]*len(name_query2), line = dict(color= 'red', dash='dash')),
    go.Scatter(name = 'Worse (0)', x = name_query2, y= [0.01]*len(name_query2), line = dict(color = '#800080', dash='dash'))],
layout=go.Layout(
        title="The ratio between execution time without and with using AQO",
        yaxis_title="Relative execution time",
        xaxis_title="queries",
        width=1400, height = 800, font=dict(
        family="Courier New, monospace",
        size=16,
        color="Black"),
    yaxis=dict(range=[-1, 10]),legend=dict(orientation = 'h', xanchor = "center", x = 0.5, y= 1)
    ),
)
fig3.update_traces(texttemplate='%{text:.2s}', textposition='top center')
upload_pics(fig3, '{}/{}_folder{}'.format(main_path, 'learn',cycle_test), 'Stack')
fig3.show()

The script file with all code to analyze AQO functionality is located [here](analyze AQO final version.ipynb).

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.ipynb_checkpoints		.ipynb_checkpoints
JOB		JOB
join-order-benchmark		join-order-benchmark
some pictures		some pictures
.~aqo_cоpy 4-2 final.pptx		.~aqo_cоpy 4-2 final.pptx
Beginer_courses.csv		Beginer_courses.csv
Cardinality error.html		Cardinality error.html
Execution Time between with and without AQO.html		Execution Time between with and without AQO.html
Execution and planning time during learning AQO.html		Execution and planning time during learning AQO.html
Expert_courses.csv		Expert_courses.csv
Intermediate_courses.csv		Intermediate_courses.csv
README.md		README.md
analyze AQO final version.ipynb		analyze AQO final version.ipynb
aqo_cоpy 4-2 final.pdf		aqo_cоpy 4-2 final.pdf
aqo_cоpy 4-2 final.pptx		aqo_cоpy 4-2 final.pptx
database_analyzes.ipynb		database_analyzes.ipynb
database_students.sql		database_students.sql
dump_students (copy).sql		dump_students (copy).sql
dump_students.sql		dump_students.sql
dump_students2.sql		dump_students2.sql
final_graphic_exec_time_diff.html		final_graphic_exec_time_diff.html
functions.py		functions.py
preprocess_dataframe.py		preprocess_dataframe.py
score.csv		score.csv
the benefits by exec time.html		the benefits by exec time.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This repository contains materials which were used for AQO testing

Content of repository:

What AQO is

Description and you sources

Extra links about AQO

Database about students

Database description

Analyze data from database

Degree column

The length of letters of essay (essay_text_len column)

The mark about having preparation course students before taking exam

Groups of students

Corellation between data

AQO tests on JOB database

How to prepare database

How to test the AQO

disabled mode

learn mode

controlled mode

How analyze AQO test

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

This repository contains materials which were used for AQO testing

Content of repository:

What AQO is

Description and you sources

Extra links about AQO

Database about students

Database description

Analyze data from database

Degree column

The length of letters of essay (essay_text_len column)

The mark about having preparation course students before taking exam

Groups of students

Corellation between data

AQO tests on JOB database

How to prepare database

How to test the AQO

disabled mode

learn mode

controlled mode

How analyze AQO test

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages