Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
777decc
chore: use consistent language for code blocks
mirdono Apr 30, 2026
afe79c8
chore(doc): remove incorrect comment concerning docker dev config
mirdono Apr 30, 2026
b3d4977
chore(doc): use links to for external resources
mirdono May 4, 2026
c5109f7
chore(doc): description of PDF pipeline
mirdono May 4, 2026
d527c15
chore(doc): add explicit TODO to clarify sections will be written
mirdono May 5, 2026
65f2987
chore(doc): add short entry on policy impact report
mirdono May 6, 2026
0180698
chore(doc): partner README placeholder
mirdono May 7, 2026
b85cee6
chore(doc): server requirements in partner README
mirdono May 5, 2026
cf5b019
chore(doc): short entry on updating application
mirdono May 18, 2026
02ce639
chore(doc): partner-specific configuration entry
mirdono May 18, 2026
5f68590
chore(drc): docker override configuration for Bamberg
mirdono May 7, 2026
4d78106
chore(doc,drc): PDF harvesting for Bamberg
mirdono May 8, 2026
6237a8a
chore(embedding): correct comment for set cron_schedule
mirdono May 8, 2026
8867a15
chore(doc): revise section on smart search setup
mirdono May 8, 2026
4091da0
chore(doc): add explanation to configure external LLMs
mirdono May 8, 2026
d448ae6
chore(doc): document server setup wrt identifier and dispatcher
mirdono May 11, 2026
49245c3
chore(doc): account management for pipeline dashboard
mirdono May 13, 2026
a7a5b1b
chore(doc): remove lingering diff lines
mirdono May 21, 2026
338142f
chore(doc): add reference to partner-specific configurations
mirdono May 22, 2026
47c38d0
fix(doc): put general overwrite file last in .env example
mirdono May 22, 2026
ec9a695
chore(doc): improved documentation on using external LLM providers
mirdono May 22, 2026
2d764b9
chore(doc): add some background information for the pipelines
mirdono May 22, 2026
c26be5f
chore(doc): remove VC section placeholder
mirdono May 22, 2026
16df169
chore(doc,drc): add segmentation API key for pdf-content service
mirdono May 26, 2026
659c19e
chore(drc): placholder for configuring an API key codelist-labeling
mirdono May 26, 2026
7a6af83
chore(drc): remove unused volume for `oparl-to-eli` service
mirdono May 22, 2026
cb0a30b
chore(doc): link to used services in OParl description
mirdono May 22, 2026
c64ddc4
chore(doc): typo
mirdono May 26, 2026
a9cc64b
chore(doc,drc): drc override configuration for Freiburg
mirdono May 22, 2026
928c83d
chore(doc): describe UC1
mirdono May 26, 2026
02ad884
chore(doc): use correct URI for SDG concept scheme
mirdono May 26, 2026
e7e261e
chore(doc): data harvesting example Freiburg
mirdono May 26, 2026
595699a
chore(doc,drc): drc overwrite configuration for Ghent
mirdono May 27, 2026
4839216
chore(doc): document OSLO harvesting for Ghent
mirdono May 27, 2026
84e63d9
chore(doc): add warnings for configuration of API keys
mirdono May 28, 2026
6432389
chore(doc,drc): add warning for API key configuration
mirdono May 28, 2026
e292eb7
chore(doc,drc): add warning for API key configuration
mirdono May 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 90 additions & 21 deletions README.md

Large diffs are not rendered by default.

1 change: 0 additions & 1 deletion compose/oparl.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,6 @@ services:
oparl-to-eli:
image: lblod/oparl-to-eli-service:0.0.3
volumes:
- ../config/oparl-to-eli/:/config/
- ../data/files:/share
environment:
OPARL_ENDPOINT: 'https://ris.freiburg.de/oparl'
Expand Down
3 changes: 2 additions & 1 deletion config/embedding/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,4 +48,5 @@
#embedding_model = "embeddinggemma:300m-bf16" bigger, but slower
embedding_model = "embeddinggemma:300m-qat-q4_0"
#qwen3-embedding:0.6b has a larger context size, but is not recommended by AI advisory board
cron_schedule = "* * * * *" # every 5 minutes
# qwen3-embedding:0.6b has a larger context size, but is not recommended by AI advisory board
cron_schedule = "* * * * *" # every minute
202 changes: 202 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# Additional documentation

This folder contains additional documentation, primarily aimed at configuring and setting up the application to support only a subset of the use cases.


## Server setup
### Requirements
#### Hardware
To run the full app a sufficiently powerful server is advised. A GPU is only required if you want to locally run the LLM-based functionality. Otherwise, the relevant services should be configured to outsource such functionality to cloud services.

Our server has the following specifications:

- CPU: 13th Gen Intel(R) Core(TM) i5-13500
- GPU: NVIDIA RTX 4000 SFF Ada Generation
- Memory: 64GB
- Storage: 2TB

#### Software
This application is a [semantic.works](https://semantic.works/) app and thereby has limited dependencies. The following software is required to run the application:

- `git` to obtain the application source code
- `docker` and `docker compose` to configure and run the application's microservices
- A reverse proxy that forwards HTTP requests to the app's identifier service. We typically use [app-letsencrypt](https://github.com/redpencilio/app-letsencrypt) for this purpose.

### Updating the app
Generally updating (parts of) the app consists of pulling the latest version from the remote repository via a `git pull` and, recreating and/or restarting the appropriate services.
For each service `A` that was added or updated (version bump or changed environment variables), do `docker compose up [-d] A`. For each service `B` for which their configuration was updated in the `../config/B` folder, do a `docker compose restart B`. Note, that `up` on its own does **not** cause a service to update its configuration.


## Service configuration

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Add a section for the ollama service.

The ollama service should pull in the appropriate LLMs, as described in #55. While the general README mentions this for the smart search setup it is currently not documented this is needed for other use cases as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i flagged this with pedro and joachim, the service should have ollama load the models on startup if they aren't available yet in ollama on disk and crash if that fails. Running manual pulls by drc execing seems crazy

Comment thread
mirdono marked this conversation as resolved.
Most of the services in this app are configured via the docker compose configurations files and appropriate configuration files in the `config` folder in this project. Note, the [gitbook page](https://app.gitbook.com/o/-MP9Yduzf5xu7wIebqPG/s/PzeOtGh2pfnNKyqa7G5w/decide-project/write-up-uc0.0-dataspace) on UC0.0 contains background on the overal architecture of a semantic.works application.

To simplify configuring the appropriate services we provide [partner-specific configurations](#partner-configurations).


### Identifier
The `identifier` service is an HTTP proxy that acts as access point to the app. All external requests should be forwarded to this service for further processing in an app. On servers we typically use [app-letsencrypt](https://github.com/redpencilio/app-letsencrypt) as a reverse proxy to forward incoming requests the the correct app instance. To allow `app-letsencrypt` to forward requests to the correct app, the app's `identifier` service should

- expose the appropriate environment variables; and
- be part of of `app-letsencrypt`'s default network.

This is most easily done in the app's `docker-compose.override.yml` configuration file. For example, the DECIDe app instance hosted by ABB has the following configuration entries:

```yaml
services:
identifier:
environment:
VIRTUAL_HOST: "ds.decide.lblod.info,dashboard.decide.lblod.info,yasgui.decide.lblod.info,human-validator.decide.lblod.info"
LETSENCRYPT_HOST: "ds.decide.lblod.info,dashboard.decide.lblod.info,yasgui.decide.lblod.info,human-validator.decide.lblod.info"
LETSENCRYPT_EMAIL: "support+servers@redpencil.io"
# Configuration for other services
# ...

networks:
proxy:
name: letsencrypt_default
external: true
```

The example docker compose override files in this folder contain commented template entries that can be used for your app.


### Subdomains used for different frontends
This app contains several frontends to which the `dispatcher` service forwards requests based on subdomains. This can be seen in the `dispatcher` service [configuration](../config/dispatcher/dispatcher.ex) in rules using `reverse_host` to match incoming requests. Should you use different subdomains in you app instance, make sure to update the appropriate rules in your app's dispatcher configuration.

| Frontend | Subdomain |
|---------------------------------------------------------------------------------------------------------------|------------------------|
| [Pipeline dashboard](https://github.com/lblod/frontend-harvesting-self-service/tree/feature/oparl-harvesting) | `dashboard` |
| [Yasgui](https://github.com/lblod/frontend-decide-yasgui) | `yasgui` |
| [dcat](https://github.com/lblod/frontend-decide-dcat) | `ds` |
| [Human Validation Tool](https://github.com/lblod/frontend-decide-human-validator) | `human-validator` |
| [Smart search](https://github.com/lblod/frontend-decide-question-answering) | 'smart-search' |
| [Policy impact report](https://github.com/lblod/frontend-decide-policy-impact-report) | `policy-impact-report` |


### Outsource LLM to the cloud

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Clarify the need to obtain API keys for external services and how they are to be configured.

As part of LBRON-1423 the API keys will become environment variables to be configured in a non-versioned override file. This would be similar as the entity-linking configuration in the added example override.

Furthermore, it is probably best to emphasise these API keys are configured into a local docker-compose.override.yml instead of the versioned example override.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the example override entry for named-entity-recognition based on #43

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the example override entry for pdf-content based on #24

> [!WARNING]
> Currently, API keys have to be configured in (versioned) configuration file per service, as documented in the README of each service. This functionality is being reworked to allow configuring such API keys as environment variables. We will update the service's READMEs as well as the [partner-specific configurations](#partner-configurations) when this functionality is available.

The AI services relying on LLMs by default use local models. But they can also be configured to outsource such computations to external providers in the cloud. This requires at least that you

- obtain the appropriate API keys (or other access tokens) from the providers; and
- configure the necessary services with these via their environment variables.

The READMEs for each individual service describes the necessary configuration in more detail:

- The [named-entity-recognition (NER)](https://github.com/semantic-ai/decide-geocoding-service/blob/master/README.md#L39) service allows to configure providers for several of its features.
- The [entity-linking-backend](https://github.com/semantic-ai/entity-linking-backend/blob/master/README.md) service README documents how to configure external providers.
- The [codelist-labeling](https://github.com/semantic-ai/codelist-labeling-service/blob/master/README.md) service can be configured to use a mistral as external provider. Using another external provider requires adding the appropriate `langchain-*` package to the service by editing its `requirements.txt` file and building your own image.
- The [Question-answering](https://github.com/semantic-ai/decide-question-answering/blob/master/README.md) service can be configured to use different providers. This does require adding the appropriate `langchain-*` package to the service by editing its `requirements.txt` file and building your own image.
- The [Embedding](https://github.com/semantic-ai/embedding-service/blob/master/README.md) service currently does not **not** support using an external provider. Embeddings can generated locally without a GPU, but this will take considerable longer.


### Login for pipeline dashboard
The app is configured with a [default account](../config/migrations/add-test-user/20251211000000-add-test-user.sparql) with username `test` for the pipeline dashboard. Accounts are managed by inserting and/or updating triples in the triplestore, typically using [migrations](https://github.com/mu-semtech/mu-migrations-service). The creation of migrations can be simplified using [mu-cli](https://github.com/mu-semtech/mu-cli).

#### Adding a new account
Creating a new requires adding a migration similar to the already existing [account](../config/migrations/add-test-user/20251211000000-add-test-user.sparql). The [registration](https://github.com/mu-semtech/registration-service) service provides a mu-cli script to easily generate such a migration.

- Ensure [mu-cli](https://github.com/mu-semtech/mu-cli) is installed
- Uncomment the entry for the `registration` service in the appropriate override file and start the service: `docker compose up registration`
- Execute the script to generate a migration using mu-cli: `mu script registration generate-account --name NAME --account USERNAME --password PASSWORD`. This creates a migration in `config/migrations/TIMESTAMP-create-user-USERNAME.sparql`
- Restart `migrations` service to execute generated migration: `docker compose restart migrations`
- Stop the `registration` service and re-comment entry

#### Disabling existing accounts
To disable an account its status can be changed to inactive via another migration. First, generate a new migration file using the script provided by the `migrations` service, the following command will create file `config/migrations/TIMESTAMP-NAME.sparql`:

```bash
mu script migrations new sparql NAME
```

Second, the query below deactivates a given account. Copy this query into the generated migration file and replace `ACCOUNT_UUID` by the UUID of the `OnlineAccount` resource. This UUID can be found in the migration that initially added the account. For example, to disable the [default account](../config/migrations/add-test-user/20251211000000-add-test-user.sparql) the replacement for `ACCOUNT_UUID` would be `d011deb8-64b8-4497-81df-e32ff19cbdc5`.

```sparql
PREFIX account: <http://mu.semte.ch/vocabularies/account/>
PREFIX accounts: <http://ext.data.gift/accounts/>

DELETE {
GRAPH <http://mu.semte.ch/graphs/users> {
?account account:status ?currentStatus .
}
} INSERT {
GRAPH <http://mu.semte.ch/graphs/users> {
?account account:status <http://mu.semte.ch/vocabularies/account/status/inactive> .
}
} WHERE {
GRAPH <http://mu.semte.ch/graphs/users> {
VALUES ?account {
accounts:ACCOUNT_UUID
}
?account a foaf:OnlineAccount ;
account:status ?currentStatus .
}
}
```

Finally, restart the `migrations` service to execute the created migration.

```bash
docker compose restart migrations; docker compose logs -f migrations
```


## Partner configurations
This folder also contains some pre-configured docker compose configurations disabling services that are unnecessary for the use cases specific partners are interested in. The easiest way to include this configurations is to add them as last entry in your `.env` file:

```bash
COMPOSE_FILE=docker-compose.yml:./docs/docker-compose.override.NAME.yml:docker-compose.override.yml
```

Note, take care **not** to include the `docker-compose.dev.yml` file here as this can expose services to the outside world.


### Bamberg
The city of Bamberg is mostly interested in use case 0.1 and 2. Therefore their [partner-specific configuration](./docker-compose.override.bamberg.yml) disables unnecessary services as well as provide some placeholders for configuring specific services. See the comments in the override file for more information.

#### Data harvesting
Due to technical limitations our `pdf-scraper` service cannot directly retrieve PDFs from the [web portal](https://www.stadt.bamberg.de/buergerinformationssystem/tr010) of the city of Bamberg. A workaround is to obtain the PDFs via another method and feed them into the app from disk using an additional service.

To this end, an `internal-files` service is configured in `docker-compose.override.bamberg.yml`. This service mounts a folder `data/internal-files`, make sure to create this folder, in which PDFs can be placed.

In the pipeline dashboard you can use `http://internal-files/FILENAME.pdf` as input decision URLs. As municipality select `Stadt Bamberg` from the options in the dropdown, as illustrated in the following screenshot.

![Example form for harvesting PDFs](./harvest-bamberg-form-example.png)


### Freiburg
The city of Freiburg is mostly interested in use case 1. Therefore, their [partner-specific configuration](./docker-compose.override.freiburg.yml) disables most services for other use cases and provides some placeholders for configuring relevant services. See the comments in the override file for more information.

#### Data harvesting
To harvest decisions from your OParl endpoint, use the pipeline dashboard to create a "Harvest OParl API & Publish as ELI" job. For example, the following screenshot illustrate the form to harvest all decisions from `https://ris.freiburg.de/oparl`.

![Example form for harvesting from an OParl endpoint](./harvest-freibur-form-example.png)

#### Nominatim
The `nominatim` service should be configured to retrieve the OpenStreetMap (OSM) Data Extracts for Germany instead of Belgium. To do this set the `PBF_URL` environment to the correct URL, as illustrated in `docker-compose.override.freiburg.yml`.

> [!WARNING]
> The OSM Data Extracts are downloaded and processing only when starting the service for the first time. Be sure to configure the correct `PBF_URL` **before** starting the service. If you (accidentally) started the service with a incorrect `PBF_URL` you can down the service, remove the mounted volume, and up the service with the correct configuration.

> [!NOTE]
> Downloading and processing the Data Extract for Germany takes a long time, in the order of hours on my development machine, and uses a lot of resources from your machine. You can follow its proces via the service's logs.

### Ghent
The city of Ghent is mostly interested in use cases 0.1 and 2. Therefore, their [partner-specific configuration](./docker-compose.override.ghent.yml) disables most services for other use cases and provides some placeholders for configuring relevant services. See the comments in the override file for more information.

#### Data harvesting
Decisions for Ghent are harvested from [Lokaal Beslist](https://lokaalbeslist.vlaanderen.be). The pipeline dashboard can be used to create the relevant jobs to gather data. To initially harvest all decisions create a "Harvest Lokaal Beslist OSLO & Publish as ELI" job and select as "Initial sync" as Sync mode in the form. The URL field will be automatically filled with the correct value.

![Example form for initial sync](./harvest-ghent-initial-sync-form-example.png)

> [!WARNING]
> The initial sync will take some time as it syncs all data from the configured harvester.

To update your data after an initial sync has completed, also create a "Harvest Lokaal Beslist OSLO & Publish as ELI" job. But select "Delta sync" as Sync mode:

![Example form for delta sync](./harvest-ghent-delta-sync-form-example.png)

To periodically update your data automatically, you can create a Scheduled job for a delta sync. This can be done via the "Scheduled jobs" tab of the pipeline dashboard.

![Example form for scheduled delta sync](./harvest-ghent-delta-sync-scheduled-form-example.png)
Loading