Skip to content

Conversation

@harshach
Copy link
Collaborator

@harshach harshach commented Jan 14, 2026

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

  • New SFTP connector:
    • Python ingestion source in ingestion/src/metadata/ingestion/source/drive/sftp/ with paramiko-based connection handling
    • Supports username/password and SSH key authentication (RSA, Ed25519, ECDSA, DSS)
  • CSV/TSV schema extraction:
    • _extract_csv_schema in metadata.py uses pandas to detect column types and extract sample data from structured files
  • Backend File entity extension:
    • FileRepository.java adds support for columns field with database schema updates in migration scripts
  • Connection schema:
    • sftpConnection.json defines configuration with rootDirectories, structuredDataFilesOnly, and filter patterns
  • Comprehensive testing:
    • 616 lines of unit tests and 564 lines of integration tests using testcontainers with real SFTP server

This will update automatically on new commits.


@github-actions
Copy link
Contributor

The Java checkstyle failed.

Please run mvn spotless:apply in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Java code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Contributor

TypeScript types have been updated based on the JSON schema changes in the PR

@github-actions
Copy link
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 14, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion-base-slim:trivy (debian 12.13)

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (33)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (7)

Package Vulnerability ID Severity Installed Version Fixed Version
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6
jaraco.context GHSA-58pv-8j8x-9vj2 🚨 HIGH 5.3.0 6.1.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/extended_sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/lineage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data_aut.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage_aut.yaml

No Vulnerabilities Found

@github-actions
Copy link
Contributor

github-actions bot commented Jan 14, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion:trivy (debian 12.12)

Vulnerabilities (4)

Package Vulnerability ID Severity Installed Version Fixed Version
libpam-modules CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-modules-bin CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-runtime CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam0g CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (33)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (15)

Package Vulnerability ID Severity Installed Version Fixed Version
Werkzeug CVE-2024-34069 🚨 HIGH 2.2.3 3.0.3
aiohttp CVE-2025-69223 🚨 HIGH 3.12.12 3.13.3
aiohttp CVE-2025-69223 🚨 HIGH 3.13.2 3.13.3
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6
azure-core CVE-2026-21226 🚨 HIGH 1.37.0 1.38.0
deepdiff CVE-2025-58367 🔥 CRITICAL 7.0.1 8.6.1
jaraco.context GHSA-58pv-8j8x-9vj2 🚨 HIGH 5.3.0 6.1.0
jaraco.context GHSA-58pv-8j8x-9vj2 🚨 HIGH 5.3.0 6.1.0
pyasn1 CVE-2026-23490 🚨 HIGH 0.6.1 0.6.2
ray CVE-2025-62593 🔥 CRITICAL 2.47.1 2.52.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /home/airflow/openmetadata-airflow-apis/openmetadata_managed_apis.egg-info/PKG-INFO

No Vulnerabilities Found

ulixius9
ulixius9 previously approved these changes Jan 19, 2026
@gitar-bot
Copy link

gitar-bot bot commented Jan 20, 2026

🔍 CI failure analysis for ac4d2bb: Five CI failures: (1) CodeQL pre-existing security issue, (2) Python SFTP unit test path bug, (3) Playwright infrastructure failure, (4) maven-collate-ci external test suite failure, (5) UI unit tests need updates for new schema tab.

Issue 1: CodeQL Security Analysis Failure (Pre-existing)

CodeQL detected clear-text logging of sensitive information in ingestion/src/metadata/ingestion/api/parser.py (lines 342, 407). This file was NOT modified in this PR - it's a pre-existing security issue.


Issue 2: Python SFTP Unit Test Failure (Critical - Bug in PR)

The py-run-tests (3.10) job failed with:

FAILED ingestion/tests/unit/topology/drive/test_sftp.py::TestSftpSource::test_fetch_directories
AssertionError: Lists differ: ['subdir1'] != ['data', 'subdir1']

The fix is to return the full path including the root directory in _build_directory_path().


Issue 3: Playwright CI Docker Build Failure (Infrastructure - Unrelated)

Network timeout (exit code 28) while downloading IBM i Access ODBC drivers. This is a transient infrastructure failure unrelated to the SFTP connector changes.


Issue 4: Maven Collate CI Failure (External Test Suite - Unrelated)

The maven-collate-ci job failed after triggering an external workflow:

Workflow triggered: https://github.com/open-metadata/openmetadata-collate/actions/runs/21156410687
Workflow completed with conclusion=failure

Root Cause: This job dispatches a workflow to the separate openmetadata-collate repository to run integration tests. The external workflow failed.

This failure is unrelated to the SFTP connector changes. The PR doesn't modify any collate-related code. This is either:

  • A pre-existing issue in the collate repository
  • Infrastructure problems in the external test suite
  • Flaky tests in the collate repository

Issue 5: UI Coverage Tests Failure (Related to PR)

The ui-coverage-tests job failed with 2 test errors:

FAIL src/utils/FileDetailsUtils.test.tsx
FAIL src/utils/FileClassBase.test.ts

Root Cause (UI Tests):

Test 1: FileDetailsUtils.test.tsx

  • Error: TypeError: (0 , LocalUtil_1.detectBrowserLanguage) is not a function
  • The fix is: Add a mock at the top of the test file:
jest.mock('utils/LocalUtil', () => ({
  detectBrowserLanguage: jest.fn(() => 'en-US'),
}));

Test 2: FileClassBase.test.ts

  • Error: expect(received).toHaveLength(expected) - Expected 5 tabs, received 6
  • Cause: The PR adds a new "schema" tab to File entities
  • The fix is: Update the test assertion from expect(result).toHaveLength(5) to expect(result).toHaveLength(6)

Summary of All Failures:

  • CodeQL: Pre-existing security issue (unmodified code)
  • py-run-tests: SFTP path bug (requires fixing)
  • playwright-ci: Infrastructure failure (IBM driver download timeout)
  • maven-collate-ci: External test suite failure (unrelated)
  • ui-coverage-tests: Two tests need updates (requires fixing)
Code Review 👍 Approved with suggestions 3 resolved / 4 findings

Solid new SFTP connector feature with comprehensive test coverage. One resource leak issue to address in the connection handling code.

⚠️ Bug: Resource leak: Transport not closed on SFTPClient failure

📄 ingestion/src/metadata/ingestion/source/drive/sftp/connection.py:96

In get_connection(), if SFTPClient.from_transport(transport) throws an exception after transport.connect() has succeeded, the Transport object is never closed, causing a resource leak.

Impact: Each failed connection attempt leaks a socket/transport, which can accumulate over time leading to resource exhaustion.

Suggested fix: Close the transport in the exception handler:

except Exception as exc:
    if transport:
        try:
            transport.close()
        except Exception:
            pass
    logger.debug(traceback.format_exc())
    raise SourceConnectionException(f"Failed to connect to SFTP server: {exc}")
More details ✅ 3 resolved
Bug: `extractSampleData` config option is defined but never used

📄 ingestion/src/metadata/ingestion/source/drive/sftp/metadata.py
The extractSampleData configuration option is defined in the JSON schema (sftpConnection.json) and generates corresponding TypeScript types, but it's never checked in the actual Python ingestion code.

Looking at metadata.py, the _extract_csv_schema method:

  1. Always extracts both columns AND sample data
  2. The sample data is discarded in both call sites: columns, _ = self._extract_csv_schema(file_info.full_path, file_info.name)

Expected behavior: When extractSampleData is False (default), only schema/columns should be extracted. When True, sample data should also be captured.

Current behavior: Sample data is always extracted internally (performance cost) and always discarded (never used).

Impact:

  • Users who set extractSampleData: true will expect sample data but won't get it
  • Users who leave it at default (false) still pay the performance cost of extracting sample data

Suggested fix:

  1. Check service_connection.extractSampleData before extracting sample data
  2. If extractSampleData is True, store/use the sample_data returned by _extract_csv_schema
  3. Pass sample_data to the CreateFileRequest when available
Bug: `extractSampleData` config option defined but never used

📄 ingestion/src/metadata/ingestion/source/drive/sftp/metadata.py
The extractSampleData configuration option is defined in the SFTP connection JSON schema (line 138-143 of sftpConnection.json) with a description indicating it should control sample data extraction. However, the Python ingestion code in metadata.py never checks this config option.

The _extract_csv_schema method extracts both columns and sample data and returns them as a tuple (columns, sample_data). But in yield_file, the sample data is always discarded with columns, _ = self._extract_csv_schema(...) regardless of the extractSampleData setting.

This means:

  1. Sample data is always extracted from SFTP files (performance overhead) even when disabled
  2. Sample data is never included in the CreateFileRequest, so users cannot benefit from this feature even if they enable it

Suggested fix:

  1. Check self.service_connection.extractSampleData before calling _extract_csv_schema
  2. When enabled, include the sample data in the CreateFileRequest or use the separate sample data API endpoint
Quality: Unit test file deleted without full replacement

📄 openmetadata-service/src/test/java/org/openmetadata/service/resources/drives/FileResourceTest.java
The FileResourceTest.java (882 lines) was completely deleted. While there is a FileResourceIT.java integration test that was modified to add tests for the new columns functionality, it's unclear if all the deleted unit tests have been replaced.

The deleted tests included important functionality like:

  • post_fileCreateWithInvalidService_400
  • post_fileCreateWithoutRequiredFields_400
  • test_listFilesWithRootParameter
  • test_listFilesWithRootParameterAndPagination
  • testBulk_PreservesUserEditsOnUpdate
  • etc.

Consider ensuring the integration tests cover these scenarios or keeping essential unit tests that don't require full integration setup.

What Works Well

Well-structured new SFTP connector with support for username/password and SSH key authentication. Comprehensive test suite with both unit tests (794 lines) and integration tests using testcontainers. Clean separation of concerns between connection handling, metadata extraction, and CSV schema parsing. Backend properly extends FileRepository with column support and sample data APIs.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off Gitar will not commit updates to this branch.
Display: compact Hiding non-applicable rules.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | This comment will update automatically (Docs)


transport.connect(username=auth_type.username, pkey=pkey)
else:
raise ValueError(f"Unsupported authentication type: {type(auth_type)}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: Resource leak: Transport not closed on SFTPClient failure

Details

In get_connection(), if SFTPClient.from_transport(transport) throws an exception after transport.connect() has succeeded, the Transport object is never closed, causing a resource leak.

Impact: Each failed connection attempt leaks a socket/transport, which can accumulate over time leading to resource exhaustion.

Suggested fix: Close the transport in the exception handler:

except Exception as exc:
    if transport:
        try:
            transport.close()
        except Exception:
            pass
    logger.debug(traceback.format_exc())
    raise SourceConnectionException(f"Failed to connect to SFTP server: {exc}")

Was this helpful? React with 👍 / 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants