Skip to content

Error restoring JSONs to MongoDB in 0 - Load Data #2

@garandria

Description

@garandria

I am trying to run the noteboook 0 - Load Data and had an issue on the Restore JSONs to MongoDB cell. I have already downloaded repositories.json.gz and runs.json.gz as mentioned in the README. More details below:

gha-dataset/jupyter $ ls
'0 - Load data.ipynb'  '1 - Dataset metrics.ipynb'   repositories.json.gz   runs.json.gz
gha-dataset/jupyter $ file repositories.json.gz
repositories.json.gz: gzip compressed data, from Unix, original size modulo 2^32 69208463
gha-dataset/jupyter $ du -sh repositories.json.gz
67M     repositories.json.gz
gha-dataset/jupyter $ file runs.json.gz
runs.json.gz: gzip compressed data, from Unix, original size modulo 2^32 1063390519 gzip compressed data, reserved method, has CRC, from FAT filesystem (MS-DOS, OS/2, NT), original size modulo 2^32 1063390519
gha-dataset/jupyter $ du -sh runs.json.gz
964M    runs.json.gz

Now, when I run the cell mentioned above, I get the following enconding error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[11], line 2
      1 with gzip.open("repositories.json.gz", "rt") as fd:
----> 2     for line in tqdm(fd):
      3         mongo_repositories.insert_one(json.loads(fd.read()))
      5 with gzip.open("runs.json.gz", "rt") as fd:

File ~/.py-venv/lib/python3.11/site-packages/tqdm/notebook.py:250, in tqdm_notebook.__iter__(self)
    248 try:
    249     it = super().__iter__()
--> 250     for obj in it:
    251         # return super(tqdm...) will not catch exception
    252         yield obj
    253 # NB: except ... [ as ...] breaks IPython async KeyboardInterrupt

File ~/.py-venv/lib/python3.11/site-packages/tqdm/std.py:1181, in tqdm.__iter__(self)
   1178 time = self._time
   1180 try:
-> 1181     for obj in iterable:
   1182         yield obj
   1183         # Update and possibly print the progressbar.
   1184         # Note: does not call self.update(1) for speed optimisation.

File <frozen codecs>:322, in decode(self, input, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Any idea?
(cc @YuTeruya @kanalsop)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions